web / HTMLDataExtractor / 0.2.2



Extracts HTML from a URL, File, or raw input; optionally runs an XPath query; returns the content as structured JSON.


Provide either a URL to be scraped, raw HTML, or a FILE from a data collection:




Any of these may be a list instead of a singular, for example:


You may also choose to provide an XPATH query.  If specified, only items matching the expression will be returned:


By default, the results will be returned in JSON using the Cobra convention.  However, you can also choose to use a different FORMAT:


  • Abdera: Use "attributes" for attributes, "children" for nodes
  • BadgerFish: Use "$" for text content, @ to prefix attributes
  • Cobra: Use "attributes" for attributes (even when empty), "children" for nodes, values are strings
  • GData: Use "$t" for text content, attributes added as-is
  • Yahoo Use "content" for text content, attributes added as-is
  • Parker: Use tail nodes for text content, ignore attributes