web

web / HTMLDataExtractor / 0.2.2

README.md

HTMLDATAEXTRACTOR

Extracts HTML from a URL, File, or raw input; optionally runs an XPath query; returns the content as structured JSON.

INPUT

Provide either a URL to be scraped, raw HTML, or a FILE from a data collection:

{"URL":"http://algorithmia.com"}

{"HTML":"<html><p>hello!</p></html>"}

{"FILE":"data://.my/samples/index.html"}

Any of these may be a list instead of a singular, for example:

{"URL":["http://algorithmia.com","http://example.com"]}

You may also choose to provide an XPATH query.  If specified, only items matching the expression will be returned:

{"URL":"http://algorithmia.com","XPATH":"//p"}

By default, the results will be returned in JSON using the Cobra convention.  However, you can also choose to use a different FORMAT:

{"URL":"http://algorithmia.com","FORMAT":"BadgerFish"}

  • Abdera: Use "attributes" for attributes, "children" for nodes
  • BadgerFish: Use "$" for text content, @ to prefix attributes
  • Cobra: Use "attributes" for attributes (even when empty), "children" for nodes, values are strings
  • GData: Use "$t" for text content, attributes added as-is
  • Yahoo Use "content" for text content, attributes added as-is
  • Parker: Use tail nodes for text content, ignore attributes