HTMLDATAEXTRACTOR
Extracts HTML from a URL, File, or raw input; optionally runs an XPath query; returns the content as structured JSON.
INPUT
Provide either a URL to be scraped, raw HTML, or a FILE from a data collection:
{"URL":"http://algorithmia.com"}
{"HTML":"<html><p>hello!</p></html>"}
{"FILE":"data://.my/samples/index.html"}
Any of these may be a list instead of a singular, for example:
{"URL":["http://algorithmia.com","http://example.com"]}
You may also choose to provide an XPATH query. If specified, only items matching the expression will be returned:
{"URL":"http://algorithmia.com","XPATH":"//p"}
By default, the results will be returned in JSON using the Cobra convention. However, you can also choose to use a different FORMAT:
{"URL":"http://algorithmia.com","FORMAT":"BadgerFish"}
- Abdera: Use "attributes" for attributes, "children" for nodes
- BadgerFish: Use "$" for text content, @ to prefix attributes
- Cobra: Use "attributes" for attributes (even when empty), "children" for nodes, values are strings
- GData: Use "$t" for text content, attributes added as-is
- Yahoo Use "content" for text content, attributes added as-is
- Parker: Use tail nodes for text content, ignore attributes