util

util / Html2Text / 0.1.6

README.md

Table of Contents

  1. Introduction
  2. Examples
  3. Credits

Introduction

Takes in a url and extracts the content from the page. Makes an attempt to remove non-content text like navigation and footer text.

Input:

  • (Required): Website URL.

Output:

  • Extracted text from website URL.

Examples

Example 1.

  • Parameter 1: Wikipedia article URL.
"https://en.wikipedia.org/wiki/Aziz_Sancar"

Output:

Aziz Sancar (born 8 September 1946) is a Turkish-American biochemist and molecular biologist specializing in DNA repair, cell cycle checkpoints, and circadian clock.[4] In 2015, he was awarded the Nobel Prize in Chemistry ... [24] Sancar is the second Turkish Nobel laureate after Orhan Pamuk, who is also an alumnus of Istanbul University.

Example 2.

  • Parameter 1: Techcrunch article URL.
"http://techcrunch.com/2015/03/12/algorithmia-launches-with-more-than-800-algorithms-on-its-marketplace/"

Output:

"Algorithmia, the startup that raised $2.4 million last August to connect academics building powerful algorithms and the app developers who could put them to use, just brought its marketplace out of private beta. More than 800 algorithms are available on the marketplace, providing the smarts needed to do various tasks in the fields of machine learning, audio and visual processing, and even computer vision. Algorithm developers can host their work on the site and charge a fee per-use to developers who integrate the algorithm into their own work. The platform encourages further additions to its library through a bounty system, which lets users request algorithms that researchers familiar with the field can contribute from their work or develop from scratch for a fee. To demonstrate the platform’s algorithm hosting tools, the Algorithmia team built a simple app using seven user-contributed algorithms that visualizes what a crawler does as it works through links to build the structure of a site."

Credits

JSOUP was used to scrape content from HTML in this algorithm.