Chessgecko

Chessgecko / html2textPro / 0.2.0

README.md

Introduction

WebArticleSraper as it’s name suggests, is an api that can scrape the main article out of any generic web page. It specializes in being able to determine which text is relevant, and which ones aren’t thus having a surprisingly low false positive score. As a result of it pickiness, there are occasional cases of false negatives, so this api is recommended for users who need to scrape the vast majority of the relevant text on an article site, and are ok with missing a sentence or two. Essentially it performs better for data mining ( where useless comments/advertisements are ignored) opposed to consumer driven application, where 100% the text must be scraped. Another additive feature, is that if the scraper scrapes a site incorrectly, we have a easy to use configuration page where you can explicitly tell it what information is useful and which to ignore. Plus it has the ability to recognize titles.


Examples: Test it on any web article or wiki page below!