This algorithm takes a web address and returns a summary relevant structural details of the site. Specifically, it is intended to identify the relevant pages on a hotel website, returning selected metadata and the relative importance of various pages as measured by PageRank.
The returned information includes:
- url - the original address given, assumed to be the main page of the website.
- language - the language of the main page. See https://algorithmia.com/algorithms/nlp/LanguageIdentification for a guide to the returned language symbols.
- tags - important terms from the website.
- important pages - we check to identify which pages on the site are used for rooms, reservations/booking, photos, and location. For this we currently support English, Spanish, Italian, German, and Portuguese.
- pageRanks - an ordered list of pages on the site by page rank, the higher the rank, the more likely the page is to be important.