web

web / GetRecommendations / 0.4.68

README.md

This algorithm provides page recommendations for a domain. It is primarily geared for use with the Algorithmia recommendation system, which provides more specificity, but can be used independently of this as well.

It takes as input a url and a required string. Only urls whose source contains the required string will be considered as recommendations. Note that the page source itself is inspected, not just the user-readable text. Examples include long UUIDs embedded in the page's java script, or simply the empty string "" to consider every page. Get Recommendations maintains a permanent algorithm collection (see the documentation). When a url is sent to it, it checks to see if any url from the same domain has ever been sent. If not, the domain is explored, for 200 urls or 2 minutes, whichever comes first, using /web/BreadthFirstSiteMap and filtering for urls containing the required string. These urls are processed into recommendations as described below.

It also takes an optional third integer parameter, which dictates how many months in the past (relative to a given page) will be considered for recommendation. For example, if the parameter is 24, no page that is more than 24 months older than a given page will be recommended. Note however that recommendations are always being updated as new pages are added, and for any given page, anything published more recently than it is eligible as a recommendation assuming some nonzero similarity. If this parameter is not supplied or is equal to -1, all pages will be considered.

For each explored domain, the algorithm maintains a word count summary for each url and for the domain as a whole. For every new url in an existing domain, it scrapes the url for content using /web/AnalyzeURL, processes this into word statistics, and generates keywords for each url using  /nlp/KeywordsForDocumentSet, and generates recommendations from these keywords using /nlp/KeywordSetSimilarity

All recommendations are stored in a table, and any time the algorithm is queried with a url that is in this table, it returns the corresponding recommendations. Thus, the first call to a domain takes a few minutes and a call with a novel url takes up to a few seconds, but any subsequent call with a url returns very quickly.