web / SiteMap / 0.1.7
README.md
Table of Contents
Introduction
Starting from a given url, crawls pages within the same domain. Returns a graph representing the link structure of the crawled site.
Input:
- (Required): Website URL.
- (Optional): Depth of search. (default = 2)
- (Optional): Save result to given file location instead of listing in output. Output returns an empty object when this parameter is given. Useful when output larger than the 10 MB limit.
Output:
- A list of URLs extracted from the given URL. (with same domain name.)
Examples
Example 1.
- Parameter 1: Algorithmia website URL.
["https://algorithmia.com", 1]
Output:
{ "http://algorithmia.com": [ "http://developers.algorithmia.com", "https://algorithmia.com/terms", "https://algorithmia.com/algorithms/TimeSeries/OutlierDetection", "https://algorithmia.com/algorithms/opencv/FaceDetection", ..... "https://algorithmia.com/about", "https://algorithmia.com/algodev", "https://algorithmia.com/algorithms/util/Url2Text", "https://algorithmia.com/privacy" ] }
Example 2.
- Parameter 1: Archive.org website URL.
["http://archive.org", 1]
Output:
{ "http://archive.org": [ "https://archive.org/details/library_of_congress", "https://archive.org/web/", "https://archive.org/details/microfilm", "https://blog.archive.org/category/announcements/", ..... "https://archive.org/details/television", "https://archive.org/details/internetarcade", "https://archive.org/details/animationandcartoons", "https://archive.org/details/netlabels" ] }
Credit
Algorithm was built using the JSOUP library.
Contents