web

web / SiteMap / 0.1.7

README.md

Table of Contents

  1. Introduction
  2. Examples
  3. Credit

Introduction

Starting from a given url, crawls pages within the same domain. Returns a graph representing the link structure of the crawled site.

Input:

  • (Required): Website URL.
  • (Optional): Depth of search. (default = 2)
  • (Optional): Save result to given file location instead of listing in output. Output returns an empty object when this parameter is given. Useful when output larger than the 10 MB limit.

Output:

  • A list of URLs extracted from the given URL. (with same domain name.)

Examples

Example 1.

  • Parameter 1: Algorithmia website URL.
["https://algorithmia.com", 1]

Output:

{
  "http://algorithmia.com": [
    "http://developers.algorithmia.com",
    "https://algorithmia.com/terms",
    "https://algorithmia.com/algorithms/TimeSeries/OutlierDetection",
    "https://algorithmia.com/algorithms/opencv/FaceDetection",
    .....
    "https://algorithmia.com/about",
    "https://algorithmia.com/algodev",
    "https://algorithmia.com/algorithms/util/Url2Text",
    "https://algorithmia.com/privacy"
  ]
}

Example 2.

  • Parameter 1: Archive.org website URL.
["http://archive.org", 1]

Output:

{
  "http://archive.org": [
    "https://archive.org/details/library_of_congress",
    "https://archive.org/web/",
    "https://archive.org/details/microfilm",
    "https://blog.archive.org/category/announcements/",
    .....
    "https://archive.org/details/television",
    "https://archive.org/details/internetarcade",
    "https://archive.org/details/animationandcartoons",
    "https://archive.org/details/netlabels"
  ]
}

Credit

Algorithm was built using the JSOUP library.