Scoping a crawl

Keep a clone inside the lines with depth, page, prefix, subdomain, and exclude controls.

By default kage crawls every in-scope page it can reach from the seed, staying on the seed's host. On a large site that can be a lot of pages. These flags bound the crawl.

Limit by count and depth

# Stop after 200 pages
kage clone example.com --max-pages 200

# Only follow links three hops from the seed
kage clone example.com --max-depth 3

--max-depth 0 (the default) means unlimited depth; --max-pages 0 means unlimited pages. Combine them to put a hard ceiling on a run.

Limit by path

To clone just one section of a site, restrict the crawl to a path prefix:

kage clone example.com --scope-prefix /docs

Only pages whose path starts with /docs are followed. Assets are still fetched from wherever the page references them, so the section renders correctly.

To skip parts of a site, exclude path prefixes (repeatable):

kage clone example.com --exclude /archive --exclude /tags

Subdomains

By default a clone stays on the exact seed host. To treat subdomains of the seed as in scope, add --subdomains:

kage clone example.com --subdomains

Now blog.example.com and docs.example.com are crawled too, each landing under its own host directory inside the mirror.

Politeness

kage honours robots.txt by default and seeds itself from sitemap.xml. If you are cloning a site you control, or you have a reason to ignore the robots rules, you can turn them off, but do so responsibly:

kage clone example.com --no-robots --no-sitemap

Lazy-loaded media

Sites that load images as you scroll will only have their above-the-fold images captured unless you tell kage to scroll each page:

kage clone example.com --scroll

This makes each render a little slower but captures media that only loads on view.