1 The crawl utility starts a depth-first traversal of the web at the
2 specified URLs. It stores all JPEG images that match the configured
3 constraints. Crawl is fairly fast and allows for graceful termination.
4 After terminating crawl, it is possible to restart it at exactly
5 the same spot where it was terminated. Crawl keeps a persistent
6 database that allows multiple crawls without revisiting sites.
8 The main reason for writing crawl was the lack of simple open source
9 web crawlers. Crawl is only a few thousand lines of code and fairly
10 easy to debug and customize.
12 Some of the main features:
13 - Saves encountered JPEG images
14 - Image selection based on regular expressions and size contrainsts
15 - Resume previous crawl after graceful termination
16 - Persistent database of visited URLs
17 - Very small and efficient code