crawler

A Web crawler.

Start from the url and crawl the web pages with a specified depth.
Save the pages which contain a keyword(if provided) into database.
Support multi-threading.
Support logging.
Support self-testing.

usage

main.py [-h] -u URL -d DEPTH [--logfile FILE] [--loglevel {1,2,3,4,5}]
               [--thread NUM] [--dbfile FILE] [--key KEYWORD] [--testself]

optional arguments:

  -h, --help            show this help message and exit
  -u URL                Specify the begin url
  -d DEPTH              Specify the crawling depth
  --logfile FILE        The log file path, Default: spider.log
  --loglevel {1,2,3,4,5}
                        The level of logging details. Larger number record
                        more details. Default:3
  --thread NUM          The amount of threads. Default:10
  --dbfile FILE         The SQLite file path. Default:data.sql
  --key KEYWORD         The keyword for crawling. Default: None. For more then
                        one word, quote them. example: --key 'Hello world'
  --testself            Crawler self test

Name	Name	Last commit message	Last commit date
Latest commit History 47 Commits 47 Commits
.gitignore	.gitignore
README.md	README.md
crawler.py	crawler.py
database.py	database.py
main.py	main.py
options.py	options.py
proxy.py	proxy.py
threadPool.py	threadPool.py
webPage.py	webPage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

A Web crawler.

usage

optional arguments:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

Folders and files

Latest commit

History

Repository files navigation

crawler

A Web crawler.

usage

optional arguments:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages