crawler

A Web crawler.

Start from the url and crawl the web pages with a specified depth.
Save the pages which contain a keyword(if provided) into database.
Support multi-threading.
Support logging.
Support self-testing.

注意：

目前已经修改了原来的代码，现在用于抓取豆瓣小组的信息，包括小组ID、创建时间、组内人数等。

抓取操作步骤

抓取小组的ID, GID
根据得到的小组ID，抓取小组的讨论列表，放在data/GID.txt
根据讨论列表，抓取小组的评论数据，并生成评论结构, 放在data/GID/topic_id.txt和structure/GID/topic_id.txt
可视化评论树，放在image/GID/topic_id.jpg

usage

main.py [-h] -u URL -d DEPTH [--logfile FILE] [--loglevel {1,2,3,4,5}]
               [--thread NUM] [--dbfile FILE] [--key KEYWORD] [--testself]

optional arguments:

  -h, --help            show this help message and exit
  -u URL                Specify the begin url
  -d DEPTH              Specify the crawling depth
  --logfile FILE        The log file path, Default: spider.log
  --loglevel {1,2,3,4,5}
                        The level of logging details. Larger number record
                        more details. Default:3
  --thread NUM          The amount of threads. Default:10
  --dbfile FILE         The SQLite file path. Default:data.sql
  --key KEYWORD         The keyword for crawling. Default: None. For more then
                        one word, quote them. example: --key 'Hello world'
  --testself            Crawler self test

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
dataset		dataset
social		social
tables		tables
.gitignore		.gitignore
API code.txt		API code.txt
API code.txt~		API code.txt~
CommentCrawler.log~		CommentCrawler.log~
Douban API code~		Douban API code~
README.md		README.md
add_comment_list.py		add_comment_list.py
add_title.py		add_title.py
behavior.py		behavior.py
buildtree.py		buildtree.py
chinese_seg.py		chinese_seg.py
chinese_stop_words.txt~		chinese_stop_words.txt~
comment_crawler.py		comment_crawler.py
crawl_title.py		crawl_title.py
crawler.py		crawler.py
database.py		database.py
draw_comment_dist.py		draw_comment_dist.py
draw_precision.py		draw_precision.py
find_high_freq_tokens.py		find_high_freq_tokens.py
find_previous_topic.py		find_previous_topic.py
find_remaining_topics_list.py		find_remaining_topics_list.py
gen_instance.py		gen_instance.py
gen_user_behavior.py		gen_user_behavior.py
gen_user_interest.py		gen_user_interest.py
logconfig.py		logconfig.py
main.py		main.py
models.py		models.py
options.py		options.py
patterns.py		patterns.py
prediction_statics.py		prediction_statics.py
prepare.py		prepare.py
prepare_corpus.py		prepare_corpus.py
prepare_corpus_comment.py		prepare_corpus_comment.py
prepare_train_test.py		prepare_train_test.py
proxy.py		proxy.py
recrawl.py		recrawl.py
remove_line_feed.py		remove_line_feed.py
save_topic_list.py		save_topic_list.py
spider.log~		spider.log~
stacktracer.py		stacktracer.py
stopword.txt		stopword.txt
svm_training.py		svm_training.py
svm_training2.py		svm_training2.py
threadPool.py		threadPool.py
topicCrawler.log~		topicCrawler.log~
topic_crawler.py		topic_crawler.py
trace.html		trace.html
train_lda.py		train_lda.py
user.py		user.py
utils.py		utils.py
webPage.py		webPage.py

hitalex/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

A Web crawler.

注意：

抓取操作步骤

usage

optional arguments:

About

Resources

Stars

Watchers

Forks

Languages