baidu_locusts

Crawl and graze all the Chinese datas from Baidu Baike, including title and summary from all the clauses.
It's made because the Chinese clause in Wikipedia has too less information.
Furthermore, it's an advanced version of "baike_spider"with greediness for all datas and kill-and-continue check points.
Suggestions or discussions are definitly welcome :)

百度百科的中文語料爬蟲，
能夠爬取所有條目的標題和摘要！
這是一個從「baike_spider」修改而來的版本，
加強了
1.能夠爬取「所有」條目的能力。
2.因條目數過多，而能夠存檔記憶點以利分次性爬取。
3.分批次存檔並清空RAM，以免佔用過多資源
非常歡迎討論或指教 :)

Requirement

python 2.7
urllib2
Beautifulsoup (bs4)

Usage

mkdir txt;mkdir urls
python locusts_main [-new] [-load PATH]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
__init__.py		__init__.py
cut.py		cut.py
html_outputer.py		html_outputer.py
html_parser.py		html_parser.py
locusts_main.py		locusts_main.py
url_manager.py		url_manager.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

init.py

init.py

cut.py

cut.py

html_outputer.py

html_outputer.py

html_parser.py

html_parser.py

locusts_main.py

locusts_main.py

url_manager.py

url_manager.py

Repository files navigation

baidu_locusts

Requirement

Usage

About

Releases

Packages

Languages

AngusKung/baidu_locusts

Folders and files

Latest commit

History

Repository files navigation

baidu_locusts

Requirement

Usage

About

Resources

Stars

Watchers

Forks

Languages