Skip to content

Crawl and graze all the datas from Baidu Baike. It's an advanced version of "baike_spider"with greediness for all datas and kill-and-continue version control ability.

Notifications You must be signed in to change notification settings

AngusKung/baidu_locusts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

baidu_locusts

Crawl and graze all the Chinese datas from Baidu Baike, including title and summary from all the clauses.
It's made because the Chinese clause in Wikipedia has too less information.
Furthermore, it's an advanced version of "baike_spider"with greediness for all datas and kill-and-continue check points.
Suggestions or discussions are definitly welcome :)

百度百科的中文語料爬蟲,
能夠爬取所有條目的標題和摘要!
這是一個從「baike_spider」修改而來的版本,
加強了
1.能夠爬取「所有」條目的能力。
2.因條目數過多,而能夠存檔記憶點以利分次性爬取。
3.分批次存檔並清空RAM,以免佔用過多資源
非常歡迎討論或指教 :)

Requirement

python 2.7
urllib2
Beautifulsoup (bs4)

Usage

mkdir txt;mkdir urls
python locusts_main [-new] [-load PATH]

About

Crawl and graze all the datas from Baidu Baike. It's an advanced version of "baike_spider"with greediness for all datas and kill-and-continue version control ability.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages