Crawl and graze all the Chinese datas from Baidu Baike, including title and summary from all the clauses.
It's made because the Chinese clause in Wikipedia has too less information.
Furthermore, it's an advanced version of "baike_spider"with greediness for all datas and kill-and-continue check points.
Suggestions or discussions are definitly welcome :)
百度百科的中文語料爬蟲,
能夠爬取所有條目的標題和摘要!
這是一個從「baike_spider」修改而來的版本,
加強了
1.能夠爬取「所有」條目的能力。
2.因條目數過多,而能夠存檔記憶點以利分次性爬取。
3.分批次存檔並清空RAM,以免佔用過多資源
非常歡迎討論或指教 :)
python 2.7
urllib2
Beautifulsoup (bs4)
mkdir txt;mkdir urls
python locusts_main [-new] [-load PATH]