Parsing_WUBA

This program aim to parse http://bj.58.com/sale.shtml

channel_extract.py

Extract the links of each channels

page_parsing.py

Get the links of items in a channel such as: I want to get all items links in http://bj.58.com/shouji/ stored in MongoDB
Based on the items links I get, get the items basic information Stored in MongoDB

main.py

Call channel_extract and page_parsing

counts.py

Counting the number of item_list and item_info in DB

visualisation page

use package charts to generate visualization and it also concludes some operations for MongoDB

db.collection.update() eg.item_info.update({'_id':i['_id']},{'$set':{'area':area}}) - update_database

db.aggregate(pipeline) pipeline = [ {'$match':{'publish_time':'2017-12-13'}}, - like find() {'$group':{'_id':'$price','counts':{'$sum':1}}}, change _id to price, and counts the times(each time +1) {'$sort':{'counts':-1}}, -1 represents from large to small, 1 represents small to large {'$limit':3}, get the first three ] - much more effective than 'for' statement in python

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Test		Test
.DS_Store		.DS_Store
README.md		README.md
channel_extract.py		channel_extract.py
counts.py		counts.py
main.py		main.py
page_parsing.py		page_parsing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test

Test

.DS_Store

.DS_Store

README.md

README.md

channel_extract.py

channel_extract.py

counts.py

counts.py

main.py

main.py

page_parsing.py

page_parsing.py

Repository files navigation

Parsing_WUBA

channel_extract.py

page_parsing.py

main.py

counts.py

visualisation page

About

Releases

Packages

Languages

leejoonsung007/Parsing_wuba

Folders and files

Latest commit

History

Repository files navigation

Parsing_WUBA

channel_extract.py

page_parsing.py

main.py

counts.py

visualisation page

About

Resources

Stars

Watchers

Forks

Languages