GitHub - xuweizhixin/Wei-s-crawler

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bs4		bs4
downloads		downloads
calsize.py		calsize.py
crawler.py		crawler.py
crawler.pyc		crawler.pyc
getpage.py		getpage.py
getpage.pyc		getpage.pyc
pygoogle.py		pygoogle.py
pygoogle.pyc		pygoogle.pyc
readme.rtf		readme.rtf
result.txt		result.txt
studentMain.py		studentMain.py
test.py		test.py
test.pyc		test.pyc

Repository files navigation

{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural

\f0\fs24 \cf0 Folders:\
bs4:\
The bs4 contains the library of Beautiful Soup, which we used to parse the html. \
downloads:\
This folder contains all the pages we have crawled and downed, all in .html format. \
\
\
Files(source codes):\
\
pygoogle.py:\
This is the API we used to get the top10 results from google search engine. \
\
studentMain.py\
This is our main function of the project, basically it will take the keywords and get top10 results from google and get it ready to crawl. \
\
crawler.py:\
This contains two major functions of our project. get_all_link_keyword will get all the links within a particular url and also calculate the score of the links. crawl_web will crawl the pages according to the priority of the links, also download the pages until it hits the limit. \
\
getpage.py\
This file contains a function get_page which will omit the \'93bad\'94 links, check the robots.txt and the page that pass these test.\
\
\
test.py\
This file contains a function test which will test the result of the download pages. It will return the accuracy rate. (number of pages downloaded contain the keywords)/(total pages) }

About

No description, website, or topics provided.

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bs4

bs4

downloads

downloads

calsize.py

calsize.py

crawler.py

crawler.py

crawler.pyc

crawler.pyc

getpage.py

getpage.py

getpage.pyc

getpage.pyc

pygoogle.py

pygoogle.py

pygoogle.pyc

pygoogle.pyc

readme.rtf

readme.rtf

result.txt

result.txt

studentMain.py

studentMain.py

test.py

test.py

test.pyc

test.pyc

Repository files navigation

About

Releases

Packages

Languages

xuweizhixin/Wei-s-crawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages