Skip to content

xuweizhixin/Wei-s-crawler

Repository files navigation

{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural

\f0\fs24 \cf0 Folders:\
bs4:\
The bs4 contains the library of Beautiful Soup, which we used to parse the html. \
downloads:\
This folder contains all the pages we have crawled and downed, all in .html format. \
\
\
Files(source codes):\
\
pygoogle.py:\
This is the API we used to get the top10 results from google search engine. \
\
studentMain.py\
This is our main function of the project, basically it will take the keywords and get top10 results from google and get it ready to crawl. \
\
crawler.py:\
This contains two major functions of our project. get_all_link_keyword will get all the links within a particular url and also calculate the score of the links. crawl_web will crawl the pages according to the priority of the links, also download the pages until it hits the limit. \
\
getpage.py\
This file contains a function get_page which will omit the \'93bad\'94 links, check the robots.txt and the page that pass these test.\
\
\
test.py\
This file contains a function test which will test the result of the download pages. It will return the accuracy rate. (number of pages downloaded contain the keywords)/(total pages) }

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages