xuweizhixin/Wei-s-crawler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
{\rtf1\ansi\ansicpg1252\cocoartf1265 {\fonttbl\f0\fswiss\fcharset0 Helvetica;} {\colortbl;\red255\green255\blue255;} \margl1440\margr1440\vieww10800\viewh8400\viewkind0 \pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural \f0\fs24 \cf0 Folders:\ bs4:\ The bs4 contains the library of Beautiful Soup, which we used to parse the html. \ downloads:\ This folder contains all the pages we have crawled and downed, all in .html format. \ \ \ Files(source codes):\ \ pygoogle.py:\ This is the API we used to get the top10 results from google search engine. \ \ studentMain.py\ This is our main function of the project, basically it will take the keywords and get top10 results from google and get it ready to crawl. \ \ crawler.py:\ This contains two major functions of our project. get_all_link_keyword will get all the links within a particular url and also calculate the score of the links. crawl_web will crawl the pages according to the priority of the links, also download the pages until it hits the limit. \ \ getpage.py\ This file contains a function get_page which will omit the \'93bad\'94 links, check the robots.txt and the page that pass these test.\ \ \ test.py\ This file contains a function test which will test the result of the download pages. It will return the accuracy rate. (number of pages downloaded contain the keywords)/(total pages) }
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published