Scrapy_Projects

This repo contains all HPFB (Health Product and Food Branch) projects about scraping PDF/XML/HTTP/TXT/XLSX files by using Scrapy. Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. All scraped items are indexed into elasticsearch, which is an upcoming powerful search engine.

A reproducable example of Scrapy Architecture can be found in here. For more information, please visit Scrapy documentation.

To check similarity by using Cosine Algorithm, please refer here. For language detection, please check Apache tika.

Please make sure to change all directories before running.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
nas		nas
pmscrapy		pmscrapy
scrapexml		scrapexml
slsascrapy		slsascrapy
superlist		superlist
superscrapy		superscrapy
.Rhistory		.Rhistory
.gitignore		.gitignore
HPFB_Scrapy_Projects.Rproj		HPFB_Scrapy_Projects.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nas

nas

pmscrapy

pmscrapy

scrapexml

scrapexml

slsascrapy

slsascrapy

superlist

superlist

superscrapy

superscrapy

.Rhistory

.Rhistory

.gitignore

.gitignore

HPFB_Scrapy_Projects.Rproj

HPFB_Scrapy_Projects.Rproj

README.md

README.md

Repository files navigation

Scrapy_Projects

About

Releases

Packages

Languages

JasonHJiang/HPFB_Scrapy_Projects

Folders and files

Latest commit

History

Repository files navigation

Scrapy_Projects

About

Resources

Stars

Watchers

Forks

Languages