Skip to content

JasonHJiang/HPFB_Scrapy_Projects

Repository files navigation

Scrapy_Projects

This repo contains all HPFB (Health Product and Food Branch) projects about scraping PDF/XML/HTTP/TXT/XLSX files by using Scrapy. Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. All scraped items are indexed into elasticsearch, which is an upcoming powerful search engine.

A reproducable example of Scrapy Architecture can be found in here. For more information, please visit Scrapy documentation.

To check similarity by using Cosine Algorithm, please refer here. For language detection, please check Apache tika.

Please make sure to change all directories before running.

About

Scrapy Projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages