Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
dir		dir
fixedTrainingTxt		fixedTrainingTxt
pdfStores		pdfStores
txtAndXls		txtAndXls
PROJECT_EXPLANATION_CHN.docx		PROJECT_EXPLANATION_CHN.docx
README.md		README.md
downloadFile.txt		downloadFile.txt
fixTextControl.py		fixTextControl.py
reSearchControl.py		reSearchControl.py
searchControl.py		searchControl.py
useDownloadList.py		useDownloadList.py

Repository files navigation

cninfo-crawler-pdf-extracter

this is a crawler for pdf about information of securities from cninfo.com.cn. In this project
I first crawl pdf document from cninfo.com.cn.
Secondly, I use pdfminer3k package to transform pdf into html and separate paragraphs and tables.
Finally I use whoosh+jieba to build up a chinese text search engine.

Details:

PROJECT_EXPLANATION_CHN.docx : Contest definition

useDownloadList.py : Use PROJECT_EXPLANATION_CHN.docx to extract download List

researchControl.py : Main function for extracting infromation from document

searchControl.py : Main function for extracting infromation from document

dir : Tool Function Repositary

fixedTrainingTxt;pdfStores;txtAndXls : Data storage

About

No description, website, or topics provided.

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%