sixgod pyton 网页正文提取

思想

基于行块分布函数的通用网页正文抽取
优势：线性时间、不建DOM树、与HTML标签无关

:::python

     from vampire.htmlextract import HtmlExtract  
     from vampire.utils import network  
     
     h = HtmlExtract()  
     html = network.get_html_string("http://finance.jfinfo.com/news/20131022/00311378.shtml")  
     print h.get_text(html) #返回新闻正文提取

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
vampire		vampire
README.md		README.md
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vampire

vampire

README.md

README.md

setup.py

setup.py

test.py

test.py

Repository files navigation

sixgod pyton 网页正文提取

思想

About

Releases

Packages

idreamsoft/sixgod

Folders and files

Latest commit

History

Repository files navigation

sixgod pyton 网页正文提取

思想

About

Resources

Stars

Watchers

Forks