Skip to content

jxqlovejava/sixgod

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sixgod pyton 网页正文提取

思想

  • 基于行块分布函数的通用网页正文抽取
  • 优势: 线性时间、不建DOM树、与HTML标签无关

:::python

     from vampire.htmlextract import HtmlExtract  
     from vampire.utils import network  
     
     h = HtmlExtract()  
     html = network.get_html_string("http://finance.jfinfo.com/news/20131022/00311378.shtml")  
     print h.get_text(html) #返回新闻正文提取

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%