Python Document.get_clean_html 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: readability.readability

클래스/타입: Document

메소드/함수: get_clean_html

hotexamples.com에서의 예제들: 2

Python Document.get_clean_html - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 readability.readability.Document.get_clean_html에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Document(30)

short_title(30)

summary(30)

title(20)

encode(9)

replace(9)

reverse_tags(4)

content(3)

transform(2)

get_clean_html(2)

get_publish_date(2)

parse(2)

split(2)

text_content(1)

summary_with_metadata(1)

strip(1)

read(1)

seek(1)

lower(1)

get_text(1)

get_author(1)

find_all(1)

find(1)

encoding(1)

write(1)

예제 #1

파일 보기

def process_html(html):
    doc = Document(html)
    return {
        'content': doc.content(),
        'clean_html': doc.get_clean_html(),
        'short_title': doc.short_title(),
        'summary': html_to_text(doc.summary()),
        'title': doc.title()
    }

예제 #2

파일 보기

# encoding:utf-8
# import html2text
import requests
import time
import re
from readability.readability import Document

url = "http://world.huanqiu.com/exclusive/2016-07/9209839.html"
# res = requests.get('http://finance.sina.com.cn/roll/2019-02-12/doc-ihrfqzka5034116.shtml')
res = requests.get(url)

st = time.time()
d = Document(res.content)

# 获取新闻标题
readable_title = d.short_title()
print(readable_title)
# 获取内容并清洗
readable_article = d.summary()
# print(readable_article)

print(d.get_clean_html())

print("time: {}".format(time.time() - st))

# text_p = re.sub(r'</?div.*?>', '', readable_article)
# text_p = re.sub(r'((</p>)?<a href=.*?>|</a>(<p>)?)', '', text_p)
# text_p = re.sub(r'<select>.*?</select>', '', text_p)
# print(text_p)