Python readability_extract 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: pressley.util

메소드/함수: readability_extract

hotexamples.com에서의 예제들: 3

Python readability_extract - 3개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 pressley.util.readability_extract에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

파일: scrape.py 프로젝트: socoboy/pressley

def get_link_content(link):
    try:
        response = requests.get(link)
        if response.status_code == 400:
            logging.warn(u"404 {}".format(link))
            return None
        if response.status_code != 200:
            raise Exception(u"Unable to fetch release content: {0}".format(link))
    except requests.exceptions.InvalidURL as e:
        logging.warn(u"Invalid link {0}: {1}".format(link, unicode(e)))
        return None

    content_type = response.headers.get('content-type')
    if not content_type:
        logging.warn(u"Response did not contain a Content-Type header: {0}".format(link))
        return None

    (mime_type, mime_subtype, mt_params) = parse_mime_type(content_type)
    if mime_type != 'text' or mime_subtype not in ('html', 'xhtml'):
        logging.warn(u"Skipping non-HTML link: {0}".format(link))
        return None

    if len(response.content) == 0:
        logging.warn(u"Server returned an empty body: {0}".format(link))
        return None

    (title, body) = readability_extract(response.content)
    return kill_control_characters(body)

예제 #2

파일 보기

파일: romneycampaign.py 프로젝트: sunlightlabs/pressley

 def body(self):
     if self._body is None:
         response = requests.get(self.url)
         response.raise_for_status()
         (_junk_title, body) = readability_extract(response.content)
         self._body = kill_control_characters(body)
     return self._body

예제 #3

파일 보기

파일: congressional_leadership.py 프로젝트: socoboy/pressley

    def extract(self, link):
        response = requests.get(link).content
        (title, body) = readability_extract(response)
        date = getattr(self, 'parse_%s_date'% self.extra['leader'])(body, response, link) 
        doc = { 'url': link,
                'title': title,
                'text': body,
                'date': date,
                'source': self.sources[self.index]}

        return doc