Python bs_preprocess 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: wikiboto.utils

메소드/함수: bs_preprocess

hotexamples.com에서의 예제들: 2

Python bs_preprocess - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 wikiboto.utils.bs_preprocess에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

파일: tasks.py 프로젝트: dotkrnl/dotFoods

def process_list(url, title, raw_content, depth):
    if depth <= 0:
        logger.info('Reached depth limit: {0}'.format(title))
        return

    logger.info('Processing list {0}'.format(title))

    content = BeautifulSoup(
        bs_preprocess(raw_content), 'html.parser')

    url_name = url_name_of(url)
    WikiList.objects.get_or_create(
        url_name=url_name,
        defaults={'title': title})

    # remove noisy links
    link_list = [a for a in content.find_all('a')]
    for f in LINK_FILTERS:
        link_list = filter(f, link_list)

    # crawl all links of the list
    for link in link_list:
        abs_url = urljoin(url, link['href'])
        crawl_page.delay(abs_url, url_name, depth - 1)

예제 #2

파일 보기

파일: tasks.py 프로젝트: dotkrnl/dotFoods

def process_article(url, title, raw_content, parent=None):
    logger.info('Processing article {0}'.format(title))

    content = BeautifulSoup(
        bs_preprocess(raw_content), 'html.parser')

    body = BeautifulSoup(raw_content, 'html.parser')
    for edits in body.find_all(class_='mw-editsection'):
        edits.extract()
    body.find(id='jump-to-nav').extract()
    body = BeautifulSoup(str(body), 'html.parser')

    page, exist = WikiPage.objects.get_or_create(
        url_name=url_name_of(url),
        defaults={
            'title': title,
            'origin': url})
    page.body = str(body)
    page.save()

    # add lists
    if parent:
        wl, exist = WikiList.objects.get_or_create(
            url_name=parent)
        page.lists.add(wl)

    # add categories
    for cat in content.find(id='mw-normal-catlinks')\
            .find_all('li'):
        url_name = url_name_of_cat(cat.find('a')['href'])
        if not url_name:
            continue
        wc, exist = WikiCategory.objects.get_or_create(
            url_name=url_name,
            defaults={'title': str(cat.text)})
        page.categories.add(wc)