Python fix_mojibake 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: alert.lib.mojibake

메소드/함수: fix_mojibake

hotexamples.com에서의 예제들: 4

Python fix_mojibake - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 alert.lib.mojibake.fix_mojibake에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

파일: tasks.py 프로젝트: ellliottt/courtlistener

def extract_from_pdf(doc, path, DEVNULL, callback=None):
    """ Extract text from pdfs.

    Here, we use pdftotext. If that fails, try to use tesseract under the
    assumption it's an image-based PDF. Once that is complete, we check for the
    letter e in our content. If it's not there, we try to fix the mojibake
    that ca9 sometimes creates.
    """
    process = subprocess.Popen(
        ["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
        shell=False,
        stdout=subprocess.PIPE,
        stderr=DEVNULL
    )
    content, err = process.communicate()
    if content.strip() == '' and callback:
        # probably an image PDF. Send it to OCR
        result = subtask(callback).delay(path)
        success, content = result.get()
        if success:
            doc.extracted_by_ocr = True
        elif content == '' or not success:
            content = 'Unable to extract document content.'
    elif 'e' not in content:
        # It's a corrupt PDF from ca9. Fix it.
        content = fix_mojibake(unicode(content, 'utf-8', errors='ignore'))

    return doc, content, err

예제 #2

파일 보기

파일: tasks.py 프로젝트: wmbutler/courtlistener

def extract_from_pdf(doc, path, DEVNULL, callback=None):
    """ Extract text from pdfs.

    Here, we use pdftotext. If that fails, try to use tesseract under the
    assumption it's an image-based PDF. Once that is complete, we check for the
    letter e in our content. If it's not there, we try to fix the mojibake
    that ca9 sometimes creates.
    """
    process = subprocess.Popen(
        ["pdftotext", "-layout", "-enc", "UTF-8", path, "-"],
        shell=False,
        stdout=subprocess.PIPE,
        stderr=DEVNULL)
    content, err = process.communicate()
    if content.strip() == '' and callback:
        # probably an image PDF. Send it to OCR
        result = subtask(callback).delay(path)
        success, content = result.get()
        if success:
            doc.extracted_by_ocr = True
        elif content == '' or not success:
            content = 'Unable to extract document content.'
    elif 'e' not in content:
        # It's a corrupt PDF from ca9. Fix it.
        content = fix_mojibake(unicode(content, 'utf-8', errors='ignore'))

    return doc, content, err

예제 #3

파일 보기

파일: fix_mojibake_in_cases_94.py 프로젝트: Andr3iC/courtlistener

def cleaner(simulate=False, verbose=True):
    """Fix cases that have mojibake as a result of pdffactory 3.51."""

    # Find all the cases using Solr
    results_si = conn.raw_query(**{'q': u'ÚÑÎ', 'caller': 'mojibake',})
    for result in results_si:
        # For each document
        doc = Document.objects.get(pk=result['id'])
        if verbose:
            print "https://www.courtlistener.com" + doc.get_absolute_url()
        # Correct the text
        text = doc.plain_text
        doc.plain_text = fix_mojibake(text)

        # Save the case
        if not simulate:
            doc.save()

예제 #4

파일 보기

def cleaner(simulate=False, verbose=True):
    """Fix cases that have mojibake as a result of pdffactory 3.51."""

    # Find all the cases using Solr
    results_si = conn.raw_query(**{"q": u"ÚÑÎ", "caller": "mojibake",})
    for result in results_si:
        # For each document
        doc = Document.objects.get(pk=result["id"])
        if verbose:
            print "https://www.courtlistener.com" + doc.get_absolute_url()
        # Correct the text
        text = doc.plain_text
        doc.plain_text = fix_mojibake(text)

        # Save the case
        if not simulate:
            doc.save()