Python Corpus.replace_characters 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: tmtoolkit.corpus

클래스/타입: Corpus

메소드/함수: replace_characters

hotexamples.com에서의 예제들: 2

Python Corpus.replace_characters - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 tmtoolkit.corpus.Corpus.replace_characters에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Corpus(20)

from_folder(8)

from_files(6)

from_builtin_corpus(5)

copy(3)

keys(3)

items(2)

builtin_corpora(2)

replace_characters(2)

apply(2)

add_files(1)

split_by_paragraphs(1)

from_pickle(1)

get_doc_labels(1)

from_zip(1)

from_tabular(1)

add_doc(1)

filter_characters(1)

to_pickle(1)

예제 #1

파일 보기

파일: test_corpus.py 프로젝트: ihavemanyquestions/tmtoolkit

def test_corpus_replace_characters_simple():
    c = Corpus({'doc1': 'ABC', 'doc2': 'abcDeF'})
    c.replace_characters({'a': None, 'C': 'c', 'e': ord('X')})

    assert c.docs == {
        'doc1': 'ABc',
        'doc2': 'bcDXF',
    }

    c.replace_characters({ord('A'): None})

    assert c.docs == {
        'doc1': 'Bc',
        'doc2': 'bcDXF',
    }

    c.replace_characters(str.maketrans('DXFY', '1234'))

    assert c.docs == {
        'doc1': 'Bc',
        'doc2': 'bc123',
    }

    c.replace_characters({})

    assert c.docs == {
        'doc1': 'Bc',
        'doc2': 'bc123',
    }

예제 #2

파일 보기

파일: bundestag18_tfidf.py 프로젝트: yushu-liu/tmtoolkit

    '̃': None,
    '̆': None,
    'ҫ': 'ç',    # they look the same but they aren't
    '‘': None,
    '’': None,
    '‚': ',',
    '“': None,
    '”': None,
    '„': None,
    '…': None,
    '\u202f': None,
    '�': None
}

print('replacing characters in each document of the corpus')
corpus.replace_characters(char_transl_table)

print('these non-ASCII characters are left:')
pprint(corpus.unique_characters - set(string.printable))

#%% Correct contractions

# some contractions have a stray space in between, like "EU -Hilfen" where it should be "EU-Hilfen"
# correct this by applying a custom function with a regular expression (RE) to each document in the corpus
pttrn_contraction_ws = re.compile(r'(\w+)(\s+)(-\w+)')

print('correcting wrong contractions')
# in each document text `t`, remove the RE group 2 (the stray white space "(\s+)") for each match `m`
corpus.apply(lambda t: pttrn_contraction_ws.sub(lambda m: m.group(1) + m.group(3), t))

#%% Create a TMPreproc object for token processing