Python WordPunctTokenizer.replace 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: nltk.tokenize

클래스/타입: WordPunctTokenizer

메소드/함수: replace

hotexamples.com에서의 예제들: 1

Python WordPunctTokenizer.replace - 1개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 nltk.tokenize.WordPunctTokenizer.replace에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

WordPunctTokenizer(30)

tokenize(30)

span_tokenize(5)

lower(2)

append(1)

index(1)

replace(1)

strip(1)

예제 #1

파일 보기

파일: Text Classification Experimentation.py 프로젝트: maxflood/Text-Classification-of-Patient-Safety-Incident-Reports

    # Remove all single character words
    document = re.sub(r'\s[a-zA-z]{1}\s', ' ', document)
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    # Tokenizing
    document = WordPunctTokenizer().tokenize(document)
    # Remove Stopwords
    document = [word for word in document if word not in stopset]
    # Stemming
    document = [SnowballStemmer('english').stem(t) for t in document]
    doc_length.append(len(document))
    document = ' '.join(document)
    # Remove all single characters that could have been created due to tokenization
    document = re.sub(r'\s[a-zA-z]{1}\s', ' ', document)
    # Editing some words of intrest
    document = document.replace('bp', 'bloodpressure')
    document = document.replace('blood pressure', 'bloodpressure')
    document = document.replace('ordered', 'order')
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    X.append(document)
df['incident'] = X

# Most common features after stemming pre-processing
tokens = df.incident.str.cat(sep=' ')
tokens = WordPunctTokenizer().tokenize(
    tokens)  #shows there are 1,297,146 words in this corpus
# shows how many unique words there are
unique_words = nltk.Fr___Dist(tokens)  # shows 21,116 unique words
top_words = unique_words.most_common(50)
# ploting the most common words