Python DataCleaner.clean_content 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: data_cleaner

클래스/타입: DataCleaner

메소드/함수: clean_content

hotexamples.com에서의 예제들: 2

Python DataCleaner.clean_content - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 data_cleaner.DataCleaner.clean_content에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

DataCleaner(30)

clean_file(4)

clean(3)

drop_na_values(2)

_validate_filters(2)

clean_content(2)

create_categorical_feature(2)

create_levels_column(1)

drop_columns(1)

do_clean(1)

create_valence_column(1)

create_tag_frequencies(1)

create_tag_columns(1)

_get_api_response(1)

create_length_feature(1)

_get_file_encoding(1)

_build_data(1)

clean_data(1)

auto_clean(1)

_plural_entity_level(1)

create_binary_feature(1)

예제 #1

파일 보기

파일: tokenizer.py 프로젝트: thangbk2209/natural_language_understanding

 def tokenize(self):
     data_cleaner = DataCleaner(self.corpus)
     all_word, all_sentence_split = data_cleaner.clean_content()
     print ('all_word')
     print (all_word)
     # print ('all_sentence_split')
     # print (all_sentence_split)
     return all_word, all_sentence_split

예제 #2

파일 보기

from tokenizer import Tokenizer
from data_cleaner import DataCleaner
import pandas as pd
import numpy as np
import pickle as pk

corpus_file = '../../data/corpus.txt'
file_to_save_vocab = '../../results/tokenization/vocabulary.txt'

file_to_save_corpus = '../../results/tokenization/corpus_split.csv'
# read data to a file
with open(corpus_file, encoding="utf-8") as f:
    corpus = f.read().lower()
    print("----------------------------------CORPUS----")
data_cleaner = DataCleaner(corpus)
all_words, all_sentences_split = data_cleaner.clean_content()
print('------------------vocabulary------------------------')
print(len(all_words))
# print (all_sentences_split)
words_to_save = []
file = open(file_to_save_vocab, 'w', encoding="utf8")
for word in all_words:
    file.write(word + '\n')
print(len(all_words))
print('DONE')