Python Normalizr.normalize 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: normalizr

클래스/타입: Normalizr

메소드/함수: normalize

hotexamples.com에서의 예제들: 4

Python Normalizr.normalize - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 normalizr.Normalizr.normalize에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Normalizr(8)

normalize(4)

자주 사용되는 메소드들

Normalizr (8)

normalize (4)

예제 #1

파일 보기

파일: similarityProfile.py 프로젝트: atharsefid/paper_matching

def normalize(text):
    normalizr = Normalizr(language='en')
    normalizations = [
        'remove_extra_whitespaces',
        ('replace_punctuation', {
            'replacement': ' '
        }), 'lower_case', ('remove_stop_words', {
            'ignore_case': 'False'
        })
    ]
    h = HTMLParser()
    text = normalizr.normalize(xstr(text), normalizations)
    return str(h.unescape(text))

예제 #2

파일 보기

파일: tweet_normalisation.py 프로젝트: dlaertius/web-crawler-tweepy-classifier

def normalisation(tweet):
    mention_removed = re.sub(r'(?:@[\w_]+)', '', tweet.lower())
    html_removed = re.sub(r'<[^>]+>', '', mention_removed)
    hashtag_removed = re.sub(r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", '',
                             html_removed)
    removed_repeated_chars = re.sub(r'(.)\1+', r'\1\1', hashtag_removed)
    normalised_text1 = re.sub(' +', ' ', removed_repeated_chars)

    normalizr = Normalizr(language='en')

    normalizations = [
        ('replace_urls', {
            'replacement': ' '
        }),
        ('replace_punctuation', {
            'replacement': ' '
        }),
        ('replace_emojis', {
            'replacement': ' '
        }),
        ('replace_hyphens', {
            'replacement': ' '
        }),
        ('replace_symbols', {
            'replacement': ' '
        }),
        'remove_accent_marks',
        'remove_stop_words',
        'remove_extra_whitespaces',
    ]

    normalised_text2 = normalizr.normalize(normalised_text1, normalizations)
    array_words = normalised_text2.split()
    #print (array_words)

    normalised_text3 = [correction(word) for word in array_words]
    normalised_tweet = " ".join(normalised_text3)

    return normalised_tweet

예제 #3

파일 보기

파일: Presentation.py 프로젝트: ChadGhostal/htopics

bannedWords = ["", "rt", "amp"]
for x in range(0, len(content)):
    stringList.append([])
    stringList[x] = content[x].split(" ")

#Used to store the index of the tweet that contains a word in the corpus
dxInCorpus = -1

for x in range(0, len(content)):
    if (x % 100 == 0):
        print("tweet " + str(x) + " of " + str(len(content)))
    tweetWords = stringList[x]
    numWords = len(tweetWords)
    for i in range(0, numWords):
        word = normalizr.normalize(stringList[x][i].lower())
        stringList[x][i] = word
    #numWordsInCorpus = 0;
    for i in range(0, numWords):
        word = stringList[x][i]
        #if (word in crpNodeList):
        #   numWordsInCorpus = numWordsInCorpus + 1;
        #if (numWordsInCorpus > 1):
        for i in range(0, numWords):
            firstWord = stringList[x][i]
            for j in range(i + 1, numWords):
                secondWord = stringList[x][j]
                w = 1
                #if (firstWord in crpNodeList or secondWord in crpNodeList):
                #if (firstWord in crpNodeList and secondWord in crpNodeList):
                if graph.has_edge(firstWord, secondWord):

예제 #4

파일 보기

    }),
    ('replace_emojis', {
        'replacement': ' '
    }),
    ('replace_hyphens', {
        'replacement': ' '
    }),
    ('replace_symbols', {
        'replacement': ' '
    }),
    'remove_accent_marks',
    'remove_stop_words',
    'remove_extra_whitespaces',
]

arq_2.write(normalizr.normalize(texto, normalizations))

arq_2.close()
arq.close()

#calculando a quantidade total de palavras válidas porém repetidas da base. TOTAL : 4650
'''
arq_2 = open("FINAL_Entretenimento.txt", 'w')

st = ""
for z in arq_2:
	st += z

z = z.split()
print (z)
print (len(z))