Python FreqDist.r_Nr Examples

Programming Language: Python

Namespace/Package Name: nltk

Class/Type: FreqDist

Method/Function: r_Nr

Examples at hotexamples.com: 1

Python FreqDist.r_Nr - 1 examples found. These are the top rated real world Python examples of nltk.FreqDist.r_Nr extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

keys(30)

N(30)

values(30)

update(30)

plot(30)

most_common(30)

FreqDist(30)

items(30)

freq(30)

inc(26)

hapaxes(25)

B(22)

get(22)

max(18)

iteritems(7)

pop(6)

copy(5)

tabulate(4)

samples(3)

__delitem__(2)

pformat(2)

sort_values(2)

has_key(1)

__init__(1)

transpose(1)

sort(1)

pprint(1)

reverse(1)

reset_index(1)

r_Nr(1)

_cumulative_frequencies(1)

clear(1)

elements(1)

insert(1)

viewkeys(1)

Example #1

Show file

def main():
    print("doing stuff")
    readRegionalisms()
    #because sets aren't in a stable ordering,
    # and calculateIDF expects just a freqdist not a (sub name, freqdist) tuple
    toProcess = list(readSubredditSet())
    frequencies = list()
    for subredditname in toProcess:
        #todo: figure out how to also do bigrams.
        c_freqdist = getFrequency(subredditname)
        frequencies.append(c_freqdist)

    #now have a list of all words
    totalfreq = FreqDist()
    for frequency in frequencies:
        totalfreq = frequency + totalfreq

    #get list of words to calculate tf-idf score for
    N = len(frequencies)
    all_words = set(totalfreq.keys())
    #remove all words that only occcur on average less than once per corpus.
    #based on http://www.nltk.org/_modules/nltk/probability.html#FreqDist.hapaxes
    all_words = all_words - {
        item
        for item in totalfreq.keys() if totalfreq[item] < N
    }

    rnrdict = totalfreq.r_Nr()
    numremoved = 0
    for i in range(N):
        numremoved += rnrdict.get(i, 0)

    print("removed " + str(numremoved) +
          " words from the set of words processed due to low frequency.")
    del rnrdict, numremoved

    idfdict = dict()
    for word in all_words:
        #calculate idf scores for words
        idfdict[word] = calculateIDF(frequencies, word, len(toProcess))

    #frequencies are in the freq dist, idf of a word is in idfdict. now onto like ???
    #tfidf = list()
    for i in range(len(toProcess)):
        current = (toProcess[i], calcTfidf(frequencies[i], all_words, idfdict))
        outputResults(current)