f = cr^-s where s and c are parameters that depend on the language and the text. If you take the logarithm of both sides of this equation, you get: log f = log c - s log r So if you plot log f versus log r, you should get a straight line with slope -s and intercept log c. Write a program that reads a text from a file, counts word frequencies, and prints one line for each word, in descending order of frequency, with log f and log r. Use the graphing program of your choice to plot the results and check whether they form a straight line. Can you estimate the value of s? Solution: http://www.greenteapress.com/thinkpython/code/zipf.py. To make the plots, you might have to install matplotlib (see http://matplotlib.org/). ''' if __name__ == '__main__': print "Exercise 13:" mylist = process_file('emma.txt') myhist = histogram(mylist) sorted_list = convert_to_sorted_list(myhist) freq_list = [] for freq in sorted_list: freq_list.append(freq[0]) for idx, freq in enumerate(freq_list): print idx, freq
You should attempt this exercise before you go on; then you can can download my solution from http://www.greenteapress.com/thinkpython/code/markov.py. You will also need http://www.greenteapress.com/thinkpython/code/emma.txt. ''' if __name__ == '__main__': sortedbookwordslist = convert_to_sorted_list(book) print sortedbookwordslist print len(book), "different words were used." compare_to_wordlist("words.txt") choose_from_hist(sample_hist) hist = process_file('emma.txt') words = process_file('words.txt') diff = missing_from_words(hist, words) print "The words in the book that aren't in the word list are:" for word in diff: print word, myhist = chapter11.histogram(book) print "\n random word:", pick_random_word(myhist)