Here is a description for each python file:
- extract_text_from_rotten_tomatoes.py
Created a script named extract_text_from_rotten_tomatoes.py
This script defines a function that accepts the URL to a movie on RottenTomatoes. It then creates a
text file that includes the following information for each review in the first 2 review pages for
the movie:
- the name of the critic
- the rating. The rating should be 'rotten', 'fresh', or 'NA' if the review doesn't have a rating.
- the source (e.g 'New York Daily News') of the review. Is 'NA' if the review doesn't have a source.
- the text of the review. Is 'NA' if the review doesn't have text.
- the date of the review. Is 'NA' if the review doesn't have a date.
The file includes one line for each review. The reviews in the file appear in the same
order as they do on the website. The 5 values that you write for each movie is written in
the order listed above. The 5 values are separated by a TAB.
- webcounter.py
Created a script called webcounter.py
- The script defines a function run() with 3 parameters: a link to webpage and two words w1 and w2.
- The function returns a set of all the words in the webpage that have a higher frequency than w1 but a
lower frequency than w2.
- Ignored case.
- Removed all non-letter characters before you count
- Ignored stopwords
- getngrams.py.
My script defines the following function:
processSentence(sentence,posLex,negLex,tagger): The parameters of this function are a sentence (a
string), a set positive words, a set of negative words, and a POS tagger. The function returns a
list with all the 4-grams in the sentence that have the following structure:
not <any word> <pos/neg word> <noun>
For example: not a good idea