- pip installation
pip install nltk
- Downloading nltk's components
>>import nltk
>>nltk.download('all')
let example = "Hello Miss. Purva Singh! How are you? We are very excited to meet you !!!!"
let example_words = ("draw", "drawing", "drew", "draws")
S.No. | Terminology | Description | Python Library | Examples |
---|---|---|---|---|
1. | Tokenizing | Tokenizing can be considered as a form of grouping a charecter sequence. They are of 2 types - 1. Sentence Tokenizer 2. Word Tokenizer | sent_tokenize(example) word_tokenize(example) |
SENTENCE TOKENIZER - Hello Miss. Purva Singh! How are you? We are very excited to meet you !!!! WORD TOKENIZER - 'Hello','Miss','.','Purva','Singh','!','How','are','you','?','We','are', 'very','excited','to','meet','you','!','!','!','!' |
2. | Corpora | Corpora refers to large collection of texts | import nltk.corpus |
medical journals, presidential speech, any English language |
3. | Lexicon | Lexicon refers to dictionary of words and their meanings | bull - To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market bull - also an animal |
|
4 | Stop Words | Stop words refers to those set of extra words in the sentence that we donot need. They are filler words and w.r.t data analysis, they are useless | from nltk.corpus import stopwords set(stopwords.stop("english")) |
'Hello', 'Miss', '.', 'Purva', 'Singh', '!', 'How', '?', 'We', 'excited', 'meet', '!', '!', '!', '!' |
5 | Stemming | Sometimes words might have variations, due to their tenses. Stemming would normalize the sentences | from nltk.stem import PorterStemmer ps = PorterStemmer() |
Stemming would give a set of root words. ps.stem(example_words) = ("draw", "draw", "drew", "draw") |
6 | Tagging | Part of speech tagging refers to labeling words in a sentence as nouns, adjectives, verbs, tenses etc. | part_of_speech_tag = nltk.pos_tag(tokenized_words) |
(('PRESIDENT', 'NNP'), ('members', 'NNS'), ('W.', 'NNP'), ('THE', 'DT')) 1. NNP - proper noun 2. DT - determiner 3. NNS - noun plural |
7 | Chunking | Chunking can be referred as grouping of words based upon a regular expression. | chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" |
Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP) (Chunk ADDRESS/NNP) (Chunk A/NNP JOINT/NNP SESSION/NNP) |
8 | Chinking | Chinking can be referred as exclusion of words, represented by outward curly braces - }(Chinking RegExp){. | chunkGram = r"""Chunk: {<.*>+}<VB.?>*<NNP>+<NN>?{""" |
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP) |
9. | Named entity recognition | Main idea behind Named entity recognition is to chunk "entities" such as people, places, things, locations, monetary figures, and more | named_entity = nltk.ne_chunk(tagged) |
ORGANIZATION - Caplan and Gold, WHO PERSON - Purva Singh LOCATION - Bhilai, Bangalore DATE - June, 2019-06-29 TIME - two fifty a m, 1:30 p.m. |
10 | Lemmatizing | Lemmatizing is similar to stemming, but in former, every word generated is an actual word unlike stemming. Lemmatizer function takes an optional parameter "pos"(part of speech) which by default is noun. | from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() lemmatizer.lemmatize("pretty") |
# pretty print(lemmatizer.lemmatize("pretty")) # drawing print(lemmatizer.lemmatize("drawing", pos='a')) # good print(lemmatizer.lemmatize("better", pos='a')) # better print(lemmatizer.lemmatize("better")) |
- Create a twitter developer account.
- Create an app by filling all the required details.
- Sometimes email confirmation mail can come in your spam folder.
- After creating app, under keys and tokens section, you can find your respective -- consumer key, consumer secret key, token key and token secret key
- One of the reasons for Stream API giving 401 is :: Twitter account's time zone and ubuntu machine's timezone are not in SYNC
- To check current time zone in ubuntu, type
date
command :
- To check time zone of your twitter account, follow the following steps -
- Goto twitter
- Click on your profile -> settings and privacy -> Timezone
- Set timezone in sync with your ubuntu machine's timezone.