from sklearn.feature_extraction.text import CountVectorizer corpus = ["This is the first document.", "This document is the second document.", "And this is the third one.", "Is this the first document?"] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) # Output: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this'] print(X.toarray()) # Output: [[0 1 1 1 0 0 1 0 1], # [0 2 0 1 0 1 1 0 1], # [1 0 0 1 1 0 1 1 1], # [0 1 1 1 0 0 1 0 1]]
from sklearn.feature_extraction.text import CountVectorizer corpus = ["I love cats!", "Dogs are great too.", "Fish are awesome."] vectorizer = CountVectorizer(ngram_range=(1,2)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) # Output: ['are', 'are awesome', 'cats', 'dogs', 'dogs are', 'fish', 'fish are', 'great', 'great too', 'love', 'love cats'] print(X.toarray()) # Output: [[0 0 1 0 0 0 0 0 0 1 1], # [1 0 0 1 1 0 0 1 1 0 0], # [1 1 0 0 0 1 1 0 0 0 0]]In this example, we have a corpus of three documents that discuss different animals. We specify a n-gram range of 1 to 2, which means that the CountVectorizer will extract both unigrams and bigrams. The resulting sparse matrix `X` consists of 11 features and 3 observations. The `get_feature_names()` method returns the words and bigrams extracted from the corpus, and the `toarray()` method returns the sparse matrix in a more readable format.