from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(X.toarray())
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(ngram_range=(1,2)) X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names())In this example, we use the CountVectorizer in combination with the ngram_range parameter to generate feature names for bigrams. We again instantiate an object of the CountVectorizer class and pass `ngram_range=(1,2)` as an argument to create a vocabulary of unigrams and bigrams. We then use the `fit_transform` method to generate the sparse matrix, and finally, the `get_feature_names` method to print the feature names. Overall, CountVectorizer is a powerful tool from the sklearn library that can quickly and effectively transform text data into numerical representations.