Python TfidfVectorizer.fit_transfrom Examples

Programming Language: Python

Namespace/Package Name: sklearn.feature_extraction.text

Class/Type: TfidfVectorizer

Method/Function: fit_transfrom

Examples at hotexamples.com: 1

Python TfidfVectorizer.fit_transfrom - 1 examples found. These are the top rated real world Python examples of sklearn.feature_extraction.text.TfidfVectorizer.fit_transfrom extracted from open source projects. You can rate examples to help us improve the quality of examples.

Frequently Used Methods

Show Hide

fit(30)

get_stop_words(30)

TfidfVectorizer(30)

fit_transform(30)

get_feature_names(30)

inverse_transform(30)

build_analyzer(30)

build_tokenizer(29)

get_params(29)

get_feature_names_out(14)

__init__(12)

idf_(11)

build_preprocessor(8)

max_features(8)

_validate_vocabulary(3)

max_df(3)

fir(2)

N_(2)

fit_on_texts(2)

build_vocab(2)

decode(2)

_tfidf(2)

decode_error(1)

append(1)

_document_frequency(1)

_get_param_names(1)

kneighbors(1)

join(1)

_stop_words_id(1)

inv_vocabulary_(1)

input(1)

infer_vector(1)

idx_target_cache(1)

get_word_net_feature_vecs(1)

bert(1)

get_shape(1)

encode(1)

get_feautre_names(1)

cate_set(1)

get_feature_name(1)

fit_transfrorm(1)

fit_transfrom(1)

count(1)

fit_trainsform(1)

count_args(1)

count_chunks(1)

encoding(1)

mean(1)

Example #1

Show file

File: similarity.py Project: vipulSharma18/Task_Practice

issues_combined = json.loads(file_combined.read())

# extracting only the bodies of issues from the dictionaries, the all_bodies is ordered
# the list are ordered in the order of insertion, i checked it: https://stackoverflow.com/questions/13694034/is-a-python-list-guaranteed-to-have-its-elements-stay-in-the-order-they-are-inse

all_bodies = []
# to check duplicacy in the issues_combined only, remove this for loop from code
for issue, body in issues_open.items():
	all_bodies.append(body)
for issue, body in issues_combined.items():
	all_bodies.append(body)

# there is redundancy in this all_bodies list but to keep code logically lucid it is better to do this way, redundancy won't affect the similarity scores in any way
vectorizer = TfidfVectorizer(tokenizer=None)   #i'm assuming you have already normalized the input files

X = vectorizer.fit_transfrom(all_bodies).todense()  
# X now contains the transformed representation of each issue body, now we can directly access this transformed representation instead of
# repeatedly transforming each comment

#renaming reference for me, to prevent refactoring of code
open_dict = issues_open
closed_dict = issues_closed
combined_dict = issues_combined

pos_open = 0   # the index of the open issue currently in consideration
for opn_issue_num, open_body in open_dict.items():
	pos_combined = 0   # the index of the combined issue currently in consideration
	for cmb_issue_num, combined_body in combined_dict.items():
		if(cmb_issue_num == opn_issue_num):
      continue
    #2d array's row from 0 to len(open_dict) - 1, were of open issues' comments' transformation