Skip to content

jgarciab/matchString

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Database merging and string matching

Javier Garcia-Bernardo, 2017

CODE AND FIGURES HERE: match_strings.ipynb

TODO:

  • Make it more elegant/flexible, this is recycled code from many years ago.
  • Big data to avoid comparing all names in database 1 to all names in database 2. This can be achieved neatly with LSH forests (see see here for the current implementation

Requirements:

  1. Libraries
pip install distance numpy pandas matplotlib sklearn seaborn python-Levenshtein 
  1. Train and test set: Two files with three columns (string1, string2, 0/1 for match)

How to run it:

database1 = "./D/database_1.csv"
database2 = "./D/database_2.csv"
train_data_file = "./D/train.csv"
test_data_file = "./D/test.csv"

tfidf_matrix_train,dictTrain,tfidf_matrix_trainBigrams,dictTrainBigrams,lenGram = createTFIDF(database1,database2)
clf,clf2 = train(train_data_file,tfidf_matrix_train,dictTrain,tfidf_matrix_trainBigrams,dictTrainBigrams,lenGram,sep="\t")
predict = test(test_data_file,tfidf_matrix_train,dictTrain,tfidf_matrix_trainBigrams,dictTrainBigrams,lenGram,clf,clf2,sep="\t")
plot(predict)
  • You can then use clf (the SVM) to predict matches between any two strings, you can use the plot with ROC curve to set up your threshold (or let the algorithm find it, but that will depend on your training set).
distances = find_distances(st1,st2)
clf.decision_function(np.array(temp,dtype=float))

About

Python compare methods of string matching

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published