Skip to content

A tool to predict a suitable place for a product in an existing taxonomy

Notifications You must be signed in to change notification settings

caldweln/taxonomy-predict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 

Repository files navigation

taxonomy-predict

A tool to predict a suitable place for a product in an existing taxonomy.

Using data extracted from a product database, taxonomy-predict fits a tree of classifiers to the already categorized products.

This tree can then be used to classify un-categoried products.

This is a work in progress, preliminary results below.

Setup

See setup.txt

Results

Training on a dataset of 55K products, where 5% is reserved for validation, achieved the following :

results

category length dist

Classifier Train Time
LogisticRegression 14m44s
LinearSVC 13m42s
RandomForestClassifier 57m14s
MultinomialNB 15m02s

Configuration

The classifier to be used, file locations and database settings can be configured at etc/config_openfoodfacts.

Results were achieved with the following classifier configurations:

classifier_module='sklearn.linear_model',
classifier_name='LogisticRegression',
classifier_params={'C':1,'class_weight':'balanced'}

classifier_module='sklearn.svm',
classifier_name='LinearSVC',
classifier_params={'C':1,'class_weight':'balanced'}

classifier_module='sklearn.ensemble',
classifier_name='RandomForestClassifier',
classifier_params={'n_estimators':10}

classifier_module='sklearn.naive_bayes',
classifier_name='MultinomialNB',
classifier_params={'alpha':1}

Notes

  • results obtained on a Open Food Facts mongodb data dump
  • only product category hierarchies of length of at least 5 are considered
  • LogisticRegression requires about 10Gb of RAM on OFF data
    • however others may use considerably more

Disclaimer

No warranties, provided 'AS-IS'.

About

A tool to predict a suitable place for a product in an existing taxonomy

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages