A test web scraper and NLP classifier using BeautifulSoup and NLTK.
This test program uses product descriptions and category hierarchies from Macys and trains a classifier that can predict the category for a given product description.
scrape.py Runs the product scraper.
nlp.py runs the classification analysis.
# Install dependencies:
pip install -r requirements.txt
# For usage info:
python scrape.py --help
python nlp.py --help
# Run scraper (optionally specify only the categories of interest;
# see the --help option)
python scrape.py all
# get coffee :)
# run analysis
python nlp.py --category 1
>python nlp.py -c 1
Loading data from directory: ./data
Classifying product descriptions up to a product category hierarchy depth of 1.
9886 Data Samples.
Generating features...
Training Classifier...
Testing Accuracy...
Most Informative Features
contains(item) = False Beauty : Jewelr = 4171.1 : 1.0
contains(tax) = True Beauty : Jewelr = 3562.5 : 1.0
contains(card) = True Beauty : Jewelr = 3562.5 : 1.0
contains(minimum) = True Beauty : Jewelr = 3562.5 : 1.0
contains(veneer) = True furnit : Jewelr = 3529.9 : 1.0
contains(42) = True furnit : Jewelr = 3529.9 : 1.0
contains(consist) = True Kitche : Jewelr = 3355.6 : 1.0
contains(bevel) = True Kitche : Jewelr = 3355.6 : 1.0
contains(stain) = True Kitche : Jewelr = 3355.6 : 1.0
contains(area) = True Kitche : Jewelr = 3355.6 : 1.0
contains(feet) = True Kitche : Jewelr = 3355.6 : 1.0
contains(fair) = True Holida : Jewelr = 3299.2 : 1.0
contains(artisan) = True Holida : Jewelr = 3299.2 : 1.0
contains(price) = True Holida : Jewelr = 3299.2 : 1.0
contains(valu) = True Beauty : Jewelr = 2345.1 : 1.0
contains(72) = True Shower : Jewelr = 2284.1 : 1.0
contains(shower) = True Shower : Jewelr = 2284.1 : 1.0
contains(checkout) = True Beauty : Jewelr = 2137.5 : 1.0
contains(home) = True Kitche : Jewelr = 2013.4 : 1.0
contains(composit) = True Kitche : Jewelr = 2013.4 : 1.0
contains(origin) = True Kitche : Jewelr = 2013.4 : 1.0
contains(curl) = True Kitche : Jewelr = 2013.4 : 1.0
contains(charg) = True Beauty : Jewelr = 1634.2 : 1.0
contains(polyest) = True Shower : Jewelr = 1631.5 : 1.0
contains(import) = True Shower : Jewelr = 1631.5 : 1.0
Accuracy: 0.977755308392
A few examples:
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: for the home | Prediction: for the home
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: for the home | Prediction: for the home
Actual: for the home | Prediction: for the home
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: for the home | Prediction: for the home
Actual: Bed & Bath | Prediction: Bed & Bath
Actual: Bed & Bath | Prediction: Bed & Bath
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: Jewelry & Watches
Actual: Bed & Bath | Prediction: Bed & Bath
>python nlp.py -c 2
Loading data from directory: ./data
Classifying product descriptions up to a product category hierarchy depth of 2.
9886 Data Samples.
Generating features...
Training Classifier...
Testing Accuracy...
Most Informative Features
contains(tax) = True See Al : FINE J = 3886.8 : 1.0
contains(card) = True See Al : FINE J = 3886.8 : 1.0
contains(minimum) = True See Al : FINE J = 3886.8 : 1.0
contains(warranti) = True Watche : FINE J = 3768.0 : 1.0
contains(dial) = True Watche : FINE J = 3539.6 : 1.0
contains(valu) = True Skin C : FINE J = 3350.5 : 1.0
contains(consist) = True Kitche : FINE J = 3108.3 : 1.0
contains(area) = True Kitche : FINE J = 3108.3 : 1.0
contains(price) = True Holida : FINE J = 3056.0 : 1.0
contains(artisan) = True Holida : FINE J = 3056.0 : 1.0
contains(clock) = True Watche : FINE J = 2854.5 : 1.0
contains(numer) = True Watche : FINE J = 2854.5 : 1.0
contains(month) = True Makeup : FINE J = 2397.8 : 1.0
contains(item) = False GIFTS : FINE J = 2286.3 : 1.0
contains(import) = True Slipco : FINE J = 2265.3 : 1.0
contains(polyest) = True Shower : FINE J = 2115.7 : 1.0
contains(shower) = True Shower : FINE J = 2115.7 : 1.0
contains(help) = True Makeup : FINE J = 2078.1 : 1.0
contains(tip) = True Skin C : FINE J = 2058.7 : 1.0
contains(safe) = True Casual : FINE J = 2028.9 : 1.0
contains(tuck) = True Slipco : FINE J = 2009.2 : 1.0
contains(home) = True Quilts : FINE J = 1961.8 : 1.0
contains(sensit) = True GIFTS : FINE J = 1951.7 : 1.0
contains(case) = True Watche : FINE J = 1941.1 : 1.0
contains(origin) = True Kitche : FINE J = 1865.0 : 1.0
Accuracy: 0.864509605662
A few examples:
Actual: Home Decor | Prediction: Home Decor
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: Jewelry & Watches | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: Jewelry & Watches
Actual: Home Decor | Prediction: Home Decor
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: Bedding Basics | Prediction: Bedding Basics
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: Home Decor | Prediction: Home Decor
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: Jewelry & Watches
Actual: Jewelry & Watches | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
Actual: FINE JEWELRY | Prediction: FINE JEWELRY
>python nlp.py -c 3
Loading data from directory: ./data
Classifying product descriptions up to a product category hierarchy depth of 3.
9886 Data Samples.
Generating features...
Training Classifier...
Testing Accuracy...
Most Informative Features
contains(item) = False Collec : Earrin = 1339.8 : 1.0
contains(tax) = True SHOP A : Earrin = 1308.8 : 1.0
contains(card) = True SHOP A : Earrin = 1308.8 : 1.0
contains(charg) = True SHOP A : Earrin = 1308.8 : 1.0
contains(minimum) = True SHOP A : Earrin = 1308.8 : 1.0
contains(clock) = True Clocks : Neckla = 1292.9 : 1.0
contains(import) = True Slipco : Earrin = 1291.9 : 1.0
contains(bracelet) = True Bracel : Neckla = 1269.4 : 1.0
contains(limit) = True Watche : Neckla = 1252.4 : 1.0
contains(cotton) = True Bath R : Neckla = 1197.7 : 1.0
contains(heat) = True Hair C : Earrin = 1163.4 : 1.0
contains(artisan) = True Gifts : Earrin = 1141.0 : 1.0
contains(ornament) = True Holida : Neckla = 1124.0 : 1.0
contains(half) = True Gifts : Earrin = 1070.8 : 1.0
contains(receiv) = True Gifts : Neckla = 1052.3 : 1.0
contains(origin) = True Kitche : Earrin = 1051.3 : 1.0
contains(comfort) = True Kitche : Earrin = 1051.3 : 1.0
contains(100) = True Kitche : Earrin = 1051.3 : 1.0
contains(...) = True Kitche : Earrin = 1051.3 : 1.0
contains(use) = True Skin C : Neckla = 1039.0 : 1.0
contains(consist) = True Kitche : Neckla = 1033.1 : 1.0
contains(slip) = True Kitche : Neckla = 1033.1 : 1.0
contains(brush) = True Skin C : Neckla = 1012.7 : 1.0
contains(candl) = True Candle : Neckla = 948.8 : 1.0
contains(aa) = True Clocks : Earrin = 847.0 : 1.0
Accuracy: 0.866531850354
A few examples:
Actual: Rings | Prediction: Rings
Actual: Shower Curtains & Accessories | Prediction: Shower Curtains & Accessories
Actual: Earrings | Prediction: Earrings
Actual: Earrings | Prediction: Jewelry & Watches
Actual: Bath Towels | Prediction: Bath Towels
Actual: Jewelry & Watches | Prediction: Earrings
Actual: Rings | Prediction: Rings
Actual: Earrings | Prediction: Earrings
Actual: Necklaces | Prediction: Necklaces
Actual: Bracelets | Prediction: Bracelets
Actual: Bath Towels | Prediction: Bath Towels
Actual: Earrings | Prediction: Earrings
Actual: Necklaces | Prediction: Necklaces
Actual: Candles & Home Fragrance | Prediction: Candles & Home Fragrance
Actual: Bracelets | Prediction: Bracelets
Actual: Necklaces | Prediction: Necklaces
Actual: Earrings | Prediction: Earrings
Actual: Earrings | Prediction: Earrings
Actual: Bowls & Vases | Prediction: Collections
Actual: Hair Care | Prediction: Hair Care