Kaggle competition Avito demand prediction challenge
Date cleanup
- replace_na(list(image_top_1 = -1, price = -1)) %T>%
text data str_to_lower(txt) %>% **** should you do it **** str_replace_all("[^[:alpha:]]", " ") %>% str_replace_all("\s+", " ") %>%
- price = log1p(price),
char_count word_count word_density punctuation_count https://www.kaggle.com/codename007/avito-eda-fe-time-series-dt-visualization
-
time features
df['has_image'] = df.image.apply(lambda image: True if type(image) == unicode else False).astype('bool') df.drop(['image'], axis=1, inplace=True)
for col in ['title', 'description'for col in ['title', 'description']: df[col + '_length'] = df[col].apply(lambda txt: len(txt) if type(txt) == unicode else 0).astype('uint32') df.drop([col], axis=1, inplace=True)]: df[col + '_length'] = df[col].apply(lambda txt: len(txt) if type(txt) == unicode else 0).astype('uint32') df.drop([col], axis=1, inplace=True)
https://www.kaggle.com/wolfgangb33r/advanced-avito-prediction-xgboost-word-char-counts/code
Snapshot of Train Periods Dataset
importance figure