Alice.py is your primary teacher of data analysis and Kaggle.
In this md, we will introduce how to deal with Kaggle task. It is like a 武功秘籍. 这里写的主要是方便记忆的理论,具体的代码实现会在 Alice.py 中和 tools 中体现。
Frist we need to know the data, understand the meaning of attributes and visualize some performance and relations. This step is helpful for us to choose and generate features, sometimes totally reform the problem.
from technology perspective we have:
df.column.value_counts.plot('bar')
to valuecount the column.
df.groupby('Pclass')['Survived'].agg(np.mean).plot('bar')
Plot kde of something for different category.
serie.plot('kde')
use plot_distribution or violin plot, or multi-boxplot
we have several ways to fill the NA:
0, forward fill, backward fill
Find the most related columns, use the mean or median in this category to predict.
Use some model trained on some related columns to fill the NA.
differentiate the entry with and without information in this column.
cut the continues value to discrete
dig information out of string
由于有歧义,所以这里包括了两点。 其一是把skew的数据给拉成正态分布的,其二是使得其期望在0处,标准差为1。
Things we probably need to adjust:
Use cross validation