Skip to content

Apollo1840/United_Kagglers

Repository files navigation

Introduction

Alice.py is your primary teacher of data analysis and Kaggle.

In this md, we will introduce how to deal with Kaggle task. It is like a 武功秘籍. 这里写的主要是方便记忆的理论,具体的代码实现会在 Alice.py 中和 tools 中体现。

1, Frist step

Frist we need to know the data, understand the meaning of attributes and visualize some performance and relations. This step is helpful for us to choose and generate features, sometimes totally reform the problem.

visualization

from technology perspective we have:

1)value_counts

df.column.value_counts.plot('bar')

to valuecount the column.

2)ratio compare by:

df.groupby('Pclass')['Survived'].agg(np.mean).plot('bar')

3)kde plot

Plot kde of something for different category.

serie.plot('kde')

use plot_distribution or violin plot, or multi-boxplot

2, Clean the data

fillNA

we have several ways to fill the NA:

1) Trival approaches:

0, forward fill, backward fill

2) Categoral approach:

Find the most related columns, use the mean or median in this category to predict.

3) Model approach:

Use some model trained on some related columns to fill the NA.

3, Feature engineering

1) get_dummies

2) get_dummies_na

differentiate the entry with and without information in this column.

3) cut

cut the continues value to discrete

4) String comprehension

dig information out of string

4, Preprocessing

1) normalize the data

由于有歧义,所以这里包括了两点。 其一是把skew的数据给拉成正态分布的,其二是使得其期望在0处,标准差为1。

5, Build Model

Things we probably need to adjust:

2) C

3) penalty

4) tol

6, Evaluate the model

Use cross validation

7, Model Augmentation

1) Bagging

2) votingClassifier

3) grid search

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages