Introduction

Alice.py is your primary teacher of data analysis and Kaggle.

In this md, we will introduce how to deal with Kaggle task. It is like a 武功秘籍. 这里写的主要是方便记忆的理论，具体的代码实现会在 Alice.py 中和 tools 中体现。

1, Frist step

Frist we need to know the data, understand the meaning of attributes and visualize some performance and relations. This step is helpful for us to choose and generate features, sometimes totally reform the problem.

visualization

from technology perspective we have:

1)value_counts

df.column.value_counts.plot('bar')

to valuecount the column.

2)ratio compare by:

df.groupby('Pclass')['Survived'].agg(np.mean).plot('bar')

3)kde plot

Plot kde of something for different category.

serie.plot('kde')

use plot_distribution or violin plot, or multi-boxplot

2, Clean the data

fillNA

we have several ways to fill the NA:

1) Trival approaches:

0, forward fill, backward fill

2) Categoral approach:

Find the most related columns, use the mean or median in this category to predict.

3) Model approach:

Use some model trained on some related columns to fill the NA.

3, Feature engineering

1) get_dummies

2) get_dummies_na

differentiate the entry with and without information in this column.

3) cut

cut the continues value to discrete

4) String comprehension

dig information out of string

4, Preprocessing

1) normalize the data

由于有歧义，所以这里包括了两点。其一是把skew的数据给拉成正态分布的，其二是使得其期望在0处，标准差为1。

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
001_titanic		001_titanic
002_house_price		002_house_price
003_digits		003_digits
004_NewYorkTaxiDuration		004_NewYorkTaxiDuration
005_landmarks_retrieval		005_landmarks_retrieval
tools		tools
.gitignore		.gitignore
Alice.py		Alice.py
README.md		README.md

Apollo1840/United_Kagglers

Folders and files

Latest commit

History

Repository files navigation

Introduction

1, Frist step

visualization

1)value_counts

2)ratio compare by:

3)kde plot

2, Clean the data

fillNA

1) Trival approaches:

2) Categoral approach:

3) Model approach:

3, Feature engineering

1) get_dummies

2) get_dummies_na

3) cut

4) String comprehension

4, Preprocessing

1) normalize the data

5, Build Model

2) C

3) penalty

4) tol

6, Evaluate the model

7, Model Augmentation

1) Bagging

2) votingClassifier

3) grid search

About

Resources

Stars

Watchers

Forks

Languages