Beispiel #1
0
#%%
import warnings
import pandas as pd
from sklearn import preprocessing
from Data import DataManager
from Analyser import Analyser

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 51)
output = 'output/'
size = (150, 100)
warnings.filterwarnings('ignore')
# %%
# 1. Read from Dataset
dataManager = DataManager()
dfFullData = dataManager.readData()
#%%
# 2. Analysing Data from Dataset
dfFullData.describe()
#%%
dfFullData.info()
#%%
# Base on displot 01_INITIAL_DistPlot.png, boxplot 02_OUTLIER_BoxPlot.png, data information above and the original datset:
# 1.some feature in dataset does not have normal distibution thus has to be skewed,
# 2.The true label, 'diagnosis' has binary values and the ratio of Yes to No is disproportionate thus stratification has to be done.
# 3.The range of numercal values in some features is wide thus has to be scaled down.
# 4.The true label, 'diagnosis' has to be converted to numbers via label encoding since it has only 2 values
# 5.There are some features with outliers, thus outliers has to be removed
# 6.Features 'ID' and 'Unnamed' has to be dropped as they are not useful
# 7. There are no empty cells
# In addition, the follwing steps also has to be checked: