Python preprocess_df示例

编程语言: Python

命名空间/包名称: utils

方法/功能: preprocess_df

hotexamples.com的示例: 3

Python preprocess_df - 已找到3个示例。这些是从开源项目中提取的最受好评的utils.preprocess_df现实Python示例。您可以评价示例，以帮助我们提高示例质量。

示例#1

显示文件

文件： pandas_profile.py 项目： ykskks/SIGNATE-Student-Cup-2019

MODEL_ID = "preprocess_df"


#################### 
## Load data
#################### 
# 変数名の英訳
train_cols_eng = ["id", "rent", "location", "access", "layout", "age", "direction", "area", "floor",
           "bath_toilet", "kitchen", "broad_com", "facility", "parking", "environment", "structure",
           "contract_period"]
test_cols_eng = ["id", "location", "access", "layout", "age", "direction", "area", "floor",
           "bath_toilet", "kitchen", "broad_com", "facility", "parking", "environment", "structure",
           "contract_period"]

train = pd.read_csv("./data/train.csv", names=train_cols_eng, header=0)
test = pd.read_csv("./data/test.csv", names=test_cols_eng, header=0)


#################### 
## Preprocess data
#################### 

train = preprocess_df(train)
print("Preprocessing done!")


#################### 
## Visualize
#################### 
profile = train.profile_report()
profile.to_file(f"./logs/visualization/{MODEL_ID}_profile.html")

示例#2

显示文件

           "bath_toilet", "kitchen", "broad_com", "facility", "parking", "environment", "structure",
           "contract_period"]
test_cols_eng = ["id", "location", "access", "layout", "age", "direction", "area", "floor",
           "bath_toilet", "kitchen", "broad_com", "facility", "parking", "environment", "structure",
           "contract_period"]

train = pd.read_csv("./data/train.csv", names=train_cols_eng, header=0)
test = pd.read_csv("./data/test.csv", names=test_cols_eng, header=0)

use_cols = []

#################### 
## Preprocess data
#################### 

train_processed = preprocess_df(train)
test_processed = preprocess_df(test)

# handle outliers
train_processed.drop(20427, axis=0, inplace=True) # 築1019年、どう修正するべきか不明なので
train_processed.loc[20231, "age_year"] = 52
train_processed.loc[20231, "age_in_months"] = 52 * 12 + 5 # 築520年、おそらく52年のタイポと仮定

train_processed.loc[5775, "rent"] = 120350 # 条件からしてありえない高値。おそらくゼロの個数違い
train_processed.loc[20926, "area"] = 43.01 # 条件からしてありえなく広い。おそらくゼロの個数違い

train_processed.reset_index(drop=True, inplace=True)
target = train_processed["rent"]
target_log = np.log1p(target)
train_processed.drop(["id", "rent"], axis=1, inplace=True)
test_processed.drop("id", axis=1, inplace=True)

示例#3

显示文件

文件： student_project.py 项目： mkirmse/nd320-c1-emr-data-starter


selected_features_df = select_model_features(agg_drug_df, student_categorical_col_list, student_numerical_col_list,
                                            PREDICTOR_FIELD)


# ### Preprocess Dataset - Casting and Imputing  

# We will cast and impute the dataset before splitting so that we do not have to repeat these steps across the splits in the next step. For imputing, there can be deeper analysis into which features to impute and how to impute but for the sake of time, we are taking a general strategy of imputing zero for only numerical features. 
# 
# OPTIONAL: What are some potential issues with this approach? Can you recommend a better way and also implement it?

# In[37]:


processed_df = preprocess_df(selected_features_df, student_categorical_col_list, 
        student_numerical_col_list, PREDICTOR_FIELD, categorical_impute_value='nan', numerical_impute_value=0)


# ## Split Dataset into Train, Validation, and Test Partitions

# **Question 6**: In order to prepare the data for being trained and evaluated by a deep learning model, we will split the dataset into three partitions, with the validation partition used for optimizing the model hyperparameters during training. One of the key parts is that we need to be sure that the data does not accidently leak across partitions.
# 
# Please complete the function below to split the input dataset into three partitions(train, validation, test) with the following requirements.
# - Approximately 60%/20%/20%  train/validation/test split
# - Randomly sample different patients into each data partition
# - **IMPORTANT** Make sure that a patient's data is not in more than one partition, so that we can avoid possible data leakage.
# - Make sure that the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset
# - Total number of rows in original dataset = sum of rows across all three dataset partitions

# In[38]: