Python fetch_midwest_survey示例

编程语言: Python

命名空间/包名称: dirty_cat.datasets

方法/功能: fetch_midwest_survey

hotexamples.com的示例: 2

Python fetch_midwest_survey - 已找到2个示例。这些是从开源项目中提取的最受好评的dirty_cat.datasets.fetch_midwest_survey现实Python示例。您可以评价示例，以帮助我们提高示例质量。

示例#1

显示文件

文件： 03_fit_predict_plot_midwest_survey.py 项目： world4jason/dirty_cat

an open-ended question, on which one-hot encoding does not work well.
The other columns are more traditional categorical or numerical
variables.

Let's see how different encoding for the dirty column impact on the
score of a classification problem.

"""

################################################################################
# Loading the data
# ----------------
from dirty_cat.datasets import fetch_midwest_survey
import pandas as pd

dataset = fetch_midwest_survey()
df = pd.read_csv(dataset['path']).astype(str)

################################################################################
# The challenge with this data is that it contains a free-form input
# column, where people put whatever they want:
dirty_column = 'In your own words, what would you call the part of the country you live in now?'
print(df[dirty_column].value_counts()[-10:])

################################################################################
# Separating clean, and dirty columns as well a a column we will try to predict
# ------------------------------------------------------------------------------

target_column = 'Location (Census Region)'
clean_columns = [
    'Personally identification as a Midwesterner?',

示例#2

显示文件

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

from dirty_cat import datasets
from dirty_cat import SimilarityEncoder

# encoding methods
encoder_dict = {
    'one-hot': OneHotEncoder(handle_unknown='ignore'),
    'similarity': SimilarityEncoder(similarity='ngram',
                                    handle_unknown='ignore'),
    'num': FunctionTransformer(None)
}

data_file = datasets.fetch_midwest_survey()

for method in ['one-hot', 'similarity']:
    # Load the data
    df = pd.read_csv(data_file).astype(str)

    target_column = 'Location (Census Region)'
    y = df[target_column].values.ravel()

    # Transform the data into a numerical matrix
    encoder_type = {
        'one-hot': [
            'Personally identification as a Midwesterner?', 'Illinois in MW?',
            'Indiana in MW?', 'Kansas in MW?', 'Iowa in MW?',
            'Michigan in MW?', 'Minnesota in MW?', 'Missouri in MW?',
            'Nebraska in MW?', 'North Dakota in MW?', 'Ohio in MW?',