def reusing_code_random_forest_on_iris() -> Dict:
    """
    Again I will run a regression on the iris dataset, but reusing
    the existing code from assignment1. I am also including the species column as a one_hot_encoded
    value for the prediction. Use this to check how different the results are (score and
    predictions).
    """
    df = read_dataset(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    ohe = generate_one_hot_encoder(df['species'])
    df = replace_with_one_hot_encoder(df, 'species', ohe,
                                      list(ohe.get_feature_names()))
    '''
    !!!My Results!!!
    When comparing the two regression functions together, it seems that the raw data will typically have a higher 
    accuracy compared to the processed data, but they are essentially the same.
    I believe this is due to the same reasons I mentioned in the classification file, the data is essentially the same,
    the only preprocessing which was done fixed the outliers and removed the missing values in the dataset, but this
    dataset does not actually have too many outliers or nans, thus we are left with a very similar dataset.
    The normalization and one hot encoding changes in the dataset in a way that processing it with a model may be more
    efficient, but it will not actually change the meaning of the data! 
    So, although the datasets are essentially the same, I would still choose to use the preprocessed data, as the 
    normalization and one hot encoding may make the model process the data more efficiently compared to just the
    raw data.
    '''
    X, y = df.iloc[:, 1:], df.iloc[:, 0]
    model = simple_random_forest_regressor(X, y)
    print(model)
    return model
def random_forest_iris_dataset_again() -> Dict:
    """
    Run the result of the process iris again task of e_experimentation and discuss (1 sentence)
    the differences from the above results. Also, as above, use sepal_length as the label column and
    the one_hot_encoder to transform the categorical column into a usable format.
    Feel free to change your e_experimentation code (changes there will not be considered for grading
    purposes) to optimise the model (e.g. score, parameters, etc).
    """
    df = process_iris_dataset_again()

    ohe = generate_one_hot_encoder(df['species'])
    df = replace_with_one_hot_encoder(df, 'species', ohe,
                                      list(ohe.get_feature_names()))
    '''
    !!!Explanation!!!
    It seems that with this dataset, the model performs with much better accuracy compared to the dataset in the 
    previous function. I believe the key difference between this data set and the previous data set is the addition
    of the large petal length column. Although this column only added noise while predicting a feature like the 
    flowers species (discussed in the classification file), I believe it may be making a benefit to the model while
    trying to predict something like sepal_length. This may be due to the correlation between the sepal_length and 
    the large_sepal_length feature, as they are closely related.
    I think it makes sense that this data set would have a better accuracy as it provides more information into the 
    data we are trying to predict! I discussed this in a few places in the classification file, but I believe it is 
    generally important to include relevant data to what is being predicted in order to have a well trained model!
    We see that here, as the large sepal length feature is relevant to he length of the sepal.
    '''
    X, y = df.iloc[:, 1:], df.iloc[:, 0]

    model = simple_random_forest_regressor(X, y)
    print(model)
    return model
Exemplo n.º 3
0
def reusing_code_random_forest_on_iris() -> Dict:
    """
    Again I will run a regression on the iris dataset, but reusing
    the existing code from assignment1. I am also including the species column as a one_hot_encoded
    value for the prediction. Use this to check how different the results are (score and
    predictions).
    """
    df = read_dataset(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    ohe = generate_one_hot_encoder(df['species'])
    df = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names()))

    X, y = df.iloc[:, 1:], df.iloc[:, 0]
    return simple_random_forest_regressor(X, y)
def train_iris_dataset_again() -> Dict:
    """
    Run the result of the process iris again task of e_experimentation, but now using the
    decision tree regressor AND random_forest regressor. Return the one with lowest R^2.
    Use the same label column and one hot encoding logic as before.
    Discuss (1 sentence) what you found different between the results.
    Feel free to change your e_experimentation code (changes there will not be considered for grading
    purposes) to optimise the model (e.g. score, parameters, etc).
    """
    df = process_iris_dataset_again()

    ohe = generate_one_hot_encoder(df['species'])
    df = replace_with_one_hot_encoder(df, 'species', ohe,
                                      list(ohe.get_feature_names()))

    X, y = df.iloc[:, 1:], df.iloc[:, 0]

    dt = decision_tree_regressor(X, y)
    print(dt)

    rf = simple_random_forest_regressor(X, y)
    print(rf)
    '''
    Given the results, it is evident that on average the Decision Tree and Random Forest regressors yield similar 
    accuracy scores, it is hard to say which is the clear winner.
    This is very similar to the instance of comparing the two on the processed iris set from the classification file,
    I believe the iris data set is quite well balanced, so the improvements and reduction in overfitting the 
    random forest model has to offer over the decision tree is somewhat mitigated. 
    I would say becuase of the this, the decision tree has an advantage over the random forest model, as it is more
    efficient in both time and space (uses only one tree compared to many).
    I would like to note as I did in the classification example, that I would usually choose a random forest over a 
    decision tree as they have many benefits related to improving accuracy through reducing overfitting, but in this
    case those issues are not present, thus the decision tree wins my pick!
    '''
    if rf['score'] > dt['score']:
        print('random forest wins!')
        return rf
    else:
        print('decision tree wins!')
        return dt
Exemplo n.º 5
0
def iris_clusters() -> Dict:
    """
    Let's use the iris dataset and clusterise it:
    """
    df = pd.read_csv(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    # Let's generate the clusters considering only the numeric columns first
    no_species_column = simple_k_means(df.iloc[:, :4])

    ohe = generate_one_hot_encoder(df['species'])
    df_ohe = replace_with_one_hot_encoder(df, 'species', ohe,
                                          list(ohe.get_feature_names()))

    # Notice that here I have binary columns, but I am using euclidean distance to do the clustering AND score evaluation
    # This is pretty bad
    no_binary_distance_clusters = simple_k_means(df_ohe)

    # Finally, lets use just a label encoder for the species.
    # It is still bad to change the labels to numbers directly because the distances between them does not make sense
    le = generate_label_encoder(df['species'])
    df_le = replace_with_label_encoder(df, 'species', le)
    labeled_encoded_clusters = simple_k_means(df_le)

    # See the result for yourself:
    print(no_species_column['score'], no_binary_distance_clusters['score'],
          labeled_encoded_clusters['score'])
    ret = no_species_column
    if no_binary_distance_clusters['score'] > ret['score']:
        print('no binary distance')
        ret = no_binary_distance_clusters
    if labeled_encoded_clusters['score'] > ret['score']:
        print('labeled encoded')
        ret = labeled_encoded_clusters
    return ret
def your_choice() -> Dict:
    """
    Now choose one of the datasets included in the assignment1 (the raw one, before anything done to them)
    and decide for yourself a set of instructions to be done (similar to the e_experimentation tasks).
    Specify your goal (e.g. analyse the reviews of the amazon dataset), say what you did to try to achieve the goal
    and use one (or both) of the models above to help you answer that. Remember that these models are regression
    models, therefore it is useful only for numerical labels.
    We will not grade your result itself, but your decision-making and suppositions given the goal you decided.
    Use this as a small exercise of what you will do in the project.
    """
    '''
    !!!My Goal!!!
    I will be using the dataset Iris Dataset
        I think this dataset is relatively simple, and very well balanced; making it super fun to use because
        the resulting models so far have seemed quite robust. 
    With this dataset, I want to run a regression which predicts the petal width of a given iris flower. 
    To find this out, I will preprocess the data in the following ways:
        - Extract petal_width for use as the target feature
        - Extract and one hot encode the species column to use for the feature vector
        - Extract the petal length column to use for the feature vector
        - Let's say for this specific problem, the iris flower researcher student forgot to measure any information
        regarding the sepal - oops! I want to see how robust of a model I can make using just those two features listed.
        - I think plan to make a 
    I will train a decision tree model for this problem, as we saw before I do not think I will need something as
    powerful as a random forest as this dataset is very well balanced.
    I do however want to see the difference in accuracy score when I include the sepal information into the dataset, 
    say the student's mentor went back to record that information to complete the data set (after firing the student).
    Thus I will use two decision trees and compare them at the end, to see if leaving out the sepal info will have any
    effect! I will return the one with the better score.
    Let's see if the mentor made the right decision in firing the student ;)
    '''
    df = pd.read_csv(Path('..', '..', 'iris.csv'))

    ohe = generate_one_hot_encoder(df_column=df['species'])
    df = replace_with_one_hot_encoder(df=df,
                                      column='species',
                                      ohe=ohe,
                                      ohe_column_names=ohe.get_feature_names())

    y = df['petal_width']

    # student collected features and model
    columns = ['petal_length', 'x0_setosa', 'x0_versicolor', 'x0_virginica']
    x_student = df[columns]
    dt_student = decision_tree_regressor(X=x_student, y=y)

    # mentor collected features and model
    columns = [
        'sepal_length', 'sepal_width', 'petal_length', 'x0_setosa',
        'x0_versicolor', 'x0_virginica'
    ]
    x_mentor = df[columns]
    dt_mentor = decision_tree_regressor(X=x_mentor, y=y)

    print(dt_student)
    print(dt_mentor)
    '''
    !!!My Results!!!
    On average, the models seem to have about the same score, around .88 to .93.
    I believe this is a real score as the dataset is balanced and there should not be any contamination between training
    and testing sets.
    With that being said, the students model has few features in its data set, with just about the same score. I believe
    this may support the theory that in order to predict petal width, data relating to the petal is very important.

    Although upon further manipulation of the dataset, it seems that another feature which is truly crucial for is the 
    species of the flower. The robustness of the model seems to be very dependent on the presence of this feature as 
    well. Thus the petal_length and species feature are very important for this problem! 

    I think I would still rehire the student, as they were able to make a model that scored just as well, but used a 
    smaller dataset (for this specific problem), and thus actually saved the lab a little bit of space, and perhaps a 
    whole lot of time collecting data on iris flowers. The work from the student also questioned the data set, and 
    allowed the "lab" to find out which features are the most important for this problem.
    I would rehire the student and give him a raise!!
    '''
    if dt_student['score'] > dt_mentor['score']:
        print('rehire the student!')
        return dt_student
    else:
        print('keep him fired!')
        return dt_mentor
def your_choice() -> Dict:
    """
    Now choose one of the datasets included in the assignment1 (the raw one, before anything done to them)
    and decide for yourself a set of instructions to be done (similar to the e_experimentation tasks).
    Specify your goal (e.g. analyse the reviews of the amazon dataset), say what you did to try to achieve the goal
    and use one (or both) of the models above to help you answer that. Remember that these models are classification
    models, therefore it is useful only for categorical labels.
    We will not grade your result itself, but your decision-making and suppositions given the goal you decided.
    Use this as a small exercise of what you will do in the project.
    """
    '''
    !!!My Goal!!!
    I will be using the dataset "Geography"
    With this dataset, I want to find out if we can fit a model to predict the World Bank Income Group of a country
    given a some geographical and bank related features
    To find this out, I will preprocess the data in the following ways:
        - Fix any missing data in the columns that are mentioned below
        - Extract and label encode the World Bank groups column into the labels vector 
        - Extract and one hot encode World bank region column into the features vector
        - Extract latitude into the features vector
        - Extract longitude into the features vector
    I will train both a Decision Tree and Random Forest to find my goal, and return the model with the greater accuracy
    '''
    df = pd.read_csv(Path('..', '..', 'geography.csv'))

    '''
    !!!Explanation!!!
    The only columns with Nans for the target features for this were from the Vatican, 
    so I replaced their null values with the values from Italy.
    I know they are technically separate, but until the data set can be filled we will simply consider them the same.
    '''
    df['World bank region'].fillna(value='Europe & Central Asia', inplace=True)
    df['World bank, 4 income groups 2017'].fillna('High Income', inplace=True)

    le = generate_label_encoder(df_column=df['World bank, 4 income groups 2017'])
    df = replace_with_label_encoder(df=df, column='World bank, 4 income groups 2017', le=le)

    ohe = generate_one_hot_encoder(df_column=df['World bank region'])
    df = replace_with_one_hot_encoder(df=df, column='World bank region', ohe=ohe,
                                      ohe_column_names=ohe.get_feature_names())

    columns = ['Latitude', 'Longitude', 'x0_East Asia & Pacific', 'x0_Europe & Central Asia',
               'x0_Latin America & Caribbean', 'x0_Middle East & North Africa', 'x0_North America',
               'x0_South Asia', 'x0_Sub-Saharan Africa']
    X = df[columns]
    y = df['World bank, 4 income groups 2017']

    dt = decision_tree_classifier(X=X, y=y)
    #print(dt)
    rf = simple_random_forest_classifier(X=X, y=y)
    #print(rf)
    '''
    !!!My Results!!!
    It seems that once again on average the Decision Tree and Random Forest are yielding similar results.
    Their accuracies are quite low, and range from around 50 to nearly 70 percent accuracy.
    I don't think a lot of overfitting is occurring here, as the datasets are well balanced, and properly split
    into training and testing.
    The data set does have a lack of columns that relate to the economy, wealth, or demographics of the country,
    So I believe that more data may improve the model to fit a mapping between the demographic and wealth data of a
    given country, and its income group (target label).
    Features that could be collected as additional data columns could include things such as average income, employment
    rate, tax information, and more!
    I believe although this model is just a start, it could be beneficial to companies who are figuring out economic
    policies or tax plans. I believe, the ability to use this model while trying to come up with plans to benefit a 
    country's economy could be useful, with enough relevant training and data :)
    '''
    if rf['accuracy'] > dt['accuracy']:
        #print('random forest wins')
        return rf
    else:
        #print('decision tree wins')
        return dt