def reusing_code_random_forest_on_iris() -> Dict: """ Again I will run a regression on the iris dataset, but reusing the existing code from assignment1. I am also including the species column as a one_hot_encoded value for the prediction. Use this to check how different the results are (score and predictions). """ df = read_dataset(Path('..', '..', 'iris.csv')) for c in list(df.columns): df = fix_outliers(df, c) df = fix_nans(df, c) df[c] = normalize_column(df[c]) ohe = generate_one_hot_encoder(df['species']) df = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names())) ''' !!!My Results!!! When comparing the two regression functions together, it seems that the raw data will typically have a higher accuracy compared to the processed data, but they are essentially the same. I believe this is due to the same reasons I mentioned in the classification file, the data is essentially the same, the only preprocessing which was done fixed the outliers and removed the missing values in the dataset, but this dataset does not actually have too many outliers or nans, thus we are left with a very similar dataset. The normalization and one hot encoding changes in the dataset in a way that processing it with a model may be more efficient, but it will not actually change the meaning of the data! So, although the datasets are essentially the same, I would still choose to use the preprocessed data, as the normalization and one hot encoding may make the model process the data more efficiently compared to just the raw data. ''' X, y = df.iloc[:, 1:], df.iloc[:, 0] model = simple_random_forest_regressor(X, y) print(model) return model
def random_forest_iris_dataset_again() -> Dict: """ Run the result of the process iris again task of e_experimentation and discuss (1 sentence) the differences from the above results. Also, as above, use sepal_length as the label column and the one_hot_encoder to transform the categorical column into a usable format. Feel free to change your e_experimentation code (changes there will not be considered for grading purposes) to optimise the model (e.g. score, parameters, etc). """ df = process_iris_dataset_again() ohe = generate_one_hot_encoder(df['species']) df = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names())) ''' !!!Explanation!!! It seems that with this dataset, the model performs with much better accuracy compared to the dataset in the previous function. I believe the key difference between this data set and the previous data set is the addition of the large petal length column. Although this column only added noise while predicting a feature like the flowers species (discussed in the classification file), I believe it may be making a benefit to the model while trying to predict something like sepal_length. This may be due to the correlation between the sepal_length and the large_sepal_length feature, as they are closely related. I think it makes sense that this data set would have a better accuracy as it provides more information into the data we are trying to predict! I discussed this in a few places in the classification file, but I believe it is generally important to include relevant data to what is being predicted in order to have a well trained model! We see that here, as the large sepal length feature is relevant to he length of the sepal. ''' X, y = df.iloc[:, 1:], df.iloc[:, 0] model = simple_random_forest_regressor(X, y) print(model) return model
def reusing_code_random_forest_on_iris() -> Dict: """ Again I will run a regression on the iris dataset, but reusing the existing code from assignment1. I am also including the species column as a one_hot_encoded value for the prediction. Use this to check how different the results are (score and predictions). """ df = read_dataset(Path('..', '..', 'iris.csv')) for c in list(df.columns): df = fix_outliers(df, c) df = fix_nans(df, c) df[c] = normalize_column(df[c]) ohe = generate_one_hot_encoder(df['species']) df = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names())) X, y = df.iloc[:, 1:], df.iloc[:, 0] return simple_random_forest_regressor(X, y)
def train_iris_dataset_again() -> Dict: """ Run the result of the process iris again task of e_experimentation, but now using the decision tree regressor AND random_forest regressor. Return the one with lowest R^2. Use the same label column and one hot encoding logic as before. Discuss (1 sentence) what you found different between the results. Feel free to change your e_experimentation code (changes there will not be considered for grading purposes) to optimise the model (e.g. score, parameters, etc). """ df = process_iris_dataset_again() ohe = generate_one_hot_encoder(df['species']) df = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names())) X, y = df.iloc[:, 1:], df.iloc[:, 0] dt = decision_tree_regressor(X, y) print(dt) rf = simple_random_forest_regressor(X, y) print(rf) ''' Given the results, it is evident that on average the Decision Tree and Random Forest regressors yield similar accuracy scores, it is hard to say which is the clear winner. This is very similar to the instance of comparing the two on the processed iris set from the classification file, I believe the iris data set is quite well balanced, so the improvements and reduction in overfitting the random forest model has to offer over the decision tree is somewhat mitigated. I would say becuase of the this, the decision tree has an advantage over the random forest model, as it is more efficient in both time and space (uses only one tree compared to many). I would like to note as I did in the classification example, that I would usually choose a random forest over a decision tree as they have many benefits related to improving accuracy through reducing overfitting, but in this case those issues are not present, thus the decision tree wins my pick! ''' if rf['score'] > dt['score']: print('random forest wins!') return rf else: print('decision tree wins!') return dt
def iris_clusters() -> Dict: """ Let's use the iris dataset and clusterise it: """ df = pd.read_csv(Path('..', '..', 'iris.csv')) for c in list(df.columns): df = fix_outliers(df, c) df = fix_nans(df, c) df[c] = normalize_column(df[c]) # Let's generate the clusters considering only the numeric columns first no_species_column = simple_k_means(df.iloc[:, :4]) ohe = generate_one_hot_encoder(df['species']) df_ohe = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names())) # Notice that here I have binary columns, but I am using euclidean distance to do the clustering AND score evaluation # This is pretty bad no_binary_distance_clusters = simple_k_means(df_ohe) # Finally, lets use just a label encoder for the species. # It is still bad to change the labels to numbers directly because the distances between them does not make sense le = generate_label_encoder(df['species']) df_le = replace_with_label_encoder(df, 'species', le) labeled_encoded_clusters = simple_k_means(df_le) # See the result for yourself: print(no_species_column['score'], no_binary_distance_clusters['score'], labeled_encoded_clusters['score']) ret = no_species_column if no_binary_distance_clusters['score'] > ret['score']: print('no binary distance') ret = no_binary_distance_clusters if labeled_encoded_clusters['score'] > ret['score']: print('labeled encoded') ret = labeled_encoded_clusters return ret
def your_choice() -> Dict: """ Now choose one of the datasets included in the assignment1 (the raw one, before anything done to them) and decide for yourself a set of instructions to be done (similar to the e_experimentation tasks). Specify your goal (e.g. analyse the reviews of the amazon dataset), say what you did to try to achieve the goal and use one (or both) of the models above to help you answer that. Remember that these models are regression models, therefore it is useful only for numerical labels. We will not grade your result itself, but your decision-making and suppositions given the goal you decided. Use this as a small exercise of what you will do in the project. """ ''' !!!My Goal!!! I will be using the dataset Iris Dataset I think this dataset is relatively simple, and very well balanced; making it super fun to use because the resulting models so far have seemed quite robust. With this dataset, I want to run a regression which predicts the petal width of a given iris flower. To find this out, I will preprocess the data in the following ways: - Extract petal_width for use as the target feature - Extract and one hot encode the species column to use for the feature vector - Extract the petal length column to use for the feature vector - Let's say for this specific problem, the iris flower researcher student forgot to measure any information regarding the sepal - oops! I want to see how robust of a model I can make using just those two features listed. - I think plan to make a I will train a decision tree model for this problem, as we saw before I do not think I will need something as powerful as a random forest as this dataset is very well balanced. I do however want to see the difference in accuracy score when I include the sepal information into the dataset, say the student's mentor went back to record that information to complete the data set (after firing the student). Thus I will use two decision trees and compare them at the end, to see if leaving out the sepal info will have any effect! I will return the one with the better score. Let's see if the mentor made the right decision in firing the student ;) ''' df = pd.read_csv(Path('..', '..', 'iris.csv')) ohe = generate_one_hot_encoder(df_column=df['species']) df = replace_with_one_hot_encoder(df=df, column='species', ohe=ohe, ohe_column_names=ohe.get_feature_names()) y = df['petal_width'] # student collected features and model columns = ['petal_length', 'x0_setosa', 'x0_versicolor', 'x0_virginica'] x_student = df[columns] dt_student = decision_tree_regressor(X=x_student, y=y) # mentor collected features and model columns = [ 'sepal_length', 'sepal_width', 'petal_length', 'x0_setosa', 'x0_versicolor', 'x0_virginica' ] x_mentor = df[columns] dt_mentor = decision_tree_regressor(X=x_mentor, y=y) print(dt_student) print(dt_mentor) ''' !!!My Results!!! On average, the models seem to have about the same score, around .88 to .93. I believe this is a real score as the dataset is balanced and there should not be any contamination between training and testing sets. With that being said, the students model has few features in its data set, with just about the same score. I believe this may support the theory that in order to predict petal width, data relating to the petal is very important. Although upon further manipulation of the dataset, it seems that another feature which is truly crucial for is the species of the flower. The robustness of the model seems to be very dependent on the presence of this feature as well. Thus the petal_length and species feature are very important for this problem! I think I would still rehire the student, as they were able to make a model that scored just as well, but used a smaller dataset (for this specific problem), and thus actually saved the lab a little bit of space, and perhaps a whole lot of time collecting data on iris flowers. The work from the student also questioned the data set, and allowed the "lab" to find out which features are the most important for this problem. I would rehire the student and give him a raise!! ''' if dt_student['score'] > dt_mentor['score']: print('rehire the student!') return dt_student else: print('keep him fired!') return dt_mentor
def your_choice() -> Dict: """ Now choose one of the datasets included in the assignment1 (the raw one, before anything done to them) and decide for yourself a set of instructions to be done (similar to the e_experimentation tasks). Specify your goal (e.g. analyse the reviews of the amazon dataset), say what you did to try to achieve the goal and use one (or both) of the models above to help you answer that. Remember that these models are classification models, therefore it is useful only for categorical labels. We will not grade your result itself, but your decision-making and suppositions given the goal you decided. Use this as a small exercise of what you will do in the project. """ ''' !!!My Goal!!! I will be using the dataset "Geography" With this dataset, I want to find out if we can fit a model to predict the World Bank Income Group of a country given a some geographical and bank related features To find this out, I will preprocess the data in the following ways: - Fix any missing data in the columns that are mentioned below - Extract and label encode the World Bank groups column into the labels vector - Extract and one hot encode World bank region column into the features vector - Extract latitude into the features vector - Extract longitude into the features vector I will train both a Decision Tree and Random Forest to find my goal, and return the model with the greater accuracy ''' df = pd.read_csv(Path('..', '..', 'geography.csv')) ''' !!!Explanation!!! The only columns with Nans for the target features for this were from the Vatican, so I replaced their null values with the values from Italy. I know they are technically separate, but until the data set can be filled we will simply consider them the same. ''' df['World bank region'].fillna(value='Europe & Central Asia', inplace=True) df['World bank, 4 income groups 2017'].fillna('High Income', inplace=True) le = generate_label_encoder(df_column=df['World bank, 4 income groups 2017']) df = replace_with_label_encoder(df=df, column='World bank, 4 income groups 2017', le=le) ohe = generate_one_hot_encoder(df_column=df['World bank region']) df = replace_with_one_hot_encoder(df=df, column='World bank region', ohe=ohe, ohe_column_names=ohe.get_feature_names()) columns = ['Latitude', 'Longitude', 'x0_East Asia & Pacific', 'x0_Europe & Central Asia', 'x0_Latin America & Caribbean', 'x0_Middle East & North Africa', 'x0_North America', 'x0_South Asia', 'x0_Sub-Saharan Africa'] X = df[columns] y = df['World bank, 4 income groups 2017'] dt = decision_tree_classifier(X=X, y=y) #print(dt) rf = simple_random_forest_classifier(X=X, y=y) #print(rf) ''' !!!My Results!!! It seems that once again on average the Decision Tree and Random Forest are yielding similar results. Their accuracies are quite low, and range from around 50 to nearly 70 percent accuracy. I don't think a lot of overfitting is occurring here, as the datasets are well balanced, and properly split into training and testing. The data set does have a lack of columns that relate to the economy, wealth, or demographics of the country, So I believe that more data may improve the model to fit a mapping between the demographic and wealth data of a given country, and its income group (target label). Features that could be collected as additional data columns could include things such as average income, employment rate, tax information, and more! I believe although this model is just a start, it could be beneficial to companies who are figuring out economic policies or tax plans. I believe, the ability to use this model while trying to come up with plans to benefit a country's economy could be useful, with enough relevant training and data :) ''' if rf['accuracy'] > dt['accuracy']: #print('random forest wins') return rf else: #print('decision tree wins') return dt