コード例 #1
0
def reusing_code_random_forest_on_iris() -> Dict:
    """
    Again I will run a regression on the iris dataset, but reusing
    the existing code from assignment1. I am also including the species column as a one_hot_encoded
    value for the prediction. Use this to check how different the results are (score and
    predictions).
    """
    df = read_dataset(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    ohe = generate_one_hot_encoder(df['species'])
    df = replace_with_one_hot_encoder(df, 'species', ohe,
                                      list(ohe.get_feature_names()))
    '''
    !!!My Results!!!
    When comparing the two regression functions together, it seems that the raw data will typically have a higher 
    accuracy compared to the processed data, but they are essentially the same.
    I believe this is due to the same reasons I mentioned in the classification file, the data is essentially the same,
    the only preprocessing which was done fixed the outliers and removed the missing values in the dataset, but this
    dataset does not actually have too many outliers or nans, thus we are left with a very similar dataset.
    The normalization and one hot encoding changes in the dataset in a way that processing it with a model may be more
    efficient, but it will not actually change the meaning of the data! 
    So, although the datasets are essentially the same, I would still choose to use the preprocessed data, as the 
    normalization and one hot encoding may make the model process the data more efficiently compared to just the
    raw data.
    '''
    X, y = df.iloc[:, 1:], df.iloc[:, 0]
    model = simple_random_forest_regressor(X, y)
    print(model)
    return model
コード例 #2
0
def reusing_code_random_forest_on_iris() -> Dict:
    """
    Again I will run a classification on the iris dataset, but reusing
    the existing code from assignment1. Use this to check how different the results are (score and
    predictions).
    """
    df = read_dataset(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        # Notice that I am now passing though all columns.
        # If your code does not handle normalizing categorical columns, do so now (just return the unchanged column)
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    X, y = df.iloc[:, :4], df.iloc[:, 4]
    le = generate_label_encoder(y)

    # Be careful to return a copy of the input with the changes, instead of changing inplace the inputs here!
    y_encoded = replace_with_label_encoder(y.to_frame(), column='species', le=le)
    rf = simple_random_forest_classifier(X, y_encoded['species'])

    '''
    !!Explanation!!
    Both the classifier in this function and the one in the last yield just about the same score on average
    I believe this is because the two datasets are essentially the same at this point:
    They both have label encoded classes
    The only difference is this function removed nans and outliers, which the dataset does not possess many of anyway
    And also normalizes the dataset, which from what my understanding might not actually change the values 
    in relation to other values. This normalization may just make the model in this function more efficient!
    Due to this potential boost in efficiency due to normalization, I would choose this function's model over the last 
    '''
    print(rf['accuracy'])
    return rf
コード例 #3
0
def reusing_code_random_forest_on_iris() -> Dict:
    """
    Again I will run a regression on the iris dataset, but reusing
    the existing code from assignment1. I am also including the species column as a one_hot_encoded
    value for the prediction. Use this to check how different the results are (score and
    predictions).
    """
    df = read_dataset(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    ohe = generate_one_hot_encoder(df['species'])
    df = replace_with_one_hot_encoder(df, 'species', ohe, list(ohe.get_feature_names()))

    X, y = df.iloc[:, 1:], df.iloc[:, 0]
    return simple_random_forest_regressor(X, y)
コード例 #4
0
def reusing_code_random_forest_on_iris() -> Dict:
    """
    Again I will run a classification on the iris dataset, but reusing
    the existing code from assignment1. Use this to check how different the results are (score and
    predictions).
    """
    df = read_dataset(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        # Notice that I am now passing though all columns.
        # If your code does not handle normalizing categorical columns, do so now (just return the unchanged column)
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    X, y = df.iloc[:, :4], df.iloc[:, 4]
    le = generate_label_encoder(y)

    # Be careful to return a copy of the input with the changes, instead of changing inplace the inputs here!
    y_encoded = replace_with_label_encoder(y.toframe(),
                                           column='species',
                                           le=le)
    return simple_random_forest_classifier(X, y_encoded['species'])
コード例 #5
0
def iris_clusters() -> Dict:
    """
    Let's use the iris dataset and clusterise it:
    """
    df = pd.read_csv(Path('..', '..', 'iris.csv'))
    for c in list(df.columns):
        df = fix_outliers(df, c)
        df = fix_nans(df, c)
        df[c] = normalize_column(df[c])

    # Let's generate the clusters considering only the numeric columns first
    no_species_column = simple_k_means(df.iloc[:, :4])

    ohe = generate_one_hot_encoder(df['species'])
    df_ohe = replace_with_one_hot_encoder(df, 'species', ohe,
                                          list(ohe.get_feature_names()))

    # Notice that here I have binary columns, but I am using euclidean distance to do the clustering AND score evaluation
    # This is pretty bad
    no_binary_distance_clusters = simple_k_means(df_ohe)

    # Finally, lets use just a label encoder for the species.
    # It is still bad to change the labels to numbers directly because the distances between them does not make sense
    le = generate_label_encoder(df['species'])
    df_le = replace_with_label_encoder(df, 'species', le)
    labeled_encoded_clusters = simple_k_means(df_le)

    # See the result for yourself:
    print(no_species_column['score'], no_binary_distance_clusters['score'],
          labeled_encoded_clusters['score'])
    ret = no_species_column
    if no_binary_distance_clusters['score'] > ret['score']:
        print('no binary distance')
        ret = no_binary_distance_clusters
    if labeled_encoded_clusters['score'] > ret['score']:
        print('labeled encoded')
        ret = labeled_encoded_clusters
    return ret
コード例 #6
0
def cluster_life_expectancy() -> Dict:
    """
    Run the result of the process life_expectancy task of e_experimentation through the custom_clustering and discuss (max 3 sentences)
    the result (clusters and score) and also say any limitations (e.g. problems with metrics) that you find.
    We are not looking for an exact answer, we want to know if you really understand your choice and the results of custom_clustering.
    Once again, don't worry about the clustering technique implementation, but do analyse the data/result and check if the clusters makes sense.
    """
    '''
    !!!My Goal!!!
    I would like to cluster the year, expectancy, and latitude.
    I believe this may show some trends in age expectancy with respect to the location of each country
    I think it's important to consider location, as some parts of the world are much more developed than others,
    thus may have varying life expectancies.
    '''

    df = process_life_expectancy_dataset()
    '''
    !!!Explanation!!!
    As expectancy cannot be made up, I simply remove the entries where expectancy is Nan.
    This only reduces the data set by around 10 rows.
    '''

    df['expectancy'].fillna(value=0, inplace=True)
    df = df[df['expectancy'] != 0]
    '''
    !!!Explanation!!!
    As labels are not required for an unsupervised task such as DBScan clustering, I remove their columns from the 
    data set.
    '''

    df = df.drop(['x0_africa'], axis=1)
    df = df.drop(['x0_americas'], axis=1)
    df = df.drop(['x0_asia'], axis=1)
    df = df.drop(['x0_europe'], axis=1)
    '''
    !!!Explanation!!!
    Conceptually I do not think the name column makes much sense to include in the cluster.
    We are trying to cluster locations based on age expectancy, any information the algorithm needs relating to location
    will come from the latitude data. The name is simply a label that we put on the Latitude to make it easier, but 
    will mean little to the DBScan algorithm, thus I remove it. It does not provide any additional information that 
    may help this investigation.
    '''

    df = df.drop(['name'], axis=1)
    '''
    !!!Explanation!!!
    This investigation will be independent of time, as time for the most part is irrelevant.
    '''

    df = df.drop(['year'], axis=1)
    '''
    !!!Explanation!!!
    I use the same normalization and PCA technique as before. 
    The normalization is to standardize the data all onto the same scale (mean=0,std=1)
    The PCA is to reduce the dimensionality of the data, by projecting the data onto one dimension
    This will all make the model run more efficiently
    '''

    df['expectancy'] = normalize_column(df_column=df['expectancy'])
    df['latitude'] = normalize_column(df_column=df['latitude'])

    pca = PCA(n_components=1)
    projected = pca.fit_transform(X=df)
    '''
    !!!My Results!!!
    I was able to get working cluster which yielded the score of 0.99. The commented out code below is the method I
    used to find my hyper parameters, the model I've returned use the hyper parameters that seem to yield the best
    results.

    I believe that clustering data based on location is an effective way of showing trends and correlations in respect
    to geography. For example, if a few places are all clustered together, it may indicate that those places all have
    a lower or higher life expectancy. This could lead to an investigation on finding constant factors that may 
    correlate with either having a high or low average expectancy. This information could be extremely useful for 
    international policies, in order to increase overall life expectancy around the world.

    The limitation here is, as always, lack of data! I believe a few more fields representing metrics which represent
    demographic and economic details of the country could improve this investigation. I believe a lot of this data can 
    also be found in the geography data set provided in the assignment folder, a good start at least. I think if I were 
    to continue with this investigation, I would integrate that data into this and see if the clusters would yield 
    more information regarding the factors that affect life expectancy around the world.
    '''
    '''
    eps = [0.1, 0.001, 0.0001]
    mins = [5, 10, 15]
    for i in eps:
        for n in mins:
            model = custom_clustering(X=projected, eps=i, min_samples=n)
            print(model)
    '''

    model = custom_clustering(X=projected, eps=0.1, min_samples=15)
    return model
コード例 #7
0
def cluster_amazon_video_game_again() -> Dict:
    """
    Run the result of the process amazon_video_game_again task of e_experimentation through the custom_clustering and discuss (max 3 sentences)
    the result (clusters and score) and also say any limitations (e.g. problems with metrics) that you find.
    We are not looking for an exact answer, we want to know if you really understand your choice and the results of custom_clustering.
    Once again, don't worry about the clustering technique implementation, but do analyse the data/result and check if the clusters makes sense.
    """
    '''
    !!!My Goal!!!
    Similar to the last function, I am going to try to develop this function into a more concrete idea in order to focus
    my investigation.
    This data set relates to user information, rather than product information, thus I would like to see if there are
    trends for each user review as the number of reviews the user gives differs. 
    I believe this can be valuable to a company as it gives the start to some user data analysis, once again a company 
    could use something like this in order to develop a method of tracking users, with additional purchasing data they
    may be able to see the difference of what they are reviewing and simply just purchasing, and the frequency of it. 
    This could give the company data to push specific products towards specific users, while also potentially gaining
    information about bots or scammer accounts within their site.
    I think we will need to use users that only have a certain amount of reviews and more, as too few reviews might not
    really amount to any useful data. This is a working hyper parameter, that perhaps the company would choose and tune,
    but for now I will only look at users who have more than 4 reviews. Conveniently, this cuts down the data size
    quite a lot, as I was having difficulties with too much data otherwise.
    Due to the nature of the investigation, I believe this function will be quite similar to the previous function, 
    thus I will be reusing a lot of the code found there, and refer to the previous function during my explanations.
    '''

    df = process_amazon_video_game_dataset_again()
    '''
    !!!Explanation!!!
    Dropping the user duplicates for the same reasons I dropped the asin duplicates previously.
    '''

    df = df.drop_duplicates(subset=['user'])
    '''
    !!!Explanation!!!
    Dropping columns that are not relevant to the investigation, or will not make sense for clustering.
    '''
    df = df.drop('user', axis=1)
    df = df.drop('asin', axis=1)
    df = df.drop('review', axis=1)
    df = df.drop('time', axis=1)
    '''
    !!!Explanation!!!
    As mentioned in the description, only choose users that have more than 4 reviews
    '''

    df = df[df['user_count'] > 4]
    print(df)
    '''
    !!!Explanation!!!
    Normalize then PCA, same reasons as previously.
    '''

    df['user_count'] = normalize_column(df_column=df['user_count'])
    df['user_average'] = normalize_column(df_column=df['user_average'])
    pca = PCA(n_components=1)
    projected = pca.fit_transform(X=df)
    '''
    !!!The Limitations!!!
    I believe for the scope of this specific investigation, we are limited from the features we have. We would be able
    to know more about trends of users if we had some insight into their purchase trends. That way we would be able to
    have a better idea if they were reviewing products that they bought, and what then we would be able to even relate 
    that data back to the previous function and find if a products average review is similar to the user review, and 
    then perhaps try to find out the reason why differences occur. With this data we would also be able to recommend 
    other products the company has to offer, based on those results.

    Another limitation is simply the computational power of my laptop! There is a lot of data here, and a lot that I had
    to cut out for efficiency of the model. Say we had more computers to stream the model on, maybe a few nice GPUs, we
    could process a lot more data a lot faster!
    '''
    '''
    !!!My Results!!!
    WIth the data that I was able to run the model on, using the parameters that I selected I was able to get the 
    running model to yield a score of 0.82. As mentioned before, I believe DBScan is a much better form of fitting
    to this data set compared to the supervised classifiers and regressors, as this data set has a lot of noise and 
    outliers, while also being very large with somewhat ambiguous labels in places.

    This seems to be good, and I think overall this is a good start to an 
    overall larger product of putting users into clusters based on activity, in order to extract information which
    can lead the company to recommend other products to the 'real' users, based on their previous purchases and reviews.
    The overall program could also spot 'fake' users, bots and scammers who may be hyping up a product even though they
    will not actually purchase it. I believe both of listed functionality is very good for a company who is trying to 
    sell products online as they will be able to learn more about their customers, and target their products depending
    on what they've learned. I believe overall these two functions begin to show the power of data science for 
    businesses and company's who are able to collect data on their customers.
    '''

    model = custom_clustering(X=projected, eps=0.000302, min_samples=20)
    print(model)

    return model
コード例 #8
0
def cluster_amazon_video_game() -> Dict:
    """
    Run the result of the process amazon_video_game task of e_experimentation through the custom_clustering and discuss (max 3 sentences)
    the result (clusters and score) and also say any limitations (e.g. problems with metrics) that you find.
    We are not looking for an exact answer, we want to know if you really understand your choice and the results of custom_clustering.
    Once again, don't worry about the clustering technique implementation, but do analyse the data/result and check if the clusters makes sense.
    """
    '''
    !!!My Goals!!!
    This question was a bit vague to me, so I've decided that with my clusters I want to use this data set to begin to
    find a method of investigating any trends or interesting behaviours between a products average review, and how many
    users have reviewed that given product. I think this method would be valuable to companies, which I discuss more in
    the My Results section of this function. The TA, Asal, helped me with the thought process of the overall idea of this
    function as well as a few of the techniques within, so I must cite her here and thank her for her help :)
    '''

    df = process_amazon_video_game_dataset()
    '''
    !!!Explanation!!!
    This data set holds data regarding the number of reviews and the average reviews PER product. That means each data
    attribute is specific to that particular product. That also means that each row of the data set involving that 
    product, or with the same 'asin' id (which is the field we drop the duplicates by) are the same! Thus there is a
    ton of data duplication in this data set, and thus we can remove the redundant data and save a bunch of space.
    When we drop the duplicates, the rows go from around one million to around fifty thousand! This saves a lot of 
    time and space for our clustering models.
    '''

    df = df.drop_duplicates(subset=['asin'])
    '''
    !!!Explanation!!!
    Here I drop both the 'time' and 'asin' columns from the data set. 
    Obviously, the 'time' column has no relevance to the data now that we've dropped duplicates.
    As for the 'asin' column, I do not think it is very important here as it does not make much sense to use it for 
    what I am trying to investigate. I want to look at if the average rating is affected by the amount of times users
    rate it, and the movie id itself is not very relevant, if we are looking generally for all movies; especially if
    we need to normalize each column to ready it for PCA, thus the 'asin' column gets dropped!
    '''

    df = df.drop('time', axis=1)
    df = df.drop('asin', axis=1)
    '''
    !!!Explanation!!!
    I decided to use the PCA dimensionality reduction technique as DBScan falls victim to the Curse of Dimensionality,
    I decided to bring my data set down to 1 dimension. From what I understand PCA projects this 2d data to a one 
    dimensional plane, choosing priority on the more important "component". 

    I needed to normalize my data as well to properly execute the PCA, which is shown below as well. The Z-score
    normalization method I use will in a way scale the data to be consistent with each other. In a way, they will now
    be able to "speak the same language". The data will be normalized to both fit a distribution of mean=0 and 
    std_dev=1. It is important for the data to be on the "same scale" as if not, the PCA algorithm will run the risk
    of denoting whichever column simply has the larger values as the "principal component".
    '''

    df['count'] = normalize_column(df_column=df['count'])
    df['review'] = normalize_column(df_column=df['review'])
    pca = PCA(n_components=1)
    projected = pca.fit_transform(X=df)
    '''
    !!!My Results!!!
    The commented out section of code is a sample of how I chose my eps and min_samples hyper parameters.
    The final model which I return is the one I chose to use. I am typically getting around a score of 0.75 with it.
    I believe this is a more believable result compared to the classification or regression, I think this data set
    is more suited for a clustering algorithm, especially one like DBScan which is good at handling noise. I think at
    the end of the day it is important to have a clear goal of what you want to do with a model, and without concrete
    labels or at least more features to avoid duplication, it was difficult to process this data with a supervised
    model, but at least clustering is able to show off some relationships that may be evident between review counts 
    and average rating for each product.

    I think an extension of this investigation may be looking into review trends for products on Amazon, both the 
    average rating and number of reviews, but also data such as total number of purchases per product. This would 
    perhaps show if customers are actually purchasing a product and leaving good reviews, or if the company is just
    hiring people to leave good reviews on their product on the Amazon website; which could perhaps lead to detection
    of phony reviews and scams. This may be valuable to a company as they can improve their overall quality assurance.
    '''
    '''
    eps = [0.0001, 0.0002, 0.000302, 0.0004, 0.0005]
    mins = [20, 30, 40, 50, 60, 100]
    for i in eps:
        for n in mins:
            model = custom_clustering(X=projected, eps=i, min_samples=n)
            print(model)
    '''

    model = custom_clustering(X=projected, eps=0.000302,
                              min_samples=20)  # gives close to 0.76 score
    return model