Esempio n. 1
0
    def safe_unique(self, parent_trace, df, column_name):
        '''
        More robust implementation than Pandas for obtaining a list of the unique values for a column in 
        a DataFrame.

        In Pandas, one might typically do something like:

            df[column_name].unique()

        This has proven not robust enough in the Apodeixi code base because it can obscurely mask a defect
        elsewhere in Apodeixi, with a cryptic message like:

            'DataFrame' object has no attribute 'unique'

        The problem arises because there might be "duplicates" (usually due to another defect in the code)
        in the columns of DataFrame df. While technically speaking columns are "unique", the way Pandas
        handles a "user bad practice" of putting multiple columns with the same name is to treat the column
        index as based on objects, not strings. That allows effectively to have duplicates among the columns
        of DataFrame df, like so:

            UID |   Area    | UID-1 | Indicator |   UID-2   | Sub-Indicator | UID-2     | UID-3         | Space
            ---------------------------------------------------------------------------------------------------------
            A1  |  Adopt    | A1.I1 | throughput|           |               | A1.I1.S1  |               | tests
            A1  |  Adopt    | A1.I2 | latency   | A1.I2.SU1 |   interactive |           | A1.I2.SU1.S1  | tests  

        The second occurrence of "UID-2" should have been merged into the "UID-3" column, but we once had an Apodeixi defect
        that didn't, instead having two columns called "UID2". This is because Apodeixi was incorrectly using
        "UID-n" if the UID had exactly n tokens, which is not a unique acronym path if some of the entities
        are blank as in the example above, where the first row has no sub-indicator.

        Upshot: the dataframe columns have "UID-2" duplicated, so an attempt to do 

                df["UID-2]

        would produce a DataFrame, not a Series, so calling "unique()" on it would error out with a very cryptic
        message:

            'DataFrame' object has no attribute 'unique'

        Instead, what this "more robust" method does is check if the column in question is not unique, and so it will
        error out with hopefully a less criptic message. 
        If column is unique, it will return a list.

        @param column_name A string, corresponding to the name of a column in the DataFrame
        @param df A DataFrame. It is expected to have the `column_name` parameter as one of its columns.
        '''
        if type(column_name) != str:
            raise ApodeixiError(
                parent_trace,
                "Can't get unique values for a DataFrame's column because column name provided is a '"
                + str(type(column_name)) +
                "' was provided instead of a string as expected")
        if type(df) != _pd.DataFrame:
            raise ApodeixiError(
                parent_trace, "Can't get unique values for column '" +
                str(column_name) + "' because a '" + str(type(df)) +
                "' was provided instead of a DataFrame as expected")

        if len(column_name.strip()) == 0:
            raise ApodeixiError(
                parent_trace,
                "Can't get unique values for a DataFrame's column because column name provided is blank"
            )

        columns = list(df.columns)
        matches = [col for col in columns if col == column_name]

        if len(matches) == 0:
            raise ApodeixiError(
                parent_trace,
                "Can't get unique values in a DataFrame for column '" +
                str(column_name) + "' because it " +
                " is not one of the DataFrame's columns",
                data={"df columns": str(columns)})
        elif len(matches) > 1:
            raise ApodeixiError(
                parent_trace,
                "Can't get unique values in a DataFrame for column '" +
                str(column_name) + "' because it " +
                "appears multiple times as a column in the DataFrame",
                data={"df columns": str(columns)})

        # All is good, so now it is safe to call the Pandas unique() function
        return list(df[column_name].unique())