def safe_unique(self, parent_trace, df, column_name): ''' More robust implementation than Pandas for obtaining a list of the unique values for a column in a DataFrame. In Pandas, one might typically do something like: df[column_name].unique() This has proven not robust enough in the Apodeixi code base because it can obscurely mask a defect elsewhere in Apodeixi, with a cryptic message like: 'DataFrame' object has no attribute 'unique' The problem arises because there might be "duplicates" (usually due to another defect in the code) in the columns of DataFrame df. While technically speaking columns are "unique", the way Pandas handles a "user bad practice" of putting multiple columns with the same name is to treat the column index as based on objects, not strings. That allows effectively to have duplicates among the columns of DataFrame df, like so: UID | Area | UID-1 | Indicator | UID-2 | Sub-Indicator | UID-2 | UID-3 | Space --------------------------------------------------------------------------------------------------------- A1 | Adopt | A1.I1 | throughput| | | A1.I1.S1 | | tests A1 | Adopt | A1.I2 | latency | A1.I2.SU1 | interactive | | A1.I2.SU1.S1 | tests The second occurrence of "UID-2" should have been merged into the "UID-3" column, but we once had an Apodeixi defect that didn't, instead having two columns called "UID2". This is because Apodeixi was incorrectly using "UID-n" if the UID had exactly n tokens, which is not a unique acronym path if some of the entities are blank as in the example above, where the first row has no sub-indicator. Upshot: the dataframe columns have "UID-2" duplicated, so an attempt to do df["UID-2] would produce a DataFrame, not a Series, so calling "unique()" on it would error out with a very cryptic message: 'DataFrame' object has no attribute 'unique' Instead, what this "more robust" method does is check if the column in question is not unique, and so it will error out with hopefully a less criptic message. If column is unique, it will return a list. @param column_name A string, corresponding to the name of a column in the DataFrame @param df A DataFrame. It is expected to have the `column_name` parameter as one of its columns. ''' if type(column_name) != str: raise ApodeixiError( parent_trace, "Can't get unique values for a DataFrame's column because column name provided is a '" + str(type(column_name)) + "' was provided instead of a string as expected") if type(df) != _pd.DataFrame: raise ApodeixiError( parent_trace, "Can't get unique values for column '" + str(column_name) + "' because a '" + str(type(df)) + "' was provided instead of a DataFrame as expected") if len(column_name.strip()) == 0: raise ApodeixiError( parent_trace, "Can't get unique values for a DataFrame's column because column name provided is blank" ) columns = list(df.columns) matches = [col for col in columns if col == column_name] if len(matches) == 0: raise ApodeixiError( parent_trace, "Can't get unique values in a DataFrame for column '" + str(column_name) + "' because it " + " is not one of the DataFrame's columns", data={"df columns": str(columns)}) elif len(matches) > 1: raise ApodeixiError( parent_trace, "Can't get unique values in a DataFrame for column '" + str(column_name) + "' because it " + "appears multiple times as a column in the DataFrame", data={"df columns": str(columns)}) # All is good, so now it is safe to call the Pandas unique() function return list(df[column_name].unique())