Skip to content

JohnHBauer/flattenVectorUDT

Repository files navigation

flattenVectorUDT

Each vector-valued column in a Spark DataFrame is replaced by ordinary columns to facilitate use in PySpark.

Columns of type VectorUDTare identified, and each vector element is placed in its own column. For example, LogisticRegressionModel creates a column probability, which will be replaced by columns probaility[0], probability[1], ... Note that these are not indexed, but instead [0] is part of the column's name.

VectorAssembler, OneHotEncoder, and Model objects create User-Defined Types which are vector-valued: either DenseVector or SparseVector, the elements of which may be floats or integers.

The elements of these vectors are not directly indexable in PySpark. Moreover, when creating a pandas_udf using pyarrow, if a VectorUDT is encountered the conversion will not be optimized. Instead, a pandas Series of Object will result. These values must be converted row by row using .tolist() or .values. Working with these in pandas is a tedious and error-prone process.

The vectorFlattenerUDT function provides the core functionality. Given a dataframe, it detects columns with dataType VectorUDT, inspects one row to determine the length and value type, generates a udf which places each vector element in its own column, and uses SparkSQL select to return a dataframe without VectorUDTs.

If meta-data for the length and element type of VectorUDT was consistently and reliably set, there would be no need to inspect a row in order to infer them.

A Transformer class minimally wraps the above, and an Estimator/Model pair. The fit method finds the names, lengths, and data types of vectors, and makes them available as parameters of the Model obect which it emits.

A VectorReAssembler detects columns whose names end in [###] (where ### is a sequence of digits) and uses VectorAssembler to re-assemble them into a vector. A dataframe which is flattened and re-assembled will usually be identical to the initial dataframe. However, note that any SparseVector-valued columns will be re-assembled as DenseVector.

Ideally, call flattenVectorUDT before pandas_udf, apply some pandas transformations, (drop unneeded columns), and re-assemble. pandas_udf could be decorated to do this.

In the long term, Scala code along the lines of projection of each vector element into its own column followed by flatmap, together with a PySpark wrapper, might look a little cleaner.

About

PySpark VectorUDT are replaced by columns

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages