from pyspark.ml.feature import StringIndexer df = spark.createDataFrame([(0, 'male'), (1, 'female'), (2, 'male'), (3, 'female'), (4, 'other')], ['id', 'gender']) indexer = StringIndexer(inputCol='gender', outputCol='indexed_gender') indexed = indexer.fit(df).transform(df) indexed.show()
+---+------+--------------+ | id|gender|indexed_gender| +---+------+--------------+ | 0| male| 0.0| | 1|female| 1.0| | 2| male| 0.0| | 3|female| 1.0| | 4| other| 2.0| +---+------+--------------+
from pyspark.ml.feature import StringIndexer df = spark.createDataFrame([(0, 'male'), (1, 'female'), (2, 'male'), (3, 'female'), (4, 'other'), (5, None)], ['id', 'gender']) indexer = StringIndexer(inputCol='gender', outputCol='indexed_gender', handleInvalid='skip') indexed = indexer.fit(df).transform(df) indexed.show()
+---+------+--------------+ | id|gender|indexed_gender| +---+------+--------------+ | 0| male| 0.0| | 1|female| 1.0| | 2| male| 0.0| | 3|female| 1.0| | 4| other| 2.0| +---+------+--------------+This example shows how to set the `handleInvalid` parameter to 'skip' to skip rows containing null values in the input column. Library: The package library used in the above examples is PySpark.