from pyspark.ml.feature import StringIndexer indexer = StringIndexer(inputCol="city", outputCol="city_index") indexed = indexer.fit(df).transform(df) indexed.show()
+---+-----------+-----------+ | id| city| city_index| +---+-----------+-----------+ | 1| New York| 0.0| | 2|Los Angeles| 1.0| | 3| Chicago| 2.0| | 4| Chicago| 2.0| | 5| New York| 0.0| | 6|Los Angeles| 1.0| +---+-----------+-----------+
from pyspark.ml.feature import StringIndexer indexer = StringIndexer(inputCol="label", outputCol="label_index") indexed = indexer.fit(df).transform(df) indexed.show()
+---+--------+-----------+ | id| label|label_index| +---+--------+-----------+ | 1|positive| 0.0| | 2|negative| 1.0| | 3|positive| 0.0| | 4|positive| 0.0| | 5|negative| 1.0| | 6|positive| 0.0| +---+--------+-----------+This code snippet converts the `label` column into a new column `label_index` with numerical indices. In this case, the `StringIndexer` is used to encode the labels in a classification task. The `fit()` and `transform()` methods are used in the same way as in the first example. Both examples demonstrate the functionality of the `StringIndexer` module in PySpark. The module can be found in the pyspark.ml.feature package.