The `createRDD` function in the `pyspark.streaming.kafka.KafkaUtils` module is used to create an RDD (Resilient Distributed Dataset) from Kafka topics. RDDs are the fundamental data structure in Spark, and they allow for distributed processing of large datasets across a cluster of computers.
This function takes in the following parameters: - `ssc`: The `StreamingContext` object, which is the main entry point for Spark Streaming functionality. - `kafkaParams`: A dictionary of Kafka configuration properties, such as the Kafka broker list, Kafka topic to fetch from, and consumer group. - `fromOffsets`: A dictionary containing the starting offset for each Kafka topic partition. If not specified, it will start consuming from the latest offset.
The function returns an RDD of Kafka messages. Each Kafka message is represented as a tuple containing the topic name, partition ID, offset, key, and value.
This function is commonly used in Spark Streaming applications to consume and process data from Kafka topics in real-time.
Python KafkaUtils.createRDD - 16 examples found. These are the top rated real world Python examples of pyspark.streaming.kafka.KafkaUtils.createRDD extracted from open source projects. You can rate examples to help us improve the quality of examples.