Python RDD.isEmpty 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: pyspark

클래스/타입: RDD

메소드/함수: isEmpty

hotexamples.com에서의 예제들: 4

Python RDD.isEmpty - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 pyspark.RDD.isEmpty에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

RDD(30)

map(30)

flatMap(16)

count(11)

mapPartitionsWithIndex(10)

getNumPartitions(9)

filter(9)

repartition(6)

mapPartitions(6)

toDF(5)

collect(5)

mapValues(5)

groupByKey(4)

isEmpty(4)

coalesce(3)

cache(3)

take(3)

toDebugString(2)

persist(2)

unpersist(2)

zip(2)

zipWithIndex(2)

__init__(2)

_reserialize(2)

first(2)

distinct(2)

join(2)

sum(1)

_to_java_object_rdd(1)

union(1)

cogroup(1)

countApproxDistinct(1)

sortByKey(1)

subtractByKey(1)

sortBy(1)

sample(1)

randomSplit(1)

foreach(1)

name(1)

groupBy(1)

keys(1)

예제 #1

파일 보기

파일: spark_consumer.py 프로젝트: eschizoid/jconf-2020

def process_rdd(time: time_, rdd: RDD) -> None:
    if rdd.isEmpty():
        return
    else:
        logging.info("----------- %s -----------" % str(time))
        sql_context = get_sql_context_instance(rdd.context)
        tweets_df = sql_context.createDataFrame(rdd, StringType())
        tweets_df.write.json(
            f"""s3a://jconf-2020/bronze/{time_.strftime("%Y-%m-%d")}/{reverse_current_time_millis()}"""
        )

예제 #2

파일 보기

 def __preprocessRdd(self, rdd: RDD):
     rddc = rddCorrector()
     rdd = rdd.map(lambda l: rddc.correct(l))
     if rdd != None:
         if (rdd.isEmpty() == False):
             rdd = rdd.map(lambda l: l.replace("<tweet>", ""))
             rdd = rdd.map(lambda l: l.replace("</tweet>", ""))
             df = DataFrameWorks().convertDataFrame(rdd, self.__spark)
             df = CleanText().clean(df, self.__spark)
             return df
     return None

예제 #3

파일 보기

    def __convert_service_format(rdd: RDD) -> RDD:
        if rdd.isEmpty():
            return rdd

        df = rdd.toDF()

        df = add_neighborhoods(df)

        df = df.withColumn("row_id", hasher(df["row_id"])) \
            .withColumn("category_id", hasher(df["category"])) \
            .withColumn("opened",
                        unix_timestamp(to_timestamp("opened", "yyyy-MM-dd'T'HH:mm:ss.SSS")).cast(
                            IntegerType())) \
            .withColumn("report_datetime",
                        unix_timestamp(to_timestamp("report_datetime", "yyyy-MM-dd'T'HH:mm:ss.SSS")).cast(
                            IntegerType())) \
            .withColumn("neighborhood_id", hasher(df["neighborhood"]))

        return df.rdd

예제 #4

파일 보기

    def __convert_service_format(rdd: RDD) -> RDD:
        if rdd.isEmpty():
            return rdd
        df = rdd.toDF()

        # Find neighborhoods from lat/lon
        # This is necessary, because a lot of the data from the API is missing neighborhood data
        df = add_neighborhoods(df)

        # Add key data and parse dates
        df = df.withColumn("category_id", hasher("category")) \
            .withColumn("neighborhood_id", hasher("neighborhood")) \
            .withColumn("opened",
                        unix_timestamp(to_timestamp("openedStr", "yyyy-MM-dd'T'HH:mm:ss.SSS")).cast(IntegerType())) \
            .withColumn("updated",
                        unix_timestamp(to_timestamp("updatedStr", "yyyy-MM-dd'T'HH:mm:ss.SSS")).cast(IntegerType())) \
            .drop("openedStr", "updatedStr")

        return df.rdd