Python DataFrame.persist 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: pyspark.sql.dataframe

클래스/타입: DataFrame

메소드/함수: persist

hotexamples.com에서의 예제들: 2

Python DataFrame.persist - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 pyspark.sql.dataframe.DataFrame.persist에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

withColumn(30)

select(30)

DataFrame(24)

filter(20)

createOrReplaceTempView(16)

count(11)

drop(11)

_schema(10)

join(6)

collect(6)

show(5)

groupBy(5)

withColumnRenamed(5)

coalesce(5)

where(4)

toPandas(4)

cache(3)

repartition(3)

limit(3)

alias(3)

selectExpr(3)

unpersist(2)

toDF(2)

fillna(2)

schema(2)

printSchema(2)

persist(2)

_h2o_frame(2)

head(2)

explain(2)

foreach(1)

sortWithinPartitions(1)

take(1)

orderBy(1)

toLocalIterator(1)

transform(1)

agg(1)

mapInPandas(1)

예제 #1

파일 보기

    def persistLocal(self, dfName: str, df: DataFrame, persistType: str):
        ''' Persist the input Datafrmae locally (memory/disk/none) and runs `df.take(1)` to force persist.
        '''
        df.persist(
            self.getSparkPersistType(persistTypStr=persistType.toUpperCase()))

        if (self.__printcount == None):
            row = df.take(1)
        return df

예제 #2

파일 보기

파일: truth.py 프로젝트: uk-gov-mirror/moj-analytical-services.splink

def _truth_space_table_old(
    df_labels_with_splink_scores: DataFrame,
    spark: SparkSession,
    threshold_actual: float = 0.5,
    score_colname: str = None,
):
    """Create a table of the ROC space i.e. truth table statistics
    for each discrimination threshold

    Args:
        df_labels_with_splink_scores (DataFrame): A dataframe of labels and associated splink scores
            usually the output of the truth.labels_with_splink_scores function
        threshold_actual (float, optional): Threshold to use in categorising clerical match
            scores into match or no match. Defaults to 0.5.
        score_colname (float, optional): Allows user to explicitly state the column name
            in the Splink dataset containing the Splink score.  If none will be inferred

    Returns:
        DataFrame: Table of 'truth space' i.e. truth categories for each threshold level
    """

    # This is used repeatedly to generate the roc curve
    df_labels_with_splink_scores.persist()

    # We want percentiles of score to compute
    score_colname = _get_score_colname(df_labels_with_splink_scores,
                                       score_colname)

    percentiles = [x / 100 for x in range(0, 101)]

    values_distinct = df_labels_with_splink_scores.select(
        score_colname).distinct()
    thresholds = values_distinct.stat.approxQuantile(score_colname,
                                                     percentiles, 0.0)
    thresholds.append(1.01)
    thresholds = sorted(set(thresholds))

    roc_dfs = []
    for thres in thresholds:
        df_e_t = df_e_with_truth_categories(df_labels_with_splink_scores,
                                            thres, spark, threshold_actual,
                                            score_colname)
        df_roc_row = _summarise_truth_cats(df_e_t, spark)
        roc_dfs.append(df_roc_row)

    all_roc_df = reduce(DataFrame.unionAll, roc_dfs)
    return all_roc_df