Exemplos de get_tokenizers_for_blocking em Python

Linguagem de programação: Python

Espaço para nome / nome do pacote: dmagellan.feature.tokenizers

Método / Função: get_tokenizers_for_blocking

Exemplos em hotexamples.com: 2

get_tokenizers_for_blocking em Python - 2 exemplos encontrados. Esses são os exemplos do mundo real mais bem avaliados de dmagellan.feature.tokenizers.get_tokenizers_for_blocking em Python extraídos de projetos de código aberto. Você pode avaliar os exemplos para nos ajudar a melhorar a qualidade deles.

Exemplo n.º 1

0

Exibir arquivo

Arquivo: autofeaturegen.py Projeto: kvpradap/dmagellan

def get_features_for_blocking(ltable, rtable): """ This function automatically generates features that can be used for blocking purposes. Args: ltable,rtable (DataFrame): The pandas DataFrames for which the features are to be generated. Returns: A pandas DataFrame containing automatically generated features. Specifically, the DataFrame contains the following attributes: 'feature_name', 'left_attribute', 'right_attribute', 'left_attr_tokenizer', 'right_attr_tokenizer', 'simfunction', 'function', 'function_source', and 'is_auto_generated'. Further, this function also sets the following global variables: _block_t, _block_s, _atypes1, _atypes2, and _block_c. The variable _block_t contains the tokenizers used and _block_s contains the similarity functions used for creating features. The variables _atypes1, and _atypes2 contain the attribute types for ltable and rtable respectively. The variable _block_c contains the attribute correspondences between the two input tables. Raises: AssertionError: If `ltable` is not of type pandas DataFrame. AssertionError: If `rtable` is not of type pandas DataFrame. Examples: >>> import py_entitymatching as em >>> A = em.read_csv_metadata('path_to_csv_dir/table_A.csv', key='ID') >>> B = em.read_csv_metadata('path_to_csv_dir/table_B.csv', key='ID') >>> block_f = em.get_features_for_blocking(A, B) Note: In the output DataFrame, two attributes demand some explanation: (1) function, and (2) is_auto_generated. The function, points to the actual Python function that implements the feature. Specifically, the function takes in two tuples (one from each input table) and returns a numeric value. The attribute is_auto_generated contains either True or False. The flag is True only if the feature is automatically generated by py_entitymatching. This is important because this flag is used to make some assumptions about the semantics of the similarity function used and use that information for scaling purposes. See Also: :meth:`py_entitymatching.get_attr_corres`, :meth:`py_entitymatching.get_attr_types`, :meth:`py_entitymatching.get_sim_funs_for_blocking` :meth:`py_entitymatching.get_tokenizers_for_blocking` """ # Validate input parameters # # We expect the ltable to be of type pandas DataFrame if not isinstance(ltable, pd.DataFrame): logger.error('Input table A is not of type pandas DataFrame') raise AssertionError('Input table A is not of type pandas DataFrame') # # We expect the rtable to be of type pandas DataFrame if not isinstance(rtable, pd.DataFrame): logger.error('Input table B is not of type pandas dataframe') raise AssertionError('Input table B is not of type pandas dataframe') # Get the similarity functions to be used for blocking sim_funcs = sim.get_sim_funs_for_blocking() # Get the tokenizers to be used for blocking tok_funcs = tok.get_tokenizers_for_blocking() # Get the attr. types for ltable and rtable attr_types_ltable = au.get_attr_types(ltable) attr_types_rtable = au.get_attr_types(rtable) # Get the attr. correspondences between ltable and rtable attr_corres = au.get_attr_corres(ltable, rtable) # Get features based on attr types, attr correspondences, sim functions # and tok. functions feature_table = get_features(ltable, rtable, attr_types_ltable, attr_types_rtable, attr_corres, tok_funcs, sim_funcs) # Export important variables to global name space # em._match_t = tok_funcs # em._block_s = sim_funcs # em._atypes1 = attr_types_ltable # em._atypes2 = attr_types_rtable # em._block_c = attr_corres # Return the feature table return feature_table

Exemplo n.º 2

0

Exibir arquivo

Arquivo: test_rb_tbls.py Projeto: kvpradap/tuning_tool

A.reset_index(inplace=True, drop=True) B.reset_index(inplace=True, drop=True) s = A.title.str.len().sort_values().index A1 = A.reindex(s) A1 = A1.reset_index(drop=True) s = B.title.str.len().sort_values().index B1 = B.reindex(s) B1 = B1.reset_index(drop=True) rb = RuleBasedBlocker() feature_table = get_features_for_blocking(A, B) sim = get_sim_funs_for_blocking() tok = get_tokenizers_for_blocking() block_f = get_features_for_blocking(A1, B1) _ = rb.add_rule(['title_title_lev_dist(ltuple, rtuple) > 6'], block_f) rb.set_table_attrs(['title'], ['title']) input_tables = OrderedDict() input_tables['ltable'] = A1 input_tables['rtable'] = B1 input_args = OrderedDict() input_args['l_key'] = 'id' input_args['r_key'] = 'id' input_args['compute'] = True input_args['show_progress'] = False input_args['scheduler'] = multiprocessing.get