Ejemplos de Indexer.consolidate_postings en Python

Lenguaje de programación: Python

Namespace/Package Name: indexer

Clase / Tipo: Indexer

Método / Función: consolidate_postings

Ejemplos en hotexamples.com: 1

Python Indexer.consolidate_postings - 1 ejemplos encontrados. Estos son los ejemplos en Python del mundo real mejor valorados de indexer.Indexer.consolidate_postings extraídos de proyectos de código abierto. Puedes valorar ejemplos para ayudarnos a mejorar la calidad de los ejemplos.

Métodos usados con frecuencia

Mostrar Ocultar

add_new_doc(30)

Indexer(30)

create_index(6)

create_unigram_index(3)

calculate_idf(3)

LoadIndexes(3)

close(3)

dump(3)

coords_to_indices(2)

indices_to_coords(2)

calculationSummerize(2)

add_idf_to_dictionary(2)

add_document(2)

LoadDict(2)

fix_inverted_index(2)

finish(2)

evaluate_input(1)

execute(1)

create_save_indexer_with_relevant_docs(1)

entities_and_small_big(1)

directory(1)

delete_dict_after_saving(1)

create_indexer(1)

create_dirs(1)

create_bulk_index_string(1)

finish_index(1)

CreatInvertedIndex(1)

finish_indexing(1)

get_num_spatial_nodes(1)

tokenize(1)

set_idx_fields(1)

process(1)

keys(1)

isStopword(1)

ignore_extensions(1)

get__lda__(1)

fit(1)

getStemmed(1)

getOr(1)

getAnd(1)

get(1)

generate_local_index(1)

create_block(1)

generate_global_index(1)

compute_tf(1)

createIndex(1)

add_square_Wij(1)

bp_index(1)

batch_get_feat_stacked(1)

after_indexing(1)

Ejemplo n.º 1

Mostrar archivo

Archivo: search_engine.py Proyecto: yairch/Search_Engine

def run_engine(corpus_path, output_path, stemming=False):
    """
    Builds the retrieval model.
    Preprocess, parse and index corpus.
    :return: a tuple of number_of_documents in the corpus and average_document_length
    """

    number_of_documents = 0
    total_document_length = 0

    reader = ReadFile(corpus_path)
    parser = Parse()
    indexer = Indexer(output_path)

    # read all parquet data files
    files = glob(corpus_path + "/**/*.parquet", recursive=True)

    # read, parse and index document in batches. Posting files are divided by english alphabet
    # a batch is defined as all the documents in a single parquet file
    # each batch is first written as many sub-batches indicated by an index and later merged into one coherent batch
    batch_index = 0
    file_index = 0
    while file_index < len(files):

        # batch two files at a time to reduce disk seek time penalty
        first_file = files[file_index]
        first_documents_list = reader.read_file(first_file)

        if file_index + 1 < len(files):
            second_file = files[file_index + 1]
            second_documents_list = reader.read_file(second_file)
            documents_list = first_documents_list + second_documents_list

        else:  # if only one batch left for the last batch
            documents_list = first_documents_list

        file_index += 2

        # Iterate over every document in the file

        # parse documents
        parsed_file = set()
        for document_as_list in documents_list:
            parsed_document = parser.parse_doc(document_as_list, stemming)
            parsed_file.add(parsed_document)
            total_document_length += parsed_document.doc_length
            number_of_documents += 1

        # index parsed documents
        indexer.index_batch(parsed_file, str(batch_index))

        batch_index += 1

    # calculate average document length
    average_document_length = float(
        total_document_length) / number_of_documents

    # after indexing all non-entity terms in the corpus, index legal entities
    indexer.index_entities()

    # save index dictionary to disk
    utils.save_obj(indexer.inverted_idx, output_path + "inverted_idx")

    # after indexing the whole corpus, consolidate all partial posting files
    indexer.consolidate_postings()

    return number_of_documents, average_document_length