Exemplos de Binarizer.binarize_bpe em Python

Linguagem de programação: Python

Espaço para nome / nome do pacote: ncc.data.tools.binarizer

Classe / Tipo: Binarizer

Método / Função: binarize_bpe

Exemplos em hotexamples.com: 2

Binarizer.binarize_bpe em Python - 2 exemplos encontrados. Esses são os exemplos do mundo real mais bem avaliados de ncc.data.tools.binarizer.Binarizer.binarize_bpe em Python extraídos de projetos de código aberto. Você pode avaliar os exemplos para nos ajudar a melhorar a qualidade deles.

Métodos Frequentes

Exibir Ocultar

binarize(4)

binarize_bpe(2)

find_offsets(2)

Métodos Frequentes

binarize (4)

binarize_bpe (2)

find_offsets (2)

Exemplo n.º 1

0

Exibir arquivo

Arquivo: preprocess_codebert.py Projeto: CGCL-codes/naturalcc

def make_binary_dataset(vocab: Dictionary, input_file, output_file, attr: str, num_workers: int): """make binary dataset""" LOGGER.info("[{}] Dictionary: {} types".format(attr, len(vocab) - 1)) n_seq_tok = [0, 0] replaced = Counter() # save un-recorded tokens def merge_result(worker_result): replaced.update(worker_result["replaced"]) n_seq_tok[0] += worker_result["nseq"] n_seq_tok[1] += worker_result["ntok"] # split a file into different parts # if use multi-processing, we first process 2nd to last file # 1.txt -> 10 processor, 0(p0)(0-99), 100(p1)(100-199), ... offsets = Binarizer.find_offsets(input_file, num_workers) pool = None if num_workers > 1: # p1-pN -> (1 bin-txt, 1 idx), (N bin-txt, N idx) pool = Pool(processes=num_workers - 1) for worker_id in range(1, num_workers): prefix = "{}{}".format(output_file, worker_id) pool.apply_async(binarize, (args, input_file, vocab, prefix, attr, offsets[worker_id], offsets[worker_id + 1]), callback=merge_result) pool.close() # process 1th file, if multi-processing available. If not, process all file # p0 -> 0,end ds_file = '{}.mmap'.format(output_file) ds = indexed_dataset.make_builder( ds_file, impl=args['preprocess']['dataset_impl'], vocab_size=len(vocab)) merge_result( Binarizer.binarize_bpe(input_file, vocab, lambda t: ds.add_item(t), offset=0, end=offsets[1])) if num_workers > 1: # p1-pN pool.join() # merge sub-processors' index and data files into final files and delete them. for worker_id in range(1, num_workers): temp_file_path = "{}{}".format(output_file, worker_id) ds.merge_file_(temp_file_path) # idx, txt os.remove(indexed_dataset.data_file_path(temp_file_path)) os.remove(indexed_dataset.index_file_path(temp_file_path)) ds.finalize('{}.idx'.format(output_file)) LOGGER.info( "[{}] {}: {} sents, {} tokens, BPE no replaced token".format( attr, input_file, n_seq_tok[0], n_seq_tok[1], ))

Exemplo n.º 2

0

Exibir arquivo

Arquivo: preprocess_codebert.py Projeto: CGCL-codes/naturalcc

def binarize(args: Dict, filename: str, dict: Dictionary, out_file_prefix: str, attr: str, offset: int, end: int): """binarize function for multi-processing""" ds_file = '{}.mmap'.format(out_file_prefix) ds = indexed_dataset.make_builder(ds_file, impl=args['preprocess']['dataset_impl'], vocab_size=len(dict)) def consumer(tensor): ds.add_item(tensor) res = Binarizer.binarize_bpe(filename, dict, consumer, offset=offset, end=end) ds.finalize('{}.idx'.format(out_file_prefix)) return res