Python Binarizer.find_offsets 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: ncc.data.tools.binarizer

클래스/타입: Binarizer

메소드/함수: find_offsets

hotexamples.com에서의 예제들: 2

Python Binarizer.find_offsets - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 ncc.data.tools.binarizer.Binarizer.find_offsets에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

binarize(4)

binarize_bpe(2)

find_offsets(2)

자주 사용되는 메소드들

binarize (4)

binarize_bpe (2)

find_offsets (2)

예제 #1

파일 보기

파일: preprocess_codebert.py 프로젝트: CGCL-codes/naturalcc

    def make_binary_dataset(vocab: Dictionary, input_file, output_file,
                            attr: str, num_workers: int):
        """make binary dataset"""
        LOGGER.info("[{}] Dictionary: {} types".format(attr, len(vocab) - 1))
        n_seq_tok = [0, 0]
        replaced = Counter()  # save un-recorded tokens

        def merge_result(worker_result):
            replaced.update(worker_result["replaced"])
            n_seq_tok[0] += worker_result["nseq"]
            n_seq_tok[1] += worker_result["ntok"]

        # split a file into different parts
        # if use multi-processing, we first process 2nd to last file
        # 1.txt -> 10 processor, 0(p0)(0-99), 100(p1)(100-199), ...
        offsets = Binarizer.find_offsets(input_file, num_workers)
        pool = None
        if num_workers > 1:
            # p1-pN -> (1 bin-txt, 1 idx), (N bin-txt, N idx)
            pool = Pool(processes=num_workers - 1)
            for worker_id in range(1, num_workers):
                prefix = "{}{}".format(output_file, worker_id)
                pool.apply_async(binarize,
                                 (args, input_file, vocab, prefix, attr,
                                  offsets[worker_id], offsets[worker_id + 1]),
                                 callback=merge_result)
            pool.close()
        # process 1th file, if multi-processing available. If not, process all file
        # p0 -> 0,end
        ds_file = '{}.mmap'.format(output_file)
        ds = indexed_dataset.make_builder(
            ds_file,
            impl=args['preprocess']['dataset_impl'],
            vocab_size=len(vocab))
        merge_result(
            Binarizer.binarize_bpe(input_file,
                                   vocab,
                                   lambda t: ds.add_item(t),
                                   offset=0,
                                   end=offsets[1]))
        if num_workers > 1:
            # p1-pN
            pool.join()
            # merge sub-processors' index and data files into final files and delete them.
            for worker_id in range(1, num_workers):
                temp_file_path = "{}{}".format(output_file, worker_id)
                ds.merge_file_(temp_file_path)
                # idx, txt
                os.remove(indexed_dataset.data_file_path(temp_file_path))
                os.remove(indexed_dataset.index_file_path(temp_file_path))
        ds.finalize('{}.idx'.format(output_file))

        LOGGER.info(
            "[{}] {}: {} sents, {} tokens, BPE no replaced token".format(
                attr,
                input_file,
                n_seq_tok[0],
                n_seq_tok[1],
            ))

예제 #2

파일 보기

 def make_graph_bin_dataset(dict: Dictionary, input_file, output_file,
                            num_workers):
     offsets = Binarizer.find_offsets(input_file, num_workers)
     if num_workers > 1:
         # p1-pN -> (1 bin-txt, 1 idx), (N bin-txt, N idx)
         pool = Pool(processes=num_workers)
         for worker_id in range(num_workers):
             prefix = "{}{}".format(output_file, worker_id)
             pool.apply_async(
                 binarize_dgl,
                 (args, input_file, dict, prefix, offsets[worker_id],
                  offsets[worker_id + 1]),
             )
         pool.close()
     else:
         prefix = "{}0".format(output_file)
         binarize_dgl(args, input_file, dict, prefix, 0, -1)