Python BertTokenizer.tokenizeの例

プログラミング言語: Python

名前空間/パッケージ名: tokenization

クラス/型: BertTokenizer

メソッド/関数: tokenize

hotexamples.comのコード掲載数: 4

Python BertTokenizer.tokenize - 4件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのtokenization.BertTokenizer.tokenizeの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

from_pretrained(30)

BertTokenizer(14)

tokenize(4)

コード例 #1

ファイルを表示

def preprocess_text_input(
        context='Danielle is a girl who really loves her cat, Steve.',
        question='What cat does Danielle love?',
        vocab_file='DeepLearningExamples/PyTorch/LanguageModeling/BERT/vocab/vocab',
        max_seq_length=384,
        max_query_length=64,
        n_best_size=1,
        max_answer_length=30,
        null_score_diff_threshold=-11.0):
    tokenizer = BertTokenizer(vocab_file, do_lower_case=True, max_len=512)
    doc_tokens = context.split()
    query_tokens = tokenizer.tokenize(question)
    feature = preprocess_tokenized_text(doc_tokens,
                                        query_tokens,
                                        tokenizer,
                                        max_seq_length=max_seq_length,
                                        max_query_length=max_query_length)

    tensors_for_inference, tokens_for_postprocessing = feature

    input_ids = torch.tensor(tensors_for_inference.input_ids,
                             dtype=torch.long).unsqueeze(0)
    segment_ids = torch.tensor(tensors_for_inference.segment_ids,
                               dtype=torch.long).unsqueeze(0)
    input_mask = torch.tensor(tensors_for_inference.input_mask,
                              dtype=torch.long).unsqueeze(0)
    return (input_ids, segments_ids, input_mask)

コード例 #2

ファイルを表示

    def predict(cls,
                context,
                question,
                bing_key=None,
                max_seq_length=384,
                max_query_length=64,
                n_best_size=3,
                do_lower_case=True,
                can_give_negative_answer=True,
                max_answer_length=30,
                null_score_diff_threshold=-11.0):
        """For the input, do the predictions and return them.
        Args:
            input (a pandas dataframe): The data on which to do the predictions. There will be
                one prediction per row in the dataframe"""
        predictor_model = cls.get_predictor_model()

        doc_tokens = context.split()
        tokenizer = BertTokenizer(vocab_file,
                                  do_lower_case=True,
                                  max_len=max_seq_length)
        query_tokens = tokenizer.tokenize(question)
        feature = preprocess_tokenized_text(doc_tokens,
                                            query_tokens,
                                            tokenizer,
                                            max_seq_length=max_seq_length,
                                            max_query_length=max_query_length)
        tensors_for_inference, tokens_for_postprocessing = feature

        input_ids = torch.tensor(tensors_for_inference.input_ids,
                                 dtype=torch.long,
                                 device=device).unsqueeze(0)
        segment_ids = torch.tensor(tensors_for_inference.segment_ids,
                                   dtype=torch.long,
                                   device=device).unsqueeze(0)
        input_mask = torch.tensor(tensors_for_inference.input_mask,
                                  dtype=torch.long,
                                  device=device).unsqueeze(0)

        # run prediction
        with torch.no_grad():
            start_logits, end_logits = predictor_model(input_ids, segment_ids,
                                                       input_mask)

        # post-processing
        start_logits = start_logits[0].detach().cpu().tolist()
        end_logits = end_logits[0].detach().cpu().tolist()
        prediction = get_predictions(doc_tokens, tokens_for_postprocessing,
                                     start_logits, end_logits, n_best_size,
                                     max_answer_length, do_lower_case,
                                     can_give_negative_answer,
                                     null_score_diff_threshold)

        return prediction

コード例 #3

ファイルを表示

ファイル: inference.py プロジェクト: gloriouskilka/DeepLearningExamples-1

def main():
    parser = argparse.ArgumentParser()

    ## Required parameters
    parser.add_argument(
        "--bert_model",
        default=None,
        type=str,
        required=True,
        help="Bert pre-trained model selected in the list: bert-base-uncased, "
        "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
        "bert-base-multilingual-cased, bert-base-chinese.")
    parser.add_argument("--init_checkpoint",
                        default=None,
                        type=str,
                        required=True,
                        help="The checkpoint file from pretraining")

    ## Other parameters
    parser.add_argument(
        "--verbose_logging",
        action='store_true',
        help=
        "If true, all of the warnings related to data processing will be printed. "
    )
    parser.add_argument("--seed", default=1, type=int)
    parser.add_argument(
        "--question",
        default=
        "Most antibiotics target bacteria and don't affect what class of organisms? ",
        type=str,
        help="question")
    parser.add_argument(
        "--context",
        default=
        "Within the genitourinary and gastrointestinal tracts, commensal flora serve as biological barriers by competing with pathogenic bacteria for food and space and, in some cases, by changing the conditions in their environment, such as pH or available iron. This reduces the probability that pathogens will reach sufficient numbers to cause illness. However, since most antibiotics non-specifically target bacteria and do not affect fungi, oral antibiotics can lead to an overgrowth of fungi and cause conditions such as a vaginal candidiasis (a yeast infection). There is good evidence that re-introduction of probiotic flora, such as pure cultures of the lactobacilli normally found in unpasteurized yogurt, helps restore a healthy balance of microbial populations in intestinal infections in children and encouraging preliminary data in studies on bacterial gastroenteritis, inflammatory bowel diseases, urinary tract infection and post-surgical infections. ",
        type=str,
        help="context")
    parser.add_argument(
        "--max_seq_length",
        default=384,
        type=int,
        help=
        "The maximum total input sequence length after WordPiece tokenization. Sequences "
        "longer than this will be truncated, and sequences shorter than this will be padded."
    )
    parser.add_argument(
        "--max_query_length",
        default=64,
        type=int,
        help=
        "The maximum number of tokens for the question. Questions longer than this will "
        "be truncated to this length.")
    parser.add_argument(
        "--n_best_size",
        default=1,
        type=int,
        help="The total number of n-best predictions to generate. ")
    parser.add_argument(
        "--max_answer_length",
        default=30,
        type=int,
        help=
        "The maximum length of an answer that can be generated. This is needed because the start "
        "and end predictions are not conditioned on one another.")
    parser.add_argument("--no_cuda",
                        action='store_true',
                        help="Whether not to use CUDA when available")
    parser.add_argument(
        "--do_lower_case",
        action='store_true',
        help=
        "Whether to lower case the input text. True for uncased models, False for cased models."
    )
    parser.add_argument(
        '--version_2_with_negative',
        action='store_true',
        help='If true, then the model can reply with "unknown". ')
    parser.add_argument(
        '--null_score_diff_threshold',
        type=float,
        default=-11.0,
        help=
        "If null_score - best_non_null is greater than the threshold predict 'unknown'. "
    )
    parser.add_argument(
        '--vocab_file',
        type=str,
        default=None,
        required=True,
        help="Vocabulary mapping/file BERT was pretrainined on")
    parser.add_argument("--config_file",
                        default=None,
                        type=str,
                        required=True,
                        help="The BERT model config")
    parser.add_argument('--fp16',
                        action='store_true',
                        help="use mixed-precision")
    parser.add_argument("--local_rank",
                        default=-1,
                        help="ordinal of the GPU to use")

    args = parser.parse_args()
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.cuda.manual_seed(args.seed)

    if args.local_rank == -1 or args.no_cuda:
        device = torch.device("cuda" if torch.cuda.is_available()
                              and not args.no_cuda else "cpu")
    else:
        torch.cuda.set_device(args.local_rank)
        device = torch.device("cuda", args.local_rank)

    tokenizer = BertTokenizer(args.vocab_file,
                              do_lower_case=args.do_lower_case,
                              max_len=512)  # for bert large

    # Prepare model
    config = BertConfig.from_json_file(args.config_file)

    # Padding for divisibility by 8
    if config.vocab_size % 8 != 0:
        config.vocab_size += 8 - (config.vocab_size % 8)

    # initialize model
    model = BertForQuestionAnswering(config)
    model.load_state_dict(
        torch.load(args.init_checkpoint, map_location='cpu')["model"])
    model.to(device)
    if args.fp16:
        model.half()
    model.eval()

    print("question: ", args.question)
    print("context: ", args.context)
    print()

    # preprocessing
    doc_tokens = args.context.split()
    query_tokens = tokenizer.tokenize(args.question)
    feature = preprocess_tokenized_text(doc_tokens,
                                        query_tokens,
                                        tokenizer,
                                        max_seq_length=args.max_seq_length,
                                        max_query_length=args.max_query_length)

    tensors_for_inference, tokens_for_postprocessing = feature

    input_ids = torch.tensor(tensors_for_inference.input_ids,
                             dtype=torch.long).unsqueeze(0)
    segment_ids = torch.tensor(tensors_for_inference.segment_ids,
                               dtype=torch.long).unsqueeze(0)
    input_mask = torch.tensor(tensors_for_inference.input_mask,
                              dtype=torch.long).unsqueeze(0)

    # load tensors to device
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)

    # run prediction
    with torch.no_grad():
        start_logits, end_logits = model(input_ids, segment_ids, input_mask)

    # post-processing
    start_logits = start_logits[0].detach().cpu().tolist()
    end_logits = end_logits[0].detach().cpu().tolist()
    answer, answers = get_answer(doc_tokens, tokens_for_postprocessing,
                                 start_logits, end_logits, args)

    # print result
    print()
    print(answer)
    print()
    print(json.dumps(answers, indent=4))

コード例 #4

ファイルを表示

ファイル: client.py プロジェクト: gloriouskilka/DeepLearningExamples-1

                             args.triton_model_name,
                             args.triton_model_version,
                             http_headers=args.http_headers,
                             verbose=args.verbose)

    print("question: ", args.question)
    print("context: ", args.context)
    print()

    # pre-processing
    tokenizer = BertTokenizer(args.vocab_file,
                              do_lower_case=args.do_lower_case,
                              max_len=512)  # for bert large

    doc_tokens = args.context.split()
    query_tokens = tokenizer.tokenize(args.question)
    feature = preprocess_tokenized_text(doc_tokens,
                                        query_tokens,
                                        tokenizer,
                                        max_seq_length=args.max_seq_length,
                                        max_query_length=args.max_query_length)

    tensors_for_inference, tokens_for_postprocessing = feature

    dtype = np.int64
    input_ids = np.array(tensors_for_inference.input_ids,
                         dtype=dtype)[None, ...]  # make bs=1
    segment_ids = np.array(tensors_for_inference.segment_ids,
                           dtype=dtype)[None, ...]  # make bs=1
    input_mask = np.array(tensors_for_inference.input_mask,
                          dtype=dtype)[None, ...]  # make bs=1