Python PunktSentenceTokenizer.sentences_from_tokens 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: nltk.tokenize

메소드/함수: sentences_from_tokens

hotexamples.com에서의 예제들: 1

Python PunktSentenceTokenizer.sentences_from_tokens - 1개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 nltk.tokenize.PunktSentenceTokenizer.sentences_from_tokens에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

PunktSentenceTokenizer(30)

tokenize(30)

span_tokenize(9)

sentences_from_text(7)

train(2)

difference(1)

intersection(1)

sentences_from_tokens(1)

span_tokenize_sents(1)

tokenizer(1)

union(1)

예제 #1

파일 보기

파일: imdb.py 프로젝트: Poezedoez/Generating-movie-plots-guided-by-genre

    def create_data(self):
        if self.split == 'train':
            self._create_vocab()
        else:
            self._load_vocab()

        print(f'Creating data for {self.split} split...')
        tokenizer = TweetTokenizer(preserve_case=False)
        sent_tokenizer = PunktSentenceTokenizer()

        DetectorFactory.seed = 0

        data = defaultdict(dict)
        df = pd.read_csv(self.raw_data_path)
        for _, row in df.iterrows():
            # Only keep English plot samples
            if detect(row['plot']) != 'en':
                continue
            tokens = tokenizer.tokenize(row['plot'])
            # Split the plot into separate sentences
            sentences = sent_tokenizer.sentences_from_tokens(tokens)
            # Generate a sample from each sentence
            for words in sentences:
                randn = np.random.uniform()
                # Only save 30 percent of the sentences in our dataset
                # due to performance limitations
                # if sentence longer than max sequence length don't use it
                if randn > 0.3 or len(words) > self.max_sequence_length - 1:
                    continue

                input = ['<sos>'] + words
                input = input[:self.max_sequence_length]

                target = words[:self.max_sequence_length - 1]
                target = target + ['<eos>']

                assert len(input) == len(target), "%i, %i" % (len(input),
                                                              len(target))
                length = len(input)

                input.extend(['<pad>'] * (self.max_sequence_length - length))
                target.extend(['<pad>'] * (self.max_sequence_length - length))

                input = [self.w2i.get(w, self.w2i['<unk>']) for w in input]
                target = [self.w2i.get(w, self.w2i['<unk>']) for w in target]

                id = len(data)
                data[id]['input'] = input
                data[id]['target'] = target
                data[id]['length'] = length

        with io.open(os.path.join(self.data_dir, self.data_file),
                     'wb') as data_file:
            data = json.dumps(data, ensure_ascii=False)
            data_file.write(data.encode('utf8', 'replace'))

        self._load_data(vocab=False)