Python Tokenizer.alph_tokenize 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: tokenizer_gen

클래스/타입: Tokenizer

메소드/함수: alph_tokenize

hotexamples.com에서의 예제들: 3

Python Tokenizer.alph_tokenize - 3개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 tokenizer_gen.Tokenizer.alph_tokenize에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

Tokenizer(15)

i_tokenize(9)

alph_tokenize(3)

자주 사용되는 메소드들

Tokenizer (15)

i_tokenize (9)

alph_tokenize (3)

예제 #1

파일 보기

    def search_mult(self, query, limit, offset):
        """Multiword search.

        :return: a dictionary with the file names of
        the files that contain all words of the query as the keys
        and all Positions in that file of the words of the query as the values.  

        :param db: database containing file(s)
               query: input query
        """
        self.query = query
        t = Tokenizer()
        res = []  # list for dictionaries of search results
        fs = []  # list for sets of names of files
        output = {}
        dic = self.db
        for i in t.alph_tokenize(query):
            #print(i)
            if not dic.get(i.tok) in res:
                res.append(dic.get(i.tok))
        # create list of sets of filenames for each word
        for f in res:
            fs.append(set(f.keys()))
        for r in sorted(
                list(set.intersection(*fs))
        )[offset:offset +
          limit]:  # get files that contain all the words of the query
            for item in res:
                output.setdefault(r, []).append(item[r])
        # sort positions by line and start index
        for el in output:
            output[el] = our_sort(output[el])
        return output

예제 #2

파일 보기

    def search_mult_stem(self, query, limit, offset):
        """Multiword search with stemming.

        :return: a dictionary with the file names of
        the files that contain all stems/lemmas of the query words as the keys
        and a generator of all Positions query words stems/lemmas in that file.  

        :param query: input query
               limit: number of files
               offset: index of the first file (starting at 1)
        """
        t = Tokenizer()
        stemmer = Stemmer_agent()
        res = []  # list for dictionaries of search results
        fs = []  # list for sets of names of files
        output = {}
        dic = self.db
        for i in t.alph_tokenize(query):
            #print(i)
            stems = {}
            for st in stemmer.stem(i.tok):
                if st in dic:
                    #print(st)
                    for fn in dic.get(st).keys():
                        stems.setdefault(fn, []).extend(dic.get(st)[fn])
            res.append(stems)
        #for f in res:
        #   print(f)
        # create list of sets of filenames for each word
        for f in res:
            fs.append(set(f.keys()))
        for r in sorted(
                list(set.intersection(*fs))
        )[offset:offset +
          limit]:  # get files that contain all the words of the query
            for item in res:
                output.setdefault(r, []).append(item[r])
        # sort positions by line and start index
        for el in output:
            output[el] = our_sort(output[el])
        return output

예제 #3

파일 보기

파일: tokenizer_gen_test.py 프로젝트: darya-den/search-server

 def test_symbol(self):
     t = Tokenizer()
     res = list(t.alph_tokenize('b'))
     gold = [Token('b', 0, 0, "alph")]
     self.assertEqual(res, gold)