Python QueryUtils.static_remove_cn_punct 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: query_util

클래스/타입: QueryUtils

메소드/함수: static_remove_cn_punct

hotexamples.com에서의 예제들: 4

Python QueryUtils.static_remove_cn_punct - 4개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 query_util.QueryUtils.static_remove_cn_punct에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

static_jieba_cut(8)

QueryUtils(7)

static_remove_cn_punct(4)

static_remove_pu(3)

corenlp_cut(2)

remove_cn_punct(2)

static_simple_remove_punct(2)

static_corenlp_cut(1)

예제 #1

파일 보기

파일: sc_scene_clf.py 프로젝트: aquadrop/solr_py

    def _build_feature_extractor(self, mode, files):
        print('Build feature extraction...')
        corpus = list()

        for path in files:
            with open(path, 'r') as f:
                for line in f:
                    # line = json.loads(line.strip().decode('utf-8'))
                    # question = line['question']
                    question = line.replace('\t', '').replace(
                        ' ', '').strip('\n').decode('utf-8')
                    question = QueryUtils.static_remove_cn_punct(str(question))
                    tokens = self.cut(question)
                    corpus.append(tokens)

        if mode == 'ngram':
            bigram_vectorizer = CountVectorizer(
                ngram_range=(1, 2),
                min_df=0.0,
                max_df=1.0,
                analyzer='char',
                stop_words=[',', '?', '我', '我要'],
                binary=True)
            self.feature_extractor = bigram_vectorizer.fit(corpus)
        if mode == 'tfidf':
            print_cn('use {0}'.format(mode))
            tfidf_vectorizer = TfidfVectorizer(analyzer='char',
                                               ngram_range=(1, 2),
                                               max_df=1.0,
                                               min_df=1,
                                               sublinear_tf=True)
            self.feature_extractor = tfidf_vectorizer.fit(corpus)

예제 #2

파일 보기

    def _prepare_data(self, files):
        print('prepare data...')

        embeddings = list()
        queries = list()
        queries_ = dict()
        labels = list()
        mlb = MultiLabelBinarizer()

        for index in xrange(len(files)):
            path = files[index]
            label = self.named_labels[index]
            queries_[label] = list()
            with open(path, 'r') as f:
                for line in f:
                    # line = json.loads(line.strip().decode('utf-8'))
                    # question = line['question']
                    question = line.replace('\t', '').replace(
                        ' ', '').strip('\n').decode('utf-8')
                    question = QueryUtils.static_remove_cn_punct(str(question))
                    tokens = QueryUtils.static_jieba_cut(question)
                    # print_cn(tokens)
                    if len(tokens) == 0:
                        continue
                    # cc=self.check_zero_tokens(tokens)
                    # if not cc:
                    #     continue
                    queries_[label].append(question)
        # print len(queries_)
        for label, questions in queries_.iteritems():
            for question in questions:
                if question in queries and label not in labels[queries.index(
                        question)]:
                    # print_cn(question)
                    index = queries.index(question)
                    labels[index].append(label)
                else:
                    # print_cn(question)
                    queries.append(question)
                    labels.append([label])
                    tokens = self.cut(question).split(' ')
                    embedding = self.get_w2v_emb(tokens)
                    embeddings.append(embedding)

        embeddings = np.array(embeddings)
        embeddings = np.squeeze(embeddings)
        self.mlb = mlb.fit(labels)
        labels = self.mlb.transform(labels)

        # print (embeddings.shape, len(queries))
        # print_cn(labels.shape)

        return embeddings, labels, queries

예제 #3

파일 보기

파일: belief_rnn_utils.py 프로젝트: aquadrop/solr_py

def cut(input_):
    input_ = QueryUtils.static_remove_cn_punct(input_)
    tokens = list(jieba.cut(input_, cut_all=False))
    return tokens

예제 #4

파일 보기

 def cut(self, input_):
     input_ = QueryUtils.static_remove_cn_punct(input_)
     tokens = jieba.cut(input_, cut_all=True)
     seg = " ".join(tokens)
     tokens = _uniout.unescape(str(seg), 'utf8')
     return tokens