Python proc примеры использования

Язык программирования: Python

Пространство имен/Пакет: preProcess

Метод/Функция: proc

Примеров на hotexamples.com: 2

Python proc - 2 примера найдено. Это лучшие примеры Python кода для preProcess.proc, полученные из open source проектов. Вы можете ставить оценку каждому примеру, чтобы помочь нам улучшить качество примеров.

Пример #1

Показать файл

Файл: evaluationSVM.py Проект: xaviercallens/MLSentimentAnalysisDataCamp

#some of our classes so that they are available to the workers
sc.addFile("helpers.py") 
sc.addFile("exctract_terms.py")
#now if we import these files they will also be available to the workers
from helpers import *
import extract_terms as et



# load data : data is a list with the text per doc in each cell. Y is the respective class value
#1 :positive , 0 negative
print "loading local data"
data,Y=lf.loadLabeled(trainF) 

print "preprocessing"
pp.proc(data) #clean up the data from  number, html tags, punctuations (except for "?!." ...."?!" are replaced by "."
m = TfidfVectorizer(analyzer=et.terms) # m is a compressed matrix with the tfidf matrix the terms are extracted with our own custom function 

'''
we need an array to distribute to the workers ...
the array should be the same size as the number of workers
we need one element per worker only
'''
ex=np.zeros(8) 
rp=randint(0,7)
ex[rp]=1 #one random worker will be selected so we set one random element to non-zero

md=sc.broadcast(m) #broadcast the vectorizer so that he will be available to all workers
datad=sc.broadcast(data) # broadcast teh data

#execute vectorizer in one random  remote machine

Пример #2

Показать файл

Файл: tfidfTEST.py Проект: ANTPHAM/Polytechnique2016_SparkDataCamp

import loadFiles as lf

import exctract_terms as et
import preProcess as pp
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


data,Y=lf.loadLabeled("./train")
#analyzer and preprocessor dont work together...one or the other
pp.proc(data) # we apply the preprocessing by ourselves
m = TfidfVectorizer(analyzer=et.terms)

tt = m.fit_transform(data)

rows,cols=tt.shape
print "number of features :" +str(len(m.get_feature_names())) #this is the same as the number of columns
#the function get_feature_names() returns a list of all the terms found 

print "non compressed matrix expected size:" + str(rows*cols*8/(1024*1024*1024))+"GB"