Python webCrawler 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: wikicrawler

메소드/함수: webCrawler

hotexamples.com에서의 예제들: 3

Python webCrawler - 3개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 wikicrawler.webCrawler에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

파일: proj_classification.py 프로젝트: ganggit/webclassifier

def processdata(urllists, word_count_threshold, depth):
    content = []
    nums = []
    nums.append(0)
    for url in urllists:
        crawler = webCrawler(url, depth)
        crawler.crawl()
        nums.append(len(crawler.data))
        content.extend(crawler.data)

    instance = features(word_count_threshold)    
    word_counts, wordtoix = instance.extractwords(content)
    N = len(word_counts)
    for i in range(1, len(nums)):
        nums[i] = nums[i-1] + nums[i]
     
    cid = 0   
    output = np.zeros((nums[len(nums)-1], N+1))    
    for url in urllists:
        crawler = webCrawler(url, depth)
        crawler.crawl()
        currlen = len(crawler.data)
        feats = instance.bagofwords(crawler.data, word_counts, wordtoix)
        print feats.shape
        b = np.zeros((currlen,N+1))
        print b[:, :-1].shape
        b[:,0:N] = feats
        b[:,N] = cid +1 
        output[nums[cid]:nums[cid+1],:] = b
        cid = cid + 1
    np.savetxt('test.out', output, delimiter=',')   # X is an array

예제 #2

파일 보기

파일: proj_classification.py 프로젝트: ganggit/webclassifier

def getdata(urllists, depth):
    content = []
    nums = []
    nums.append(0)
    for url in urllists:
        #if url != "https://en.wikipedia.org/wiki/1990_RTHK_Top_10_Gold_Songs_Awards":
        #    continue
        crawler = webCrawler(url, depth)
        crawler.crawl()
        nums.append(len(crawler.data))
        content.extend(crawler.data)
    return content

예제 #3

파일 보기

파일: test_sklearn.py 프로젝트: ganggit/webclassifier

    urllists =[]
    urllists.append( "https://en.wikipedia.org/wiki/Sandra_Bullock");
    urllists.append( "https://en.wikipedia.org/wiki/Far_East_scarlet-like_fever");
    filepath = os.path.dirname(os.path.realpath(__file__))
    dictname = 'dictionary.txt'
    dict2idx = 'dict2idx.txt'
    

    with open(os.path.join(filepath,dictname), 'r') as fread:
        word_counts = json.load(fread)
    with open(os.path.join(filepath,dict2idx), 'r') as fread:
        wordtoix = json.load(fread)    
    #np.random.shuffle
    clf = joblib.load('model.pkl')
    for url in urllists:
            crawler = webCrawler(url, 1)
            crawler.crawl()
            instance = features(word_count_threshold) 
            feats = instance.bagofwords(crawler.data, word_counts, wordtoix)
            
            X = feats
            #print X.shape
            print fsum(X)
            transformer = TfidfTransformer()
            tfidf = transformer.fit_transform(X)
            X = tfidf.toarray()
            print fsum(X)
            yhat = clf.predict(X)
            print yhat
            
    print "finish page testing"