Python SgmlLinkExtractor._process_links 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: scrapy.contrib.linkextractors.sgml

클래스/타입: SgmlLinkExtractor

메소드/함수: _process_links

hotexamples.com에서의 예제들: 2

Python SgmlLinkExtractor._process_links - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor._process_links에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

자주 사용되는 메소드들

보기 숨기기

SgmlLinkExtractor(30)

extract_links(30)

__init__(3)

_process_links(2)

_extract_links(1)

_link_allowed(1)

matches(1)

pop(1)

예제 #1

파일 보기

파일: spider.py 프로젝트: berkantaydin/linkedpy

    def my_parse(self, response):

        log.msg('Parsing urls from %s' % response.url, level=log.INFO)

        # http://my.linkedin.com/directory/people/a.html
        lx1 = SgmlLinkExtractor(
                allow= '(' + self.base_url + ')?' + r'/directory/people/([a-z]|\@)\.html',
                deny=(self.deny_re),
              )
        # http://my.linkedin.com/directory/people/my/A1.html
        lx2 = SgmlLinkExtractor(
                allow= '('+ self.base_url + ')?' + r'/directory/people/my/[A-Z]\d+\.html',
                deny=(self.deny_re),
                )
        # http://my.linkedin.com/directory/people/my/ahamid-3.html
        # http://my.linkedin.com/directory/people/my/aan.html
        lx3 = SgmlLinkExtractor(
                allow= '(' + self.base_url + ')?' + r'/directory/people/my/[a-z]+(\-\d+)?\.html',
                deny=(self.deny_re),
                )
        # http://my.linkedin.com/pub/zarita-a-baharum/23/9a2/756
        lx4 = SgmlLinkExtractor(
                allow= '(' + self.base_url +')?' + r'/pub/[a-z\-]+/[a-z0-9]+/[a-z0-9]+/[a-z0-9]+',
                deny=(self.deny_re),
                )
        # http://www.linkedin.com/in/levananh
        lx5 = SgmlLinkExtractor(
                allow= '(' + self.base_url + ')?' + r'/in/[a-z0-9]+$',
                deny=(self.deny_re),
                )

        try:
            l1 = lx1._extract_links(response.body, response.url, 'utf-8')
            l1 = lx1._process_links(l1)

            l2 = lx2._extract_links(response.body, response.url, 'utf-8')
            l2 = lx2._process_links(l2)

            l3 = lx3._extract_links(response.body, response.url, 'utf-8')
            l3 = lx3._process_links(l3)

            l4 = lx4._extract_links(response.body, response.url, 'utf-8')
            l4 = lx4._process_links(l4)

            l5 = lx5._extract_links(response.body, response.url, 'utf-8')
            l5 = lx5._process_links(l5)

            links = [URL(main_url = response.url, found_urls = l1[i].url) for i in range(len(l1))]
            links.extend([URL(main_url = response.url, found_urls = l2[i].url) for i in range(len(l2))])
            links.extend([URL(main_url = response.url, found_urls = l3[i].url) for i in range(len(l3))])
            links.extend([URL(main_url = response.url, found_urls = clean_url(l4[i].url)) for i in range(len(l4))])
            links.extend([URL(main_url = response.url, found_urls = clean_url(l5[i].url)) for i in range(len(l5))])
            s = 'http://' + CountryCode.code
            if s in response.url:
                links.append(URL(main_url = response.url, found_urls = '$'))

        except:
            pass

        pub_re = [r'/pub/[a-z\-]+/[a-z0-9]+/[a-z0-9]+/[a-z0-9]+',
                  r'/in/[a-z0-9]+'
                 ]
        for pub in pub_re:
            if re.search(pub, response.url):
                self.extract(response) # extract profiles
        
        self.db.insert_urls(links)

예제 #2

파일 보기

파일: le_sgml.py 프로젝트: cfhb/crawl_youtube

 def _process_links(self, links):
   links = SgmlLinkExtractor._process_links(self, links)
   return links