Python webgraphItem示例

编程语言: Python

命名空间/包名称: scrapy_webgraph.items

方法/功能: webgraphItem

hotexamples.com的示例: 2

Python webgraphItem - 已找到2个示例。这些是从开源项目中提取的最受好评的scrapy_webgraph.items.webgraphItem现实Python示例。您可以评价示例，以帮助我们提高示例质量。

示例#1

显示文件

文件： webgraph_perso.py 项目： brozi/webpage-graph

 def parse_item(self, response):
     hxs = HtmlXPathSelector(response)
     i = webgraphItem()
     i['node'] = response.url
     print "#######################"
     print response.url
     print "#######################"
    # i['http_status'] = response.status
     llinks=[]
     for anchor in hxs.select('//a[@href]'):
         href=anchor.select('@href').extract()[0]
         if not href.lower().startswith("javascript") and  href.startswith("http://perso.ens-lyon.fr/baptiste.roziere/"):
             llinks.append(urljoin_rfc(response.url,href))
     i['edge'] = llinks
     return i

示例#2

显示文件

文件： webgraph.py 项目： brozi/webpage-graph

 def parse_item(self, response):
     hxs = HtmlXPathSelector(response)
     i = webgraphItem()
     i['node'] = response.url
     print "#######################"
     print response.url
     print "#######################"
    # i['http_status'] = response.status
     llinks=[]
     seen = {}
     for anchor in hxs.select('//a[@href]'):
         href=anchor.select('@href').extract()[0]
         if href.startswith("http://www.cdiscount.com") and not (href in seen):
             seen[href]=True
             llinks.append(urljoin_rfc(response.url,href))
     i['edge'] = llinks
     return i