Skip to content

niroyb/WebScraping

Repository files navigation

WebScraping Tools

This repo contains examples of web scraping. Dependencies are lxml for DOM traversal and feedparser for RSS.

There is also a utility class "scraptools" to group common scraping operations:

  • getDOM : Returns the DOM element of the page for the given url
  • getElementsFromHTML : Returns a list of lxml elements from html source corresponding to a cssSelector
  • getElementsFromUrl : Returns a list of lxml elements from the page fetched at url corresponding to a cssSelector
  • getUrlContent : Gets the content of a url as a string
  • downloadResource : Download the content of a url to the disk
  • saveResource : Saves data to file in binary write mode
  • urlIterator : Successively yields page urls while there is a next one found by the cssSelector

Examples:

  • Scrap_97ThingsProgrammer : Aggregates 97 good programming practices and generates a printer friendly html page
  • Scrap_Eduportefolio : Get names of students attending Polytechnique Montreal
  • Scrap_GoogleImg : Download top imgages for a search on Google image
  • Scrap_Imgur : Download individual images or a whole gallery
  • Scrap_Moodle : Recursively downloads all the files from the course pages on Moodle
  • Scrap_Nordelec : Get information about the companies inside this building
  • Scrap_PrenomMasc : Get first names from a website
  • Scrap_Reddit : Parses posts from a subreddit
  • Scrap_RSS_titles : Get the article titles of rss feeds. Usefull for a quick glance at the news from the console ;)
  • Scrap_Tumblr : Gets pictures based on their tags

About

Various programs to extract information from the web

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages