Skip to content

isaacdlp/scraphacks

Repository files navigation

Scrap Hacks

Useful Web Scrapping "Hacks"

Prices

Use pricescrap.py to compile a csv file with asset prices and upload it to OpenFinance.

Installation

  • Install Python 3
  • Install Lxml pip install lxml
  • Install RoboBrowser pip install robobrowser
  • Edit your openfinance credentials in pricescrap.json (see "General Requirements" below).

Usage

  • Add the assets that you want to scap daily quotes for in the assets array.
  • The script offers support for the following data sources ...
  • ... but it can be easily extended to support other price providers
  • Run the program python pricescrap.py

Duolingo

Use duolingoscrap.py to complile cheat sheets from Duolingo.

It portrays the use of Selenium and BeautifulSoup.

Installation

  • Install the Chrome Browser
  • Install Python 3
  • Install BeautifulSoup pip install bs4
  • Install Selenium pip install selenium
  • Add to your system the required ChromeDriver
  • Edit your duolingo credentials in duolingoscrap.json (see "General Requirements" below).

Usage

  • Duplicate the template duolingo/duo.html with the name of the language you want to download (e.g. duolingo/Russian.html).
  • Optionally you can edit the parts that correspond to the name of the language and the flag:
<div class="_1sdh6 ljpAk">
    <div class="yZINH">
        <!-- This section has the flag -->
        <span class="_1eqxJ _3viv6 HCWXf _3PU7E _2XSZu"></span>
    </div>
    <div class="yZINH _1_vhy">
        <!-- This section has the name -->
        <h2>Russian</h2>
        <div><span>Cheat Sheet</span></div>
    </div>
</div>
  • Edit the languages dictionary inside duolingoscrap.py to associate Duolingo's extension for the language (e.g. ru) to your recently created file (e.g. duolingo/Russian.html).
languages = {
    "ru": "duolingo/Russian.html"
}
  • You have two options: download only selected lessons (add them to the lessons array), or the whole language (leave the lessons array empty). In the second case note that:

    • It will only scan the active language. If you have several languages in Duolingo, you need to switch to the language that you want to download.
    • It will only download up to your last available lesson. It can't download lessons you can't access yet (but you can run the program again on a future date).
    • It keeps track of the lessons you already downloaded and will not overwrite or duplicate them.
  • Run the program python duolingoscrap.py

Safari Books

Use safariscrap.py to download books and video lessons from SafariBooks.

It combines Selenium with Browsermobproxy to observe and manipulate web traffic (required in for video download) and PdfReactor to convert HTML books into PDF.

Installation

  • Follow all the instructions above (credentials in safariscrap.json)
  • Install Java (Required by BrowserMobProxy)
  • Optionally, install PdfReactor.
    • If you use PdfReactor switch the pdfReactor variable in safariscrap.py accordingly:
pdfReactor = None
# pdfReactor = PDFreactor("http://localhost:9423/service/rest")

Usage

  • Search in SafariBooks what you want to download. You have two options:
    • Download a specific course like https://www.safaribooksonline.com/library/view/numpy-cookbook/9781849518925/.
      • The application will automatically detect whether it is a book or a video tutorial and proceed accordingly.
    • Download ALL courses from a given topic like https://www.safaribooksonline.com/topics/python
  • List any combination of topics and courses in any order in the courses array.
    • Useful Notes for topics with many courses:
      • If you just list one topic, you can further refine the page it starts downloading from using the topicNum variable.
      • By default the program will NOT overwrite courses downloaded previously. You can switch this with the overwrite variable.
courses = [
    "https://www.safaribooksonline.com/library/view/python-data-structures/9781786467355/",
    "https://www.safaribooksonline.com/topics/java"
]

overwrite = False
topicNum = 0
  • Run the program python safariscrap.py

Drumeo

Use drumeoscrap.py to download songs, play-alongs and video lessons from Drumeo.

Very similar in requirements and functionality to Safari Books above, with two important additions:

  • Reads and writes session cookies to disk (in JSON format), avoiding unnecessary logins.
  • As well as whole video files, it can download *.ts video segments, assemble them and transform into mpg (using FFMPEG).

General Requirements

In all cases you need to create a credentials file named pricescrap.json, duolingoscrap.json, etc with the following structure:

{
  "username" : "<your_username>",
  "password" : "<your_password>"
}

About

Useful web scrapping "hacks"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published