(Work in progress)
Compass Searcher is a web scraper program I built when I was taking online bootcamp classes at Lighthouse Labs (LHL). The online learning portal did not have any search functionality, so I build a web scraper to crawl and store all the pages into an indexed database for easy searching.
The web scraper was initially difficult to program because the learning portal requires the user to be logged in to view. It must first log in to Github, then log in to LHL. I was able to overcome this obstacle by carefully sifting through the network tab in Google Chrome's web developer tools, looking for recurring data that pointed me to the right post and get requests to make in order to log in.
My initial plan was to create this app with ReactJS frontend and Express backend and host it on a basic VPS on digital ocean, but switched to using a Flask backend and vanilla HTML and CSS frontend to save time. This project is currently hosted on PythonAnywhere.
Compass Searcher only provides a clickable link back to the target page and a short snippet (shorter than a tweet) to give context around the search term. It does not provide access to protected content (i.e. content that requires login).
Compass Searcher has a page scraper component that logs in to compass and caches relevant parts of each page (using the calendar as an index). This information is stored inside an indexed databased which is refreshed in sync with course progress. When the user enters a search term, Compass Searcher searches the database and returns the pages that match the search term, providing a very short snippet around the search term for context.
Compass Searcher uses React.js as the frontend, a SQLite database, and Python Flask as the backend framework. Why Flask instead of Express?
- Flask comes with an ORM called SQLAlchemy, a tried-and-tested library that provides much better security. (Protecting LHL assets!)
- Unlike Heroku, PythonAnywhere provides unrestricted uptime and online apps do not go to sleep due to inactivity. Since PythonAnywhere uses Python, the backend had to be written in Python instead oF JS.
You can use this online without any setup. The following is only needed if you want to run this app locally. Note that you need to supply your own Git username and password in the .env file. This is required for the scraper to login to Compass.
- requests (scraper)
- beautifulsoup4 (scraper)
- lxml (scraper)
- python-dotenv
- flask
- flask-sqlalchemy
- flask-migrate
- flask-login
- flask-wtf
- Install dependencies - create a python virtual environment using
venv
orvirtualenv
and activate the environment. - Run
pip install -r requirements.txt
to automatically install python dependencies - Run
flask db init
to initialize the SQLite database - Run
flask db migrate
to migrate (setup) the database using the app's models - Run
flask db upgrade
to commit the initiation process. - Run
python app.py
to start a development server, then in a separate terminal, runpython scraper/test.py offline
to do an offline test. If you're not getting any errors or empty arrays, your flask app has been set up correctly!
! Current as of March 2021 - this method may change in the future.
- Make a copy of the
.env.example
file and rename it to.env
. - Logout of github if you're currently logged in. Open the network tab of your developer console on the github login page, and then log back in to github.
- In the network tab, the first item should be
session
. Right click on it and selectCopy
thenCopy as cURL
. Paste into any text editing software and look for the line starting with-H 'cookie:
. Copy everything starting from (not including) the first:
to the last single quote'
(not including the quote). Paste this value in the.env
file, surrounded by single or double quotes. You should end up withGIT_COOKIE='<your cookie here>'
, - Scroll down to
Form Data
and look for theauthenticity_token
,login
,password
,timestamp
, andtimestamp_secret
keys. Copy these values into the corresponding keys in your.env
file. - Once this is set up, run
python scraper/test.py online
in the root folder of this project directory. If you're not getting any errors or an empty array, the scraper has been set up correctly and is now ready to use. Runpython scraper/run.py
to start scraping and populating your database!
Daryl Tang