GitHub - bansallab/roundup: Mining the web for data on cattle sales at US livestock markets.

Introduction

Project roundup is a collection of Python scripts that mine the web for data on cattle sales at US livestock markets (where animals are sold through live auctions). Data are pulled from livestock markets with websites that post data in the form of "market reports" or "sale reports". In particular, a website is a valid data source if the reports provide information on the location of a consignor (who brings cattle to market) or a buyer (who takes delivery from the market). Different livestock market websites provide remarkably consistent types of information, but in drastically varying formats. To systematically record cattle sales, a script is customized to "round up" data from each livestock market's website and write a commonly formatted CSV version to a locally stored archive.

Collaboration

Contribution to and use of this repository is de facto restricted to collaborators with access to the associated database, which includes the website URLs among other private information. For collaborators contributing to the project, the following instructions will help you get started at writing a script to scrape a new website. The function scrape_util.get_market expects to find a working mysql client program and a configuration file for connecting to the database under group heading [roundup-db] at ~/.my.cnf. See the MySQL reference manual for details.

Getting Started

Python and git are the two basic tools you need to contribute. If you do not use a package manager (e.g. APT or Homebrew), you can download binaries from git-scm.com and python.org. Install the latest versions (Python 3.x). (Note for Windows users: the default install options for git are acceptable, but feel free to uncheck integration with Windows Explorer. If you opt not to modify your PATH variables, use the installed "git bash" shell to execute the git commands below.)

At minimum, two Python packages are required: sqlalchemy, py-dateutil and BeautifulSoup4. Experienced Python programmers excepted (who should install the packages however they want), install the packages from within Python (indicated by the the Python prompt >>>):

>>> import pip
>>> pip.main(['install', 'sqlalchemy', 'py-dateutil', 'BeautifulSoup4'])

Clone this Repository onto your Local Machine

Every file in this repository with a name like *_scrape.py is a script that converts the market reports found at a particular website to a CSV file. "Cloning" this repository copies all the files to your machine, giving you many examples to learn from and copy. From the command line of your shell (incl. git-bash on Windows), execute:

> cd /path/to/your/projects/
> git clone https://github.com/itcarroll/roundup.git
> cd roundup

The first step for adding a new *_scrape.py script is to obtain the unique identifier, <id>, associated with each market website in our private database. You will create a "branch" of the repository for your work, which will be merged into the master branch when the script is complete. Because every new branch should stem from the most up-to-date master, create your branch with the following:

> git checkout master
> git pull
> git checkout -b <branch_name>
> cp 214_scrape.py <id>_scrape.py

Use any branch name you want, it's not persistent. The pull command told git to update your local repository from GitHub. The copied script is a template: open it up and start looking around.

When you successfully connect to the database and run your script, it will create folders <prefix>_scrape and <prefix>_scrape/dbased, where prefix is a friendlier string associated with the previously obtained. The first holds newly written CSV files, with names patterned after <prefix>_YY-MM-DD.csv. The subfolder dbased will hold CSV files copied from <prefix>_scrape after being imported into the database by a cron job. The importance of the subfolder here is that it holds market reports that have already been imported and should not be changed. The script should and will overwrite CSV files in <prefix>_scrape, which will take many iterations to perfect.

So that Ian know's you've gotten started, commit your new file to the git repository and push the branch upstream to GitHub.

> git add <id>_scrape.py
> git commit -m 'initial commit'
> git push -u origin <branch_name>

For all subsequent versions of your script, make a commit and push your work to git hub like so

> git commit -am 'some message about the changes to <id>_scrape.py'
> git push

Study the Details

Take a look at some online market reports, including the source (i.e. the raw HTML).

Study the 214_scrape.py Python script until you understand how it works.

Each function's docstring (the triple quoted text) describes its purpose.
Commented lines (preceded with #) describe the script's sections and/or logic.
The module scrape_util.py contains definitions used by all the *_scrape.py scripts, including the CSV headers for the data your script will collect.
The main() function exucutes the following sequence: 1. Load the current collection of reports available online. 1. Locate the collection of archived CSV files. 1. Iterate through each report to: 1. Read the sale date (see 5_scrape.py for an example with multiple reports for a given day). 1. Check the archive for an existing CSV file. 1. Read the rows of the report into a list. 1. Open a new CSV file and write a line for each row that represents a sale.

Use the Python debugger (pdb) to execute specific segments. Here is an example debugger session:

>>> import pdb
>>> from 214_scrape import *
>>> pdb.runcall(main)
> /path/to/your/projects/roundup/wishek_scrape.py(131)main()
-> url = base_url + report_path
(Pdb) l
126  	            writer.writerow(sale)
127  	
128  	def main():
129  	
130  	    # Get URLs for all reports
131  ->	    url = base_url + report_path
132  	    soup = BeautifulSoup(urllib.request.urlopen(url).read())
133  	    report = [soup]
134  	
135  	    # Locate existing CSV files
136  	    archive = scrape_util.ArchiveFolder(argv)
(Pdb) n
> /Users/icarroll/projects/cownet/roundup/wishek_scrape.py(132)main()
-> soup = BeautifulSoup(urllib.request.urlopen(url).read())
(Pdb) url
'http://www.wisheklivestock.com/market.htm'
(Pdb)

Study these tools!

Navigate to a website's HTML to grab the list of market report URL's. The BeautifulSoup package is instrumental, or you may require Selenium with the PhantomJS webdriver for pages heavy on javascript.
Sometimes the data are in a table, making the job easy. Sometimes each sale record is one string, which Python's regular expressions module will help you parse.

Note on Contributing

If you are new to Git, you might need these tutorials. You may want to set up a SSH key pair.
Commit frequently with short, descriptive messages: > git commit -am 'what I just did'
Test your script by carefully inspecting the generated CSV files for errors.
Issue a pull request (from here) to notify Ian when you need suggestions or the script is working!
To update your local copy with commits Ian pushes to the repo, run git checkout <branch_name> followed by git pull.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
.rsync-filter		.rsync-filter
LICENSE.md		LICENSE.md
README.md		README.md
_115_scrape.py		_115_scrape.py
_12_scrape.py		_12_scrape.py
_133_scrape.py		_133_scrape.py
_155.py		_155.py
_159_scrape.py		_159_scrape.py
_164_scrape.py		_164_scrape.py
_167_scrape.py		_167_scrape.py
_168_scrape.py		_168_scrape.py
_171_scrape.py		_171_scrape.py
_172_repair.py		_172_repair.py
_172_scrape.py		_172_scrape.py
_182_scrape.py		_182_scrape.py
_185_scrape.py		_185_scrape.py
_186_scrape.py		_186_scrape.py
_187_scrape.py		_187_scrape.py
_189_scrape.py		_189_scrape.py
_190_scrape.py		_190_scrape.py
_191_scrape.py		_191_scrape.py
_195_scrape.py		_195_scrape.py
_198_scrape.py		_198_scrape.py
_1_scrape.py		_1_scrape.py
_201_scrape.py		_201_scrape.py
_203_scrape.py		_203_scrape.py
_204_scrape.py		_204_scrape.py
_205_scrape.py		_205_scrape.py
_210_past.py		_210_past.py
_210_scrape.py		_210_scrape.py
_214_scrape.py		_214_scrape.py
_219_scrape.py		_219_scrape.py
_248_scrape.py		_248_scrape.py
_256_scrape.py		_256_scrape.py
_276_scrape.py		_276_scrape.py
_289_scrape.py		_289_scrape.py
_299_scrape.py		_299_scrape.py
_2_scrape.py		_2_scrape.py
_301_scrape.py		_301_scrape.py
_307_scrape.py		_307_scrape.py
_308_scrape.py		_308_scrape.py
_30_scrape.py		_30_scrape.py
_310_scrape.py		_310_scrape.py
_314_scrape.py		_314_scrape.py
_319_scrape.py		_319_scrape.py
_321_scrape.py		_321_scrape.py
_328.py		_328.py
_343_scrape.py		_343_scrape.py
_352_scrape.py		_352_scrape.py
_368_scrape.py		_368_scrape.py
_369_scrape.py		_369_scrape.py
_370_scrape.py		_370_scrape.py
_371_scrape.py		_371_scrape.py
_382_scrape.py		_382_scrape.py
_390.py		_390.py
_391_scrape.py		_391_scrape.py
_392.py		_392.py
_393_scrape.py		_393_scrape.py
_3_scrape.py		_3_scrape.py
_408_scrape.py		_408_scrape.py
_43_scrape.py		_43_scrape.py
_46_scrape.py		_46_scrape.py
_47_scrape.py		_47_scrape.py
_4_scrape.py		_4_scrape.py
_53_scrape.py		_53_scrape.py
_54_scrape.py		_54_scrape.py
_5_scrape.py		_5_scrape.py
_63_scrape.py		_63_scrape.py
_87_scrape.py		_87_scrape.py
_8_scrape.py		_8_scrape.py
_98_scrape.py		_98_scrape.py
_99.py		_99.py
_9_scrape.py		_9_scrape.py
scrape.sh		scrape.sh
scrape_util.py		scrape_util.py

License

bansallab/roundup

Folders and files

Latest commit

History