Scraping Senate Financial Disclosures

Random update notes to myself

2020-02-27: I have no idea where I left off, other than remembering I should re-run scrapes once in awhile since older reports/senators may be removed from the Senate site from time to time. Last scrape was 2019-11-21

Prelim examination

Parse the extracted JSON data:

$ python efdsenate parse_raw

Note (2020-02-28): I have no idea what the above command refers to. Ignore it please.

How many paper records have been submitted, by year?

$ ack 'view/paper' -A1 data/state-indexes/*.json \
    | ack '"\d{2}/\d{2}/(\d{4})"' --output '$1' \
    | sort | uniq -c

Analyze the doc titles

cat data/parsed/state-indexes.csv \
    | csvgrep -c doc_url -r '/ptr' \
    | csvcut -c doc_title

 cat data/parsed/state-indexes.csv \
    | csvcut -c doc_url | ack -o '(?<=view/)\w+' | sort | uniq -c

# paper in 2019
cat data/parsed/state-indexes.csv \
    | csvgrep -c doc_url -r 'view/paper' \
    | csvgrep -c date -r '201[9]' \
    | csvgrep -c doc_type -r 'Annual' \
    | csvgrep -c doc_type -i -r 'Amendment' \
    | csvgrep -c filer_type -r '^Senator' \
    |   csvcut -c state,last_name,first_name,date,doc_type


cat data/parsed/state-indexes.csv \
    | csvgrep -c doc_url -r 'view/annual' \
    | csvgrep -c date -r '201[9]' \
    | csvgrep -c doc_type -r 'Annual' \
    | csvgrep -c doc_type -i -r 'Amendment' \
    | csvgrep -c filer_type -r '^Senator' \
    |   csvcut -c state,last_name,first_name,date,doc_type

 cat data/parsed/state-indexes.csv | csvgrep -c doc_type -r 'Periodic|Financial' | csvgrep -c doc_filetype -m 'paper' | csvcut -c last_name,first_name,date | ack '(.+?),(\d{4})' --output '$2  $1' | sort | uniq -c

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
archive		archive
data		data
docs/assets/images		docs/assets/images
efdsenate		efdsenate
tests		tests
.gitignore		.gitignore
ARTICLES.md		ARTICLES.md
NOTES.md		NOTES.md
README.md		README.md
errornotes.md		errornotes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive

archive

data

data

docs/assets/images

docs/assets/images

efdsenate

efdsenate

tests

tests

.gitignore

.gitignore

ARTICLES.md

ARTICLES.md

NOTES.md

NOTES.md

README.md

README.md

errornotes.md

errornotes.md

Repository files navigation

Scraping Senate Financial Disclosures

Random update notes to myself

Prelim examination

About

Releases

Packages

Languages

Soncrates/scrape-senate-financial-disclosures

Folders and files

Latest commit

History

Repository files navigation

Scraping Senate Financial Disclosures

Random update notes to myself

Prelim examination

About

Resources

Stars

Watchers

Forks

Languages