CRAWLR

POC to scrape job listing sites

To run:

Right click parameterizedRun.py, select "run"

-or-

python parameterizedRun.py

Follow the prompt to add query parameters and the results will output to crawlr/data/

Complications:

The lack of insight into business rules relating to job postings makes
it difficult to write utility code that may reduce code duplication
across CareerBuilder / Monster / etc.

For example: company X has four similar positions to fill.  The final CSV
output should not combine postings across sites in the same file because
there's no way to know definitively whether these are unique positions
(did company X post two positions to CareerBuilder and two to Monster?
Did they duplicate all four postings on both sites?)  Assumtions about
the ways an employer creates a posting will likely result in misleading
output.

Another example: the "distance from searched zipcode" query parameter
accepts different values on Monster than it does on CareerBuilder (as
well as a different query parameter key.)  This makes it difficult to
reduce code duplication in in the model objects, and difficult to write
mindful code for future additions to the list of sites we scrape.

Proposed backlog:

Unit tests
Other job posting websites (Indeed / ZipRecruiter / etc.)
Further CSV handling: aggregate functions are currently handled manually via Google sheets, specifically counting job postings per company. This could be done within crawlr/scraper/processing/

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
cb		cb
monster		monster
proxy		proxy
scraper		scraper
.gitignore		.gitignore
README.md		README.md
definitions.py		definitions.py
definitions.pyc		definitions.pyc
parameterizedRun.py		parameterizedRun.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cb

cb

monster

monster

proxy

proxy

scraper

scraper

.gitignore

.gitignore

README.md

README.md

definitions.py

definitions.py

definitions.pyc

definitions.pyc

parameterizedRun.py

parameterizedRun.py

Repository files navigation

CRAWLR

To run:

Complications:

Proposed backlog:

About

Releases

Packages

Languages

NPCodeRat/crawlr

Folders and files

Latest commit

History

Repository files navigation

CRAWLR

To run:

Complications:

Proposed backlog:

About

Resources

Stars

Watchers

Forks

Languages