TDT4305:

My rather informal homework for TDT4305.
While imperfect, I uploaded this to benefit the pyspark community (theres not enough pyspark code in the wild). Also, I kind of like my own psuedo-TDD approach. See run_tasks.sh for examples.

The report/documentation was not that important, so its not written formally. It does contain comparisons on the running times of caching/broadcasting, which could be interesting. See ./doc/build/report.pdf.

Lesson Learnt:

I did my work in three iterations; I started with a small input, then a larger one, before I iterated over the complete dataset. This was a mistake. I recommend sampling the dataset, but also trying out your algorithm on large inputs right away. You'll learn that e.g. caching is less effective for large datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
doc		doc
foursquare		foursquare
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
assignment.pdf		assignment.pdf
run_tasks.sh		run_tasks.sh
spark-defaults.conf		spark-defaults.conf
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

foursquare

foursquare

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

assignment.pdf

assignment.pdf

run_tasks.sh

run_tasks.sh

spark-defaults.conf

spark-defaults.conf

utility.py

utility.py

Repository files navigation

TDT4305:

Lesson Learnt:

About

Releases

Packages

Languages

License

andsild/TDT4305

Folders and files

Latest commit

History

Repository files navigation

TDT4305:

Lesson Learnt:

About

Resources

License

Stars

Watchers

Forks

Languages