BigDataCourse

Readme is available for each homework in respective directory

About the course

The course aims to provide a broad understanding of big data and current technologies in managing and processing them with a focus on the urban environment. With storage and networking getting significant cheaper and faster, big data sets could easily reach the hands of data enthusiasts with just a few mouse clicks. These enthusiasts could be policy makers, government employees or managers, who would like to draw insights and (business) value from big data. Thus, it is crucial for big data to be made available to the non-expert users in such a way that they can process the data without the need of a supercomputing expert. One such approach is to build big data programming frameworks that can deal with big data in as close a paradigm as the way it deals with “small data.” Also, such a framework should be as simple as possible, even if not as efficient as custom-designed parallel solutions. Users should expect that if their code works within these frameworks for small data, it will also work for big data. General topics of this course include: big data ecosystems, parallel and streaming programming model, MapReduce, Hadoop, Spark, Pig, and NoSQL solutions. Hands-on labs and exercises will be offered throughout to bolster the knowledge learned in each module.

Tools used

python2 or above
Homework 5 and 6 are done NYU Hadoop Culster (called NYU dumbo)

Installing Spark (unix and linux) (good luck to windows users)

Step 1 : Install Anaconda
Step 2: conda install pyspark
run pyspark command in terminal, that should load pyspark shell.
Note the install location using pip show pyspark | grep Location
To write and run spark application in jupyter notebook enable pyspark shell in jupyter notebook by export PYSPARK_DRIVER_PYTHON=which jupyterexport PYSPARK_DRIVER_PYTHON_OPTS=notebookexport SPARK_HOME=[PySpark Location]/pyspark in terminal or add in .bashrc in home directory in linux system.

Enabling and handling spatial data types in spark

Handling Spatial Data — Python

conda install geopandas should install geopandas with all dependecies like pandas,fiona, shapely,pyproj,rtree
make sure to install gdal library using apt-get install
sudo add-apt-repository -y ppa:ubuntugis/ppa
sudo apt update
sudo apt upgrade # if you already have gdal 1.11 installed
sudo apt install gdal-bin python-gdal python3-gdal
Most data can be stored in Python native structures: tuple, list, and dictionaries
Useful packages for handling spatial data:
fiona : read/write GIS files (a thin API for the GDAL/OGR I/O library)
geopandas : extending pandas with geo support
json : a built-in package for reading JSON files including GeoJSON
pyproj: map projection (conversion from one projection to another)
shapefile : a light-weight package for reading shapefiles
shapely : provide geometric operations on 2D planes (oblivious to geographic projections), based on PostGIS engine GEOS

Handling Spatial Data — Conversion

GDAL/OGR provides a powerful command line tool for conversion (and transformation) of geospatial data, similar to ImageMagick’s convert: ogr2ogr

Convert shapefile to geojson ogr2ogr -f GeoJSON nyc.geojson nyc.shp
also reproject the data to EPSG:4326 coordinates (~GPS lat lon): ogr2ogr -f GeoJSON -t_srs EPSG:4326 nyc.geojson nyc.shp
Further reading for R-tree please refer to this and this

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Homework0		Homework0
Homework1		Homework1
Homework2		Homework2
Homework3		Homework3
Homework4		Homework4
Homework5		Homework5
Homework6		Homework6
.gitignore		.gitignore
BDM_3_20_Lab6.pdf		BDM_3_20_Lab6.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homework0

Homework0

Homework1

Homework1

Homework2

Homework2

Homework3

Homework3

Homework4

Homework4

Homework5

Homework5

Homework6

Homework6

.gitignore

.gitignore

BDM_3_20_Lab6.pdf

BDM_3_20_Lab6.pdf

README.md

README.md

Repository files navigation

BigDataCourse

Readme is available for each homework in respective directory

About the course

Tools used

Installing Spark (unix and linux) (good luck to windows users)

Enabling and handling spatial data types in spark

Handling Spatial Data — Conversion

About

Releases

Packages

Languages

prajwalfc/BigDataCourse

Folders and files

Latest commit

History

Repository files navigation

BigDataCourse

Readme is available for each homework in respective directory

About the course

Tools used

Installing Spark (unix and linux) (good luck to windows users)

Enabling and handling spatial data types in spark

Handling Spatial Data — Conversion

About

Resources

Stars

Watchers

Forks

Languages