Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
tests		tests
README.md		README.md
__init__.py		__init__.py
data.txt		data.txt
mapper.py		mapper.py
reducer.py		reducer.py
report.pdf		report.pdf
runjob		runjob
utilities.py		utilities.py

Repository files navigation

Predicting default with MapReduce

Which factors predict the risk of incurring in financial distress?

The input dataset is taken from Kaggle, and is made of 10 variables and 150,000 samples.

Description of the files in the repository:

The list of bash commands to setup HDFS and run MapReduce: runjob
The script to be run during Map: mapper.py
The script to be run during Reduce: reducer.py
Module for data processing used by mapper and reducer: utilities.py
The dataset: data.txt
The report of the analysis: report.pdf

Data processing:

Map step: Each row in data.txt is filtered from unreasonable values (decided a priori) and cleaned from missing values. Then, following the specifications of a Logistic Regression model, the gradient vector and the Hessian matrix are extracted from the data. CLARIFICATION: Stdin is read entirely, rather then iteratively, to enable an easy preprocessing of the data (e.g. use average values to substitute NaNs) - easy meaning with only one MapReduce job. The right approach for scalability would have been to run two MapReduce jobs, one to get the population averages of each feature, the other to train the logistic regression.
Reduce step: The reducer receives a gradient vector and Hessian matrix from each mapper, and combines them according to the Newton-Raphson formula Chu et al..

About

Predicting default with MapReduce

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%