rabbda-earthquakes-realtime

Introduction

This application comes alongs with a series of solutions that aim to demonstrate how Big Data can be used to create complex and real-life Big Data applications.

Specifically, with this application, we present how to acquire real-time data from Rest APIs and store them to Hadoop HDFS.

The data source for this demo is related to earthquakes, source: USGS science for a changing world.

USGS provides a Rest API which will be using to request earthquakes data. Sample request in csv format: earthquakes

The steps to store these data to HDFS are the following:

Request the data from the Rest API.
Pre-process the data to remove headers and format earthquakes date and time.
Save the data temporary to the host machine.
Upload the data to HDFS.

Getting started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Download the repository

The initial step is to download the repository in your Hadoop machine. To do so, run in terminal the following command:

git clone https://github.com/UoW-CPC/rabbda-earthquakes-realtime.git

Running the application

Having download the repository you can now run the application. First move to the working directory by executing the command:

cd rabbda-earthquakes-realtime

Now execute the command:

ls

There you can see a folder and three files:

earthquakes, folder which contains python scripts used to perform steps 1-3 mentioned in the introduction paragraph.
requirements.txt, file used to install packages used by python scripts.
flume-earthquakes-realtime.conf, file used by Flume service to perform step 4 mentioned in the introduction paragraph.
README.md, project description file.

Requirements installation

At this phase install the requirements by running the command:

pip install -r requirements.txt

Run the Python application

Having install the requirements you can now run the python application.

Move to the earthquakes folder:

cd earthquakes

and execute the earthquakes script:

python earthquakes.py

By default the script makes a requests every 10 minutes. As an alternative you can pass a parameter to change this value. Example:

python earthquakes.py 2

Now we have a request every 2 minutes.

To see the results open a new terminal and move to the repository directory. There, you can see a new directory, data. If you move into this folder, there is a file called earthquakes.csv.

To see its content run the following command:

cat earthquakes.csv

Alternatively, you can monitor file changes with the command:

tail -F earthquakes.csv

At this point, we have temporary stored the data in the local machine.

Run the Flume Agent

The next step is to upload those data to HDFS. To do so, we use the Flume service. Open a new terminal and move once again to the rabbda-earthquakes-realtime directory.

There we have to edit the flume-earthquakes-realtime.conf file. Specifically, you need to edit the eq.sources.r1.command and eq.sinks.k1.hdfs.path to match your local environment.

Example:

eq.sources.r1.command = tail -F /home/user/rabbda-earthquakes-realtime/data/earthquakes.csv
eq.sinks.k1.hdfs.path = hdfs://NameNode.Domain.com:8020/user/UserName/flume/realtime

Now is time to start the Flume agent and upload the data to HDFS. Execute the command:

flume-ng agent --name eq --conf-file flume-earthquakes-realtime.conf

Having done this, Flume agent starts monitoring the earthquakes.csv file for changes and uploads the data to HDFS.

Verify the data in HDFS

Finally, go to Ambari Files View in the path specified previously and see the data sinking to HDFS in real-time.

Architecture

Demo video

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

earthquakes

earthquakes

.gitignore

.gitignore

README.md

README.md

flume-earthquakes-realtime.conf

flume-earthquakes-realtime.conf

requirements.txt

requirements.txt

Repository files navigation

rabbda-earthquakes-realtime

Introduction

Getting started

Download the repository

Running the application

Requirements installation

Run the Python application

Run the Flume Agent

Verify the data in HDFS

Architecture

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
earthquakes		earthquakes
.gitignore		.gitignore
README.md		README.md
flume-earthquakes-realtime.conf		flume-earthquakes-realtime.conf
requirements.txt		requirements.txt

UoW-CPC/rabbda-earthquakes-realtime

Folders and files

Latest commit

History

Repository files navigation

rabbda-earthquakes-realtime

Introduction

Getting started

Download the repository

Running the application

Requirements installation

Run the Python application

Run the Flume Agent

Verify the data in HDFS

Architecture

About

Resources

Stars

Watchers

Forks

Languages