This application comes alongs with a series of solutions that aim to demonstrate how Big Data can be used to create complex and real-life Big Data applications.
Specifically, with this application, we present how to acquire real-time data from Rest APIs and store them to Hadoop HDFS.
The data source for this demo is related to earthquakes, source: USGS science for a changing world.
USGS provides a Rest API which will be using to request earthquakes data. Sample request in csv format: earthquakes
The steps to store these data to HDFS are the following:
- Request the data from the Rest API.
- Pre-process the data to remove headers and format earthquakes date and time.
- Save the data temporary to the host machine.
- Upload the data to HDFS.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
The initial step is to download the repository in your Hadoop machine. To do so, run in terminal the following command:
git clone https://github.com/UoW-CPC/rabbda-earthquakes-realtime.git
Having download the repository you can now run the application. First move to the working directory by executing the command:
cd rabbda-earthquakes-realtime
Now execute the command:
ls
There you can see a folder and three files:
- earthquakes, folder which contains python scripts used to perform steps 1-3 mentioned in the introduction paragraph.
- requirements.txt, file used to install packages used by python scripts.
- flume-earthquakes-realtime.conf, file used by Flume service to perform step 4 mentioned in the introduction paragraph.
- README.md, project description file.
At this phase install the requirements by running the command:
pip install -r requirements.txt
Having install the requirements you can now run the python application.
Move to the earthquakes folder:
cd earthquakes
and execute the earthquakes script:
python earthquakes.py
By default the script makes a requests every 10 minutes. As an alternative you can pass a parameter to change this value. Example:
python earthquakes.py 2
Now we have a request every 2 minutes.
To see the results open a new terminal and move to the repository directory. There, you can see a new directory, data. If you move into this folder, there is a file called earthquakes.csv.
To see its content run the following command:
cat earthquakes.csv
Alternatively, you can monitor file changes with the command:
tail -F earthquakes.csv
At this point, we have temporary stored the data in the local machine.
The next step is to upload those data to HDFS. To do so, we use the Flume service. Open a new terminal and move once again to the rabbda-earthquakes-realtime directory.
There we have to edit the flume-earthquakes-realtime.conf file. Specifically, you need to edit the eq.sources.r1.command and eq.sinks.k1.hdfs.path to match your local environment.
Example:
eq.sources.r1.command = tail -F /home/user/rabbda-earthquakes-realtime/data/earthquakes.csv
eq.sinks.k1.hdfs.path = hdfs://NameNode.Domain.com:8020/user/UserName/flume/realtime
Now is time to start the Flume agent and upload the data to HDFS. Execute the command:
flume-ng agent --name eq --conf-file flume-earthquakes-realtime.conf
Having done this, Flume agent starts monitoring the earthquakes.csv file for changes and uploads the data to HDFS.
Finally, go to Ambari Files View in the path specified previously and see the data sinking to HDFS in real-time.