Dataset Reader: A Sample Application for Reading a Dataset from HDFS

This project contains a sample application that is able to read a dataset from HDFS and present it in a graphical form to user.

Let's imagine flow as below:

Dataset is uploaded through data catalog into the platform. The file is stored on the HDFS
Data scientist does some analysis on it using ATK. The result is also stored on HDFS
Application developer uploads the dataset-reader application into the platform and binds it with the file.
Dataset-reader presents the dataset in a nice form as a set of charts.

Preparing data

You can either use already prepared dataset, ready to be visualised (see: Using pre-built dataset) by the application or go through the whole sample flow and prepare your own dataset using TAP Analytics Toolkit (see: Preparing dataset manually).

Using pre-built dataset

Go to Data catalog page
Select Submit transfer tab
Fill in dataset title and choose local file upload
Select file to upload (sample dataset can be found here: data/nf-data-application.csv)
Alternatively, you can select upload using a link and specify link to raw file on github (i.e. nf-data-application.csv
Submit transfer
When the transfer finishes, a new dataset will be visible in Data catalog
To acquire link to file on HDFS go to the recently created dataset and copy the value of targetUri property.

Preparing dataset manually

To prepare the dataset on you own, follow the steps describe in [Workshop Module 1](workshop/Intel Workshop Module 1 Final.pdf)

Deploying application to TAP

Manual deployment

Compilation and running

Clone this repository

git clone https://github.com/trustedanalytics/dataset-reader-sample.git

Compile it using Maven

mvn compile

(Optional) Run it locally passing path to the file

FILE=<path_to_the_file> mvn spring-boot:run -Dspring.profiles.active=local

Pushing to the platform

Make Java package

mvn package

Login and set proper organization and space

cf api <platform API address>
cf login
cf target -o <organization name> -s <space name>

(Optional) Change the application name and host name if necessary in the manifest.yml

name: <your application name>
host: <application host name>

ℹ️ For example, if you set host to "dataset-reader" and your platform URL is "example.com", the application will be hosted under 'dataset-reader.example.com' domain.

Push dataset-reader to the platform

cf push

Application will fail to start anything because it doesn't know which file to serve and how to connect to HDFS.

Create HDFS service instance called hdfs-instance. You can do that using command line or via browser:
Using CF CLI:

```
cf create-service hdfs shared hdfs-instance
```

Using WebUI: 1. Go to Marketplace 2. Select HDFS service offering 3. Choose plan Shared 4. Type the name of the instance: hdfs-instance (Note: the instance must be called hdfs-instance) 5. Click Create new instance
Bind the hdfs-instance to application
Using command line tool:

```
cf bind-service dataset-reader hdfs-instance
```

Using WebUI: 1. Go to Applications list 1. Go to details of dataset-reader application 2. Switch to Bindings tab 3. Click Bind button next to the hdfs-instance (you can use filtering functionality to search for the service)
Create an instance of kerberos service, named kerberos-service, in analogous way as the one above and bind it as well.
Pass the path to the file on HDFS (acquired in step Preparing data) as a environment variable called "FILE":

cf set-env <application name> FILE <path to file on HDFS>

Restart the application to reload the environment variables

cf restart <application name>

Automated deployment

Clone this repository git clone https://github.com/trustedanalytics/dataset-reader-sample.git
Switch to deploy directory: cd deploy
Install tox: sudo -E pip install --upgrade tox
Run: tox
Activate virtualenv with installed dependencies: . .tox/py27/bin/activate
Run deployment script: python deploy.py, the script will use parameters provided on input. Alternatively, provide parameters when running script. (python deploy.py -h to check script parameters with their descriptions).

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
data		data
deploy		deploy
docs		docs
license		license
src		src
workshop		workshop
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.txt		LICENSE.txt
README.md		README.md
pack.sh		pack.sh
pom.xml		pom.xml

License

ahmed-abdelsamed/dataset-reader-sample

Folders and files

Latest commit

History

Repository files navigation

Dataset Reader: A Sample Application for Reading a Dataset from HDFS

Preparing data

Using pre-built dataset

Preparing dataset manually

Deploying application to TAP

Manual deployment

Compilation and running

Pushing to the platform

Automated deployment

About

Resources

License

Stars

Watchers

Forks

Languages