Skip to content

ahmed-abdelsamed/dataset-reader-sample

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status Dependency Status

Dataset Reader: A Sample Application for Reading a Dataset from HDFS

This project contains a sample application that is able to read a dataset from HDFS and present it in a graphical form to user.

Let's imagine flow as below:

  1. Dataset is uploaded through data catalog into the platform. The file is stored on the HDFS
  2. Data scientist does some analysis on it using ATK. The result is also stored on HDFS
  3. Application developer uploads the dataset-reader application into the platform and binds it with the file.
  4. Dataset-reader presents the dataset in a nice form as a set of charts.

Preparing data

You can either use already prepared dataset, ready to be visualised (see: Using pre-built dataset) by the application or go through the whole sample flow and prepare your own dataset using TAP Analytics Toolkit (see: Preparing dataset manually).

Using pre-built dataset

  1. Go to Data catalog page
  2. Select Submit transfer tab
  3. Fill in dataset title and choose local file upload
  4. Select file to upload (sample dataset can be found here: data/nf-data-application.csv)
  5. Alternatively, you can select upload using a link and specify link to raw file on github (i.e. nf-data-application.csv
  6. Submit transfer
  7. When the transfer finishes, a new dataset will be visible in Data catalog
  8. To acquire link to file on HDFS go to the recently created dataset and copy the value of targetUri property.

Preparing dataset manually

To prepare the dataset on you own, follow the steps describe in [Workshop Module 1](workshop/Intel Workshop Module 1 Final.pdf)

Deploying application to TAP

Manual deployment

Compilation and running

  1. Clone this repository

git clone https://github.com/trustedanalytics/dataset-reader-sample.git

  1. Compile it using Maven

mvn compile

  1. (Optional) Run it locally passing path to the file

FILE=<path_to_the_file> mvn spring-boot:run -Dspring.profiles.active=local

Pushing to the platform

  1. Make Java package

mvn package

  1. Login and set proper organization and space
cf api <platform API address>
cf login
cf target -o <organization name> -s <space name>
  1. (Optional) Change the application name and host name if necessary in the manifest.yml
name: <your application name>
host: <application host name>

ℹ️ For example, if you set host to "dataset-reader" and your platform URL is "example.com", the application will be hosted under 'dataset-reader.example.com' domain.

  1. Push dataset-reader to the platform
cf push

Application will fail to start anything because it doesn't know which file to serve and how to connect to HDFS.

  1. Create HDFS service instance called hdfs-instance. You can do that using command line or via browser:
  2. Using CF CLI:
```
cf create-service hdfs shared hdfs-instance
```
  1. Using WebUI: 1. Go to Marketplace 2. Select HDFS service offering 3. Choose plan Shared 4. Type the name of the instance: hdfs-instance (Note: the instance must be called hdfs-instance) 5. Click Create new instance
  2. Bind the hdfs-instance to application
  3. Using command line tool:
```
cf bind-service dataset-reader hdfs-instance
```
  1. Using WebUI: 1. Go to Applications list 1. Go to details of dataset-reader application 2. Switch to Bindings tab 3. Click Bind button next to the hdfs-instance (you can use filtering functionality to search for the service)
  2. Create an instance of kerberos service, named kerberos-service, in analogous way as the one above and bind it as well.
  3. Pass the path to the file on HDFS (acquired in step Preparing data) as a environment variable called "FILE":
cf set-env <application name> FILE <path to file on HDFS>
  1. Restart the application to reload the environment variables
cf restart <application name>

Automated deployment

  1. Clone this repository git clone https://github.com/trustedanalytics/dataset-reader-sample.git
  2. Switch to deploy directory: cd deploy
  3. Install tox: sudo -E pip install --upgrade tox
  4. Run: tox
  5. Activate virtualenv with installed dependencies: . .tox/py27/bin/activate
  6. Run deployment script: python deploy.py, the script will use parameters provided on input. Alternatively, provide parameters when running script. (python deploy.py -h to check script parameters with their descriptions).

About

Sample application reading dataset from HDFS and presenting a chart in UI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 41.9%
  • Java 28.8%
  • HTML 9.2%
  • JavaScript 6.1%
  • CSS 5.6%
  • Python 4.9%
  • Shell 3.5%