Vista

Materialization Trade-offs for Feature Transfer from Deep CNNs for Multimodal Data Analytics (technical report).

Prerequisites

Spark cluster setuped with Spark standalone mode and HDFS (tested with Java 1.8).
TensorFlow installed (tested with TensorFlow 1.3.0)
Scala build tool (sbt) installed (tested with sbt version 0.13.9). For ubuntu-16.04 following code snippet will install sbt.

    echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
    sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
    sudo apt-get update
    sudo apt-get install sbt

Project Structure

/code: Contains python and scala code for Vista optimizer and helper functions.
/data: Contains data pre-processing/generations scripts for the Foods and Amazon datasets.
/exps: Contains example scripts on how to use the Vista optimizer and other scripts implementing baseline approaches discussed in the paper.

How to use

Clone this repository in the Spark Master node.

    git clone https://github.com/ADALabUCSD/vista.git

Run the download_cnn_weights.sh script to get the pre-trained weights for the Convnets

    $ ./download_cnn_weights.sh

Go to the /data/{amazon or foods} directory and run the download_data.sh script and generate_data.py script.

    $ ./download_data.sh
    $ python generate_data.py

Ingest the generated strucutured data file (amazon.csv or foods.csv) and images into HDFS. Alternatively any other input data can be also used. Strucutured file should confirm to the {ID, X_str, y} format without the header and the images directory should contain the resized RGB images (227*227) named after the ID (e.g. ID.jpg).

    $ hadoop fs -put ./foods.csv    /foods.csv
    $ hadoop fs -put ./images       /images

Go to to /code/scala directory and build scala project to create a jar containing helper functions. The generated jar can be found at /code/scala/target/scala-2.11/vista-udfs_2.11-1.0.jar

    $ sbt package

Go to /exps directory and copy the optimizer.py to a different file. Change the content of the file for your requirement. The first important thing is creating an instance of Vista class by providing all the inputs and configuration values. After this the optimizer will make decisions and pick values for the logical plan, physical plan operators and Spark config values. Alternative the user can override the optimizer picked decisions.

    /** Instantiation Parameters
     * name          : Name given to the Spark job
     * mem_sys       : System memory of a Spark worker
     * n_nodes       : Number of nodes in the Spark cluster
     * cpu_sys       : Number of CPUs available in the Spark cluster
     * model         : ConvNet model name. Possible values -> {'alexnet', 'vgg16', 'resnet50'}
     * n_layers      : Number of layers from the top most layer of the ConvNet to be explored
     * start_layer   : Starting layer of the ConvNet. Use 0 when starting with raw images
     * struct_input  : Input path to the structured input
     * images_input  : Input path to the images
     * n             : Number of total records
     * dS            : number of structured features
     * model_name    : Name of the (PySpark MLLib) Downstream ML Model to run in the Vista optimizer
     * extra_config  : Extra configuration settings for hyperparameter tuning with the downstream model
     * tuning_method : Method (TrainValidationSplit / CrossValidator) to use for hyperparameter tuning.
     * seed          : Random Seed to set for all data split / algorithm training tasks for reproducibility in result.
     * test_size     : Fraction of dataset to be chosen for train-test and train-validation split
    **/
    vista = Vista("vista-example", 32, 8, 8, 'alexnet', 4, 0, 'hdfs://../foods.csv',
                      'hdfs://.../images', 20129, 130, model_name='LogisticRegression', extra_config={}, 
                      tuning_method=None, seed=2019, test_size=0.2)
    
    //possible values for model_name -> {'LogisticRegression', 'LinearSVC', 'DecisionTreeClassifier', 'GBTClassifier', 'RandomForestClassifier', 'OneVsRest'}
    //possible values for tuning_method -> {'TrainValidationSplit', 'CrossValidator'}
    
    // extra_config takes in a dictionary of chosen model's attribute names and list of values to explore as key-value pairs for k-Fold Cross Validation. It can also take `numFolds` for the k value. 
    // extra_config is applicable for all currently supported downstream models except 'OneVsRest'.

    //Optional: overriding system picked decisions
    vista.override_inference_type('bulk')               //posible value -> {'bulk', 'staged'}
    vista.overrdide_operator_placement('before-join')   //posible value -> {'before-join', 'after-join'}
    vista.override_join('s')                            //posible value -> {'b', 's'}
    vista.override_persistence_format('deser')          //posible value -> {'ser', 'deser'}
    
    //Starting the ConvNet feature transfer workload
    print(vista.run())

To submit the Spark job use the following command. We recommend using atleast 4GB of Spark driver memory. vista.py should be changed to point to the correct python script.

    $ spark-submit --master <spark-master-url> --driver-memory 8g --packages databricks:tensorframes:0.2.9-s_2.11 --jars ../code/scala/target/scala-2.11/vista-udfs_2.11-1.0.jar vista.py

Limitations

For the Conv layers when transferring features Vista applies max pooling by default. The filter widths and strides are selected such that every Conv volume will reduce into 2*2 filters with the same depth. Right now this configuration is not configurable. Ideally a user should be able specify different feature transformations on the Conv features such max/avg pooling.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
spark		spark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download_convnet_weights.sh		download_convnet_weights.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

spark

spark

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

download_convnet_weights.sh

download_convnet_weights.sh

Repository files navigation

Vista

Prerequisites

Project Structure

How to use

Limitations

About

Releases

Packages

Languages

License

Advitya17/Vista

Folders and files

Latest commit

History

Repository files navigation

Vista

Prerequisites

Project Structure

How to use

Limitations

About

Resources

License

Stars

Watchers

Forks

Languages