Skip to content

NCAR/PBS_Optimization

Repository files navigation

PBS Scheduler Optimization

This repository implements data preprocessing, machine learning models to optimize PBS Scheduler

NCAR Workload: Cheynne April 2019 (M2, M3, M4)
Benchmark Workload from http://www.cs.huji.ac.il/labs/parallel/workload/logs.html
Cornell Theory Center IBM SP2 (CTC)
Swedish Royal Institute of Technology IBM SP2 (KTH)
San Diego Supercomputer CenterBlue Horizon (SDSC)

Requirements

Python 3.6.2
Numpy
Pandas
Pytorch 1.0.1
Scikit-learn 0.21.1
Tensorflow 1.13.1

Data Preprocessing

python3 pbs_parser_seaprate_v3.py accounting outloc

where accounting : directory with accounting logs (previous historical PBS data)
outloc : output directory with combined accounting logs (in csv format)

To extract additional features from accounting logs, follow the commands

python3 process_data.py inloc outloc

where inloc : directory with combined accounting logs (from pbs parser step)
outloc : output directory with added features

To filter suspicious jobs from data, follow the commands

python3 filter_data.py inloc outloc

where inloc : directory with combined accounting logs (from previous proccess_data step)
outloc : output directory with filtered data

To split data ino different weeks, follow the commands

python3 weekly_split.py pre_data_loc post_data_loc

where pre_data_loc : directory containing pre-processed CSV data
post_data_loc : directory containing post-processed CSV data

Data Analysis & Plotting

The goal of the scripts in this section is to produce charts for visualization of PBS accounting data. The analysis is done by calculating jobs' misprediction and dividing them into 6 bins (0-15 minutes, 15min-1h, 1h-3h, 3h-7h, >=7h, underprediction)

Plot by single feature

This script produces breakdown of users' misprediction in terms of different user_selected features. Follow the command to execute the script

python3 create_feature_plots.py data_path feature

where data_path directory containing CSV data
feature stands for the features (i.e. queue, resources_used.walltime, etc.)

Successful scripts would produce 2 types of plots:
1. Pie plot: for individual period dissected by feature (i.e. Monday-> Friday if feature is day_week)
2. Stacked column chart: for all of the periods dissected by feature

Plot by two features

This script extends the previous script to produce two-level feature filters. Currently, feature2 is supposed to be an outer filter and feature1 is an inner filter

python3 create_multi-index_feature_plots.py feature1andfeature2

where feature1 and feature2 stand for the two different features (i.e. queue, resources_used.walltime, etc.)

Successful scripts would produce 2 types of plots:
1. Pie plot: Contribution of each feature2 with more than 2% of job volume
2. Stacked column chart: for each feature2 with components of feature1

(See example in img directory)

Current list of supported plotting features:
time_day
week_month
day_week
user
account
queue

Two-dimensional plot (scatterplot)

To generate two-dimensional plot analysis (by two features), follow the commands:

python3 overall_plot.py --multi_dim_plot=True --multi_dim_x_feature='mispred_ratio', --multi_dim_y_feature='num_jobs' --num_top=10 --groupby_val='account' --data_path='../apr_2019_full/'

Argument parameters: --multi_dim_plot whether to plot multi_dim or not, default: False
--multi_dim_x_feature features on x_axis, default: mispred_ratio
--multi_dim_y_feature features on y_axism defaultL num_jobs
--num_top top users by two 2 specified dimensions (multiplication), default: 0 (no filter)
--groupby_val data point representation (i.e. user/ account), default: user
--data_path location storing parsed accounting csv of interest

Overall plot distribution

To generate overall_plot, follow the commands:

python3 overall_plot.py --data_path='../apr_2019_full/' --groupby_val='user' --overall_distr_plot=True --overall_feature='user_mispred'

Argument parameters:
--data_path location storing parsed accounting csv of interest
--groupby_val data point representation (i.e. user/ account), default: user
--overall_distr_plot whether to plot overall distribution or not
--overall_feature feature to plot by

Current list of supported plotting features:
Resource_List.walltime
resources_used.walltime
resources_used.cput
user_mispred (user_predict running time - actual running time)
mispred_ratio (misprediction over user_predict running time)
mispred_ratio_runtime (misprediction over actual running time)

Current list of supported groupby_val features:
user
account

Training State-of-the-art Model (Neural Network)

To train either FeedForward Network (keyword: ff), Bi-directionalr Long short-term Memory Network (Bi-LSTM) (keyword: rnn), Convolutional Neural Network (keyword: cnn), or Residual Neural Network (keyword: resnet), please follow the example below. Example is for FeedForward network:

python3 train.py --batch_size=64 --num_epochs=100 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_ff_model/' --test_path='../testing_small/' --model_type='ff' --dropout=0.8 --device='cuda:0' --old=True

Training State-of-the-art Model (Random Forest, XGBoost)

Training default model proposed by "Machine Learning Predictions for Underestimation of Job Runtime on HPC System" (Guo, Nomura, Barton, Zhang, and Matsuoka, 2018)

python3 rf_xgboost.py train_path test_path rf_report xgb_report old

where train_path, test_path: directory containing CSV of training and testing data
rf_report, xgb_report directory containing summary results of RF and XGB respectively old: whether the data comes from benchmark workload or not (True: benchmark workload, False: NCAR workload)

Unsupervised Domain Adaptation by Backpropagation

Implementation of the proposed domain adaptation model from "Unsupervised Domain Adaptation by Backpropagation" (Ganin, Lempitsky, 2015)

python3 train_transfer.py --batch_size=32 --num_epochs=200 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_dann_model/' --test_path='../testing_small/' --model_type='dann' --dropout=0.8 --device='cuda:0'

Deep Adaptation Network (DAN)

Implementation of the proposed domain adaptation model from "Learning Transferable Features with Deep Adaptation Networks" (Long, Cao, Wang, Jordan, 2015)

python3 train_transfer.py --batch_size=32 --num_epochs=200 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_dan_model/' --test_path='../testing_small/' --model_type='dan' --dropout=0.8 --device='cuda:0'

Domain Adaptation with Correlation Alignment (DCORAL)

Implementation based on the proposed model from "Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation) (Chen et al., 2018)

python3 train_transfer.py --batch_size=32 --num_epochs=200 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_dan_model/' --test_path='../testing_small/' --model_type='dan' --dropout=0.8 --device='cuda:0' --old =True

Comparison Result

NCAR Workload

Methods RF XGBoost FC Bi-LSTM CNN DCORAL
M2-->M3 0.467 0.054 0.389 0.357 0.412 0.875
M3-->M2 0.396 0.295 0.591 0.610 0.211 0.365
M2-->M4 0.258 0.041 0.105 0.129 0.295 0.398
M4-->M2 0.372 0.326 0.498 0.456 0.349 0.383
M3-->M4 0.849 0.377 0.772 0.882 0.129 0.698
M4-->M3 0.747 0.277 0.835 0.873 0.505 0.735
Average 0.515 0.228 0.532 0.551 0.317 0.576

Benchmark Workload

Methods RF XGBoost FC Bi-LSTM CNN DCORAL
CTC-->KTH 0.254 0.089 0.191 0.245 0.193 0.385
KTH-->CTC 0.462 0.118 0.528 0.303 0.276 0.426
CTC-->SDSC 0.292 0.058 0.092 0.161 0.143 0.241
SDSC-->CTC 0.256 0.611 0.204 0.210 0.285 0.395
KTH-->SDSC 0.563 0.273 0.695 0.886 0.287 0.384
SDSC-->KTH 0.614 0.159 0.749 0.799 0.212 0.511
Average 0.407 0.218 0.410 0.434 0.233 0.390

Acknowledgement

https://github.com/chenchao666/JDDA-Master (Tensorflow)
https://github.com/yunjey/pytorch-tutorial

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages