Accurate and Interpretable Learning of Linux Kernel Configuration Sizes

This is the companion repository of the SPLC'22 40^th submission.

Further details are available in the related publication, see the pdf file in the main directory

In a nutshell

Our research work proposes to train interpretable performance models scaling to software systems with thousands of options. We evaluate our results on the Linux kernel. With an analysis of these models, it becomes possible to know the effect of each configuration option, and thus to decide which one should be (de)activated in order to minimize a performance property. Thanks to this work, we can answer the question : "which option should I enable to minimize the footprint of a Linux kernel?"

Artifacts

To evaluate these artifacts, two aspects should be tested:

The gathering of a dataset of measurements on the Linux kernel. The goal is there to ensure that a-] this dataset is available b-] one can easily reproduce such a dataset thanks to our infrastructure.
The training of performance models applied on this dataset of measurements.

Prerequisites

Install docker. You can check that docker is working by checking its version (use the command line sudo docker --version) or status (use sudo systemctl status docker).

1. Dataset

a-] Availability

Our dataset of Linux measurements can be downloaded at zenodo

Due to the large size (1.8 GB) of the dataset, the preview is not possible.

Each line of this dataset is composed of the configuration options (1 = activated, 0 = deactivated) used to compile the Linux kernel and the resulting size or footprint of the kernel.

We store this dataset of Linux compilations online.

b-] Reproducibility

We also maintain an infrastructure, namely TuxML, allowing users to participate to this initiative.

In this part, we invite you to add a line to the existing dataset.

You will compile a Linux kernel with randomly chosen configuration options, compute the size of the resulting kernel and send it to the database.

Please make sure python and docker are installed before executing the following command lines:

Download the code

git clone https://github.com/TuxML/tuxml.git

Enter the folder

cd tuxml

Run the script launching the compilation of the kernel

python3 kernel_generator.py

It might take few seconds or minutes to compute.

If everything is working, you'll see these lines:

Thank you for contributing to TuxML!

If you are interested, additional information is available at: https://github.com/TuxML/tuxml/wiki/User_documentation

2. Models

Finally, you will have to train a model on this dataset.

In our paper, we tested and compared multiple learning techniques.

We provide you a docker container allowing to launch and test these techniques.

First, you need to pull the container, thanks to the following command line:

sudo docker pull anonymicse2021/splc22

Then, run the container in interactive mode:

sudo docker run -ti anonymicse2021/splc22

Once in the container, run the python script train.py:

python3 train.py

Here are some additional arguments to customize your launch:

--verbose y to show detailed logs, activated by default
--training_size to indicate the proportion of the dataset that should be used as training (between 0 and 1). Default 0.1
--ml_technique the machine learning technique used to compute the result. --ml_technique lr for linear regression, --ml_technique dt for decision tree, --ml_technique rf for random forest, --ml_technique gb for gradient boosting. Default set to "rf".
--feature_selection y if a feature selection process should be applied before the launch of the model, --feature_selection n otherwise. Default y.
--metric the metric to compute on the test set. Default --metric MAPE. Alternative --metric MAE.

For instance, the following command will launch a gradient boosting tree, without feature selection and using 90% of the configurations as training

python3 train.py --ml_technique gb --feature_selection n --training_size 0.9

If everything worked, you should be able to observe the following lines:

You can now exit the container and remove the image (to spare the memory of your computer) by running the following command line:

sudo docker image rm anonymicse2021/splc22

Thank you for testing our artifact!

Others

HOW TO analyse_kconfig_help_msg.py

First install Kconfiglib pip[3] install kconfiglib

To realize Patch Kernel Makefile: git clone https://github.com/ulfalizer/Kconfiglib.git Download a Linux kernel ie in our case: https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.13.3.tar.xz In the kernel top-directory: cd linux-4.13.3 and then patch -p1 < ../Kconfiglib/makefile.patch (it will modify the Makefile of linux kernel to support some commands like scriptconfig see below)

Finally, you can use the script: always in the kernel directory linux-4.13.3, you can run: make ARCH=x86 scriptconfig SCRIPT=../analyse_kconfig_help_msg.py

Alternative to zenodo

git clone https://github.com/TuxML/size-analysis/ (to get tuxml.py)
git clone https://gitlab.com/FAMILIAR-project/tuxml-size-analysis-datasets/ (to get datasets)
then you can use tuxml.py to load a pre-encoded dataset (it returns a pandas dataframe):

import tuxml
df = tuxml.load_dataset()

An example is given with size-analysis-fast.ipynb Note: the datatset is loaded here: ../tuxml-size-analysis-datasets/all_size_withyes.pkl so be careful about relative paths and your git repo locations

Another Docker image

docker build -f docker/Dockerfile -t sklearntux . (it can take a while) or simply docker pull macher/sklearntux

docker run -it --rm macher/sklearntux python3 size-analysis-fast.py should work

Notes:

there is a all_size_withyes.pkl pre-copied (it is a .pkl of the dataset) -- it can a CSV file as well
plotting facilities are installed (matplotlib, seaborn, etc.) partly explaining the increase in size of the Docker image

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
dataOptionsText		dataOptionsText
docker-rf		docker-rf
docker		docker
docker_run		docker_run
interpretability_results		interpretability_results
output-figs		output-figs
.gitignore		.gitignore
Linux_options.json		Linux_options.json
Linux_patch_doc.ipynb		Linux_patch_doc.ipynb
README.md		README.md
Random_forest_size.ipynb		Random_forest_size.ipynb
SPLC_2022___Linux_Kernel_Size.pdf		SPLC_2022___Linux_Kernel_Size.pdf
addImportanceToOptions.py		addImportanceToOptions.py
alloptions-x64-v4.15.csv		alloptions-x64-v4.15.csv
analyse_kconfig_help_msg.py		analyse_kconfig_help_msg.py
coef_LassoCV_None_98_12498.csv		coef_LassoCV_None_98_12498.csv
compareSizeResults.py		compareSizeResults.py
comparison_doc_data.ipynb		comparison_doc_data.ipynb
correlations_vmlinux.csv		correlations_vmlinux.csv
doc.txt		doc.txt
encoding_influence.ipynb		encoding_influence.ipynb
feature_correlations.ipynb		feature_correlations.ipynb
feature_correlations.py		feature_correlations.py
feature_frequency.ipynb		feature_frequency.ipynb
feature_frequency.py		feature_frequency.py
feature_importance.csv		feature_importance.csv
feature_importance.ipynb		feature_importance.ipynb
feature_importance415.ipynb		feature_importance415.ipynb
feature_importanceDT.csv		feature_importanceDT.csv
feature_importanceEN.csv		feature_importanceEN.csv
feature_importanceET-withftimportance-GZIP-100.csv		feature_importanceET-withftimportance-GZIP-100.csv
feature_importanceET-withftimportance-GZIP-200.csv		feature_importanceET-withftimportance-GZIP-200.csv
feature_importanceET-withftimportance-GZIP-300.csv		feature_importanceET-withftimportance-GZIP-300.csv
feature_importanceET-withftimportance-GZIP-500.csv		feature_importanceET-withftimportance-GZIP-500.csv
feature_importanceET-withftimportance.csv		feature_importanceET-withftimportance.csv
feature_importanceET-withftimportance100.csv		feature_importanceET-withftimportance100.csv
feature_importanceET-withftimportance200.csv		feature_importanceET-withftimportance200.csv
feature_importanceET-withftimportance700.csv		feature_importanceET-withftimportance700.csv
feature_importanceET.csv		feature_importanceET.csv
feature_importanceGB-ccoptimize.csv		feature_importanceGB-ccoptimize.csv
feature_importanceGB.csv		feature_importanceGB.csv
feature_importanceLR.csv		feature_importanceLR.csv
feature_importanceLasso.csv		feature_importanceLasso.csv
feature_importanceLasso09.csv		feature_importanceLasso09.csv
feature_importanceRF-415.csv		feature_importanceRF-415.csv
feature_importanceRF-GZIP.csv		feature_importanceRF-GZIP.csv
feature_importanceRF-ccoptimize.csv		feature_importanceRF-ccoptimize.csv
feature_importanceRF.csv		feature_importanceRF.csv
feature_importanceRidge.csv		feature_importanceRidge.csv
feature_net.csv		feature_net.csv
feature_selection.py		feature_selection.py
features_details.csv		features_details.csv
input_performance_influence_models.ipynb		input_performance_influence_models.ipynb
linear_regression.ipynb		linear_regression.ipynb
lr_training_size.png		lr_training_size.png
mkformula.py		mkformula.py
multiple_linear_regression-AM.ipynb		multiple_linear_regression-AM.ipynb
multiple_linear_regression.ipynb		multiple_linear_regression.ipynb
net-gzip.ipynb		net-gzip.ipynb
net_best_features_selection.ipynb		net_best_features_selection.ipynb
net_curve_mape.ipynb		net_curve_mape.ipynb
net_launcher.py		net_launcher.py
net_mape_nb_features.ipynb		net_mape_nb_features.ipynb
net_mape_nb_features.png		net_mape_nb_features.png
net_mape_training_size.png		net_mape_training_size.png
neural_network_automation.ipynb		neural_network_automation.ipynb
neural_network_tuxml.ipynb		neural_network_tuxml.ipynb
non_tristate_options.json		non_tristate_options.json
non_tristate_options_exploration.ipynb		non_tristate_options_exploration.ipynb
old_README.md		old_README.md
option_columns.json		option_columns.json
option_effect_measurements.ipynb		option_effect_measurements.ipynb
optionsAndImportance.csv		optionsAndImportance.csv
optionsRelatedToSize.txt		optionsRelatedToSize.txt
options_frequencydataset_wrt_linuxdoc.csv		options_frequencydataset_wrt_linuxdoc.csv
playground.ipynb		playground.ipynb
plotHelpWords.png		plotHelpWords.png
ranking-overlapoptions.pdf		ranking-overlapoptions.pdf
regression.R		regression.R
res_mape_neural_network.csv		res_mape_neural_network.csv
results.csv		results.csv
results.pkl		results.pkl
results_lr.csv		results_lr.csv
should_we_keep_nbyes.ipynb		should_we_keep_nbyes.ipynb
size-analysis-fast.ipynb		size-analysis-fast.ipynb
size-analysis-fast.py		size-analysis-fast.py
size_analysis.ipynb		size_analysis.ipynb
test_loader_launcher.ipynb		test_loader_launcher.ipynb
tristate regression.ipynb		tristate regression.ipynb
ttest-features.csv		ttest-features.csv
ttest-features.ipynb		ttest-features.ipynb
ttest-features_sample.ipynb		ttest-features_sample.ipynb
tuxml.py		tuxml.py
value-frequencies-Kconfig.ipynb		value-frequencies-Kconfig.ipynb
vote_feature_selection.ipynb		vote_feature_selection.ipynb
welch_test_output.csv		welch_test_output.csv

TuxML/size-analysis

Folders and files

Latest commit

History