This repo contains some experiments to autotune simple convolution implementations using OpenTuner. It is currently a work in progress (or abandoned if you are reading this in the far future), but currently can demonstrate speedups over PyTorch on over 40 of the slowest 50 convolutions found in TorchBench for CPU inference. The generated code is simple C++ and does not depend on a more complex system, this makes it easy to understand and can be used to drive speedups in other systems.
Results from my local Intel Core i7-8086K CPU are here.
Install miniconda if not already installed. Python 3.7+ is required.
- Create a conda environment
conda create --name convtuner
conda activate convtuner
- Install PyTorch nightly (or replace this setp with PyTorch from source)
conda install pytorch cpuonly -c pytorch-nightly
- Install opentuner/sympy/pandas
pip install opentuner sympy pandas
Usage can be found with --help
:
usage: main.py [-h] [--verbose] [--times TIMES] [--repeat REPEAT] [--case CASE] [--autotune] [--dummy] [--limit LIMIT]
[--test-limit TEST_LIMIT] [--testcases-filename TESTCASES_FILENAME]
optional arguments:
-h, --help show this help message and exit
--verbose, -v
--times TIMES
--repeat REPEAT
--case CASE
--autotune
--dummy
--limit LIMIT, -l LIMIT
--test-limit TEST_LIMIT
--testcases-filename TESTCASES_FILENAME
Some examples:
./main.py --limit=10
- runs the slowest 10 shapes fromtestcases.csv
using pre-tuned configs found in theconfigs
folder../main.py --limit=10 --autotune
- same as above, but regenerates the config using autotuning./main.py --case=3 --autotune
- runs just the third testcase./main.py --case=3 --verbose
- runs just the third testcase, and prints out the generated source code