CUDA Operations for Liquid argon Detector is a contrived acronym
Experimental reimplementation of some Wire-Cell Toolkit algorithms using PyTorch
Some expensive WCT algorithms are amenable to running on GPU. This package prototypes that offloading.
$ git clone https://github.com/brettviren/cold.git $ python3 -m venv venv $ source venv/bin/activate $ cd cold $ python setup.py develop $ cold --help
It’s a backwards universe. GPU is slower than CPU, C++ is slower than Python (some caveats may apply).
CPU
times: conv1:0.506732 conv2:2.654418 ls:12.345227 mg:4.246980 gauss:22.894458 sum:12.477100
CUDA:
times: conv1:0.002414 conv2:0.004040 ls:42.054190 mg:5.745139 gauss:61.036936 sum:42.312326 times: conv1:0.002346 conv2:0.004523 ls:43.305027 mg:6.116210 gauss:62.809590 patch:6.090205 sum:55.193184
Try a different strategy of splatting the Gaussian depos on a monolithic
10*nwires*nticks
2D array.
CPU:
broadcast: 0.2544 gauss: 2.5069 init: 0.0208 ls: 0.9028 sum: 1.5572
CUDA:
broadcast: 0.3371 gauss: 6.5323 init: 0.0005 ls: 3.4365 sum: 5.5697
Except for allocating the initial array (“init”), CPU wins. GPU job takes 813 MB on the GPU.
Make things more monolithic, still CPU wins
CPU
gauss: 3.0514 init: 0.0225 mult: 0.1654 patch: 0.4848 sum: 0.9196
CUDA
gauss: 9.3494 init: 0.0005 mult: 0.5165 patch: 1.7573 sum: 4.5056
Simple test of calling exp()
on arrays with test_arrexp.{py,cxx}
.
GPU and CPU in PyCUDA and Numpy on 25M floats (kHz is kilo-points per second, time in us for the operation, the rss[M]
measure is the system memory usage)
$ python test/test_arrexp.py rss[M]: 161.517568 allocate[kHz]: 9964.058, time[us]: 2509.018 rss[M]: 361.791488 gpu speed[kHz]: 126436.674, time[us]: 197.727 rss[M]: 367.398912 cpu speed[kHz]: 68196.694, time[us]: 366.587 rss[M]: 467.828736
The allocation step includes random number generating.
A similar calculation of 25M floats in C++:
$ g++ -std=c++17 -O3 -o test_arrexp test/test_arrexp.cxx $ ./test_arrexp cpp speed[kHz]: 51962.5, time[us]: 481.116
GPU is faster only if array is large enough. Here is same two calculations using 1M floats.
$ python test/test_arrexp.py rss[M]: 161.611776 allocate[kHz]: 12996.285, time[us]: 76.945 rss[M]: 178.01216 gpu speed[kHz]: 6490.761, time[us]: 154.065 rss[M]: 181.071872 cpu speed[kHz]: 54354.301, time[us]: 18.398 rss[M]: 193.445888 $ ./test_arrexp cpp speed[kHz]: 50519.1, time[us]: 19.7945
One benefit of PyTorch is to mix/match what goes on GPU or CPU.
$ cold test --work-shape 1170 6200 -d cpu data/protodune-wires-larsoft-v1.txt data/pdsp.npz data/0-truthDepo.json stest: 0.000 (+0.000247): warm up device "cpu", reset time work shape: (1280, 6272) stest: 1.253 (+1.253088): load 'geometry' stest: 1.263 (+0.009805): load response stest: 1.814 (+0.551084): make nodes stest: 1.838 (+0.023723): load points stest: 1.839 (+0.001555): drifted and pitched stest: 1.841 (+0.002216): binned gauss: 2.8250 init: 0.0250 mult: 0.1646 patch: 0.4480 sum: 0.8301 stest: 6.135 (+4.293266): splatted stest: 7.863 (+1.728735): done $ cold test --work-shape 1170 6200 -d cuda data/protodune-wires-larsoft-v1.txt data/pdsp.npz data/0-truthDepo.json stest: 4.667 (+4.666625): warm up device "cuda", reset time work shape: (1280, 6272) stest: 1.379 (+1.378521): load 'geometry' stest: 1.388 (+0.009875): load response stest: 1.435 (+0.046152): make nodes stest: 1.457 (+0.022485): load points stest: 1.459 (+0.001473): drifted and pitched stest: 1.461 (+0.002437): binned gauss: 3.5291 init: 0.0213 mult: 0.2134 patch: 0.5509 sum: 0.9996 stest: 6.775 (+5.314530): splatted stest: 6.932 (+0.156778): done
Notes:
- second run is the hybrid
- the time keeper resets after “hybrid”
- the one time FFT of the response function is dune as part of “make nodes” which is why hybrid sees some early speed increase
If we discount one time start up (start from “load points”) then CPU takes 6.049s and hybrid takes 5.497s, barely 10%….
The problem with the Gaussian raster is the shape of the data. The 2D
Gaussians are represented by an (N,5)
array (amplitude, 2 means, 2
sigmas) and each gets expanded to a different shaped 2D raster patch
(eg, limited by 3 sigma) and each of these raster patches must be
added to a large, common output 2D array (nimper*nwires, nticks)
and
that sum is not thread safe. An algorithm following that description
is not very GPU’able.
Turning the problem around and calculating the output array one pixel
at a time where each pixel is a loop over all (N,5)
Gaussians would be
very parallelizable. Two difficulties: RAM usage and requiring
more detailed CUDA programming than I’ve done so far.