COLD

CUDA Operations for Liquid argon Detector is a contrived acronym

What

Experimental reimplementation of some Wire-Cell Toolkit algorithms using PyTorch

Why

Some expensive WCT algorithms are amenable to running on GPU. This package prototypes that offloading.

How

$ git clone https://github.com/brettviren/cold.git
$ python3 -m venv venv
$ source venv/bin/activate

$ cd cold
$ python setup.py develop
$ cold --help

Performance

It’s a backwards universe. GPU is slower than CPU, C++ is slower than Python (some caveats may apply).

GPU/CPU

CPU

times: conv1:0.506732 conv2:2.654418 ls:12.345227 mg:4.246980 gauss:22.894458 sum:12.477100

CUDA:

times: conv1:0.002414 conv2:0.004040 ls:42.054190 mg:5.745139 gauss:61.036936 sum:42.312326
times: conv1:0.002346 conv2:0.004523 ls:43.305027 mg:6.116210 gauss:62.809590 patch:6.090205 sum:55.193184

Try a different strategy of splatting the Gaussian depos on a monolithic 10*nwires*nticks 2D array.

CPU:

broadcast: 0.2544
    gauss: 2.5069
     init: 0.0208
       ls: 0.9028
      sum: 1.5572

CUDA:

broadcast: 0.3371
    gauss: 6.5323
     init: 0.0005
       ls: 3.4365
      sum: 5.5697

Except for allocating the initial array (“init”), CPU wins. GPU job takes 813 MB on the GPU.

Make things more monolithic, still CPU wins

CPU

gauss: 3.0514
 init: 0.0225
 mult: 0.1654
patch: 0.4848
  sum: 0.9196

CUDA

gauss: 9.3494
 init: 0.0005
 mult: 0.5165
patch: 1.7573
  sum: 4.5056

GPU/CPU and Python/C++

Simple test of calling exp() on arrays with test_arrexp.{py,cxx}.

GPU and CPU in PyCUDA and Numpy on 25M floats (kHz is kilo-points per second, time in us for the operation, the rss[M] measure is the system memory usage)

$ python test/test_arrexp.py
rss[M]: 161.517568
allocate[kHz]: 9964.058, time[us]: 2509.018
rss[M]: 361.791488
gpu speed[kHz]: 126436.674, time[us]: 197.727
rss[M]: 367.398912
cpu speed[kHz]: 68196.694, time[us]: 366.587
rss[M]: 467.828736

The allocation step includes random number generating.

A similar calculation of 25M floats in C++:

$ g++ -std=c++17 -O3 -o test_arrexp test/test_arrexp.cxx
$ ./test_arrexp 
cpp speed[kHz]: 51962.5, time[us]: 481.116

GPU is faster only if array is large enough. Here is same two calculations using 1M floats.

$ python test/test_arrexp.py
rss[M]: 161.611776
allocate[kHz]: 12996.285, time[us]: 76.945
rss[M]: 178.01216
gpu speed[kHz]: 6490.761, time[us]: 154.065
rss[M]: 181.071872
cpu speed[kHz]: 54354.301, time[us]: 18.398
rss[M]: 193.445888
$ ./test_arrexp 
cpp speed[kHz]: 50519.1, time[us]: 19.7945

GPU/CPU hybrid

One benefit of PyTorch is to mix/match what goes on GPU or CPU.

$ cold test --work-shape 1170 6200 -d cpu data/protodune-wires-larsoft-v1.txt data/pdsp.npz data/0-truthDepo.json 
stest:      0.000 (+0.000247): warm up device "cpu", reset time
work shape:  (1280, 6272)
stest:      1.253 (+1.253088): load 'geometry'
stest:      1.263 (+0.009805): load response
stest:      1.814 (+0.551084): make nodes
stest:      1.838 (+0.023723): load points
stest:      1.839 (+0.001555): drifted and pitched
stest:      1.841 (+0.002216): binned
               gauss: 2.8250
                init: 0.0250
                mult: 0.1646
               patch: 0.4480
                 sum: 0.8301
stest:      6.135 (+4.293266): splatted
stest:      7.863 (+1.728735): done

$ cold test --work-shape 1170 6200 -d cuda data/protodune-wires-larsoft-v1.txt data/pdsp.npz data/0-truthDepo.json 
stest:      4.667 (+4.666625): warm up device "cuda", reset time
work shape:  (1280, 6272)
stest:      1.379 (+1.378521): load 'geometry'
stest:      1.388 (+0.009875): load response
stest:      1.435 (+0.046152): make nodes
stest:      1.457 (+0.022485): load points
stest:      1.459 (+0.001473): drifted and pitched
stest:      1.461 (+0.002437): binned
               gauss: 3.5291
                init: 0.0213
                mult: 0.2134
               patch: 0.5509
                 sum: 0.9996
stest:      6.775 (+5.314530): splatted
stest:      6.932 (+0.156778): done

Notes:

second run is the hybrid
the time keeper resets after “hybrid”
the one time FFT of the response function is dune as part of “make nodes” which is why hybrid sees some early speed increase

If we discount one time start up (start from “load points”) then CPU takes 6.049s and hybrid takes 5.497s, barely 10%….

New approach

The problem with the Gaussian raster is the shape of the data. The 2D Gaussians are represented by an (N,5) array (amplitude, 2 means, 2 sigmas) and each gets expanded to a different shaped 2D raster patch (eg, limited by 3 sigma) and each of these raster patches must be added to a large, common output 2D array (nimper*nwires, nticks) and that sum is not thread safe. An algorithm following that description is not very GPU’able.

Turning the problem around and calculating the output array one pixel at a time where each pixel is a loop over all (N,5) Gaussians would be very parallelizable. Two difficulties: RAM usage and requiring more detailed CUDA programming than I’ve done so far.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cold		cold
data		data
docs		docs
test		test
.gitignore		.gitignore
README.org		README.org
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cold

cold

data

data

docs

docs

test

test

.gitignore

.gitignore

README.org

README.org

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

COLD

What

Why

How

Performance

GPU/CPU

GPU/CPU and Python/C++

GPU/CPU hybrid

New approach

About

Releases

Packages

Languages

brettviren/cold

Folders and files

Latest commit

History

Repository files navigation

COLD

What

Why

How

Performance

GPU/CPU

GPU/CPU and Python/C++

GPU/CPU hybrid

New approach

About

Resources

Stars

Watchers

Forks

Languages