Skip to content

CUDA Operations for Liquid argon Detector is a contrived acronym

Notifications You must be signed in to change notification settings

brettviren/cold

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COLD

CUDA Operations for Liquid argon Detector is a contrived acronym

What

Experimental reimplementation of some Wire-Cell Toolkit algorithms using PyTorch

Why

Some expensive WCT algorithms are amenable to running on GPU. This package prototypes that offloading.

How

$ git clone https://github.com/brettviren/cold.git
$ python3 -m venv venv
$ source venv/bin/activate

$ cd cold
$ python setup.py develop
$ cold --help

Performance

It’s a backwards universe. GPU is slower than CPU, C++ is slower than Python (some caveats may apply).

GPU/CPU

CPU

times: conv1:0.506732 conv2:2.654418 ls:12.345227 mg:4.246980 gauss:22.894458 sum:12.477100

CUDA:

times: conv1:0.002414 conv2:0.004040 ls:42.054190 mg:5.745139 gauss:61.036936 sum:42.312326
times: conv1:0.002346 conv2:0.004523 ls:43.305027 mg:6.116210 gauss:62.809590 patch:6.090205 sum:55.193184

Try a different strategy of splatting the Gaussian depos on a monolithic 10*nwires*nticks 2D array.

CPU:

broadcast: 0.2544
    gauss: 2.5069
     init: 0.0208
       ls: 0.9028
      sum: 1.5572

CUDA:

broadcast: 0.3371
    gauss: 6.5323
     init: 0.0005
       ls: 3.4365
      sum: 5.5697

Except for allocating the initial array (“init”), CPU wins. GPU job takes 813 MB on the GPU.

Make things more monolithic, still CPU wins

CPU

gauss: 3.0514
 init: 0.0225
 mult: 0.1654
patch: 0.4848
  sum: 0.9196

CUDA

gauss: 9.3494
 init: 0.0005
 mult: 0.5165
patch: 1.7573
  sum: 4.5056

GPU/CPU and Python/C++

Simple test of calling exp() on arrays with test_arrexp.{py,cxx}.

GPU and CPU in PyCUDA and Numpy on 25M floats (kHz is kilo-points per second, time in us for the operation, the rss[M] measure is the system memory usage)

$ python test/test_arrexp.py
rss[M]: 161.517568
allocate[kHz]: 9964.058, time[us]: 2509.018
rss[M]: 361.791488
gpu speed[kHz]: 126436.674, time[us]: 197.727
rss[M]: 367.398912
cpu speed[kHz]: 68196.694, time[us]: 366.587
rss[M]: 467.828736

The allocation step includes random number generating.

A similar calculation of 25M floats in C++:

$ g++ -std=c++17 -O3 -o test_arrexp test/test_arrexp.cxx
$ ./test_arrexp 
cpp speed[kHz]: 51962.5, time[us]: 481.116

GPU is faster only if array is large enough. Here is same two calculations using 1M floats.

$ python test/test_arrexp.py
rss[M]: 161.611776
allocate[kHz]: 12996.285, time[us]: 76.945
rss[M]: 178.01216
gpu speed[kHz]: 6490.761, time[us]: 154.065
rss[M]: 181.071872
cpu speed[kHz]: 54354.301, time[us]: 18.398
rss[M]: 193.445888
$ ./test_arrexp 
cpp speed[kHz]: 50519.1, time[us]: 19.7945

GPU/CPU hybrid

One benefit of PyTorch is to mix/match what goes on GPU or CPU.

$ cold test --work-shape 1170 6200 -d cpu data/protodune-wires-larsoft-v1.txt data/pdsp.npz data/0-truthDepo.json 
stest:      0.000 (+0.000247): warm up device "cpu", reset time
work shape:  (1280, 6272)
stest:      1.253 (+1.253088): load 'geometry'
stest:      1.263 (+0.009805): load response
stest:      1.814 (+0.551084): make nodes
stest:      1.838 (+0.023723): load points
stest:      1.839 (+0.001555): drifted and pitched
stest:      1.841 (+0.002216): binned
               gauss: 2.8250
                init: 0.0250
                mult: 0.1646
               patch: 0.4480
                 sum: 0.8301
stest:      6.135 (+4.293266): splatted
stest:      7.863 (+1.728735): done

$ cold test --work-shape 1170 6200 -d cuda data/protodune-wires-larsoft-v1.txt data/pdsp.npz data/0-truthDepo.json 
stest:      4.667 (+4.666625): warm up device "cuda", reset time
work shape:  (1280, 6272)
stest:      1.379 (+1.378521): load 'geometry'
stest:      1.388 (+0.009875): load response
stest:      1.435 (+0.046152): make nodes
stest:      1.457 (+0.022485): load points
stest:      1.459 (+0.001473): drifted and pitched
stest:      1.461 (+0.002437): binned
               gauss: 3.5291
                init: 0.0213
                mult: 0.2134
               patch: 0.5509
                 sum: 0.9996
stest:      6.775 (+5.314530): splatted
stest:      6.932 (+0.156778): done

Notes:

  • second run is the hybrid
  • the time keeper resets after “hybrid”
  • the one time FFT of the response function is dune as part of “make nodes” which is why hybrid sees some early speed increase

If we discount one time start up (start from “load points”) then CPU takes 6.049s and hybrid takes 5.497s, barely 10%….

New approach

The problem with the Gaussian raster is the shape of the data. The 2D Gaussians are represented by an (N,5) array (amplitude, 2 means, 2 sigmas) and each gets expanded to a different shaped 2D raster patch (eg, limited by 3 sigma) and each of these raster patches must be added to a large, common output 2D array (nimper*nwires, nticks) and that sum is not thread safe. An algorithm following that description is not very GPU’able.

Turning the problem around and calculating the output array one pixel at a time where each pixel is a loop over all (N,5) Gaussians would be very parallelizable. Two difficulties: RAM usage and requiring more detailed CUDA programming than I’ve done so far.

About

CUDA Operations for Liquid argon Detector is a contrived acronym

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published