Binned spectrum analysis using a parameterized representation of the spectrum.
Many physics analysis require measuring the likelihood of a set of parameters, given an observed binned distribution. They typically rely on the same underlying mechanics, eventhough they are typically expressed differently for different types of analysis (e.g. template fit vs. unfolding).
The likelihood computation can be very expensive, and typically needs to be run a very large number of times such as during minimization.
Given these two points, this package aims to provide a convenient python interface to the common mechanics relied on by nearly all binned analysis, while running the heavy computation in efficient compiled code. The gradients of the log likelihood are computed analytically, reducing the time required to perform a minimization by a factor proportional to the number of parameters.
The package also provides a higher level interface to the base mechanics, re-expressing the base functionality in the paradigms of template and unfolding analyses.
The spectrum can be thought of as the sum of a matrix: each column is a bin of the spectrum, and each row is a "source" contributing to the spectrum. The idea is to split the spectrum into its constituent sources. By scaling the contributions of each source (row) by some factor, the spectrum can be described given any coherent variation of its constituent sources.
In a template analysis, there is a source for the signal distribution, and a source for each template, indicating how many events the template adds/removes to/from the signal. In an unfolding analysis, there is a source for each "truth bin", each contributing events to the spectrum separately.
Systematic variations to a source are treated in the same way as a template: a source is introduced for the systematic, indicating how many events it adds/removes to/from the spectrum.
The row factors are aribtrary expressions of parameters. Parameters can be shared accross many row factors; for example, a luminosity parameter can be added to the factor for all rows subject to luminosity scaling.
A single "statistical parameter" is introduced for each column, allowing to indepdently change the net number of events in each spectrum bin to account for statistical fluctuations in the sources. In theory, each bin of the matrix should get its own statistical parameter such that statistical fluctuations in a given source can be correctly correlated to the scaling factor for that source. However, this serves as a decent approximation.
Building the expected spectrum given a set of parameters is only half the battle. Typically, it is then necessary to evaluate how well those parameters correspond to the observed spectrum. The spectrum also has the ability to evaluate the log likelihood that its parameters agree with a given data spectrum.
This is computed by approximating the Poisson probability that the observed number of events in a bin arose from the expected one (computed using a set of parameters). Then, a regularization penalty is applied to the parameters, and the resulting likelihood can be interpreted as being proportional to the likelihood that the parameters are the true ones, given the obsered data (as per Bayes' theorem).
This computation also keeps track of the gradient of the log likelihood with respect to each parameter. The C++ object which is responsible for carrying out this computations implements the ROOT::Math::IGradientMultiDim
interface. Thus, it can be used directly with the ROOT
minimization framework.
The C++ code generated by this package is compiled using ROOT's ACLIC module. The underlying data is stored in a binary file, and loaded dynamically into memory when a spectrum object is built. However, the propagation of parameters and their gradients are hard coded.
Advantages:
- ACLIC takes care of compiling the code on any system.
- PyROOT takes care of interfacing the compiled code to python.
- Very large amounts of data can be accomodated efficiently as it is dynamically allocated.
- The full bredth of compiler optimizations can be applied to the parameter and gradient computations.
Drawbacks:
- Each spectrum must be compiled, introducing a slight overhead (typically neglegible so long as the compiled spectrum is indeed complex enough to warrant compilation).
- Compilation generates files in a working directory.
- Each spectrum built in the same session must have a unique name.
numpy
: scientific computing for python.ROOT
: data analysis framework for compiling C++ code which is dynamically linked into python. See .
The examples depend on the following python modules:
pymcmc.py
: Markov Chain Monte Carlo sampler. Seehttps://github.com/gmcgoldr/pymcmc
.npinterval.py
: numpy computation of confidence intervals and mode. Seehttps://github.com/gmcgoldr/npinterval
.
Following is a brief explanation of the package functionality as it is organized in its files:
-
parspec.py
ParSpec
class: computes the expected binned spectrum given a set of parameters, and the likelihood of those parameters given an observed distribution. The parameters control the relative normalizaiton of the various sources contributing to the spectrum, and the statistical fluctuations in each bin. Built bySpecBuilder
.Source
class: stores information for a source contributing to the spectrum. Used during construction.SpecBuilder
class: accumulates sources and regularization information from which to build a parameterized spectrum. Writes C++ code for the spectrum, compiles it then builds and returns a pythonParSpec
object.
-
parspec.cxx
: C++ code which performs the CPU intensive spectrum computation, and likelihood computation. The code is a template: capital names surrounded by__
characters are replaced by theparspec.SpecBuilder.build
method to generate the compilable code for a specific spectrum. -
templates.py
: interface which exposesparspec
's functionality in a manner more convenient for carrying out a template analysis. -
examples.py
: a series of heavily commented examples showing how to use the higher leveltemplates
interface.