Skip to content

gmcgoldr/parspec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ParSpec

Binned spectrum analysis using a parameterized representation of the spectrum.

Rationale

Many physics analysis require measuring the likelihood of a set of parameters, given an observed binned distribution. They typically rely on the same underlying mechanics, eventhough they are typically expressed differently for different types of analysis (e.g. template fit vs. unfolding).

The likelihood computation can be very expensive, and typically needs to be run a very large number of times such as during minimization.

Given these two points, this package aims to provide a convenient python interface to the common mechanics relied on by nearly all binned analysis, while running the heavy computation in efficient compiled code. The gradients of the log likelihood are computed analytically, reducing the time required to perform a minimization by a factor proportional to the number of parameters.

The package also provides a higher level interface to the base mechanics, re-expressing the base functionality in the paradigms of template and unfolding analyses.

Parameterization

The spectrum can be thought of as the sum of a matrix: each column is a bin of the spectrum, and each row is a "source" contributing to the spectrum. The idea is to split the spectrum into its constituent sources. By scaling the contributions of each source (row) by some factor, the spectrum can be described given any coherent variation of its constituent sources.

In a template analysis, there is a source for the signal distribution, and a source for each template, indicating how many events the template adds/removes to/from the signal. In an unfolding analysis, there is a source for each "truth bin", each contributing events to the spectrum separately.

Systematic variations to a source are treated in the same way as a template: a source is introduced for the systematic, indicating how many events it adds/removes to/from the spectrum.

The row factors are aribtrary expressions of parameters. Parameters can be shared accross many row factors; for example, a luminosity parameter can be added to the factor for all rows subject to luminosity scaling.

A single "statistical parameter" is introduced for each column, allowing to indepdently change the net number of events in each spectrum bin to account for statistical fluctuations in the sources. In theory, each bin of the matrix should get its own statistical parameter such that statistical fluctuations in a given source can be correctly correlated to the scaling factor for that source. However, this serves as a decent approximation.

Log likelihood

Building the expected spectrum given a set of parameters is only half the battle. Typically, it is then necessary to evaluate how well those parameters correspond to the observed spectrum. The spectrum also has the ability to evaluate the log likelihood that its parameters agree with a given data spectrum.

This is computed by approximating the Poisson probability that the observed number of events in a bin arose from the expected one (computed using a set of parameters). Then, a regularization penalty is applied to the parameters, and the resulting likelihood can be interpreted as being proportional to the likelihood that the parameters are the true ones, given the obsered data (as per Bayes' theorem).

This computation also keeps track of the gradient of the log likelihood with respect to each parameter. The C++ object which is responsible for carrying out this computations implements the ROOT::Math::IGradientMultiDim interface. Thus, it can be used directly with the ROOT minimization framework.

Note on the compiled code

The C++ code generated by this package is compiled using ROOT's ACLIC module. The underlying data is stored in a binary file, and loaded dynamically into memory when a spectrum object is built. However, the propagation of parameters and their gradients are hard coded.

Advantages:

  • ACLIC takes care of compiling the code on any system.
  • PyROOT takes care of interfacing the compiled code to python.
  • Very large amounts of data can be accomodated efficiently as it is dynamically allocated.
  • The full bredth of compiler optimizations can be applied to the parameter and gradient computations.

Drawbacks:

  • Each spectrum must be compiled, introducing a slight overhead (typically neglegible so long as the compiled spectrum is indeed complex enough to warrant compilation).
  • Compilation generates files in a working directory.
  • Each spectrum built in the same session must have a unique name.

Dependencies

  • numpy: scientific computing for python.
  • ROOT: data analysis framework for compiling C++ code which is dynamically linked into python. See .

The examples depend on the following python modules:

Package information

Following is a brief explanation of the package functionality as it is organized in its files:

  • parspec.py

    • ParSpec class: computes the expected binned spectrum given a set of parameters, and the likelihood of those parameters given an observed distribution. The parameters control the relative normalizaiton of the various sources contributing to the spectrum, and the statistical fluctuations in each bin. Built by SpecBuilder.
    • Source class: stores information for a source contributing to the spectrum. Used during construction.
    • SpecBuilder class: accumulates sources and regularization information from which to build a parameterized spectrum. Writes C++ code for the spectrum, compiles it then builds and returns a python ParSpec object.
  • parspec.cxx: C++ code which performs the CPU intensive spectrum computation, and likelihood computation. The code is a template: capital names surrounded by __ characters are replaced by the parspec.SpecBuilder.build method to generate the compilable code for a specific spectrum.

  • templates.py: interface which exposes parspec's functionality in a manner more convenient for carrying out a template analysis.

  • examples.py: a series of heavily commented examples showing how to use the higher level templates interface.

About

Binned spectrum analysis using a parameterized representation of the spectrum.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published