Skip to content

DL1 HDF5 format IACT data writer + reader + processor

License

Notifications You must be signed in to change notification settings

sahilyadav27/dl1-data-handler

 
 

Repository files navigation

DL1 Data Handler

DOI build status Coverage Status

A package of utilities for writing, reading, and applying image processing to Cherenkov Telescope Array (CTA) DL1 data (calibrated images) in a standardized format. Created primarily for testing machine learning image analysis techniques on IACT data.

Currently supports data in the CTA pyhessio sim_telarray format, with the possibility of supporting other IACT data formats in the future. Built using ctapipe and PyTables.

Previously named image-extractor (v0.1.0 - v0.6.0). Currently under development, intended for internal use only.

Data Format

DL1DataWriter implements a standardized format for storing simulated CTA DL1 event data into Pytables files. CTAMLDataDumper is the class which implements the conversion from ctapipe containers to the CTA ML data format. See the wiki page here for a full description of this data format and an FAQ.

Installation

The following installation method (for Linux) is recommended:

Installing with pip/setuptools from source

You can install dl1-data-handler using pip after cloning the repository:

git clone https://github.com/cta-observatory/dl1-data-handler.git
cd dl1-data-handler

To install into a virtualenv environment using pip:

virtualenv /path/to/ENV
source /path/to/ENV/bin/activate
pip install .

For users that require other ctapipe version than the default one (v0.8.0) setup_light.py won’t install ctapipe along dl1-data-handler, thus allowing the user to install the ctapipe version of her/his choice (compatibility is not guaranteed for all ctapipe versions, though):

python setup_light.py install

Installing as a conda package

To install it as a conda package, first install Anaconda by following the instructions here: https://www.anaconda.com/distribution/.

Then, create and enter a new Python 3.7 environment with:

conda create -n [ENVIRONMENT_NAME] python=3.7
source activate [ENVIRONMENT_NAME]

From the environment, add the necessary channels for all dependencies:

conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels cta-observatory

Install the package:

conda install -c ctlearn-project dl1_data_handler

This should automatically install all dependencies (NOTE: this may take some time, as by default MKL is included as a dependency of NumPy and it is very large).

If you want to import any functionality from dl1-data-handler into your own Python scripts, then you are all set. However, if you wish to make use of any of the scripts in dl1-data-handler/scripts (like write_data.py), you should also clone the repository locally and checkout the corresponding tag (i.e. for version v0.8.2):

git clone https://github.com/cta-observatory/dl1-data-handler.git
git checkout v0.8.2

dl1-data-handler should already have been installed in your environment by Conda, so no further installation steps (i.e. with setuptools or pip) are necessary and you should be able to run scripts/write_data.py directly.

Dependencies

The main dependencies are:

  • PyTables >= 3.4.4
  • NumPy >= 1.15.0
  • ctapipe == 0.8.0

Also see setup.py.

Usage

DL1DataWriter

From the Command Line:

To process data files into a desired format:

scripts/write_data.py [runlist] [--config_file CONFIG_FILE_PATH] [--output_dir OUTPUT_DIR] [--debug]

on the command line.

ex:

scripts/write_data.py runlist.yml --config_file example_config.yml --debug
  • runlist - A YAML file containing groups of input files to load data from and output files to write to. See example runlist for format.
  • config_file - The path to a YAML configuration file specifying all of the settings for data loading and writing. See example config file and documentation for details on each setting. If none is provided, default settings are used for everything.
  • output_dir - Path to directory to write all output files to. If not provided, defaults to the current directory.
  • debug - Optional flag to print additional debug information from the logger.

In a Python script:

If the package was installed with pip as described above, you can import and use it in Python like:

ex:

from dl1_data_handler import dl1_data_writer

event_source_class = MyEventSourceClass
event_source_settings = {'setting1': 'value1'}

data_dumper_class = MyDataDumperClass
data_dumper_settings = {'setting2': 'value2'}

def my_cut_function(event):
    # custom cut logic here
    return True

data_writer = dl1_data_writer.DL1DataWriter(event_source_class=event_source_class,
    event_source_settings=event_source_settings,
    data_dumper_class=data_dumper_class,
    data_dumper_settings=dumper_settings,
    preselection_cut_function=my_cut_function,
    output_file_size=10737418240,
    events_per_file=500)

run_list = [
 {'inputs': ['file1.simtel.gz', 'file2.simtel.gz'],
  'target': 'output.h5'}
]

data_writer.process_data(run_list)

Generating a run list

If processing data from simtel.gz files, as long as their filenames have the format [particle_type]_[ze]deg_[az]deg_run[run_number]___[production info].simtel.gz or [particle_type]_[ze]deg_[az]deg_run[run_number]___[production info]_cone[cone_num].simtel.gz the scripts/generate_runlist.py can be used to automatically generate a runlist in the correct format.

It can be called as:

scripts/generate_runlist.py [file_dir] [--num_inputs_per_run NUM_INPUTS_PER_RUN] [--output_file OUTPUT_FILE]
  • file_dir - Path to a directory containing simtel.gz files with the filename format specified above.
  • num_inputs_per_run - Number of input files with the same particle type, ze, az, and production info to group together into each run (defaults to 10).
  • output_file - Path/filename of output runlist file. Defaults to ./runlist.yml

It will automatically sort the simtel files in the file_dir directory into groups with matching particle_type, zenith, azimuth, and production parameters. Within each of these groups, it will group together input files in sequential order into runs of size NUM_INPUTS_PER_RUN. The output filename for each run will be automatically generated as [particle_type]_[ze]deg_[az]deg_runs[run_number_range]___[production info].h5. The output YAML file will be written to output_file.

ImageMapper

The ImageMapper class transforms the hexagonal input pixels into a 2D Cartesian output image. The basic usage is demonstrated in the ImageMapper tutorial. It requires ctapipe-extra outside of the dl1-data-handler. See this publication for a detailed description: arXiv:1912.09898

Other scripts

All other scripts located in the scripts/deprecated directory are not currently updated to be compatible with dl1-data-handler >= 0.7.0 and should not be used.

Examples/Tips

  • Vitables is very helpful for viewing and debugging PyTables-style HDF5 files. Installation/download instructions can be found in the link below. NOTE: It is STRONGLY recommended that vitables be installed in a separate Anaconda environment with Python version 2.7 to avoid issues with conflicting PyQt5 versions. See this issue thread for details: ContinuumIO/anaconda-issues#1554

Known Issues/Troubleshooting

  • As of v0.7.2 there appears to be an issue when processing files containing SCT data. A fix is planned for a future version.
  • ViTables PyQt5 dependency confict (pip vs. conda): relevent issue thread

Links

  • Cherenkov Telescope Array (CTA) - Homepage of the CTA collaboration
  • CTLearn and GammaLearn - Repository of code for studies on applying deep learning to IACT analysis tasks. Maintained by groups at Columbia University, Universidad Complutense de Madrid, Barnard College (CTLearn) and LAPP (GammaLearn).
  • ctapipe - Official documentation for the ctapipe analysis package (in development)
  • ViTables - Homepage for ViTables application for Pytables HDF5 file visualization

About

DL1 HDF5 format IACT data writer + reader + processor

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 91.4%
  • Python 8.6%