awkward-1.0

Development of Awkward 1.0, to replace scikit-hep/awkward-array in 2020.

The original motivations document from July 2019, now a little out-of-date.
My StrangeLoop talk on September 14, 2019.
My PyHEP talk on October 17, 2019.
My CHEP talk on November 7, 2019.
My CHEP 2019 proceedings (to be published in EPJ Web of Conferences).
Demo for Coffea developers on December 20, 2019.
Demo for Numba developers on January 22, 2020.

Motivation for a new Awkward Array

Awkward Array has proven to be a useful way to analyze variable-length and tree-like data in Python, by extending Numpy's idioms from rectilinear arrays to arrays of complex data structures. For over a year, physicists have been using Awkward Array both in and out of uproot; it is already one of the most popular Python packages in particle physics.

However, its pure-NumPy implementation is hard to extend (finding for-loop-free implementations of operations on nested data is hard) and maintain (most bugs are NumPy corner cases). Also, the feedback users have given me through GitHub, StackOverflow, and in-person tutorials have pointed out some design mistakes. A backward-incompatible release will allow us to fix design mistakes while providing freedom to make deep changes in the implementation.

The Awkward 1.0 project is a major investment, a six-month sprint from late August 2019 to late February 2020. The time spent on a clean, robust Awkward Array is justified by the widespread adoption of Awkward 0.x: its usefulness to the community has been demonstrated.

Main goals of Awkward 1.0

Full access to create and manipulate Awkward Arrays in C++ with no Python dependencies. This is so that C++ libraries can produce and share data with Python front-ends.
Easy installation with pip install and conda install for most users (Mac, Windows, and most Linux).
Imperative (for-loop-style) access to Awkward Arrays in Numba, a just-in-time compiler for Python. This is so that physicists can write critical loops in straightforward Python without a performance penalty.
A single awkward.Array class that hides the details of how columnar data is built, with a suite of operations that apply to all internal types.
Conformance to NumPy, where Awkward and NumPy overlap.
Better control over "behavioral mix-ins," such as LorentzVector (i.e. adding methods like pt() to arrays of records with px and py fields). In Awkward 0.x, this was achieved with multiple inheritance, but that was brittle.
Support for set operations and database-style joins, which can be put to use in a declarative analysis language, but requires database-style accounting of an index (like a Pandas index).
Better interoperability with Pandas, NumExpr, and Dask, while maintaining support for ROOT, Arrow, and Parquet.
Ability to add GPU implementations of array operations in the future.
Better error messages and extensive documentation.

Architecture of Awkward 1.0

To achieve these goals, Awkward 1.0 is separated into four layers:

The user-facing Python layer with a single awkward.Array class, whose data is described by a datashape type.
The columnar representation (i.e. nested ListArray, RecordArray, etc.) is accessible but hidden, and these are all C++ classes presented to Python through pybind11.
Two object models for the columnar representation, one in C++11 (with only header-only dependencies) and the other as Numba extensions. This is the only layer in which array-allocation occurs.
A suite of operations on arrays, computing new values but not allocating memory. The first implementation of this suite is in C++ with a pure-C interface; the second may be CUDA (or other GPU language). With one exception (FillableArray), iterations over arrays only occur at this level, so performance optimizations can focus on this layer.

The Awkward transition

Since Awkward 1.0 is not backward-compatible, existing users of Awkward 0.x will need to update their scripts or only use the new version on new scripts. Awkward 1.0 is already available to early adopters as awkward1 in pip (pip install awkward1 and import awkward1 in Python). When uproot is ready to use the new Awkward Array,

it will be released as uproot 4.0,
awkward1 will be renamed awkward, and
the old Awkward 0.x will be renamed awkward0.

The original Awkward 0.x will be available in perpetuity as awkward0, but only minor bugs will be fixed, and that only for the duration of 2020. This repository will replace scikit-hep/awkward-array on GitHub.

Compiling from source

Awkward 1.0 is available to early adopters as awkward1 in pip (pip install awkward1 and import awkward1 in Python), but developers will need to compile from source. For that, you will need

CMake/CTest,
a C++11-compliant compiler,

and optionally

Python 2.7, 3.5, 3.6, 3.7, or 3.8 (CPython, not an alternative like PyPy),
NumPy 1.13.1 or later,
pytest 3.9 or later (to run tests),
Numba 0.46 or later (to run all the tests).

To get the code from GitHub, be sure to use --recursive to get Awkward's git-module dependencies (pybind11 and RapidJSON):

git clone --recursive https://github.com/scikit-hep/awkward-1.0.git

To compile without Python (unusual case):

mkdir build
cd build
cmake ..
make all
make CTEST_OUTPUT_ON_FAILURE=1 test    # optional: run C++ tests
cd ..

To compile with Python (the usual case):

python setup.py build
pytest -vv tests                       # optional: run Python tests

In lieu of "make clean" for Python builds, I use the following to remove compiled code from the source tree:

rm -rf **/*~ **/__pycache__ build dist *.egg-info awkward1/*.so **/*.pyc

See Azure Pipelines buildtest-awkward (CI) and deploy-awkward (CD).

Roadmap

The six-month sprint:

September 2019: Set up CI/CD; define jagged array types in C++; pervasive infrastructure like database-style indexing.
October 2019: NumPy-compliant slicing; the Numba implementation. Feature parity will be maintained in Numba continuously.
November 2019: Fillable arrays to create columnar data; high-level type objects; all list and record types.
December 2019: The awkward.Array user interface; behavioral mix-ins, including the string type.
January 2020: NEP 13 and NEP 18; the rest of the array nodes: option and union types, indirection.
February 2020: The array operations: flattening, padding, concatenating, combinatorics, etc. and array types needed for Uproot and Arrow/Parquet (chunked, virtual, masked, etc.).

Updating dependent libraries:

March 2020: Update vector (from hepvector and uproot-methods). This work will be done with Henry Schreiner.
April 2020: Update uproot to 4.0 using Awkward 1.0.

Most users will see Awkward 1.0 for the first time when uproot 4.0 is released.

Progress is currently on track.

Checklist of features for the six-month sprint

Completed items are ☑check-marked. See closed PRs for more details. All remaining items have been assigned an issue and a milestone.

Soon after the six-month sprint

Update hepvector to be Derived classes, replacing the TLorentzVectorArray in uproot-methods.
Update uproot (on a branch) to use Awkward 1.0.
Start the awkward → awkward0, awkward1 → awkward transition.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.ci		.ci
awkward1		awkward1
docs		docs
include/awkward		include/awkward
pybind11 @ e43e1cc		pybind11 @ e43e1cc
rapidjson @ f54b0e4		rapidjson @ f54b0e4
src		src
studies		studies
tests		tests
.atom-build.yml		.atom-build.yml
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Doxyfile		Doxyfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
VERSION_INFO		VERSION_INFO
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

License

iris-hep/awkward-1.0

Folders and files

Latest commit

History

Repository files navigation

awkward-1.0

Motivation for a new Awkward Array

Main goals of Awkward 1.0

Architecture of Awkward 1.0

The Awkward transition

Compiling from source

Roadmap

Checklist of features for the six-month sprint

Soon after the six-month sprint

Thereafter

About

Resources

License

Stars

Watchers

Forks

Languages