This module is a fork of pupynere
, a lightweight netCDF reading and writing
module. It has been modified for use in the PCIC Data Portal.
In the context of the PDP, it is used to write and stream the headers and
metadata of netCDF files as users download them.
The PDP ecosystem uses a different package
to read the data from the master file for streaming. PCIC's modifications for
streaming netCDF files include:
- calculate the filesize before the file is created so it can be sent as a content-length header
- generator pattern so data can be sent in small chunks as it is ready
- allow creation of virtual files in memory, not the filesystem
NcOrderedDict
class for synchronizing variable order across multiple generators- Memory for variable data is not allocated until needed
This package hews closely to the
NetCDF file format specification,
which should be considered supplementary documentation. Most functions and attributes on
the netcdf_file
and netcdf_variable
classes are named for the part of the NetCDF
file spec they read or write.
This class represents a netCDF file.
It may be initialized with a filename, a file pointer, or neither. If initialized with some kind of file representation, it will read and write data to the filesystem, otherwise it will be virtual file resident in virtual memory. The PDP uses the virtual option.
Arbitrary attributes may be set on a netcdf_file object and will be treated as global netCDF attributes when the file is streamed or written to disk.
filesize()
- projected or actual size of the file. Can be calculated before the file is fully populated to send to a downloader as the content-length header, based on the variable, dimension, and record specifications. No actual data needed.close()
- close the file pointer, if there is one.__setattr__
- handles setting arbitrary attributes with dot notationcreateDimension()
andcreateVariable()
- add dimensions or variables to the fileflush()
- flush to disk. Only relevant for files on the filesystemrecvars()
andnonrecvars()
- get an ordered dictionary of the record or nonrecord variables in this file__generate__()
- get this file as generator chunks_write()
- write the netCDF file to disk. Only relevant for files on the filesystem_header()
,_data()
,_dim_array()
,_gatt_array()
,_var_array()
,_att_array()
,_num_recs()
,_var_metadata()
get the various parts of a netCDF file, encoded per the netCDF specification. Helper functions to_write()
and__generate__()
. Reading the netCDF spec is helpful for understanding the use of each of these functions; their names match up._read()
- read a netCDF file from the filesystem_read_values()
,_read_dim_array()
,read_gatt_array()
,read_var_array()
,_read_att_array()
,_read_num_recs()
,_read_var_values()
read part of a netCDF file from the filesystem, following the netCDF spec. Helper functions to_read()
. Reading the netCDF spec is helpful for understanding the use of each of these functions; their names match up. Not used by the PDP._calc_begins()
- calculates the starting address of each variable. Essential for streaming data if you want to send the header before you have all the variable data.set_numrecs()
- set the number of records in the datafile without actually assigning the data. Used for streaming data to send the filesize and complete headers before all data is available._pack_begin()
,_pack_int()
,_pack_int_64()
,_pack_string()
return an integer or string in the correct format for a netCDF file. Helpers to the file-writing or file-generator functions. NetCDF is big-endian_unpack_int()
,_unpack_int64()
,unpack_string()
read the file and translate an integer or string. Helpers to the_read_*
functions
PDP Usage: To create headers for streaming, the PDP (via pydap.responses.netcdf
) uses
this class to create a new, virtual netCDF file with the dimensions and variables requested
by the user, along with global and variable metadata attributes read from the source
netCDF file.
Then, based on the user request, the number of records is set and the variable begin addresses are calculated.
At that point, netcdf_file
has all the information needed to generate the complete
netCDF file header, which is streamed to the downloader.
Next, a pydap handler package streams the data from the source file; data is never
added to the virtual netcdf file created by this class. A NcOrderedDict
is used
to synchronize the order of the variables in the header generated here and the data
generated by pydap.handlers.hd5
.
This class represents a single netCDF variable within a file.
Arbitrary attributes
may be set on a netcdf_variable
object, and will be treated as variable attributes
when the variable is streamed or written to disk.
Data may be written to a netcdf_variable
using square bracket notation as if it was
a numpy array.
A variable is either a record variable or a nonrecord variable; they two types are formatted differently when written to disk or streamed. Nonrecord variables have all dimensions of known size and are written to disk or stream in a continuous block. Record variables have one unlimited dimension (the record dimension) that is expected to expand change over time. All record variables are written at the end of the file, interleaved so that if more records are added, the file can be simply appended to without needing to rewrite earlier portions. Many things are handled differently depending on whether a variable is a record or nonrecord variable. See the "More information on NetCDF" section or the netCDF specification for more information.
isrec()
returns true if this is a record variableshape()
the current numerical dimensions of the variablegetValue()
andassignValue()
a netCDF variable may have no associated dimensions and represent a scalar value. These functions access and set such a variable.__getitem__()
and__setitem__()
- numpy-style assignment and data access__setattr__
- handles setting attributes with dot notationtypecode()
the type of the variable's dataitemsize()
the number of bytes a single datum occupies, ie 4 for an integersize()
the total number of values in this variable (or potential values - can be calculated without setting any data just from the dimensions and number of records, which is done when streaming)_data_allocated()
and_set_data()
Because the primary use of this package is just to generate file headers, memory to hold variable data is not automatically allocated. These functions check whether memory has been allocated and allocate it. Memory is not allocated until the first time data is written to the variable.
An ordered dictionary of the variables inside a netCDF file. Used by pydap to
coordinate the order pupynere lists variables in the header and the order blocks
of data are streamed by pydap.handlers.hd5
. Can be either nonrecord, record,
or all variables. In the case of all variables, nonrecord variables ordered by
ascending size, then record variables.
Not currently used
Not currently used
Not currently used
Not currently used
Generator that yields the headers of a netCDF file, then yields from additional generator(s) representing the data
Not currently used
Map from netCDF types to numpy types and vice versa, respectively.
Tests can be run with pytest.
virtualenv venv
source venv/bin/activate
pip install .
pip install pytest
pytest
Differently from the streaming workflow described above, this package can be used to create netCDF files on the filesystem, which may be useful for testing.
To create a NetCDF file
>>> f = netcdf_file('simple.nc', 'w')
>>> f.history = 'Created for a test'
>>> f.createDimension('time', 10)
>>> time = f.createVariable('time', 'i', ('time',))
>>> time[:] = range(10)
>>> time.units = 'days since 2008-01-01'
>>> f.close()
To read the NetCDF file we just created
>>> f = netcdf_file('simple.nc', 'r')
>>> print f.history
Created for a test
>>> time = f.variables['time']
>>> print time.units
days since 2008-01-01
>>> print time.shape
(10,)
>>> print time[-1]
9
>>> f.close()
This module implements the Scientific.IO.NetCDF API to read and create NetCDF files. The same API is also used in the PyNIO and pynetcdf modules, allowing these modules to be used interchangebly when working with NetCDF files. The major advantage of scipy.io.netcdf
over other modules is that it doesn't require the code to be linked to the NetCDF libraries as the other modules do.
The code is based on the NetCDF file format specification. A NetCDF file is a self-describing binary format, with a header followed by data. The header contains metadata describing dimensions, variables and the position of the data in the file, so access can be done in an efficient manner without loading unnecessary data into memory. We use the mmap
module to create Numpy arrays mapped to the data on disk, for the same purpose.
The structure of a NetCDF file is as follows
C D F <VERSION BYTE> <NUMBER OF RECORDS>
<DIMENSIONS> <GLOBAL ATTRIBUTES> <VARIABLES METADATA>
<NON-RECORD DATA> <RECORD DATA>
Record data refers to data where the first axis can be expanded at will. All record variables share a same dimension at the first axis, and they are stored at the end of the file per record, ie
A[0], B[0], ..., A[1], B[1], ..., etc,
so that new data can be appended to the file without changing its original structure. Non-record data are padded to a 4n bytes boundary. Record data are also padded, unless there is exactly one record variable in the file, in which case the padding is dropped. All data is stored in big endian byte order.
The Scientific.IO.NetCDF API allows attributes to be added directly to instances of netcdf_file
and netcdf_variable
. To differentiate between user-set attributes and instance attributes, user-set attributes are automatically stored in the _attributes
attribute by overloading __setattr__
. This is the reason why the code sometimes uses obj.__dict__['key'] = value
, instead of simply obj.key = value
; otherwise the key would be inserted into userspace attributes.