This document is a very quick overview of both HDF5 and h5py. More comprehensive documentation is available at:
The HDF Group is the final authority on HDF5. They also have an introductory tutorial which provides a good overview.
It’s a filesystem for your data.
More accurately, it’s a widely used scientific file format for archiving and sharing large amounts of numerical data. HDF5 files contain datasets, which are homogenous, regular arrays of data (like NumPy arrays), and groups, which are containers that store datasets and other groups.
In this sense, the structure of an HDF5 file is analagous to a POSIX filesystem. In fact, this is exactly the syntax used by HDF5 itself to locate resources:
/ (Root group)
/MyGroup (Subgroup)
/MyGroup/DS1 (Dataset stored in subgroup)
/MyGroup/Subgroup/DS2 (and so on)
HDF5 also has a well-developed type system, supporting integers and floats of all the normal sizes and byte orders, as well as more advanced constructs like compound and array types. The library handles all type conversion internally; you can read and write data without having to worry about things like endian-ness or precision.
It’s a simple Python interface to HDF5. You can interact with files, groups and datasets using traditional Python and NumPy metaphors. For example, groups behave like dictionaries, and datasets have shape and dtype attributes, and can be sliced and indexed just like real NumPy arrays. Datatypes are specified using standard NumPy dtype objects.
You don’t need to know anything about the HDF5 library to use h5py, apart from the basic metaphors of files, groups and datasets. The library handles all data conversion transparently, and translates operations like slicing into the appropriate efficient HDF5 routines.
One additional benefit of h5py is that the files it reads and writes are “plain-vanilla” HDF5 files. No Python-specific metadata or features are used. You can read files created by most HDF5 applications, and write files that any HDF5-aware application can understand.
First, install h5py by following the installation instructions.
Since examples are better than long-winded explanations, here’s how to:
- Make a new file
- Create an integer dataset, with shape (100,100)
- Initialize the dataset to the value 42
>>> import h5py >>> f = h5py.File('myfile.hdf5') >>> dset = f.create_dataset("MyDataset", (100, 100), 'i') >>> dset[...] = 42
The File constructor accepts modes similar to Python file modes, including “r”, “w”, and “a” (the default):
>>> f = h5py.File('file1.hdf5', 'w') # overwrite any existing file
>>> f = h5py.File('file2.hdf5', 'r') # open read-only
The Dataset object dset here represents a new 2-d HDF5 dataset. Some features will be familiar to NumPy users:
>>> dset.shape
(100, 100)
>>> dset.dtype
dtype('int32')
You can even automatically create a dataset from an existing array:
>>> import numpy as np
>>> arr = np.ones((2,3), '=i4')
>>> dset = f.create_dataset('AnotherDataset', data=arr)
HDF5 datasets support many other features, like chunking and transparent compression. See the section “Datasets” for more info.
You can store and retrieve data using Numpy-like slicing syntax. The following slice mechanisms are supported:
- Integers/slices (array[2:11:3], etc)
- Ellipsis indexing (array[2,...,4:7])
- Simple broadcasting (array[2] is equivalent to array[2,...])
- Index lists (array[ 2, [0,1,4,6] ])
along with some emulated advanced indexing features (see Sparse selection):
- Boolean array indexing (array[ array[...] > 0.5 ])
- Discrete coordinate selection (see the selections module)
You don’t need to do anything special to “close” datasets. However, as with Python files you should close the file before exiting:
>>> f.close()
H5py tries to close all objects on exit (or when they are no longer referenced), but it’s good practice to close your files anyway.
When creating the dataset above, we gave it a name:
>>> dset.name
'/MyDataset'
This bears a suspicious resemblance to a POSIX filesystem path; in this case, we say that MyDataset resides in the root group (/) of the file. You can create other groups as well:
>>> subgroup = f.create_group("SubGroup")
>>> subgroup.name
'/SubGroup'
They can in turn contain new datasets or additional groups:
>>> dset2 = subgroup.create_dataset('MyOtherDataset', (4,5), '=f8')
>>> dset2.name
'/SubGroup/MyOtherDataset'
You can access the contents of groups using dictionary-style syntax, using POSIX-style paths:
>>> dset2 = subgroup['MyOtherDataset']
>>> dset2 = f['/SubGroup/MyOtherDataset'] # equivalent
Groups (including File objects; "f" in this example) support other dictionary-like operations:
>>> list(f)
['MyDataset', 'SubGroup']
>>> 'MyDataset' in f
True
>>> 'Subgroup/MyOtherDataset' in f
True
>>> del f['MyDataset']
As a safety feature, you can’t create an object with a pre-existing name; you have to manually delete the existing object first:
>>> grp = f.create_group("NewGroup")
>>> grp = f.create_group("NewGroup")
ValueError: Name already exists (Symbol table: Object already exists)
>>> del f['NewGroup']
>>> grp = f.create_group("NewGroup")
HDF5 lets you associate small bits of data with both groups and datasets. This can be used for metadata like descriptive titles or timestamps.
A dictionary-like object which exposes this behavior is attached to every Group and Dataset object as the attribute attrs. You can store any scalar or array value you like:
>>> dset.attrs
<Attributes of HDF5 object "MyDataset" (0)>
>>> dset.attrs["Name"] = "My Dataset"
>>> dset.attrs["Frob Index"] = 4
>>> dset.attrs["Order Array"] = numpy.arange(10)
>>> for name, value in dset.attrs.iteritems():
... print name+":", value
...
Name: My Dataset
Frob Index: 4
Order Array: [0 1 2 3 4 5 6 7 8 9]
Attribute proxy objects support the same dictionary-like API as groups, but unlike group members, you can directly overwrite existing attributes:
>>> dset.attrs["Name"] = "New Name"
In addition to this basic behavior, HDF5 has a lot of other goodies. Some of these features are:
Full documentation on files, groups, datasets and attributes is available in the section “Reference Documentation“.