H5PY – A Python Package to Store Big Data Efficiently

onestop_databy:

Python

This tutorial shows how to use the

v, a python package to store big data efficiently. It will mainly focus on creating and reading HDF5 files.

1. What is H5PY?

The h5py is a package that interfaces Python to the HDF5 binary data format, enabling you to store big amounts of numeric data and manipulate it from NumPy.

2. Importance of H5PY

H5Py enables storing and manipulate big amounts of numerical data. Imagine that you need to store large amounts of data with quick access. Definitely text file shall not work. Scientists run cosmological simulations that generate big quantities of data. To analyze them, the exact dataset which the scientists want should be accessible quickly and painlessly. H5PY works well in such cases.

H5Py is a powerful and quick running binary format with no maximum limit for the file size. The tool runs as parallel IO carrying a lot of low-level optimizations within itself to run the queries faster with smaller memory requirements.

Consider the multi-terabyte datasets that can be sliced as if they were real NumPy arrays. Thousands of datasets will be able to be stored in a single file and categorized. They can be tagged based on categories or however we want. H5Py can directly use NumPy and Python metaphors such as their NumPy array syntax and dictionary. For example datasets in a file can be iterated over and over or the attributes of the datasets such as .dtype or .shape can be checked out.

While H5Py is an easy to use high-level interface, it is based on Cython, an object-oriented program encapsulating HDF5 C API. So, one can do almost anything using C in HDF5 and thus anything can be done using H5Py. On top of all these, all the files created are in binary format which is widely used standard and hence can be exchanged with any programmers who use any other programs like MATLAB and IDL. Also installing HDF5 directly is a pain. But installing H5Py is simpler in comparison by just using a favorite package manager.

3. Installation of H5PY

Pre-build installation is the most recommended way to install H5Py and it can be done using Python distributions or H5Py wheels or OS-specific package managers.

# installing it with conda
$ conda install h5py

# installing it with pip
$ pip install h5py

4. Write HDF5 files

Next, we show below how to write HDF5 files using Python. First we important h5py and numpy

import numpy as np
import h5py

And create two numpy array with random numbers: First array with dimensions 100 x 100 and the second array with dimensions 200 x 200.

dataset_1 = np.random.random(size = (100,100))
dataset_2 = np.random.random(size = (200,200))

As the datasets are numpy arrays, we can confirm the dataset dimensions:

print(dataset_1.shape, dataset_2.shape)
>>> (100, 100) (200, 200)

Finally, as the datasets were created, we can use the h5py library to store the data into the HDF5 format.

with h5py.File('my_data.h5', 'w') as hf_object:
    hf_object.create_dataset('dataset_1', data=dataset_1)
    hf_object.create_dataset('dataset_2', data=dataset_2)

In case you want to compress the HDF5 file, please add the parameter compression="gzip" to create_dataset.

5. Reading HDF5 files

Last but not least, now that we have written some data to the HDF5 file, we want to read it. This can be done as follows:

Read HDF5 file

hf_object = h5py.File('my_data.h5', 'r')

Print datasets names in the HDF5 file

print(hf_object.keys())
>>> <KeysViewHDF5 ['dataset_1', 'dataset_2']>

Access data on dataset

my_dataset_1 = hf_object.get('dataset_1')

>>> <HDF5 dataset "dataset_1": shape (100, 100), type "<f8">

Don’t forget to close the file object when done.

hf_object.close()

Related Posts