Tutorials

This section has been moved to ipython notebook tutorials.

Tutorial on carray objects

This section has been moved to ipython notebook tutorial_carray.

Tutorial on ctable objects

This section has been moved to ipython notebook tutorial_ctable.

Writing bcolz extensions

Did you like bcolz but you couldn’t find exactly the functionality you were looking for? You can write an extension and implement complex operations on top of bcolz containers.

Before you start writing your own extension, let’s see some examples of real projects made on top of bcolz:

  • Bquery: a query and aggregation framework, among other things it

    provides group-by functionality for bcolz containers. See https://github.com/visualfabriq/bquery

  • Bdot: provides big dot products (by making your RAM bigger on the

    inside). Supports matrix . vector and matrix . matrix for most common numpy numeric data types. See https://github.com/tailwind/bdot

Though not a extension itself, it is worth mentioning Dask. Dask plays nicely with bcolz and provides multi-core execution on larger-than-memory datasets using blocked algorithms and task scheduling. See https://github.com/dask/dask.

In addition, bcolz also interacts well with itertools, Pytoolz or Cytoolz too and they might offer you already the amount of performance and functionality you are after.

In the next section we will go through all the steps needed to write your own extension on top of bcolz.

How to use bcolz as part of the infrastructure

Go to the root directory of bcolz, inside docs/my_package/ you will find a small extension example.

Before you can run this example you will need to install the following packages. Run pip install cython, pip install numpy and pip install bcolz to install these packages. In case you prefer Conda package management system execute conda install cython numpy bcolz and you should be ready to go. See requirements.txt:

cython>=0.20
numpy>=1.7.0
bcolz>=0.8.0

Once you have those packages installed, change your working directory to docs/my_package/, please see pkg. example and run python setup.py build_ext --inplace from the terminal, if everything ran smoothly you should be able to see a binary file my_extension/example_ext.so next to the .pyx file.

If you have any problems compiling these extensions, please make sure you have a recent version of bcolz as old versions (pre 0.8) don’t contain the necessary .pxd file which provides a Cython interface to the carray Cython module.

The setup.py file is where you will need to tell the compiler, the name of you package, the location of external libraries (in case you want to use them), compiler directives and so on. See bcolz setup.py as a possible reference for a more complete example. Along your project grows in complexity you might be interested in including other options to your Extension object, e.g. include_dirs to include a list of directories to search for C/C++ header files your code might be dependent on.

See my_package/setup.py:

from setuptools import setup, Extension
from Cython.Distutils import build_ext
from numpy.distutils.misc_util import get_numpy_include_dirs


# Sources
sources = ["my_extension/example_ext.pyx"]

setup(
    name="my_package",
    description='My description',
    license='MY_LICENSE', 
    ext_modules=[
        Extension(
            "my_extension.example_ext",
            sources=sources,
        ),
    ],
    cmdclass={"build_ext": build_ext},
    packages=['my_extension'],
)

The .pyx files is going to be the place where Cython code implementing the extension will be, in the example below the function will return a sum of all integers inside the carray.

See my_package/my_extension/example_ext.pyx

Keep in mind that carrays are great for sequential access, but random access will highly likely trigger decompression of a different chunk for each randomly accessed value.

For more information about Cython visit http://docs.cython.org/index.html

import cython
import bcolz as bz
from bcolz.carray_ext cimport carray
from numpy cimport ndarray, npy_int64

@cython.overflowcheck(True)
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef my_function(carray ca):
    """
        Function for example purposes
        
        >>> import bcolz as bz
        >>> import my_extension.example_ext as my_mod
        >>> c = bz.carray([i for i in range(1000)], dtype='i8')
        >>> my_mod.my_function(c)
        499500

    """

    cdef:
        ndarray ca_segment
        Py_ssize_t len_ca_segment
        npy_int64 sum=0

    for ca_segment in bz.iterblocks(ca):
        len_ca_segment = len(ca_segment)
        for i in range(len_ca_segment):
            sum = sum + ca_segment[i]
        
    return sum

Let’s test our extension:

>>> import bcolz
>>> import my_extension.example_ext as my_mod
>>> c = bcolz.carray([i for i in range(1000)], dtype='i8')
>>> my_mod.my_function(c)
499500