Fast creation of 2d+ memory views in Cython

Fast creation of 2d+ memory views in Cython - python

We are developing a project where we create a lot of typed memory views as numpy arrays but we often have to deal with arrays of small size so that the creation of memory views as np.ndarray is impacting a lot because it has lots of interactions with Python. I found that there are some “tricks” to improve the initialization step with Cpython arrays and the clone function for 1d numeric arrays (found on What is the recommended way of allocating memory for a typed memory view?), but we are stuck in generalizing it for numeric 2d arrays, is it possible to do that? If not, what can be another way to create 2d arrays faster than using np.ndarrays? We tried Cython arrays but they were even slower to initialize than numpy ones. Is it manually managing the memory the only solution? If so, can you give me some tips to do that in the 2d case? I found that the documentation on this is a bit limited in this respect for me since my knowledge of C is very little (but I'm trying to improve on this matter!). Thanks for the help in advance.

Related

How to stack a collection of 2D arrays of unequal shape in Cython

I want to create a Cython function that will compile a rather large image pyramid, then pass that image pyramid to another Cython function for additional processing. At issue is the fact that the 2D arrays in the image pyramid do not have the same shape. The number of pyramid layers and the dimensions of each layer are known ahead of time (i.e., I don't necessarily need a container that can be dynamically allocated and would prefer to avoid such a container for speed reasons). In pure Python, I would probably load the various pyramid layers into a pre-initialized list and pass the list as an argument to the function that needs the pyramid. Since I'm now working in Cython where the goal is improved processing speed, I'm trying to avoid any container that might ultimately slow down processing. Memory views seem to be a good option for speed; however, it seems inefficient in terms of memory to force an image pyramid with a large base layer size into a memory view. I've read the answers to the following stack overflow question:
Python list equivalent in C++?
As a guess, at least one of the suggestions at the cited link might address my question -- the problem is which suggestion? I'm relatively new to Cython and have an even more limited knowledge of C or C++, so hoping someone can point me in the right direction to help limit the number of possibilities I need to research.

Sorting performance comparison between numpy array, python list, and Fortran

I have been using Fortran for my computational physics related work for a long time, and recently started learning and playing around with Python. I am aware of the fact that being an interpreted language Python is generally slower than Fortran for primarily CPU-intensive computational work. But then I thought using numpy would significantly improve the performance for a simple task like sorting.
So my test case was sorting an array/a list of size 10,000 containing random floats using bubble sort (just a test case with many array operations, so no need to comment on the performance of the algorithm itself). My timing results are as follows (all functions use identical algorithm):
Python3 (using numpy array, but my own function instead of numpy.sort): 33.115s
Python3 (using list): 9.927s
Fortran (gfortran) : 0.291s
Python3 (using numpy.sort): 0.269s (not a fair comparison, since it uses a different algorithm)
I was surprised that operating with numpy array is ~3 times slower than with python list, and ~100 times slower than Fortran. So at this point my questions are:
Why operating with numpy array is significantly slower than python list for this test case?
In case an algorithm that I need is not already implemented in scipy/numpy, and I need to write my own function within Python framework with best performance in mind, which data type I should operate with: numpy array or list?
If my applications are performance oriented, and I want to write functions with equivalent performance as in-built numpy functions (e.g. np.sort), what tools/framework I should learn/use?

You seem to be misunderstanding what NumPy does to speed up computations.
The speedup you get in NumPy does not come from NumPy using some smart way of saving data. Or compiling your Python code to C automatically.
Instead, NumPy implements many useful algorithms in C or Fortran, numpy.sort() being one of them. These functions understand np.ndarrays as input and loop over the data in a C/Fortran loop.
If you want to write fast NumPy code there are really three ways to do that:
Break down your code into NumPy operations (multiplications, dot-product, sort, broadcasting etc.)
Write the algorithm you want to implement in C/Fortran and also write bindings to Python that accept np.ndarrays (internally they're a contiguous array of the type you've chosen).
Use Numba to speed up your function by having Python Just-In-Time compile your code to machine code (with some limitations)

Why operating with numpy array is significantly slower than python list for this test case?
NumPy arrays are container for numerical data. They contain metadata (the type and shape of the array) and a block of memory for the data itself. As such, any operation on the elements of a NumPy array involve some overhead. Python lists are better optimized for "plain Python" code: reading or writing to a list element is faster than it is for a NumPy array. The benefit of NumPy array comes from "whole array operations" (so called array operations) and from compiled extensions. C/C++/Fortran, Cython or Numba can access the content of NumPy arrays without overhead.
In case an algorithm that I need is not already implemented in scipy/numpy, and I need to write my own function within Python
framework with best performance in mind, which data type I should
operate with: numpy array or list?
For numerical code, NumPy arrays are better: you can access their content "à la C" or "à la Fortran" in compiled extensions.
If my applications are performance oriented, and I want to write functions with equivalent performance as in-built numpy functions
(e.g. np.sort), what tools/framework I should learn/use?
There are many. You can write in C using the NumPy C-API (complicated, I wouldn't advise it but it is good to know that it exists). Cython is a mature "Python-like" and "C-like" language that enables an easy and progressive performance improvements. Numba is a drop-in "just in time" interpreter: provided that you restrict your code to direct numerical operations on NumPy arrays, Numba will convert the code on the fly to a compiled equivalent.

Numpy vectorisation of python object array

Just a short question that I can't find the answer to before i head off for the day,
When i do something like this:
v1 = float_list_python = ... # <some list of floats>
v2 = float_array_NumPy = ... # <some numpy.ndarray of floats>
# I guess they don't have to be floats -
# but some object that also has a native
# object in C, so that numpy can just use
# that
If i want to multiply these vectors by a scalar, my understanding has always been that the python list is a list of object references, and so looping through the list to do the multiplication must fetch the locations of all the floats, and then must get the floats in order to do it - which is one of the reasons it's slow.
If i do the same thing in NumPy, then, well, i'm not sure what happens. There are a number of things i imagine could happen:
It splits the multpilication up across the cores.
It vectorises the multications (as well?)
The documentation i've found suggests that many of the primitives in numpy take advantage of the first option there whenever they can (i don't have a computer on hand at the moment i can test it on). And my intuition tells me that number 2 should happen whenever it's possible.
So my question is, if I create a NumPy array of python objects, will it still at least perform operations on the list in parallel? I know that if you create an array of objects that have native C types, then it will actually create a contiguous array in memory of the actual objects, and that if you create an numpy array of python objects it will create an array of references, but i don't see why this would rule out parallel operations on said list, and cannot find anywhere that explicitly states that.
EDIT: I feel there's a bit of confusion over what i'm asking. I understand what vectorisation is, I understand that it is a compiler optimisation, and not something you necesarily program in (though aligning the data such that it's contiguous in memory is important). On the grounds of vectorisation, all i wanted to know was whether or not numpy uses it. If i do something like np_array1 * np_array2 does the underlying library call use vectorisation (presuming that dtype is a compatible type).
For the splitting up over the cores, all i mean there, is if i again do something like np_array1 * np_array2, but this time dtype=object: would it divide that work up amongst there cores?

numpy is fast because it performs numeric operations like this in fast compiled C code. In contrast the list operation operates at the interpreted Python level (streamlined as much as possible with Python bytecodes etc).
A numpy array of numeric type stores those numbers in a data buffer. At least in the simple cases this is just a block of bytes that C code can step through efficiently. The array also has shape and strides information that allows multidimensional access.
When you multiply the array by a scalar, it, in effect, calls a C function titled something like 'multiply_array_by_scalar', which does the multiplication in fast compiled code. So this kind of numpy operation is fast (compared to Python list code) regardless of the number of cores or other multi-processing/threading enhancements.
Arrays of objects do not have any special speed advantage (compared to lists), at least not at this time.
Look at my answer to a question about creating an array of arrays, https://stackoverflow.com/a/28284526/901925
I had to use iteration to initialize the values.
Have you done any time experiments? For example, construct an array, say (1000,2). Use tolist() to create an equivalent list of lists. And make a similar array of objects, with each object being a (2,) array or list (how much work did that take?). Now do something simple like len(x) for each of those sub lists.

#hpaulj provided a good answer to your question. In general, from reading your question it occurred to me that you do not actually understand what "vectorization" does under the hood. This writeup is a pretty decent explanation of vectorization and how it enables faster computations - http://quantess.net/2013/09/30/vectorization-magic-for-your-computations/
With regards to point 1 - Distributing computations across multiple cores, this is not always the case with Numpy. However, there are libraries like numexpr that enable multithreaded, highly efficient Numpy array computations with support for several basic logical and arithmetic operators. Numexpr can be used to turbo charge critical computations when used in conjunction with Numpy as it avoids replicating large arrays in memory for vectorization routines (as is the case for Numpy) and can use all cores on your system to perform computations.

Many Numpy Arrays vs Few Python Classes

I'm working on a project that heavily relies on reasonably small 2D Numpy arrays to represent our data (where small is around 100 x 4, at worst). The problem with this representation is that is extremely cumbersome and unclear given the context of the problem, terrible to maintain, and very bug prone. For example, my current task, in the context of these arrays, requires adding and removing multiple rows, and allocating a large number of them. I decided that an object-oriented, graph-based approach would be far clearer to use on my end and more generic for future development. After sitting down and designing most of the functionality on paper, there will probably be 5 or 6 larger classes with internal dictionaries (representing adjacency lists for graphs), that remove the need for Numpy arrays and are much clearer to interpret.
It is well known that Numpy arrays blow Python data structures out of the water in terms of indexing speed and most other functionality. However, I also know that Python dictionaries are highly optimized and are fairly efficient in their own right. My dilemma is whether or not reallocating many Numpy arrays and copying values back and forth is more or less efficient than a few persisting classes that can be edited easily without reallocation, but will have slower indexing time than a Numpy array.
It's worth noting that I'm not using much of Numpy functionality other than np.where(), which can be worked around if necessary. Also the algorithm scales not in the size of arrays but in the number of arrays; a harder problem will have more arrays, but not necessarily larger ones.
Thanks!

Can I statically type a h5file array in Cython?

I develop a library that uses Cython at a low level to solve flow problems across 2D arrays. If these arrays are numpy arrays I can statically type them thus avoiding the Python interpreter overhead of random access into those arrays. To handle arrays of sizes so big they don't fit in memory, I plan to use hd5file Arrays from pytables in place of numpy, but I can't figure out if it's possible to statically type a CArray.
Is it possible to statically type hd5file CArrays in Cython to avoid Python interpreter overhead when randomly accessing those arrays?

If you use the h5py package, you can use numpy.asarray() on the datasets it gives you, then you have a more familiar NumPy array that you already know how to deal with.
Please note that h5py had a bug related to this until a couple years ago which caused disastrously slow performance when doing asarray() but this was solved so please don't use a very old version if you're going to try this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.