Should I preallocate a numpy array?

Should I preallocate a numpy array? - python

I have a class and it's method. The method repeats many times during execution. This method uses a numpy array as a temporary buffer. I don't need to store values inside the buffer between method's calls. Should I create a member instance of the array to avoid time leaks on memory allocation during the method execution? I know, that it is preferred to use local variables. But, is Python smart enough to allocate memory for the array only once?
class MyClass:
def __init__(self, n):
self.temp = numpy.zeros(n)
def method(self):
# do some stuff using self.temp
Or
class MyClass:
def __init__(self, n):
self.n = n
def method(self):
temp = numpy.zeros(self.n)
# do some stuff using temp
Update: replaced empty with zeros

Numpy arrays are fast, once created. However, creating an array is pretty expensive. Much more than, say, creating a python list.
In a case such as yours, where you create a new array again and again (in a for loop?), I would ALWAYS pre-allocate the array structure and then reuse it.
I can't comment on whether Python is smart enough to optimize this, but I would guess it's not :)
How big is your array and how frequent are calls to this method?

Yes, you need to preallocate large arrays. But if this will be efficient depends on how you use these arrays then.
This will cause several new allocations for intermediate results of computation:
self.temp = a * b + c
This will not (if self.x is preallocated):
numpy.multiply(a, b, out=self.x)
numpy.add(c, self.x, out=self.temp)
But for these cases (when you work with large arrays in not-trivial formulae) it is better to use numexpr or einsum for matrix calculations.

Related

Is there a Python package that generates SOA data structure from AOS?

I was working on increasing the performance of existing Python applications lately. What I found is that arranging data in arrays of basic data types (as a struct of arrays) instead of an array of structs/classes can increase performance. If data is saved in contiguous memory it also makes outsourcing heavy calculations to the GPU easier. My goal is to provide our users with a way to make use of SOA without having to know about numpy, numba, simd, etc.
Intel provides a template library and containers that can generate simd friendly data layouts from a struct/class.
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/libraries/introduction-to-the-simd-data-layout-templates.html
This already gives an idea how something similar could be done in Python.
So, instead of reading objects from memory (that can be distributed somewhere in memory),
class A:
def __init__(self, x):
self.x = x
a_s = [A(1) for _ in range(10)]
for a in a_s:
a.x=2
I would like to have the data accessible as numpy array AND as instance of class A. So, that data can be accessed something like this:
sdlt_container = generate_SDLT(A(), 10)
a = sdlt_container[2] # returns instance of class A
a.x = 2 # returns a view to second element in numpy array and sets x[2]=2
sdlt_container.x[0:5] = 3 # Change x in several "instances of A"
Accessing data as an instance of class A might involve creating a new instance of A but the variables in this object should "point to" the correct index in the numpy array. I understand that optimizations like the Intel compiler does in a for loop are not possible in Python (interpreted vs compiled).
Thanks for any ideas!

Application of numpy methods

I'm confused with how numpy methods are applied to nd-arrays. for example:
import numpy as np
a = np.array([[1,2,2],[5,2,3]])
b = a.transpose()
a.sort()
Here the transpose() method is not changing anything to a, but is returning the transposed version of a, while the sort() method is sorting a and is returning a NoneType. Anybody an idea why this is and what is the purpose of this different functionality?

Because numpy authors decided that some methods will be in place and some won't. Why? I don't know if anyone but them can answer that question.
'in-place' operations have the potential to be faster, especially when dealing with large arrays, as there is no need to re-allocate and copy the entire array, see answers to this question
BTW, most if not all arr methods have a static version that returns a new array. For example, arr.sort has a static version numpy.sort(arr) which will accept an array and return a new, sorted array (much like the global sorted function and list.sort()).

In a Python class (OOP) methods which operate in place (modify self or its attributes) are acceptable, and if anything, more common than ones that return a new object. That's also true for built in classes like dict or list.
For example in numpy we often recommend the list append approach to building an new array:
In [296]: alist = []
In [297]: for i in range(3):
...: alist.append(i)
...:
In [298]: alist
Out[298]: [0, 1, 2]
This is common enough that we can readily write it as a list comprehension:
In [299]: [i for i in range(3)]
Out[299]: [0, 1, 2]
alist.sort operates in-place, sorted(alist) returns a new list.
In numpy methods that return a new array are much more common. In fact sort is about the only in-place method I can think of off hand. That and a direct modification of shape: arr.shape=(...).
A number of basic numpy operations return a view. That shares data memory with the source, but the array object wrapper is new. In fact even indexing an element returns a new object.
So while you ultimately need to check the documentation, it's usually safe to assume a numpy function or method returns a new object, as opposed to operating in-place.
More often users are confused by the numpy functions that have the same name as a method. In most of those cases the function makes sure the argument(s) is an array, and then delegates the action to its method. Also keep in mind that in Python operators are translated into method calls - + to __add__, [index] to __getitem__() etc. += is a kind of in-place operation.

Arrays in CUDA Kernels using Python with numba-pro

I'm currently writing code that can be heavily parallelized using GPUs. My code structure essentially looks like this:
Create two arrays, let's call them A and B of length N. (CPU)
Perform NxN calculations that eventually return a scalar. These calculations only depend on A and B and can therefore be parallelized. (GPU)
Gather all these scalars in a list and take the smallest one. (CPU)
Modify A and B with this scalar (CPU)
Go back to step 2 and repeat until a certain condition is met.
Most examples are very illustrative but they all seem to work like this: Execute the major part of the code on the CPU and only perform intermediate matrix multiplications etc. on the GPU. In particular the host usually knows all the variables the kernel is going to use.
For me its exactly vice versa, I want to perform the major part of the code on the GPU and only a very small amount of steps on the CPU itself. My host knows literally nothing about whats going on inside my individual threads. Its only managing the list of scalars as well as my arrays A and B.
My questions are therefore:
How do I properly define variables inside a kernel? In particular, how do I define and initialize arrays/lists?
How do I write a device function that returns an array? (s. below MatrixMultiVector doesn't work)
Why can I not use numpy and other libraries inside CUDA Kernels? What alternatives do I have?
An example of what I currently have looks like this:
from __future__ import division
import numpy as np
from numbapro import *
# Device Functions
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Works and can be called corrently from TestKernel Scalar
#cuda.jit('float32(float32, float32)', device=True)
def myfuncScalar(a, b):
return a+b;
# Works and can be called correctly from TestKernel Array
#cuda.jit('float32[:](float32[:])', device=True)
def myfuncArray(A):
for k in xrange(4):
A[k] += 2*k;
return A
# Takes Matrix A and Vector v, multiplies them and returns a vector of shape v. Does not even compile.
# Failed at nopython (nopython frontend), Only accept returning of array passed into the function as argument
# But v is passed to the function as argument...
#cuda.jit('float32[:](float32[:,:], float32[:])', device=True)
def MatrixMultiVector(A,v):
tmp = cuda.local.array(shape=4, dtype=float32); # is that thing even empty? It could technically be anything, right?
for i in xrange(A[0].size):
for j in xrange(A[1].size):
tmp[i] += A[i][j]*v[j];
v = tmp;
return v;
# Kernels
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# TestKernel Scalar - Works
#cuda.jit(void(float32[:,:]))
def TestKernelScalar(InputArray):
i = cuda.grid(1)
for j in xrange(InputArray[1].size):
InputArray[i,j] = myfuncScalar(5,7);
# TestKernel Array
#cuda.jit(void(float32[:,:]))
def TestKernelArray(InputArray):
# Defining arrays this way seems super tedious, there has to be a better way.
M = cuda.local.array(shape=4, dtype=float32);
M[0] = 1; M[1] = 0; M[2] = 0; M[3] = 0;
tmp = myfuncArray(M);
#tmp = MatrixMultiVector(A,M); -> we still have to define a 4x4 matrix for that.
i = cuda.grid(1)
for j in xrange(InputArray[1].size):
InputArray[i,j] += tmp[j];
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Main
#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
N = 4;
C = np.zeros((N,N), dtype=np.float32);
TestKernelArray[1,N](C);
print(C)

The short answer is you can't define dynamic lists or arrays in CUDA Python. You can have statically defined local or shared memory arrays (see cuda.local.array() and cuda.shared.array in the documentation), but those have thread or block scope and can't be reused after their associated thread or block is retired. But that is about all that is supported. You can pass externally defined arrays to kernels, but their attributes are read-only.
As per your myfuncArray you can return an externally defined array. You can't return a dynamically defined array, because dynamically defined arrays (or any objects for that matter) are not supported in kernels.
You can read the CUDA Python specification for yourself, but the really short answer is that CUDA Python is a superset of Numba's No Python Mode, and while there are elementary scalar functions available, there is no Python object model support. That excludes much Python functionality, including objects and numpy.

compare ctypes arrays without additional memory

I have two large ctypes arrays which I would like to compare, without additional memory. Direct comparison doesn't work:
>>> a = ctypes.create_string_buffer(b'1'*0x100000)
>>> b = ctypes.create_string_buffer(b'1'*0x100000)
>>> a == b
False
Using either the value or raw attribute creates a copy of the array in memory.
Using memoryview to wrap both buffers slows things down by a lot.
For windows a possible solution is to use msvcrt.memcmp directly, but is there a more pythonic way or cross-platform way to do this?

Specific C libraries can be found in a platform indpendent way using ctypes.util.find_library. The functions that the library exposes can be used as desired.
Thus arrays can be compared by doing the following:
libc_name = ctypes.util.find_library("c")
libc = ctypes.CDLL(libc_name)
libc.memcmp.argtypes = (ctypes.c_void_p, ctypes.c_void_p, ctypes.c_size_t)
len(a) == len(b) and libc.memcmp(a, b, len(a)) == 0
Be warned, these function calls are very unforgiving if called incorrectly. By setting the argtypes of the function you make the function check its parameters before calling the library function.
A purely pythonic way to compare the arrays, without using large additional amounts of memory would be the following. It uses a generator, to compare each element at a time, rather than copying the entire arrays elsewhere and then comparing them.
len(a) == len(b) and all(x == y for x, y in zip(a,b))
The downside of this is that many objects, each with a small memory footprint, will be created -- which will come at its own computational expense (CPU rather than memory).

numpy.ndarray: converting to a "normal" class

[Python 3]
I like ndarray but I find it annoying to use.
Here's one problem I face. I want to write class Array that will inherit much of the functionality of ndarray, but has only one way to be instantiated: as a zero-filled array of a certain size. I was hoping to write:
class Array(numpy.ndarray):
def __init__(size):
# What do here?
I'd like to call super().__init__ with some parameters to create a zero-filled array, but it won't work since ndarray uses a global function numpy.zeros (rather than a constructor) to create a zero-filled array.
Questions:
Why does ndarray use global (module) functions instead of constructors in many cases? It is a big annoyance if I'm trying to reuse them in an object-oriented setting.
What's the best way to define class Array that I need? Should I just manually populate ndarray with zeroes, or is there any way to reuse the zeros function?

Why does ndarray use global (module) functions instead of constructors in many cases?
To be compatible/similar to Matlab, where functions like zeros or ones originally came from.
Global factory functions are quick to write and easy to understand. What should the semantics of a constructor be, e.g. how would you express a simple zeros or empty or ones with one single constructor? In fact, such factory functions are quite common, also in other programming languages.
What's the best way to define class Array that I need?
import numpy
class Array(numpy.ndarray):
def __new__(cls, size):
result = numpy.ndarray.__new__(Array, size)
result.fill(0)
return result
arr = Array(5)
def test(a):
print type(a), a
test(arr)
test(arr[2:4])
test(arr.view(int))
arr[2:4] = 5.5
test(arr)
test(arr[2:4])
test(arr.view(int))
Note that this is Python 2, but it would require only small modifications to work with Python 3.

If you don't like ndarray interface then don't inherit it. You can define your own interface and delegate the rest to ndarray and numpy.
import functools
import numpy as np
class Array(object):
def __init__(self, size):
self._array = np.zeros(size)
def __getattr__(self, attr):
try: return getattr(self._array, attr)
except AttributeError:
# extend interface to all functions from numpy
f = getattr(np, attr, None)
if hasattr(f, '__call__'):
return functools.partial(f, self._array)
else:
raise AttributeError(attr)
def allzero(self):
return np.allclose(self._array, 0)
a = Array(10)
# ndarray doesn't have 'sometrue()' that is the same as 'any()' that it has.
assert a.sometrue() == a.any() == False
assert a.allzero()
try: a.non_existent
except AttributeError:
pass
else:
assert 0

Inheritance of ndarray is little bit tricky. ndarray does not even have method __init(self, )___, so it can't be called from subclass, but there are reasons for that. Please see numpy documentation of subclassing.
By the way could you be more specific of your particular needs? It's still quite easy to cook up a class (utilizing ndarray) for your own needs, but a subclass of ndarray to pass all the numpy machinery is quite different issue.
It seems that I can't comment my own post, odd
#Philipp: It will be called by Python, but not by numpy. There are three ways to instantiate ndarray, and the guidelines how to handle all cases is given on that doc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Should I preallocate a numpy array? - python

Related

Is there a Python package that generates SOA data structure from AOS?

Application of numpy methods

Arrays in CUDA Kernels using Python with numba-pro

compare ctypes arrays without additional memory

numpy.ndarray: converting to a "normal" class

Categories

Resources