numpy -- Transform non-contiguous data to contiguous data in place

numpy -- Transform non-contiguous data to contiguous data in place - python

Consider the following code:
import numpy as np
a = np.zeros(50)
a[10:20:2] = 1
b = c = a[10:40:4]
print b.flags # You'll see that b and c are not C_CONTIGUOUS or F_CONTIGUOUS
My question:
Is there a way (with only a reference to b) to make both b and c contiguous?
It is completely fine if np.may_share_memory(b,a) returns False after this operation.
Things which are close, but don't quite work out are: np.ascontiguousarray/np.asfortranarray as they will return a new array.
My use case is that I have very large 3D fields stored in a subclass of a numpy.ndarray. In order to save memory, I would like to chop those fields down to the portion of the domain that I am actually interested in processing:
a = a[ix1:ix2,iy1:iy2,iz1:iz2]
Slicing for the subclass is somewhat more restricted than slicing of ndarray objects, but this should work and it will "do the right thing" -- the various custom meta-data attached on the subclass will be transformed/preserved as expected. Unfortunately, since this returns a view, numpy won't free the big array afterward so I don't actually save any memory here.
To be completely clear, I'm looking to accomplish 2 things:
preserve the metadata on my class instance. slicing will work, but I'm not sure about other forms of copying.
make it so that the original array is free to be garbage collected

According to Alex Martelli:
"The only really reliable way to ensure that a large
but temporary use of memory DOES return all resources to the system
when it's done, is to have that use happen in a subprocess, which
does the memory-hungry work then terminates."
However, the following appears to free at least some of the memory:
Warning: my way of measuring free memory is Linux-specific:
import time
import numpy as np
def free_memory():
"""
Return free memory available, including buffer and cached memory
"""
total = 0
with open('/proc/meminfo', 'r') as f:
for line in f:
line = line.strip()
if any(line.startswith(field) for field in ('MemFree', 'Buffers', 'Cached')):
field, amount, unit = line.split()
amount = int(amount)
if unit != 'kB':
raise ValueError(
'Unknown unit {u!r} in /proc/meminfo'.format(u=unit))
total += amount
return total
def gen_change_in_memory():
"""
https://stackoverflow.com/a/14446011/190597 (unutbu)
"""
f = free_memory()
diff = 0
while True:
yield diff
f2 = free_memory()
diff = f - f2
f = f2
change_in_memory = gen_change_in_memory().next
Before allocating the large array:
print(change_in_memory())
# 0
a = np.zeros(500000)
a[10:20:2] = 1
b = c = a[10:40:4]
After allocating the large array:
print(change_in_memory())
# 3844 # KiB
a[:len(b)] = b
b = a[:len(b)]
a.resize(len(b), refcheck=0)
time.sleep(1)
Free memory increases after resizing:
print(change_in_memory())
# -3708 # KiB

You can do this in cython:
In [1]:
%load_ext cythonmagic
In [2]:
%%cython
cimport numpy as np
np.import_array()
def to_c_contiguous(np.ndarray a):
cdef np.ndarray new
cdef int dim, i
new = a.copy()
dim = np.PyArray_NDIM(new)
for i in range(dim):
np.PyArray_STRIDES(a)[i] = np.PyArray_STRIDES(new)[i]
a.data = new.data
np.PyArray_UpdateFlags(a, np.NPY_C_CONTIGUOUS)
np.set_array_base(a, new)
In [8]:
import sys
import numpy as np
a = np.random.rand(10, 10, 10)
b = c = a[::2, 1::3, 2::4]
d = a[::2, 1::3, 2::4]
print sys.getrefcount(a)
to_c_contiguous(b)
print sys.getrefcount(a)
print np.all(b==d)
The output is:
4
3
True
to_c_contiguous(a) will create a c_contiguous copy of a, and make it as the base of a.
After the call of to_c_contiguous(b), the refcount of a is decreased, and when the refcount of a become 0, it will be freed.

I would claim the correct way to accomplish the 2 things you listed is by np.copying the slices you create.
Of course, in order for that to work correctly, you would need to define an appropriate __array_finalize__. You weren't very clear about why you decided to avoid it in the first place, but my feeling is that you should define it. (how did you solve the bx**2 problem without using __array_finalize__?)

Related

Compute and fill an array in parallel

As part of a signal processing task, I am doing some computation per frequency step.
I have a frequencies list of length 513.
I have a 3D numpy array A of shape (81,81,513), where 513 is the number of frequency points. I then have a 81x81 matrix per frequency.
I want to apply some modification to each of those matrices, to end up with a modified version of A I'll name B here, which will also be of shape (81,81,513).
For that, I start pre-allocating B with :
B = np.zeros_like(A)
I then loop over my frequencies and call a dothing function like:
for index, frequency in enumerate(frequencies):
B[:,:,index] = dothing(A[:,:,index])
The problem is that dothing takes a lot of time, and ran sequentially over 513 frequency steps seems endless.
So I wanted to parallelize it. But even after reading a lot of docs and watching a lot of videos, I get lost in all the libraries and potential solutions.
Computations at all individual frequencies can be done independently. But in the end I need to assign everything back to B in the right order.
Any idea on how to do that?
Thanks in advance
Antoine

Here I would use a shared array using shared_memory, as there's no need to protect write access if no two loop iterations ever use the same memory address. I eliminated the second array to shorten the example (only construct a single shared array), and I re-ordered the array shape to better preserve memory-aligned access.
from multiprocessing import Pool
from multiprocessing.shared_memory import SharedMemory
import numpy as np
import numpy.typing as npt
from typing import Any
from time import sleep
def dothing(arr: np.ndarray, t_func: Any) -> np.ndarray:
sleep(.05) #simulate hard work
return arr * 2
def dodothing(args: tuple[int, Any]):
global arr
index = args[0]
t_func = args[1]
arr[index] = dothing(arr[index], t_func) #write result back to self to avoid need for 2 shared arrays
def init(shm: SharedMemory, shape: tuple[int, ...], dtype: npt.DTypeLike):
global arr
arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
if __name__ == "__main__":
_A = np.ones((513,81,81), np.float64) #source data
t_funcs = ["some transfer function"] * _A.shape[0] #added example of passing some data + an index
nbytes = _A.size * _A.itemsize
dtype = _A.dtype
shape = _A.shape
shm = SharedMemory(create=True, size=nbytes)
A = np.ndarray(shape, dtype=dtype, buffer=shm.buf)
A[:] = _A[:] #copy contents into shared A
with Pool(initializer=init, initargs=(shm, shape, dtype)) as pool:
pool.map(dodothing, enumerate(t_funcs)) #enumerate returns tuple[int,Any] each loop
print(A.sum()/_A.sum()) #prove we multiplied all elements by 2
shm.close()
shm.unlink()
multiprocessing.Pool is a bit funny sometimes in what can be a valid argument to a target function, so I tend to share things like Lock, Queue, shared_memory etc. via the pool's initialization function, which accepts arguments just like Process does.

Numba #guvectorise returns garbage values

The code below is a test for calculating distance between points in a periodic system.
import itertools
import time
import numpy as np
import numba
from numba import njit
#njit(cache=True)
def get_dr(i=np.array([]),j=np.array([]),cellsize=np.array([])):
k=np.zeros(3,dtype=np.float64)
for idx, _ in enumerate(cellsize):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.round((j[idx]-i[idx])/cellsize[idx])
return np.linalg.norm(k)
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64)"],
"(m),(m),(m)->()",nopython=True,cache=True)
def get_dr_vec(i,j,cellsize,dr):
dr=0.0
k=np.zeros(3,dtype=np.float64)
for idx, _ in enumerate(cellsize):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.round((j[idx]-i[idx])/cellsize[idx])
dr=np.sqrt(np.square(k[0])+np.square(k[1])+np.square(k[2]))
N, dim = 50, 3 # 50 particles in 3D
vec = np.random.random((N, dim))
cellsize=np.array([26.4,26.4,70.0])
rList=[];rList2=[]
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
rList.append(get_dr(pI,pJ,cellsize))
end =time.perf_counter()
print("Time {:.3g}s".format(end-start))
newvec1=[];newvec2=[]
start = time.perf_counter()
for (pI, pJ) in itertools.product(vec, vec):
newvec1.append(pI)
newvec2.append(pJ)
cellsizeVec=np.full(shape=np.shape(newvec1),fill_value=cellsize,dtype=float)
rList2=get_dr_vec(np.array(newvec1),np.array(newvec2),cellsizeVec)
end =time.perf_counter()
print("Time {:.3g}s".format(end-start))
print(rList2)
exit()
Compared to get_dr() which shows the correct result, get_dr_vec() shows garbage and nan values. The function get_dr_vec() is calculating the correct value for dr, but it returns garbage values with correct dimensions. Can someone suggest any ideas on how to resolve this issue?

You made a small mistake in the guvectorize function call. Guvectorize does not want you to redefine the output variable, the output array/scalar must be filled in instead. The code below should work:
#numba.guvectorize(["void(float64[:],float64[:],float64[:],float64[:])"],
"(m),(m),(m)->()",nopython=True, cache=True)
def get_dr_vec(i,j,cellsize,dr):
k=np.zeros(3,dtype=np.float64)
for idx, _ in enumerate(cellsize):
k[idx] = (j[idx]-i[idx])-cellsize[idx]*np.round((j[idx]-i[idx])/cellsize[idx])
# The mistake was on this line. You had "dr =", but it should be "dr[0] ="
dr[0] = np.sqrt(np.square(k[0])+np.square(k[1])+np.square(k[2]))
The reason that dr = does not work is because guvectorize already allocates the dr array before you call the function. dr = messes things up because it places a new dr array in a new place in memory, so when numba looks at the original place in memory, where it expects to find an array, it instead finds nothing. dr[0] = does work, because that way, we can fill in the values in the original place in memory, where numba expect the values to be.
If it is still not 100% clear i recommend that you look through the numba documentation on this topic.
Those "garbage values" you were seeing was the output array that was never filled, similar to what you would see if you would call print(np.empty(10))

Python/Numba - Custom class object as input type

I'm starting with numba and my first goal is to try and accelerate a not so complicated function with a nested loop.
Given the following class:
class TestA:
def __init__(self, a, b):
self.a = a
self.b = b
def get_mult(self):
return self.a * self.b
and a numpy ndarray that contains class TestA objects. Dimension (N,) where N is usually ~3 million in length.
Now given the following function:
def test_no_jit(custom_class_obj_container):
container_length = len(custom_class_obj_container)
sum = 0
for i in range(container_length):
for j in range(i + 1, container_length):
obj_i = custom_class_obj_container[i]
obj_j = custom_class_obj_container[j]
sum += (obj_i.get_mult() + obj_j.get_mult())
return sum
I've tried to play around numba to get it to work with the function above however I cannot seem to get it to work with nopython=True flag, and if it's set to false, then the runtime is higher than the no-jit function.
Here is my latest try in trying to jit the function (also using nb.prange):
#nb.jit(nopython=False, parallel=True)
def test_jit(custom_class_obj_container):
container_length = len(custom_class_obj_container)
sum = 0
for i in nb.prange(container_length):
for j in nb.prange(i + 1, container_length):
obj_i = custom_class_obj_container[i]
obj_j = custom_class_obj_container[j]
sum += (obj_i.get_mult() + obj_j.get_mult())
return sum
I've tried to search around but I cannot seem to find a tutorial of how to define a custom class in the signature, and how would I go in order to accelerate a function of that sort and get it to run on GPU and possibly (any info regarding that matter would be highly appreciated) to get it to run with cuda libraries - which are installed and ready to use (previously used with tensorflow)

The numba docs give an example of creating a custom type, even for nopython mode: https://numba.pydata.org/numba-doc/latest/extending/interval-example.html
In your case though, unless this is a really slimmed down version of what you actually want to do, it seems like the easiest approach would be to re-use existing types. Additionally, the construction of a 3M length object array is going to be slow, and produce fragmented memory (as the objects are not being stored in contiguous blocks).
An example of how using record arrays might be used to solve the problem:
x_dt = np.dtype([('a', np.float64),
('b', np.float64)])
n = 30000
buf = np.arange(n*2).reshape((n, 2)).astype(np.float64)
vec3 = np.recarray(n, dtype=x_dt, buf=buf)
#numba.njit
def mult(a):
return a.a * a.b
#numba.jit(nopython=True, parallel=True)
def sum_of_prod(vector):
sum = 0
vector_len = len(vector)
for i in numba.prange(vector_len):
for j in numba.prange(i + 1, vector_len):
sum += mult(vector[i]) + mult(vector[j])
return sum
sum_of_prod(vec3)
FWIW, I'm no numba expert. I found this question when searching for how to implement a custom type in numba for non-numerical stuff. In your case, because this is highly numerical, I think a custom type is probably overkill.

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?

IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

ndarray.resize: passing the correct value for the refcheck argument

Like many others, my situation is that I have a class which collects a large amount of data, and provides a method to return the data as a numpy array. (Additional data can continue to flow in, even after returning an array). Since creating the array is an expensive operation, I want to create it only when necessary, and to do it as efficiently as possible (specifically, to append data in-place when possible).
For that, I've been doing some reading about the ndarray.resize() method, and the refcheck argument. I understand that refcheck should be set to False only when "you are sure that you have not shared the memory for this array with another Python object".
The thing is I'm not sure. Sometimes I have, sometimes I haven't. I'm fine with it raising an error if refcehck fails (I can catch it and then create a new copy), but I want it to fail only when there are "real" external references, ignoring the ones I know to be safe.
Here's a simplified illustration:
import numpy as np
def array_append(arr, values, refcheck = True):
added_len = len(values)
if added_len == 0:
return arr
old_len = len(arr)
new_len = old_len + added_len
arr.resize(new_len, refcheck = refcheck)
arr[old_len:] = values
return arr
class DataCollector(object):
def __init__(self):
self._new_data = []
self._arr = np.array([])
def add_data(self, data):
self._new_data.append(data)
def get_data_as_array(self):
self._flush()
return self._arr
def _flush(self):
if not self._new_data:
return
# self._arr = self._append1()
# self._arr = self._append2()
self._arr = self._append3()
self._new_data = []
def _append1(self):
# always raises an error, because there are at least 2 refs:
# self._arr and local variable 'arr' in array_append()
return array_append(self._arr, self._new_data, refcheck = True)
def _append2(self):
# Does not raise an error, but unsafe in case there are other
# references to self._arr
return array_append(self._arr, self._new_data, refcheck = False)
def _append3(self):
# "inline" version: works if there are no other references
# to self._arr, but raises an error if there are.
added_len = len(self._new_data)
old_len = len(self._arr)
self._arr.resize(old_len + added_len, refcheck = True)
self._arr[old_len:] = self._new_data
return self._arr
dc = DataCollector()
dc.add_data(0)
dc.add_data(1)
print dc.get_data_as_array()
dc.add_data(2)
print dc.get_data_as_array()
x = dc.get_data_as_array() # create an external reference
print x.shape
for i in xrange(5000):
dc.add_data(999)
print dc.get_data_as_array()
print x.shape
Questions:
Is there a better (fast) way of doing what I'm trying to do (creating numpy array incrementally)?
Is there a way of telling the resize() method: "perform refcheck, but ignore that one reference (or n references) which I know to be safe"? (that would solve the problem that _append1() always fails)

The resize method has two main problems. The first is that you return a reference to self._arr when the user calls get_data_as_array. Now the resize will do one of two things depending on your implementation. It'll either modify the array you've given you're user ie the user will take a.shape and the shape will unpredictably change. Or it'll corrupt that array, having it point to bad memory. You could solve that issue by always having get_data_as_array return self._arr.copy(), but that brings me to the second issue. resize is acctually not very efficient. I believe in general, resize has to allocate new memory and do a copy every time it is called to grow an array. Plus now you need to copy the array every time you want to return it to your user.
Another approach would be to design your own dynamic array, that would look something like:
class DynamicArray(object):
_data = np.empty(1)
data = _data[:0]
len = 0
scale_factor = 2
def append(self, values):
old_data = len(self.data)
total_data = len(values) + old_data
total_storage = len(self._data)
if total_storage < total_data:
while total_storage < total_data:
total_storage = np.ceil(total_storage * self.scale_factor)
self._data = np.empty(total_storage)
self._data[:old_data] = self.data
self._data[old_data:total_data] = values
self.data = self._data[:total_data]
This should be very fast because you only need to grow the array log(N) times and you use at most 2*N-1 storage where N is the max size of the array. Other than growing the array, you're just making views of _data which doesn't involve any copying and should be constant time.
Hope this is useful.

I will use array.array() to do the data collection:
import array
a = array.array("d")
for i in xrange(100):
a.append(i*2)
Every time when you want to do some calculation with the collected data, convert it to numpy.ndarray by numpy.frombuffer:
b = np.frombuffer(a, dtype=float)
print np.mean(b)
b will share data memory with a, so the convertion is very fast.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy -- Transform non-contiguous data to contiguous data in place - python

Related

Compute and fill an array in parallel

Numba #guvectorise returns garbage values

Python/Numba - Custom class object as input type

python sparse matrix creation paralellize to speed up

ndarray.resize: passing the correct value for the refcheck argument

Categories

Resources