parallelize numpy arange on dask array

parallelize numpy arange on dask array - python

I would like to use dask to do the following operation; let say I have a numpy array:
In: x = np.arange(5)
Out: [0,1,2,3,4]
Then I want a function to map np.arange to all the elements of my array.
I have already defined a function for that purpose:
def list_range(array, no_cell):
return np.add.outer(array, np.arange(no_cell)).T
# e.g
In: list_range(x,3)
Out: array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6]])
Now I want to reproduce this in parallel using map_blocks on a dask array but I always get an error. Here is my attempt based on the dask documentation of map_blocks:
constant = 4
d = da.arange(5, chunks=(2,))
f = da.core.map_blocks(list_range, d, constant, chunks=(2,))
f.compute()
I get
ValueError: could not broadcast input array from shape (4,2) into shape (4)

Have you checked out Dask's ufunc methods? For your problem, you can try,
da.add.outer(d, np.arange(constant)).T.compute()
While using map_blocks, you have to make sure that you specify the new dimensions when your operation results in a change in chunk dimensions. In your problem, the chunk dimension is no more (2,), and instead is (2,4). This new dimension should be specified using the new_axis parameter. Also, I found that map_blocks is not vstacking the blocks after map_blocks, and I couldn't get the transpose to work within the mapped function. Try this to make map_blocks work,
def list_range(array, no_cell):
return np.add.outer(array, np.arange(no_cell))
constant = 4
d = da.arange(5, chunks=(2,))
f=da.core.map_blocks(list_range, d, constant, chunks=(2,constant), new_axis=[1])
f.T.compute()

Related

How does slicing numpy arrays with other arrays work?

I have a numpy array of shape [batch_size, timesteps_per_samples, width, height], where width and height refer to a 2D grid. The values in this array can be interpreted as an elevation at a certain location that changes over time.
I want to know the elevation over time for various paths within this array. Therefore i have a second array of shape [batch_size, paths_per_batch_sample, timesteps_per_path, coordinates] (coordinates = 2, for x and y in the 2D plane).
The resulting array should be of shape [batch_size, paths_per_batch_sample, timesteps_per_path] containing the elevation over time for each sample within the batch.
The following two examples work. The first one is very slow and just serves for understanding what I am trying to do. I think the second one does what I want but I have no idea why this works nor if it may crash under certain circumstances.
Code for the problem setup:
import numpy as np
batch_size=32
paths_per_batch_sample=10
timesteps_per_path=4
width=64
height=64
elevation = np.arange(0, batch_size*timesteps_per_path*width*height, 1)
elevation = elevation.reshape(batch_size, timesteps_per_path, width, height)
paths = np.random.randint(0, high=width-1, size=(batch_size, paths_per_batch_sample, timesteps_per_path, 2))
range_batch = range(batch_size)
range_paths = range(paths_per_batch_sample)
range_timesteps = range(timesteps_per_path)
The following code works but is very slow:
elevation_per_time = np.zeros((batch_size, paths_per_batch_sample, timesteps_per_path))
for s in range_batch:
for k in range_paths:
for t in range_timesteps:
x_co, y_co = paths[s,k,t,:].astype(int)
elevation_per_time[s,k,t] = elevation[s,t,x_co,y_co]
The following code works (even fast) but I can't understand why and how o.0
elevation_per_time_fast = elevation[
:,
range_timesteps,
paths[:, :, range_timesteps, 0].astype(int),
paths[:, :, range_timesteps, 1].astype(int),
][range_batch, range_batch, :, :]
Prove that the results are equal
check = (elevation_per_time == elevation_per_time_fast)
print(np.all(check))
Can somebody explain how I can slice an nd-array by multiple other arrays?
Especially, I don't understand how the numpy knows that 'range_timesteps' has to run in step (for the index in axis 1,2,3).
Thanks in advance!

Lets take a quick look at slicing numpy array first:
a = np.arange(0,9,1).reshape([3,3])
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
Numpy has 2 ways of slicing array, full sections start:stop and by index from a list [index1, index2 ...]. The output will still be an array with the shape of your slice:
a[0:2,:]
array([[0, 1, 2],
[3, 4, 5]])
a[:,[0,2]]
array([[0, 2],
[3, 5],
[6, 8]])
The second part is that since you get a returned array with the same amount of dimensions you can easily stack any number of slices as long as you dont try to directly access an index outside of the array.
a[:][:][:][:][:][:][:][[0,2]][:,[0,2]]
array([[0, 2],
[6, 8]])

Fastest way to create array from pandas dataframe with multiple values (~500 000)

I have a dataframe containing a column vector with around 500 000 rows of array vector. What I'm trying to do is unloading the content of this column into a 2 dimension array but I don't know the fastest way to do it.
This is the format of the array I'm trying to obtain ([1, 2], [3, 4] and [5, 6] are array contained in my dataframe):
array([[1, 2],
[3, 4],
[5, 6]])
I tried to_numpy, as_matrix, and .values but it gives me a 1D array which is not what I'm looking for:
array([array([1, 2]),
array([3, 4]),
array([5, 6])])
The only methods which gave me the result I want are np.asarray() and np.array() but they take too much time in my case.
What I want is the same array I obtain with using numpy array methods (vector1,2 and 8) but faster if possible because it takes too much time when we have lot of data.
Thank you for your help !
edit : Here is my function which does the following: it takes in parameter a dataframe which which contains two columns : id and vectors which is a serie of array objects.
id vectors
1 array([1,2,3], dtype='float32')
2 array([3,4,5], dtype='float32')
3 array([6,7,8], dtype='float32')
[11530 rows x 2 columns]
What i want to do with this function is unloading the content of column id in a list which is fast and easy and the content of column vectors into an array. So i want a 2 dimensional array of array vectors.
def filter_df(df, request):
start = time.time()
filtered_df = df
ids = filtered_df['id'].tolist()
filtered_df_vectors = filtered_df['vectors'].tolist()
vectors9 = np.array(filtered_df['vectors'].tolist())
vectors1 = np.asarray(filtered_df_vectors)
vectors2 = np.array([f for f in filtered_df_vectors],dtype=np.float32)
vectors3 = filtered_df['vectors'].as_matrix()
vectors4 = filtered_df['vectors'].to_numpy()
vectors5 = filtered_df['vectors'].values
vectors6 = filtered_df.iloc[:,-1].values
vectors8 = np.array(filtered_df['vectors'].values.tolist())
vectors9 = np.array(filtered_df['vectors'].tolist())
filter_duration= time.time()-start
logger.info(f"duration: {filter_duration}s")
return ids,vectors2,filter_duration
I can't copy paste the exact output it returns me for the resulted arrays because it will be unreadable for you so i will just show the two type of array i obtain with the multiple methods i tested.
For vectors 1, 2, 8 and 9 where i use numpy methods, i obtain this format which is the one i'm looking for but it takes two much time (around 0,7 second which is too slow for my case). I wont copy paste the exact array i obtain because it will be unreadable for you. Know just that [1,2,3] represent Here is what i obtain :
array([[1,2,3],
[4,5,6],
[7,8,9]], dtype=float32)
ndim : 2
dtype('float32')
shape : (11530, 300)
size : 3459000
For vectors 3, 4, 5 and 6 where i use no numpy methods like pandas to_numpy or as_matrix are fast (~0.05 sec) but returns me with the same entry an array of this form:
array([array([1,2,3], dtype=float32),
array([4,5,6], dtype=float32),
array([7,8,9], dtype=float32)], dtype=object)
ndim : 1
dtype('O')
shape : (11530,)
size : 11530
I don't understand why it doesn't give me the same array as numpy methods gives me.

how can I extract element in an array? [duplicate]

I am using a python wrapper to call functions of a c++ dll library. A ctype is returned by the dll library, which I convert to numpy array
score = np.ctypeslib.as_array(score,1)
however, the array has no shape?
score
>>> array(-0.019486344729027664)
score.shape
>>> ()
score[0]
>>> IndexError: too many indices for array
How can I extract a double from the score array?
Thank you.

You can access the data inside a 0-dimensional array via indexing [()].
For example, score[()] will retrieve the underlying data in your array.
The idiom is in fact consistent:
# x, y, z are 0-dim, 1-dim, 2-dim respectively
x = np.array(1)
y = np.array([1, 2, 3])
z = np.array([[1, 2, 3], [4, 5, 6]])
# use 0-dim, 1-dim, 2-dim tuple indexers respectively
res_x = x[()] # 1
res_y = y[(1,)] # 2
res_z = z[(1, 2)] # 6
Tuples seem unnatural because you don't need to use them explicitly for the 1d and 2d cases, i.e. y[1] and z[1, 2] suffice. That option isn't available for the 0-dim case, so use the zero-length tuple.

ctypes - numpy array with no shape?

I am using a python wrapper to call functions of a c++ dll library. A ctype is returned by the dll library, which I convert to numpy array
score = np.ctypeslib.as_array(score,1)
however, the array has no shape?
score
>>> array(-0.019486344729027664)
score.shape
>>> ()
score[0]
>>> IndexError: too many indices for array
How can I extract a double from the score array?
Thank you.

You can access the data inside a 0-dimensional array via indexing [()].
For example, score[()] will retrieve the underlying data in your array.
The idiom is in fact consistent:
# x, y, z are 0-dim, 1-dim, 2-dim respectively
x = np.array(1)
y = np.array([1, 2, 3])
z = np.array([[1, 2, 3], [4, 5, 6]])
# use 0-dim, 1-dim, 2-dim tuple indexers respectively
res_x = x[()] # 1
res_y = y[(1,)] # 2
res_z = z[(1, 2)] # 6
Tuples seem unnatural because you don't need to use them explicitly for the 1d and 2d cases, i.e. y[1] and z[1, 2] suffice. That option isn't available for the 0-dim case, so use the zero-length tuple.

Convert a 1D array to a 2D array in numpy

I want to convert a 1-dimensional array into a 2-dimensional array by specifying the number of columns in the 2D array. Something that would work like this:
> import numpy as np
> A = np.array([1,2,3,4,5,6])
> B = vec2matrix(A,ncol=2)
> B
array([[1, 2],
[3, 4],
[5, 6]])
Does numpy have a function that works like my made-up function "vec2matrix"? (I understand that you can index a 1D array like a 2D array, but that isn't an option in the code I have - I need to make this conversion.)

You want to reshape the array.
B = np.reshape(A, (-1, 2))
where -1 infers the size of the new dimension from the size of the input array.

You have two options:
If you no longer want the original shape, the easiest is just to assign a new shape to the array
a.shape = (a.size//ncols, ncols)
You can switch the a.size//ncols by -1 to compute the proper shape automatically. Make sure that a.shape[0]*a.shape[1]=a.size, else you'll run into some problem.
You can get a new array with the np.reshape function, that works mostly like the version presented above
new = np.reshape(a, (-1, ncols))
When it's possible, new will be just a view of the initial array a, meaning that the data are shared. In some cases, though, new array will be acopy instead. Note that np.reshape also accepts an optional keyword order that lets you switch from row-major C order to column-major Fortran order. np.reshape is the function version of the a.reshape method.
If you can't respect the requirement a.shape[0]*a.shape[1]=a.size, you're stuck with having to create a new array. You can use the np.resize function and mixing it with np.reshape, such as
>>> a =np.arange(9)
>>> np.resize(a, 10).reshape(5,2)

Try something like:
B = np.reshape(A,(-1,ncols))
You'll need to make sure that you can divide the number of elements in your array by ncols though. You can also play with the order in which the numbers are pulled into B using the order keyword.

If your sole purpose is to convert a 1d array X to a 2d array just do:
X = np.reshape(X,(1, X.size))

convert a 1-dimensional array into a 2-dimensional array by adding new axis.
a=np.array([10,20,30,40,50,60])
b=a[:,np.newaxis]--it will convert it to two dimension.

There is a simple way as well, we can use the reshape function in a different way:
A_reshape = A.reshape(No_of_rows, No_of_columns)

You can useflatten() from the numpy package.
import numpy as np
a = np.array([[1, 2],
[3, 4],
[5, 6]])
a_flat = a.flatten()
print(f"original array: {a} \nflattened array = {a_flat}")
Output:
original array: [[1 2]
[3 4]
[5 6]]
flattened array = [1 2 3 4 5 6]

some_array.shape = (1,)+some_array.shape
or get a new one
another_array = numpy.reshape(some_array, (1,)+some_array.shape)
This will make dimensions +1, equals to adding a bracket on the outermost

Change 1D array into 2D array without using Numpy.
l = [i for i in range(1,21)]
part = 3
new = []
start, end = 0, part
while end <= len(l):
temp = []
for i in range(start, end):
temp.append(l[i])
new.append(temp)
start += part
end += part
print("new values: ", new)
# for uneven cases
temp = []
while start < len(l):
temp.append(l[start])
start += 1
new.append(temp)
print("new values for uneven cases: ", new)

import numpy as np
array = np.arange(8)
print("Original array : \n", array)
array = np.arange(8).reshape(2, 4)
print("New array : \n", array)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parallelize numpy arange on dask array - python

Related

How does slicing numpy arrays with other arrays work?

Fastest way to create array from pandas dataframe with multiple values (~500 000)

how can I extract element in an array? [duplicate]

ctypes - numpy array with no shape?

Convert a 1D array to a 2D array in numpy

Categories

Resources