Related
I have a 2-D numpy array X with shape (100, 4). I want to find the sum of each row of that
array and store it inside a new numpy array x_new with shape (100,0). What I've done so far
doesn't work. Any suggestions ?. Below is my approach.
x_new = np.empty([100,0])
for i in range(len(X)):
array = np.append(x_new, sum(X[i]))
Using the sum method on a 2d array:
In [8]: x = np.arange(12).reshape(3,4)
In [9]: x
Out[9]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [10]: x.sum(axis=1)
Out[10]: array([ 6, 22, 38])
In [12]: x.sum(axis=1, keepdims=True)
Out[12]:
array([[ 6],
[22],
[38]])
In [13]: _.shape
Out[13]: (3, 1)
reference: https://numpy.org/doc/stable/reference/generated/numpy.sum.html
I have 2 square array with shape = (25, 25) and I want to check if an entire row is filled with zeros and if the corresponding column is filled with zeros. If this is the case I want to remove those columns and rows from the array.
For example:
array = np.array([[1, 0, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 1],
[1, 0, 1, 1]])
I want it manipulated to
array=np.array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
I hope you can understand what I am aiming at. In this example row and column two have been removed as they are zero rows/columns.
I could do that by iterating through all of those arrays, as I have 10 million of those arrays I would like to have a pythonic/efficient way to solve this issue.
The second array is a tensorflow array manipulating that should be no problem if I know the index of the rows columns I want removed.
Edit:
I have now found following solution, but it is using for-looping:
def removepadding(y_true, y_pred):
shape = np.shape(y_true)
y_true_cleaned=[]
for i in range(shape[0]):
x = y_true[i]
for n in range(shape[1] - 1, -1, -1):
if sum(x[n, :]) == 0 and sum(x[:, n]) == 0:
x = np.delete(np.delete(x, n, 0), n, 1)
y_true_cleaned.append(x)
return y_true_cleaned
You can do it in one line:
array[array.any(axis = 1)][:, array.any(axis = 0)]
#array([[1, 1, 1],
# [1, 1, 1],
# [1, 1, 1]])
if there is negative values in the arr, np.sum may fail.
for 2d array:
import numpy as np
a = np.array([[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]])
row = np.all(a==0, axis=1)
col = np.all(a==0, axis=0)
a[~row][:,~col]
output
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])
for 3d array:
a = np.ones((3,3,3))
a[1,:,1] = 0
a[1,1,:] = 0
a[:,1,1] = 0
z = np.all(a==0, axis=2)
y = np.all(a==0, axis=1)
x = np.all(a==0, axis=0)
Z = ~np.array([z]*a.shape[2])
Y = ~np.array([y]*a.shape[1])
X = ~np.array([x]*a.shape[0])
ZZ, YY, XX = (Z*Y*X).nonzero()
a[ZZ, YY, XX]
You can use np.count_nonzero to get the indices in one step per dimension:
nnz_row = np.count_nonzero(array, axis=1)
nnz_col = np.count_nonzero(array, axis=0)
Now you make a mask of where both are zero:
mask = (nnz_row == 0) & (nnz_col == 9)
You can turn the mask into indices and pass it to np.delete:
ind = np.flatnonzero(mask)
array = np.delete(np.delete(array, ind, axis=0), ind, axis=1)
Alternatively, you can compute the positive mask:
pmask = nnz_row.astype(bool) | nnz_col.astype(bool)
This mask can select directly, analogously to what delete did with the negative mask:
array = array[pmask, :][:, pmask]
Edit: Thanks to #mad physicist, we can use np.flatnonzero. Here's the 2d case:
import numpy as np
a=np.array([[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]])
cols_to_keep = np.flatnonzero(a.sum(axis=0))
rows_to_keep = np.flatnonzero(a.sum(axis=1))
a = a[:, cols_to_keep]
a = a[rows_to_keep, :]
a
>>>
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]])
Here's the 3d case:
import numpy as np
a=np.array([
[[1,0,2,3,0,4],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]],
[[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[0,0,0,0,0,0]],
[[5,0,5,5,0,5],
[0,0,0,0,0,0],
[0,0,0,0,0,0],
[2,0,3,4,0,5],
[3,0,4,5,0,6],
[4,0,5,6,0,7],
[5,0,6,7,0,8]],
])
ix_keep_axis_0 = np.flatnonzero(a.sum(axis=(1, 2)))
ix_keep_axis_1 = np.flatnonzero(a.sum(axis=(0, 2)))
ix_keep_axis_2 = np.flatnonzero(a.sum(axis=(0, 1)))
a = a[ix_keep_axis_0, :, :]
a = a[:, ix_keep_axis_1, :]
a = a[:, :, ix_keep_axis_2]
a
>>>
array([[[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]],
[[5, 5, 5, 5],
[2, 3, 4, 5],
[3, 4, 5, 6],
[4, 5, 6, 7],
[5, 6, 7, 8]]])
problem is very simple: I have two 2d np.array and I want to get a third array that only contains the rows that are not in common with the latter twos.
for example:
X = np.array([[0,1],[1,2],[4,5],[5,6],[8,9],[9,10]])
Y = np.array([[5,6],[9,10]])
Z = function(X,Y)
Z = array([[0, 1],
[1, 2],
[4, 5],
[8, 9]])
I tried np.delete(X,Y,axis=0) but it doesn't work...
Z = np.vstack(row for row in X if row not in Y)
The numpy_indexed package (disclaimer: I am its author) extends the standard numpy array set operations to multi-dimensional use cases such as these, with good efficiency:
import numpy_indexed as npi
Z = npi.difference(X, Y)
Here's a views based approach -
# Based on http://stackoverflow.com/a/41417343/3293881 by #Eric
def setdiff2d(a, b):
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
orig_dt = np.dtype((a.dtype, a.shape[1:]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt)
b_void = b.reshape(b.shape[0], -1).view(void_dt)
# Get indices in a that are also in b
return np.setdiff1d(a_void, b_void).view(orig_dt)
Sample run -
In [81]: X
Out[81]:
array([[ 0, 1],
[ 1, 2],
[ 4, 5],
[ 5, 6],
[ 8, 9],
[ 9, 10]])
In [82]: Y
Out[82]:
array([[ 5, 6],
[ 9, 10]])
In [83]: setdiff2d(X,Y)
Out[83]:
array([[0, 1],
[1, 2],
[4, 5],
[8, 9]])
Z = np.unique([tuple(row) for row in X + Y])
I have an array of values that I want to replace with from an array of choices based on which choice is linearly closest.
The catch is the size of the choices is defined at runtime.
import numpy as np
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
If choices was static in size, I would simply use np.where
d = np.where(np.abs(a - choices[0]) > np.abs(a - choices[1]),
np.where(np.abs(a - choices[0]) > np.abs(a - choices[2]), choices[0], choices[2]),
np.where(np.abs(a - choices[1]) > np.abs(a - choices[2]), choices[1], choices[2]))
To get the output:
>>d
>>[[1, 1, 1], [5, 5, 5], [10, 10, 10]]
Is there a way to do this more dynamically while still preserving the vectorization.
Subtract choices from a, find the index of the minimum of the result, substitute.
a = np.array([[0, 0, 0], [4, 4, 4], [9, 9, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 5 5]
[10 10 10]]
a = np.array([[0, 3, 0], [4, 8, 4], [9, 1, 9]])
choices = np.array([1, 5, 10])
b = a[:,:,None] - choices
np.absolute(b,b)
i = np.argmin(b, axis = -1)
a = choices[i]
print a
>>>
[[ 1 1 1]
[ 5 10 5]
[10 1 10]]
>>>
The extra dimension was added to a so that each element of choices would be subtracted from each element of a. choices was broadcast against a in the third dimension, This link has a decent graphic. b.shape is (3,3,3). EricsBroadcastingDoc is a pretty good explanation and has a graphic 3-d example at the end.
For the second example:
>>> print b
[[[ 1 5 10]
[ 2 2 7]
[ 1 5 10]]
[[ 3 1 6]
[ 7 3 2]
[ 3 1 6]]
[[ 8 4 1]
[ 0 4 9]
[ 8 4 1]]]
>>> print i
[[0 0 0]
[1 2 1]
[2 0 2]]
>>>
The final assignment uses an Index Array or Integer Array Indexing.
In the second example, notice that there was a tie for element a[0,1] , either one or five could have been substituted.
To explain wwii's excellent answer in a little more detail:
The idea is to create a new dimension which does the job of comparing each element of a to each element in choices using numpy broadcasting. This is easily done for an arbitrary number of dimensions in a using the ellipsis syntax:
>>> b = np.abs(a[..., np.newaxis] - choices)
array([[[ 1, 5, 10],
[ 1, 5, 10],
[ 1, 5, 10]],
[[ 3, 1, 6],
[ 3, 1, 6],
[ 3, 1, 6]],
[[ 8, 4, 1],
[ 8, 4, 1],
[ 8, 4, 1]]])
Taking argmin along the axis you just created (the last axis, with label -1) gives you the desired index in choices that you want to substitute:
>>> np.argmin(b, axis=-1)
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
Which finally allows you to choose those elements from choices:
>>> d = choices[np.argmin(b, axis=-1)]
>>> d
array([[ 1, 1, 1],
[ 5, 5, 5],
[10, 10, 10]])
For a non-symmetric shape:
Let's say a had shape (2, 5):
>>> a = np.arange(10).reshape((2, 5))
>>> a
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
Then you'd get:
>>> b = np.abs(a[..., np.newaxis] - choices)
>>> b
array([[[ 1, 5, 10],
[ 0, 4, 9],
[ 1, 3, 8],
[ 2, 2, 7],
[ 3, 1, 6]],
[[ 4, 0, 5],
[ 5, 1, 4],
[ 6, 2, 3],
[ 7, 3, 2],
[ 8, 4, 1]]])
This is hard to read, but what it's saying is, b has shape:
>>> b.shape
(2, 5, 3)
The first two dimensions came from the shape of a, which is also (2, 5). The last dimension is the one you just created. To get a better idea:
>>> b[:, :, 0] # = abs(a - 1)
array([[1, 0, 1, 2, 3],
[4, 5, 6, 7, 8]])
>>> b[:, :, 1] # = abs(a - 5)
array([[5, 4, 3, 2, 1],
[0, 1, 2, 3, 4]])
>>> b[:, :, 2] # = abs(a - 10)
array([[10, 9, 8, 7, 6],
[ 5, 4, 3, 2, 1]])
Note how b[:, :, i] is the absolute difference between a and choices[i], for each i = 1, 2, 3.
Hope that helps explain this a little more clearly.
I love broadcasting and would have gone that way myself too. But, with large arrays, I would like to suggest another approach with np.searchsorted that keeps it memory efficient and thus achieves performance benefits, like so -
def searchsorted_app(a, choices):
lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)
ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)
cl = np.take(choices,lidx) # Or choices[lidx]
cr = np.take(choices,ridx) # Or choices[ridx]
mask = np.abs(a - cl) > np.abs(a - cr)
cl[mask] = cr[mask]
return cl
Please note that if the elements in choices are not sorted, we need to add in the additional argument sorter with np.searchsorted.
Runtime test -
In [160]: # Setup inputs
...: a = np.random.rand(100,100)
...: choices = np.sort(np.random.rand(100))
...:
In [161]: def broadcasting_app(a, choices): # #wwii's solution
...: return choices[np.argmin(np.abs(a[:,:,None] - choices),-1)]
...:
In [162]: np.allclose(broadcasting_app(a,choices),searchsorted_app(a,choices))
Out[162]: True
In [163]: %timeit broadcasting_app(a, choices)
100 loops, best of 3: 9.3 ms per loop
In [164]: %timeit searchsorted_app(a, choices)
1000 loops, best of 3: 1.78 ms per loop
Related post : Find elements of array one nearest to elements of array two
I have scripts with multi-dimensional arrays and instead of for-loops I would like to use a vectorized implementation for my problems (which sometimes contain column operations).
Let's consider a simple example with matrix arr:
> arr = np.arange(12).reshape(3, 4)
> arr
> ([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
> arr.shape
> (3, 4)
So we have a matrix arr with 3 rows and 4 columns.
The simplest case in my scripts is adding something to the values in the array. E.g. I'm doing this for single or multiple rows:
> someVector = np.array([1, 2, 3, 4])
> arr[0] += someVector
> arr
> array([[ 1, 3, 5, 7], <--- successfully added someVector
[ 4, 5, 6, 7], to one row
[ 8, 9, 10, 11]])
> arr[0:2] += someVector
> arr
> array([[ 2, 5, 8, 11], <--- added someVector to two
[ 5, 7, 9, 11], <--- rows at once
[ 8, 9, 10, 11]])
This works well. However, sometimes I need to manipulate one or several columns. One column at a time works:
> arr[:, 0] += [1, 2, 3]
> array([[ 3, 5, 8, 11],
[ 7, 7, 9, 11],
[11, 9, 10, 11]])
^
|___ added the values [1, 2, 3] successfully to
this column
But I am struggling to think out why this does not work for multiple columns at once:
> arr[:, 0:2] += [1, 2, 3]
> ValueError
> Traceback (most recent call last)
> <ipython-input-16-5feef53e53af> in <module>()
> ----> 1 arr[:, 0:2] += [1, 2, 3]
> ValueError: operands could not be broadcast
> together with shapes (3,2) (3,) (3,2)
Isn't this the very same way it works with rows? What am I doing wrong here?
To add a 1D array to multiple columns you need to broadcast the values to a 2D array. Since broadcasting adds new axes on the left (of the shape) by default, broadcasting a row vector to multiple rows happens automatically:
arr[0:2] += someVector
someVector has shape (N,) and gets automatically broadcasted to shape (1, N). If arr[0:2] has shape (2, N), then the sum is performed element-wise as though both arr[0:2] and someVector were arrays of the same shape, (2, N).
But to broadcast a column vector to multiple columns requires hinting NumPy that you want broadcasting to occur with the axis on the right. In fact, you have to add the new axis on the right explicitly by using someVector[:, np.newaxis] or equivalently someVector[:, None]:
In [41]: arr = np.arange(12).reshape(3, 4)
In [42]: arr[:, 0:2] += np.array([1, 2, 3])[:, None]
In [43]: arr
Out[43]:
array([[ 1, 2, 2, 3],
[ 6, 7, 6, 7],
[11, 12, 10, 11]])
someVector (e.g. np.array([1, 2, 3])) has shape (N,) and someVector[:, None] has shape (N, 1) so now broadcasting happens on the right. If arr[:, 0:2] has shape (N, 2), then the sum is performed element-wise as though both arr[:, 0:2] and someVector[:, None] were arrays of the same shape, (N, 2).
Very clear explanation of #unutbu.
As a complement, transposition (.T) can often simplify the task, by working in the first dimension :
In [273]: arr = np.arange(12).reshape(3, 4)
In [274]: arr.T[0:2] += [1, 2, 3]
In [275]: arr
Out[275]:
array([[ 1, 2, 2, 3],
[ 6, 7, 6, 7],
[11, 12, 10, 11]])