Separate Data Based on Value - python

Lets say I have the following data set
[[0 1994]
[0 1965]
[0, 1943]
[1, 1994]
[1,1965]
[2, 1948]]
I want to achieve the following output by separating out the data into individual arrays based on the value in the first row using numpy or Pandas methods instead of loops.
[
[[0 1994]
[0 1965]
[0 1943]]
[[1 1994]
[1 1965]]
[[2, 1948]]
]

Find the expected indices based on the differences of the firs columns' items and then split the array based on those indices:
In [22]: inds = np.where(np.diff(a[:,0]) != 0)[0] + 1
In [23]: np.split(a, inds)
Out[23]:
[array([[ 0, 1994],
[ 0, 1965],
[ 0, 1943]]), array([[ 1, 1994],
[ 1, 1965]]), array([[ 2, 1948]])]

The pandas way to do is to pass the data to DataFrame and then groupby operation:
df = pd.DataFrame(data)
df.groupby(0).apply(lambda x: x.values).values
Output:
[array([[ 0, 1994],
[ 0, 1965],
[ 0, 1943]]), array([[ 1, 1994],
[ 1, 1965]]), array([[ 2, 1948]])]

Related

Collating row entries in 2D array

I have a 2d numpy array consisting of 1s and 0s.
I want to club up the 1s and 0s of each row.
arr =
[[0 1 0]
[ 0 0 0]
[ 1 1 1]
[ 0 1 1]]
Desired output (each element is dtype str, to make sure leading zeros are not omitted)
[ 010 , 000 , 111 , 011 ]
How can I manipulate the 2d array to get this output? Is it possible in numpy or regex packages, by using their functions? Can a for loop be avoided to do this array transformation?
Using strings:
import numpy as np
arr = np.array([[0, 1, 0], [ 0, 0, 0], [ 1, 1, 1], [ 0, 1, 1]])
binaries = []
for idx, row in enumerate(arr):
strings = [str(integer) for integer in row]
a_string = "".join(strings)
binaries.append(a_string)
>>> binaries
>>> ['010', '000', '111', '011']
The question is quite unclear, assuming integers in and out, you could use:
a = np.array([[0, 1, 0],
[0, 0, 0],
[1, 1, 1],
[0, 1, 1]])
out = (a[:,::-1]*(10**np.arange(a.shape[1]))).sum(1)
But you won't have leading zeros…
output:
array([ 10, 0, 111, 11])
Assuming you really want to convert from binary, you should probably use np.packbits:
out = np.packbits(np.pad(a, ((0,0), (8-a.shape[1],0))), axis=1, bitorder='big')
output:
array([[2],
[0],
[7],
[3]], dtype=uint8)
or as flat version:
out = (np.packbits(np.pad(a, ((0,0), (8-a.shape[1],0))), axis=1, bitorder='big')
.ravel()
)
# array([2, 0, 7, 3], dtype=uint8)

Numpy broadcast array to smaller array with exact position for every row

Consider example matrix array:
[[0 1 2 1 0]
[1 1 2 1 0]
[0 1 0 0 0]
[1 2 1 0 0]
[1 2 2 3 2]]
What I need to do:
find maxima in every row
select smaller surrounding of the maxima from every row (3 values in this case)
paste the surrounding of the maxima into new array (narrower)
For the example above, the result is:
[[ 1. 2. 1.]
[ 1. 2. 1.]
[ 0. 1. 0.]
[ 1. 2. 1.]
[ 2. 3. 2.]]
My current working code:
import numpy as np
A = np.array([
[0, 1, 2, 1, 0],
[1, 1, 2, 1, 0],
[0, 1, 0, 0, 0],
[1, 2, 1, 0, 0],
[1, 2, 2, 3, 2],
])
b = A.argmax(axis=1)
C = np.zeros((len(A), 3))
for idx, loc, row in zip(range(len(A)), b, A):
print(idx, loc, row)
C[idx] = row[loc-1:loc+2]
print(C)
My question:
How to get rid of the for loop and replace it with some cheaper numpy operation?
Note:
This algorithm is for straightening broken "lines" in video stream frames with thousands of rows.
Approach #1
We can have a vectorized solution based on setting up sliding windows and then indexing into those with b-offsetted indices to get desired output. We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows to get sliding windows. More info on use of as_strided based view_as_windows.
The implementation would be -
from skimage.util.shape import view_as_windows
L = 3 # window length
w = view_as_windows(A,(1,L))[...,0,:]
Cout = w[np.arange(len(b)),b-L//2]
Being a view-based method, this has the advantage of being memory-efficient and hence good on performance too.
Approach #2
Alternatively, a one-liner by creating all those indices with outer-addition would be -
A[np.arange(len(b))[:,None],b[:,None] + np.arange(-(L//2),L//2+1)]
This works by making and array with all the desired indices, but somehow using that directly on A results in a 3D array, hence the subsequent indexing... Probably not optimal, but definitely another way of doing it!
import numpy as np
A = np.array([
[0, 1, 2, 1, 0],
[1, 1, 2, 1, 0],
[0, 1, 0, 0, 0],
[1, 2, 1, 0, 0],
[1, 2, 2, 3, 2],
])
b = A.argmax(axis = 1).reshape(-1, 1)
index = b + np.arange(-1,2,1).reshape(1, -1)
A[:,index][np.arange(b.size),np.arange(b.size)]

How this numpy advance indexing code works?

I am learning numpy framework.This piece of code I don't understand.
import numpy as np
a =np.array([[0,1,2],[3,4,5],[6,7,8],[9,10,11]])
print(a)
row = np.array([[0,0],[3,3]])
col = np.array([[0,2],[0,2]])
b = a[row,col]
print("This is b array:",b)
This b array returns the corner values of a array, that is, b equals [[0,2],[9,11]].
When indexing is done using an array or "array-like", to access/modify the elements of an array, then it's called advanced indexing.
In [37]: a
Out[37]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [38]: row
Out[38]:
array([[0, 0],
[3, 3]])
In [39]: col
Out[39]:
array([[0, 2],
[0, 2]])
In [40]: a[row, col]
Out[40]:
array([[ 0, 2],
[ 9, 11]])
That's what you got. Below is an explanation:
Indices of
`a[row, col]` row column
|| || || ||
VV VV VV VV
a[0, 0] a[0, 2]
a[3, 0] a[3, 2]
|__________| |
row-idx array |
|__________|
column-idx array
You're indexing a using two equally shaped 2d-arrays, hence you're output array will also have the same shape as col and row. To better understand how array indexing works you can check the docs, where as shown, indexing with 1d-arrays over the existing axis' of a given array works as follows:
result[i_1, ..., i_M] == x[ind_1[i_1, ..., i_M], ind_2[i_1, ..., i_M],
..., ind_N[i_1, ..., i_M]]
Where the same logic applies in the case of indexing with 2d-arrays over each axis, but instead you'd have a result array with up to i_N_M indices.
So going back to your example you are essentially selecting from the rows of a based on row, and from those rows you are selecting some columns col. You might find it more intuitive to translate the row and column indices into (x,y) coordinates:
(0,0), (0,2)
(3,0), (3,2)
Which, by accordingly selecting from a, results in the output array:
print(a[row,col])
array([[ 0, 2],
[ 9, 11]])
You can understand it by making more tries, to see more examples.
If you have one dimensional index:
In [58]: np.arange(10)[np.array([1,3,4,6])]
Out[58]: array([1, 3, 4, 6])
In case of two dimensional index:
In [57]: np.arange(10)[np.array([[1,3],[4,6]])]
Out[57]:
array([[1, 3],
[4, 6]])
If you use 3 dimensional index:
In [59]: np.arange(10)[np.array([[[1],[3]],[[4],[6]]])]
Out[59]:
array([[[1],
[3]],
[[4],
[6]]])
As you can see, if you make hierarchy in indexing, you will get it in the output as well.
Proceeding by steps:
import numpy as np
a = np.array([[0,1,2],[3,4,5],[6,7,8],[9,10,11]])
print(a)
gives 2d array a:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
Then:
row = np.array([[0,0],[3,3]])
assigns to 2d array row values [0,0] and [3,3]:
array([[0, 0],
[3, 3]])
Then:
col = np.array([[0,2],[0,2]])
assigns to 2d array col values [0,2] and [0,2]:
array([[0, 2],
[0, 2]])
Finally:
b = a[row,col]
assigns to b values given by a[0,0], a[0,2] for the first row, a[3,0], a[3,2] for the second row, that is:
array([[ 0, 2],
[ 9, 11]])
Where does b[0,0] <-- a[0,0] come from? It comes from the combination of row[0,0] which is 0 and col[0,0] which is 0.
What about b[0,1] <-- a[0,2]? It comes from the combination of row[0,1] which is 0 and col[0,1] which is 2.
And so forth.

Indexing the max elements in a multidimensional tensor in PyTorch

I'm trying to index the maximum elements along the last dimension in a multidimensional tensor. For example, say I have a tensor
A = torch.randn((5, 2, 3))
_, idx = torch.max(A, dim=2)
Here idx stores the maximum indices, which may look something like
>>>> A
tensor([[[ 1.0503, 0.4448, 1.8663],
[ 0.8627, 0.0685, 1.4241]],
[[ 1.2924, 0.2456, 0.1764],
[ 1.3777, 0.9401, 1.4637]],
[[ 0.5235, 0.4550, 0.2476],
[ 0.7823, 0.3004, 0.7792]],
[[ 1.9384, 0.3291, 0.7914],
[ 0.5211, 0.1320, 0.6330]],
[[ 0.3292, 0.9086, 0.0078],
[ 1.3612, 0.0610, 0.4023]]])
>>>> idx
tensor([[ 2, 2],
[ 0, 2],
[ 0, 0],
[ 0, 2],
[ 1, 0]])
I want to be able to access these indices and assign to another tensor based on them. Meaning I want to be able to do
B = torch.new_zeros(A.size())
B[idx] = A[idx]
where B is 0 everywhere except where A is maximum along the last dimension. That is B should store
>>>>B
tensor([[[ 0, 0, 1.8663],
[ 0, 0, 1.4241]],
[[ 1.2924, 0, 0],
[ 0, 0, 1.4637]],
[[ 0.5235, 0, 0],
[ 0.7823, 0, 0]],
[[ 1.9384, 0, 0],
[ 0, 0, 0.6330]],
[[ 0, 0.9086, 0],
[ 1.3612, 0, 0]]])
This is proving to be much more difficult than I expected, as the idx does not index the array A properly. Thus far I have been unable to find a vectorized solution to use idx to index A.
Is there a good vectorized way to do this?
You can use torch.meshgrid to create an index tuple:
>>> index_tuple = torch.meshgrid([torch.arange(x) for x in A.size()[:-1]]) + (idx,)
>>> B = torch.zeros_like(A)
>>> B[index_tuple] = A[index_tuple]
Note that you can also mimic meshgrid via (for the specific case of 3D):
>>> index_tuple = (
... torch.arange(A.size(0))[:, None],
... torch.arange(A.size(1))[None, :],
... idx
... )
Bit more explanation:
We will have the indices something like this:
In [173]: idx
Out[173]:
tensor([[2, 1],
[2, 0],
[2, 1],
[2, 2],
[2, 2]])
From this, we want to go to three indices (since our tensor is 3D, we need three numbers to retrieve each element). Basically we want to build a grid in the first two dimensions, as shown below. (And that's why we use meshgrid).
In [174]: A[0, 0, 2], A[0, 1, 1]
Out[174]: (tensor(0.6288), tensor(-0.3070))
In [175]: A[1, 0, 2], A[1, 1, 0]
Out[175]: (tensor(1.7085), tensor(0.7818))
In [176]: A[2, 0, 2], A[2, 1, 1]
Out[176]: (tensor(0.4823), tensor(1.1199))
In [177]: A[3, 0, 2], A[3, 1, 2]
Out[177]: (tensor(1.6903), tensor(1.0800))
In [178]: A[4, 0, 2], A[4, 1, 2]
Out[178]: (tensor(0.9138), tensor(0.1779))
In the above 5 lines, the first two numbers in the indices are basically the grid that we build using meshgrid and the third number is coming from idx.
i.e. the first two numbers form a grid.
(0, 0) (0, 1)
(1, 0) (1, 1)
(2, 0) (2, 1)
(3, 0) (3, 1)
(4, 0) (4, 1)
An ugly hackaround is to create a binary mask out of idx and use it to index the arrays. The basic code looks like this:
import torch
torch.manual_seed(0)
A = torch.randn((5, 2, 3))
_, idx = torch.max(A, dim=2)
mask = torch.arange(A.size(2)).reshape(1, 1, -1) == idx.unsqueeze(2)
B = torch.zeros_like(A)
B[mask] = A[mask]
print(A)
print(B)
The trick is that torch.arange(A.size(2)) enumerates the possible values in idx and mask is nonzero in places where they equal the idx. Remarks:
If you really discard the first output of torch.max, you can use torch.argmax instead.
I assume that this is a minimal example of some wider problem, but be aware that you are currently reinventing torch.nn.functional.max_pool3d with kernel of size (1, 1, 3).
Also, be aware that in-place modification of tensors with masked assignment can cause issues with autograd, so you may want to use torch.where as shown here.
I would expect that somebody comes up with a cleaner solution (avoiding the intermedia allocation of the mask array), likely making use of torch.index_select, but I can't get it to work right now.
could use torch.scatter here
>>> import torch
>>> a = torch.randn(4,2,3)
>>> a
tensor([[[ 0.1583, 0.1102, -0.8188],
[ 0.6328, -1.9169, -0.5596]],
[[ 0.5335, 0.4069, 0.8403],
[-1.2537, 0.9868, -0.4947]],
[[-1.2830, 0.4386, -0.0107],
[ 1.3384, 0.5651, 0.2877]],
[[-0.0334, -1.0619, -0.1144],
[ 0.1954, -0.7371, 1.7001]]])
>>> ind = torch.max(a,1,keepdims=True)[1]
>>> ind
tensor([[[1, 0, 1]],
[[0, 1, 0]],
[[1, 1, 1]],
[[1, 1, 1]]])
>>> torch.zeros_like(a).scatter(1,ind,a)
tensor([[[ 0.0000, 0.1102, 0.0000],
[ 0.1583, 0.0000, -0.8188]],
[[ 0.5335, 0.0000, 0.8403],
[ 0.0000, 0.4069, 0.0000]],
[[ 0.0000, 0.0000, 0.0000],
[-1.2830, 0.4386, -0.0107]],
[[ 0.0000, 0.0000, 0.0000],
[-0.0334, -1.0619, -0.1144]]])

Selecting rows from two nump.nd arrays and insert 0 for the missing match

I have two nd.numpy arrays named 'a' and 'b', I want to select only certain rows from array 'b' based on the comparison with 'a' and insert 0 for the rows if a match is not found. I did the first part. eg;
a = np.array([[1,5,9],
[2,6,10],
[5,14,10]])
b = np.array([[ 1,0,9],
[2,6,10],
[4,6,10]])
output
[[ 1 0 9]
[ 2 6 10]]
expected output
[[ 1 0 9]
[ 2 6 10]
[ 0 0 0]]
Code:
import numpy as np
wanted= a[:,[0]]
y=b[np.logical_or.reduce([b[:,0] == x for x in wanted])]
print y
In the above example from the array 'a', in row '3' we don't have value '5' in array 'b', So when comparing 'a' with 'b' if a no match is found I want to insert '0' to the third row so that the dimension of the two arrays are equal.
If you would like any element of b[:, 0] that is not in a[:, 0] to be zero you can do the following:
>>> b[~np.in1d(b[:, 0], a[:, 0]), :] = 0
>>> b
array([[ 1, 0, 9],
[ 2, 6, 10],
[ 0, 0, 0]])
If you would like any element of b[:, 0] that is not in the corresponding row of a to be zero:
>>> b[~np.any(b[:, 0][:,None]==a, axis=1), :] = 0
>>> b
array([[ 1, 0, 9],
[ 2, 6, 10],
[ 0, 0, 0]])

Categories

Resources