Remove rows from numpy array based on presence/absence in other arrays - python

I have 3 different numpy arrays, but they all start with two columns which contain the day of year and the time. For example:
dyn = [[ 83 12 7.10555687e-01 ..., 6.99242766e-01 6.868761e-01]
[ 83 13 8.28091972e-01 ..., 8.33734118e-01 8.47266838e-01]
[ 83 14 8.79437354e-01 ..., 8.73598144e-01 8.57156213e-01]
[ 161 23 3.28109488e-01 ..., 2.83043689e-01 2.59775391e-01]
[ 162 0 2.23502046e-01 ..., 1.96972086e-01 1.65565263e-01]
[ 162 1 2.51653976e-01 ..., 2.17209188e-01 1.42133495e-1]]
us = [[ 133 18 3.00483815e+02 ..., 1.94277561e+00 2.8168959e+00]
[ 133 19 2.98832620e+02 ..., 2.42506475e+00 2.99730800e+00]
[ 133 20 2.96706105e+02 ..., 3.16851622e+00 4.41187088e+00]
[ 161 23 2.88336560e+02 ..., 3.44864070e-01 3.85055635e-01]
[ 162 0 2.87593240e+02 ..., 2.93002410e-01 2.67112490e-01]
[ 162 2 2.86992180e+02 ..., 7.08996730e-02 2.6403210e-01]]
I need to be able to remove any rows where specific date and time isn't present in all 3 arrays. In other words, so I'm left with 3 arrays where the first 2 columns are identical in each of the 3 arrays.
So the resulting smaller arrays would be:
dyn= [[ 161 23 3.28109488e-01 ..., 2.83043689e-01 2.59775391e-01]
[ 162 0 2.23502046e-01 ..., 1.96972086e-01 1.65565263e-01]]
us= [[ 161 23 2.88336560e+02 ..., 3.44864070e-01 3.85055635e-01]
[ 162 0 2.87593240e+02 ..., 2.93002410e-01 2.67112490e-01]]
(But then also limited by what's in the third array)
I've tried using sort/zip but not sure that it should be applied to 2D array like that:
X= dyn
Y = us
xsorted=[x for (y,x) in sorted(zip(Y[:,1],X[:,1]), key=lambda pair: pair[0])]
And also a loop but that only works when the same times/days are in the same position within the array, which isn't helpful
for i in range(100):
dyn_small=dyn[dyn[:,0]==us[i,0]]

Assuming A, B and C as the input arrays, here's a vectorized approach making heavy usage of broadcasting -
# Get masks comparing all rows of A with B and then B with C
M1 = (A[:,None,:2] == B[:,:2])
M2 = (B[:,None,:2] == C[:,:2])
# Get a joint 3D mask of those two masks and get the indices of matches.
# These indices (I,J,K) of the 3D mask basically tells us the row numbers
# correspondng to each of the input arrays that are present in all of them.
# Thus, in (I,J,K), I would be the matching row number in A, J in B & K in C.
I,J,K = np.where((M1[:,:,None,:] & M2).all(3))
# Finally, select rows of A, B and C with I, J and K respectively
A_new = A[I]
B_new = B[J]
C_new = C[K]
Sample run -
1) Inputs :
In [116]: A
Out[116]:
array([[ 83, 12, 443],
[ 83, 13, 565],
[ 83, 14, 342],
[161, 23, 431],
[162, 0, 113],
[162, 1, 313]])
In [117]: B
Out[117]:
array([[161, 23, 999],
[ 5, 1, 13],
[ 83, 12, 15],
[162, 0, 12],
[ 4, 3, 11]])
In [118]: C
Out[118]:
array([[ 11, 23, 143],
[162, 0, 113],
[161, 23, 545]])
2) Run solution code to get matching row IDs and thus extract the rows :
In [119]: M1 = (A[:,None,:2] == B[:,:2])
...: M2 = (B[:,None,:2] == C[:,:2])
...:
In [120]: I,J,K = np.where((M1[:,:,None,:] & M2).all(3))
In [121]: A[I]
Out[121]:
array([[161, 23, 431],
[162, 0, 113]])
In [122]: B[J]
Out[122]:
array([[161, 23, 999],
[162, 0, 12]])
In [123]: C[K]
Out[123]:
array([[161, 23, 545],
[162, 0, 113]])

The numpy_indexed package (disclaimer: I am its author) contains functionality to solve such problems in an elegant and efficient/vectorized manner:
import numpy as np
import numpy_indexed as npi
dyn = np.array(dyn)
us = np.array(us)
dyn_index = npi.as_index(dyn[:, :2])
us_index = npi.as_index(us[:, :2])
common = npi.intersection(dyn_index, us_index)
print(common)
print(dyn[npi.contains(common, dyn_index)])
print(us[npi.contains(common, us_index)])
Note that the performance NlogN worst case; and linear insofar as the arguments to as_index are already in sorted order. By contrast, the currently accepted answer is quadratic in input size.

Related

Sorting 2 single dimensional arrays into a 1 dimensional array

I am trying to write a code that chooses one by one from a and b. I want to make a 2 dimensional array where the first index is either 0 or 1. 0 representing a and 1 representing b and the second index would just be the values in array a or b so it will be something like this [[0 7][1 13]]. I want the function to also have it in order so it will be The function starts off with a then it will be like a,b,a,b,a... if its the other way around b,a,b,a,b.... Comparing which index function comes before the other so since the first index of b is 0 and the first index of a is 7, since 0 < 7 the code will start off with b [[1 0]] and then it will go for the next index on 'a' which is 7 so the [[1 0],[0, 7]]. It will keep on doing this until it reaches the end of the array a and b. How can I get the expected output below?
import numpy as np
a = np.array([ 7, 9, 12, 15, 17, 22])
b = np.array([ 0, 13, 17, 18])
Expected Output:
[[ 1 0]
[ 0 7]
[ 1 13]
[ 0 15]
[ 1 17]
[ 0 17]
[ 1 18]
[ 0 22]]
You can combine the two arrays and sort the values while preserving the origin of each value (using 2N and 2N+1 offsetting).
Then filter out the consecutive odd/even values to only retain values with alternating origin indicator (1 or 0)
Finally, build the resulting array of [origin,value] pairs by reversing the 2N and 2N+1 tagging.
import numpy as np
a = np.array([ 7, 9, 12, 15, 17, 22])
b = np.array([ 0, 13, 17, 18])
p = 1 if a[0] > b[0] else 0 # determine first entry
c = np.sort(np.concatenate((a*2+p,b*2+1-p))) # combine/sort tagged values
c = np.concatenate((c[:1],c[1:][c[:-1]%2 != c[1:]%2])) # filter out same-array repeats
c = np.concatenate(((c[:,None]+p)%2,c[:,None]//2),axis=1) # build result
print(c)
[[ 1 0]
[ 0 7]
[ 1 13]
[ 0 15]
[ 1 17]
[ 0 17]
[ 1 18]
[ 0 22]]
This isn't a Numpy solution, but may work if you are okay processing these as lists. You can make iterators out of the lists, then alternate between them using itertools.dropwhile to proceed through the elements until you get the next in line. It might look something like:
from itertools import dropwhile
def pairs(a, b):
index = 0 if a[0] <= b[0] else 1
iters = [iter(a), iter(b)]
while True:
try:
current = next(iters[index])
yield [index,current]
index = int(not index)
except StopIteration:
break
iters[index] = dropwhile(lambda n: n < current, iters[index])
list(pairs(a, b))
Which results in:
[[1, 0], [0, 7], [1, 13], [0, 15], [1, 17], [0, 17], [1, 18], [0, 22]]
you can use conditions of which array element is from -> with sorted values -> including condition of ->group wise split and ->flip
c = np.hstack([np.vstack([a, np.zeros(len(a))]), np.vstack([b, np.ones(len(b))])]).T
c = c[c[:, 0].argsort()]
# Group wise split and flip array - 2nd possiblity
d = np.vstack(np.apply_along_axis(np.flip, 0, np.split(c, np.unique(c[:,0], return_index = True)[1])))[::-1]
res1 = np.vstack([d[0], d[1:][d[:,1][:-1]!=d[:,1][1:]]])
res2 = np.vstack([c[0], c[1:][c[:,1][:-1]!=c[:,1][1:]]])
if res1.shape[0]>res2.shape[0]:
print(res1)
else:
print(res2)
Out:
[[ 0. 1.]
[ 7. 0.]
[13. 1.]
[15. 0.]
[17. 1.]
[17. 0.]
[18. 1.]
[22. 0.]]

Element-wise numpy matrix multiplication

I have two numpy arrays A and B, both with the dimension [2,2,n], where n is a very large number. I want to matrix multiply A and B in the first two dimensions to get C, i.e. C=AB, where C has the dimension [2,2,n].
The simplest way to accomplish this is by using for loop, i.e.
for i in range(n):
C[:,:,i] = np.matmul(A[:,:,i],B[:,:,i])
However, this is inefficient since n is very large. What's the most efficient way to do this with numpy?
You can do the following:
new_array = np.einsum('ijk,jlk->ilk', A, B)
What you want is the the default array multiplication in Numpy
In [22]: a = np.arange(8).reshape((2,2,2))+1 ; a[:,:,0], a[:,:,1]
Out[22]:
(array([[1, 3],
[5, 7]]),
array([[2, 4],
[6, 8]]))
In [23]: aa = a*a ; aa[:,:,0], aa[:,:,1]
Out[23]:
(array([[ 1, 9],
[25, 49]]),
array([[ 4, 16],
[36, 64]]))
Notice that I emphasized array because Numpy's arrays look like matrices but are indeed Numpy's ndarrays.
Post Scriptum
I guess that what you really want are matricesarrays with shape (n,2,2), so that you can address individual 2×2 matrices using a single index, e.g.,
In [27]: n = 3
...: a = np.arange(n*2*2)+1 ; a_22n, a_n22 = a.reshape((2,2,n)), a.reshape((n,2,2))
...: print(a_22n[0])
...: print(a_n22[0])
[[1 2 3]
[4 5 6]]
[[1 2]
[3 4]]
Post Post Scriptum
Re semantically correct:
In [13]: import numpy as np
...: n = 3
...: a = np.arange(2*2*n).reshape((2,2,n))+1
...: p = lambda t,a,n:print(t,*(a[:,:,i]for i in range(n)),sep=',\n')
...: p('Original array', a, n)
...: p('Using `einsum("ijk,jlk->ilk", ...)`', np.einsum('ijk,jlk->ilk', a, a), n)
...: p('Using standard multiplication', a*a, n)
Original array,
[[ 1 4]
[ 7 10]],
[[ 2 5]
[ 8 11]],
[[ 3 6]
[ 9 12]]
Using `einsum("ijk,jlk->ilk", ...)`,
[[ 29 44]
[ 77 128]],
[[ 44 65]
[104 161]],
[[ 63 90]
[135 198]]
Using standard multiplication,
[[ 1 16]
[ 49 100]],
[[ 4 25]
[ 64 121]],
[[ 9 36]
[ 81 144]]

numpy broadcasting to each column of the matrix separately

I have to matrices:
a = np.array([[6],[3],[4]])
b = np.array([1,10])
when I do:
c = a * b
c looks like this:
[ 6, 60]
[ 3, 30]
[ 4, 40]
which is good.
now, lets say I add a column to a (for the sake of the example its an identical column. but it dosent have to be):
a = np.array([[6,6],[3,3],[4,4]])
b stayes the same.
the result I want is 2 identical copies of c (since the column are identical), stacked along a new axis:
new_c.shape == [3,2,2]
when if u do new_c[:,:,0] or new_c[:,:,1] you get the original c.
I tried adding new axes to both a and b using np.expand_dims but it did not help.
One way is using numpy.einsum:
>>> import numpy as np
>>> a = np.array([[6],[3],[4]])
>>> b = np.array([1,10])
>>> print(a * b)
[[ 6 60]
[ 3 30]
[ 4 40]]
>>> print(np.einsum('ij, j -> ij', a, b))
[[ 6 60]
[ 3 30]
[ 4 40]]
>>> a = np.array([[6,6],[3,3],[4,4]])
>>> print(np.einsum('ij, k -> ikj', a, b)[:, :, 0])
>>> print(np.einsum('ij, k -> ikj', a, b)[:, :, 1])
[[ 6 60]
[ 3 30]
[ 4 40]]
[[ 6 60]
[ 3 30]
[ 4 40]]
For more usage about numpy.einsum, I recommend:
Understanding NumPy's einsum
You have multiple options here, one of which is using numpy.einsum as explained in the other answer. Another possibility is using array reshape method:
result = a.T.reshape((a.shape[1], a.shape[0], 1)) * b
result = result.reshape((-1, 2))
result
array([[ 6, 60],
[ 3, 30],
[ 4, 40],
[ 6, 60],
[ 3, 30],
[ 4, 40]])
Yet what is more intuitive to me is to stack arrays by mean of np.vstack with each column of a multiplied by b as follows:
result = np.vstack([c[:, None] * b for c in a.T])
result
array([[ 6, 60],
[ 3, 30],
[ 4, 40],
[ 6, 60],
[ 3, 30],
[ 4, 40]])

identifying sub-arrays in numpy

I have two two dimensional arrays a and b (#columns of a <= #columns in b). I would like to find an efficient way of matching a row in array a to a contiguous part of a row in array b.
a = np.array([[ 25, 28],
[ 84, 97],
[105, 24],
[ 28, 900]])
b = np.array([[ 25, 28, 84, 97],
[ 22, 25, 28, 900],
[ 11, 12, 105, 24]])
The output should be np.array([[0,0], [0,1], [1,0], [2,2], [3,1]]). Row 0 in array a matches Row 0 in array b (first two positions). Row 1 in array a matches row 0 in array b (third and fourth positions).
We can leverage np.lib.stride_tricks.as_strided based scikit-image's view_as_windows for efficient patch extraction, and then compare those patches against each row off a, all of it in a vectorized manner. Then, get the matching indices with np.argwhere -
# a and b from posted question
In [325]: from skimage.util.shape import view_as_windows
In [428]: w = view_as_windows(b,(1,a.shape[1]))
In [429]: np.argwhere((w == a).all(-1).any(-2))[:,::-1]
Out[429]:
array([[0, 0],
[1, 0],
[0, 1],
[3, 1],
[2, 2]])
Alternatively, we could get the indices by the order of rows in a by pushing forward the first axis of a while performing broadcasted comparisons -
In [444]: np.argwhere((w[:,:,0] == a[:,None,None,:]).all(-1).any(-1))
Out[444]:
array([[0, 0],
[0, 1],
[1, 0],
[2, 2],
[3, 1]])
Another way I can think of is to loop over each row in a and perform a 2D correlation between the b which you can consider as a 2D signal a row in a.
We would find the results which are equal to the sum of squares of all values in a. If we subtract our correlation result with this sum of squares, we would find matches with a zero result. Any rows that give you a 0 result would mean that the subarray was found in that row. If you are using floating-point numbers for example, you may want to compare with some small threshold that is just above 0.
If you can use SciPy, the scipy.signal.correlate2d method is what I had in mind.
import numpy as np
from scipy.signal import correlate2d
a = np.array([[ 25, 28],
[ 84, 97],
[105, 24]])
b = np.array([[ 25, 28, 84, 97],
[ 22, 25, 28, 900],
[ 11, 12, 105, 24]])
EPS = 1e-8
result = []
for (i, row) in enumerate(a):
out = correlate2d(b, row[None,:], mode='valid') - np.square(row).sum()
locs = np.where(np.abs(out) <= EPS)[0]
unique_rows = np.unique(locs)
for res in unique_rows:
result.append((i, res))
We get:
In [32]: result
Out[32]: [(0, 0), (0, 1), (1, 0), (2, 2)]
The time complexity of this could be better, especially since we're looping over each row of a to find any subarrays in b.

How to print values of a 3 dimensional array that are less than a specific value?

so I have a dataset named data_low that looks like this:
(array([ 0, 0, 0, ..., 30, 30, 30]), array([ 2, 2, 5, ..., 199, 199, 199]), array([113, 114, 64, ..., 93, 94, 96]))
And this is its shape: (84243,3).
I can get a unique value for precipitation from the dataset like this:
In [63]: print(data_low[0, 2, 113])
Out [63]: 1.74
What I am trying to do is print all the values in my dataset that have a value of less than 3.86667. I'm pretty new to python, and don't really know which loop to use in order to this. Help is much appreciated. Thanks.
EDIT: Here is the program I currently have. For some context, I used ncecat to combine 31 datasets, so that is why I have three 1D arrays: the first array is the day, and the 2nd and 3rd represent longitude and latitude.
data_path = r"C:\Users\matth\Downloads\TRMM_3B42RT\3B42RT_Daily.201001.7.nc4"
f = Dataset(data_path)
latbounds = [ -38 , -20 ]
lonbounds = [ 115 , 145 ] # degrees east ?
lats = f.variables['lat'][:]
lons = f.variables['lon'][:]
# latitude lower and upper index
latli = np.argmin( np.abs( lats - latbounds[0] ) )
latui = np.argmin( np.abs( lats - latbounds[1] ) )
# longitude lower and upper index
lonli = np.argmin( np.abs( lons - lonbounds[0] ) )
lonui = np.argmin( np.abs( lons - lonbounds[1] ) )
precip_subset = f.variables['precipitation'][ : , lonli:lonui , latli:latui ]
print(precip_subset.shape)
print(precip_subset.size)
print(np.mean(precip_subset))
data_low = np.nonzero((precip_subset > 0) & (precip_subset < 3.86667))
print(data_low)
x = list(zip(*data_low))[:]
xx = np.array(x)
print(xx.shape)
print(xx.size)
for i in range(0,84243,1):
print(data_low[i, i, i])
OUT:
In [136]: %run "C:\Users\matth\precip_anomalies.py"
(31, 120, 72)
267840
1.51398
(array([ 0, 0, 0, ..., 30, 30, 30]), array([ 7, 7, 7, ..., 119, 119,
119]), array([ 9, 10, 11, ..., 23, 53, 54]))
(13982, 3)
41946
[ 0 0 0 ..., 30 30 30]
TypeErrorTraceback (most recent call last)
C:\Users\matth\precip_anomalies.py in <module>()
53
54 for i in range(0,84243,1):
---> 55 print(data_low[i, i, i])
TypeError: tuple indices must be integers, not tuple
Given that data_low is a numpy matrix (based on your question it is not, it is a 3-tuple with three arrays), you can use masking:
data_low[data_low < 3.86667]
This will return a 1D numpy array that contains all the values that are less than 3.86667.
If you want these as a vanilla Python list, you can use:
list(data_low[data_low < 3.86667])
But if you want to do further processing (in numpy) you better use the numpy array anyway.

Categories

Resources