Numpy find indices of matching columns - python

I have a large 2xn array A and a smaller 2xn array B. All columns in B can be found in A. I'm looking to find the indices of A by matching columns in B. For example,
import numpy
A = numpy.array([
[101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
[102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]
])
B = numpy.array([
[101, 103, 109],
[102, 224, 212]
])
The answer that I'm looking for is [0,6,14]. Interested to know if there is an efficient way rather than looping. Thanks!

There is hardly a good answer for your problem: numpy is not very well suited for this type of problems, although it can be done. To do subarray searches, if your dtype is not floating point, the method here is probably your best bet. You would start with something like:
AA = np.ascontiguousarray(A.T)
BB = np.ascontiguousarray(B.T)
dt = np.dtype((np.void, AA.dtype.itemsize * AA.shape[1]))
AA = AA.view(dt).ravel()
BB = BB.view(dt).ravel()
And now it is just about searching for the items in a 1D array in another 1D array, which is pretty straightforward, assuming there are no repeated columns in the original A array.
If either of your arrays is really small, as in your example, it is going to be hard to beat something like:
indices = np.argmax(AA == BB[:, None], axis = 1)
But for larger datasets, it is going to be hard to beat a sorting approach:
sorter = np.argsort(AA)
sorted_indices = np.searchsorted(AA, BB, sorter=sorter)
indices = sorter[sorted_indices]

Here's a way, given the arrays are pre-sorted:
import numpy
A = numpy.array([
[101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
[102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]
])
B = numpy.array([
[101, 103, 109],
[102, 224, 212]
])
def search2D(A, B):
to_find_and_bounds = zip(
B[1],
numpy.searchsorted(A[0], B[0], side="left"),
numpy.searchsorted(A[0], B[0], side="right")
)
for to_find, left, right in to_find_and_bounds:
offset = numpy.searchsorted(A[1, left:right], to_find)
yield offset + left
list(search2D(A, B))
#>>> [0, 6, 14]
This is O(len B · log len A).
For unsorted arrays, you can perform an indirect sort:
sorter = numpy.lexsort(A[::-1])
sorted_copy = A.T[sorter].T
sorter[list(search2D(sorted_copy, B))]
#>>> array([ 3, 6, 14])
If you need multiple results from one index, try
for to_find, left, right in to_find_and_bounds:
offset_left = numpy.searchsorted(A[1, left:right], to_find, side="left")
offset_right = numpy.searchsorted(A[1, left:right], to_find, side="right")
yield from range(offset_left + left, offset_right + left)

You could use a string-based comparison such as this one using np.char.array
ca = np.char.array(a)[0,:] + np.char.array(a)[1,:]
cb = np.char.array(b)[0,:] + np.char.array(b)[1,:]
np.where(np.in1d(ca, cb))[0]
#array([ 0, 6, 14], dtype=int64)
EDIT:
you can also manipulate the array dtype in order to transform the a array to one with shape=(18,) where each element contains the data of the two elements of the corresponding column. The same idea can be applied to array b, obtaining shape=(3,). Then you use np.where(np.in1d()) to get the indices:
nrows = a.shape[0]
ta = np.ascontiguousarray(a.T).view(np.dtype((np.void, a.itemsize*nrows))).flatten()
tb = np.ascontiguousarray(b.T).view(np.dtype((np.void, b.itemsize*nrows))).flatten()
np.where(np.in1d(ta, tb))[0]
#array([ 0, 6, 14], dtype=int64)
The idea is similar to the string-based approach.

Numpy has all you need. I assume the arrays are not sorted, conversly you can improve the following code as you prefer:
import numpy as np
a = np.array([[101, 101, 101, 102, 102, 103, 103, 104, 105, 106, 107, 108, 108, 109, 109, 110, 110, 211],
[102, 103, 105, 104, 106, 109, 224, 109, 110, 110, 108, 109, 110, 211, 212, 211, 212, 213]])
b = np.array([[101, 103, 109],
[102, 224, 212]])
idxs = []
for i in range(np.shape(b)[1]):
for j in range(np.shape(a)[1]):
if np.array_equal(b[:,i],a[:,j]):
idxs.append(j)
print idxs

Related

Make an array like numpy.array() without numpy

I've an image processing task and we're prohibited to use NumPy so we need to code from scratch. I've done the logic image transformation but now I'm stuck on creating an array without numpy.
So here's my last output code :
Output :
new_log =
[[236,
232,
226,
.
.
.
198,
204]]
I need to convert this to an array so I can write the image like this (with Numpy)
new_log =
array([[236, 232, 226, ..., 208, 209, 212],
[202, 197, 187, ..., 198, 200, 203],
[192, 188, 180, ..., 205, 206, 207],
...,
[233, 226, 227, ..., 172, 189, 199],
[235, 233, 228, ..., 175, 182, 192],
[235, 232, 228, ..., 195, 198, 204]], dtype=uint8)
cv.imwrite('log_transformed.jpg', new_log)
# new_log must be shaped like the second output
You can make a straightforward function to take your list and reshape it in a similar way to NumPy's np.reshape(). But it's not going to be fast, and it doesn't know anything about data types (NumPy's dtype) so... my advice is to challenge whoever it is that doesn't like NumPy. Especially if you're using OpenCV — it depends on NumPy!
Here's an example of what you could do in pure Python:
def reshape(l, shape):
"""Reshape a list.
Example
-------
>>> l = [1,2,3,4,5,6,7,8,9]
>>> reshape(l, shape=(3, -1))
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
"""
nrows, ncols = shape
if ncols == -1:
ncols = len(l) // nrows
if nrows == -1:
nrows = len(l) // ncols
array = []
for r in range(nrows):
row = []
for c in range(ncols):
row.append(l[ncols*r + c])
array.append(row)
return array

python add multiple arrays together

I have multiple 5x5 arrays which are contained within one large array - the overarching shape is: 5 x 5 x 29. I want to sum every 5 x 5 array to produce one single array, instead of 29 single arrays.
I know that you can do something along the lines of:
new_data = data1[:,:,0] + data1[:,:,1] + ... + data1[:,:,29]
However, this gets very cumbersome for large arrays. Is there an easier way to do this?
Assuming you are using NumPy, you should be able to do this with:
In [13]: data1 = np.arange(100).reshape(5, 5, 4) # For example
In [14]: data1[:,:,0] + data1[:,:,1] + data1[:,:,2] + data1[:,:,3] # Bad way
Out[14]:
array([[ 6, 22, 38, 54, 70],
[ 86, 102, 118, 134, 150],
[166, 182, 198, 214, 230],
[246, 262, 278, 294, 310],
[326, 342, 358, 374, 390]])
In [15]: data1.sum(axis=2) # Good way
Out[15]:
array([[ 6, 22, 38, 54, 70],
[ 86, 102, 118, 134, 150],
[166, 182, 198, 214, 230],
[246, 262, 278, 294, 310],
[326, 342, 358, 374, 390]])
If you are saying you have a list of arrays, then use a for loop.
for i in range(29):
new_data+= data1[:,:,i]
If you are saying you have a tensor or some ND array you should review and research numpy's ND array docs.
You can use a for loop. Like this:
import numpy as np
new_data = np.zeros((5, 5))
for i in range(29):
new_data += data1[:,:,i]

Python - cut only the descending part of the dataset

I have a timeseries with various downcasts. My question is how do I slice a pandas dataframe (or in this case the array, just to keep it simple) to get the data and its indexes of the descending bits of the timeseries?
import matplotlib.pyplot as plt
import numpy as np
b = np.asarray([ 1.3068586 , 1.59882279, 2.11291473, 2.64699527,
3.23948166, 3.81979878, 4.37630243, 4.97740025,
5.59247254, 6.18671493, 6.77414586, 7.43078595,
8.02243495, 8.59612224, 9.22302662, 9.83263379,
10.43125902, 11.0956864 , 11.61107838, 12.09616684,
12.63973254, 12.49437955, 11.6433792 , 10.61083269,
9.50534291, 8.47418827, 7.40571742, 6.56611512,
5.66963658, 4.89748187, 4.10543794, 3.44828054,
2.76866318, 2.24306623, 1.68034463, 1.26568186,
1.44548443, 2.01225076, 2.60715524, 3.21968562,
3.8622007 , 4.57035958, 5.14021305, 5.77879484,
6.42776897, 7.09397923, 7.71722028, 8.30860725,
8.96652218, 9.66157193, 10.23469208, 10.79889453,
10.5788411 , 9.38270646, 7.82070643, 6.74893389,
5.68200335, 4.73429009, 3.78358222, 3.05924946,
2.30428171, 1.78052369, 1.27897065, 1.16840532,
1.59452726, 2.13085096, 2.70989933, 3.3396291 ,
3.97318058, 4.62429262, 5.23997774, 5.91232803,
6.5906609 , 7.21099657, 7.82936331, 8.49636247,
9.15634983, 9.76450244, 10.39680729, 11.04659976,
11.69287237, 12.35692643, 12.99957563, 13.66228386,
14.31806385, 14.91871927, 15.57212978, 16.22288287,
16.84697357, 17.50502002, 18.15907842, 18.83068151,
19.50945548, 20.18020639, 20.84441358, 21.52792846,
22.17933087, 22.84614545, 23.51212887, 24.18308399,
24.8552263 , 25.51709528, 26.18724379, 26.84531493,
27.50690265, 28.16610365, 28.83394822, 29.49621179,
30.15118676, 30.8019521 , 31.46714114, 32.1213546 ,
32.79366952, 33.45233007, 34.12158193, 34.77502197,
35.4532211 , 36.11018053, 36.76540453, 37.41746323])
plt.plot(-b)
plt.show()
You can just change the negative diffs to NaN and then plot:
bb = pd.Series(-b)
bb[bb.diff().ge(0)] = np.nan
bb.plot()
To get the indexes of descending values, use:
bb.index[bb.diff().lt(0)]
Int64Index([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 65, 66, 67, 68,
69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81,
82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94,
95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119],
dtype='int64')
create a second dataframe where you move everyting from one index then you do it by substracting them term to term. you should get what you want (getting only the ones with negative diff)
here:
df = DataFrame(b)
df = concat([df.shift(1),df],axis = 1)
df.columns = ['t-1','t']
df.reset_index()
df = df.drop(df.index[0])
df['diff'] = df['t']-df['t-1']
res = df[df['diff']<0]
There is also an easy numpy-only solution (the question is tagged pandas but the code uses only numpy) using np.where. You want the points where the graph is descending which means the data is ascending.
# the indices where the data is ascending.
ix, = np.where(np.diff(b) > 0)
# the values
c = b[ix]
Note that this will give you the first value in each ascending pair of consecutive values, while the pandas-based solution gives the second one. To get the same indices just add 1 to ix.
s = pd.Series(b)
assert np.all(s[s.diff() > 0].index == ix + 1)
assert np.all(s[s.diff() > 0] == b[ix + 1])

How can I change the dimension of an array?

How to change this array into a 5*2 matrix?
This is my array:
[[ ([[315, 327, 333, 334, 339]], [[146, 143, 145, 145, 146]])]]
how to change the array into a 5*2 matrix
I'm really not sure what you meant by that, but if you want to rotate it (2*5 --> 5*2) you can try this
arrays = [[315, 327, 333, 334, 339], [146, 143, 145, 145, 146]]
newArrays = [[] for _ in range(len(arrays[0]))] # initialise this list first
for arr in arrays:
for i,item in enumerate(arr):
newArrays[i].append(item)
print(newArrays)
# [[315, 146], [327, 143], [333, 145], [334, 145], [339, 146]]
Numpy provides reshape method to reshape an array into array of any dimension with the same number of elements. You can use the method to reshape array of any shape into another shape as long as the product of the original array dimension(s) is/are equal to the product of new array dimension(s).
import numpy as np
a=[[ ([[315, 327, 333, 334, 339]], [[146, 143, 145, 145, 146]])]]
b=np.array(a).reshape((5,2))
list_b=b.tolist();
print list_b
# [[315, 327], [333, 334], [339, 146], [143, 145], [145, 146]]

(igraph Python) g.neighbors(vertex, mode="out") returns unexplainable list

I have a graph 'g' and I'm working with igraph. When I run
g.neighbors(vertex, mode="out") it returns me this list, where the numbers are the nodes ids:
[16, 78, 110, 114, 179, 227, 350, 366, 426]
[78]
[213]
[300]
[163, 371]
[]
The problem is it should give me :
[17, 79, 112, 116, 181, 229, 352, 368, 428]
[79]
[215]
[302]
[165, 373]
[]
I checked the real ids of the neighbors with "g.shortest_paths_dijkstra()" and I looked inside the original .gml file.
I really don't understand why "neighbors(vertex, mode="out")" is adding +1 or +2 to the neighbors ids...

Categories

Resources