Related
I have two sorted, numpy arrays similar to these ones:
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
Elements never repeat in the same array. I want to figure out a way of pythonicaly figuring out a list of indexes that contain the locations in the arrays at which the same element exists.
For instance, 1 exists in x and y at index 0. Element 2 in x doesn't exist in y, so I don't care about that item. However, 8 does exist in both arrays - in index 2 in x but index 1 in y. Similarly, 15 exists in both, in index 4 in x, but index 2 in y. So the outcome of my function would be a list that in this case returns [[0, 0], [2, 1], [4, 2]].
So far what I'm doing is:
def get_indexes(x, y):
indexes = []
for i in range(len(x)):
# Find index where item x[i] is in y:
j = np.where(x[i] == y)[0]
# If it exists, save it:
if len(j) != 0:
indexes.append([i, j[0]])
return indexes
But the problem is that arrays x and y are very large (millions of items), so it takes quite a while. Is there a better pythonic way of doing this?
Without Python loops
Code
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
Usage
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)
# result[0]: array([[0, 0],
[2, 1],
[4, 2]], dtype=int64)
Timing Posted Solutions
Results show that darrlg code has the fastest run time.
Code Adjustment
Each posted solution as a function.
Slight mod so that each solution outputs an numpy array.
Curve named after poster
Code
import numpy as np
import perfplot
def create_arr(n):
' Creates pair of 1d numpy arrays with half the elements equal '
max_val = 100000 # One more than largest value in output arrays
arr1 = np.random.randint(0, max_val, (n,))
arr2 = arr1.copy()
# Change half the elements in arr2
all_indexes = np.arange(0, n, dtype=int)
indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, ))) # assign new random values at change locations
arr1 = np.sort(arr1)
arr2 = np.sort(arr2)
return (arr1, arr2)
def get_indexes_lllrnr101(x,y):
' lllrnr101 answer '
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return np.array(ans)
def get_indexes_joostblack(x, y):
'joostblack'
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
try:
if y[idy]==val:
indexes.append([idx,idy])
except IndexError:
continue # ignore index errors
return np.array(indexes)
def get_indexes_mustafa(x, y):
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2]
return np.array(list(zip(indices_in_x, indices_in_y)))
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
def get_indexes_akopcz(x, y):
' akopcz answer '
return np.array([
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
])
perfplot.show(
setup = create_arr, # tuple of two 1D random arrays
kernels=[
lambda a: get_indexes_lllrnr101(*a),
lambda a: get_indexes_joostblack(*a),
lambda a: get_indexes_mustafa(*a),
lambda a: get_indexes_darrylg(*a),
lambda a: get_indexes_akopcz(*a),
],
labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
n_range=[2 ** k for k in range(5, 21)],
xlabel="Array Length",
# More optional arguments with their default values:
# logx="auto", # set to True or False to force scaling
# logy="auto",
equality_check=None, #np.allclose, # set to None to disable "correctness" assertion
# show_progress=True,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
What you are doing is O(nlogn) which is decent enough.
If you want, you can do it in O(n) by iterating on both arrays with two pointers and since they are sorted, increase the pointer for the array with smaller object.
See below:
x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]
def get_indexes(x,y):
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return ans
print(get_indexes(x,y))
which gives me:
[[0, 0], [2, 1], [4, 2]]
Although, this function will search for all the occurances of x[i] in the y array, if duplicates are not allowed in y it will find x[i] exactly once.
def get_indexes(x, y):
return [
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
]
You can use numpy.searchsorted:
def get_indexes(x, y):
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
if y[idy]==val:
indexes.append([idx,idy])
return indexes
One solution is to first look from x's side to see what values are included in y by getting their indices through np.isin and np.flatnonzero, and then use the same procedure from the other side; but instead of giving x entirely, we give only the (already found) intersected elements to gain time:
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2])
Now you can zip them to get the result:
result = list(zip(indices_in_x, indices_in_y)) # [(0, 0), (2, 1), (4, 2)]
I'm trying to implement the Warshall algorithm in python 3 to create a matrix with the shortest distance between each point.
This is supposed to be a simple implementation, I make a matrix and fill it with the distance between each point.
However, I'm getting the wrong result, and I dont know what is the problem with my implementation.
#number of vertex (N), number of connections(M)
N, M = 4,4;
#my matrix [A,B,C] where A and B indicates a connection
#from A to B with a distance C
A = [[0,1,2],[0,2,4],[1,3,1],[2,3,5]];
#matrix alocation
inf = float("inf");
dist = [[inf for x in range(N)] for y in range(M)];
#set distances from/to the same vertex as 0
for vertex in range(N):
dist[vertex][vertex] = 0;
#set the distances from each vertex to the other
#they are bidirectional.
for vertex in A:
dist[vertex[0]][vertex[1]] = vertex[2];
dist[vertex[1]][vertex[0]] = vertex[2];
#floyd warshall algorithm
for k in range(N):
for i in range(N):
for j in range(N):
if dist[i][j] > dist[i][k] + dist[k][j]:
dist[1][j] = dist[i][k] + dist[k][j];
print(dist);
Expected Matrix on the first index (dist[0]):
[0, 2, 4, 3]
Actual result:
[0, 2, 4, inf]
for some reason I keep getting inf instead of 3 on dist[0][3].
What am I missing?
It's a little tricky to spot, but a simple change-by-change trace of your program spots the problem:
if dist[i][j] > dist[i][k] + dist[k][j]:
dist[1][j] = dist[i][k] + dist[k][j];
^ This should be i, not 1
You're changing the distance from node 1 to the target node; rather than from the source node. Your resulting distance matrix is
[0, 2, 4, 3]
[2, 0, 6, 1]
[4, 6, 0, 5]
[3, 1, 5, 0]
See this lovely debug blog for help.
I'm attempting to convert a double summation formula into code, but can't figure out the correct matrix/vector representation of it.
The first summation is i to n, and the second is over j > i to n.
I'm guessing there is a much more efficient & pythonic way of writing this?
I resorted to nested for loops to just get it working but, as expected, it runs very slowly with a large dataset:
def wapc_denom(weights, vols):
x = []
y = []
for i, wi in enumerate(weights):
for j, wj in enumerate(weights):
if j > i:
x.append(wi * wj * vols[i] * vols[j])
y.append(np.sum(x))
return np.sum(y)
Edit:
Using guidance from smci's answer I think I have a potential solution:
def wapc_denom2(weights, vols):
return np.sum(np.tril(np.outer(weights, vols.T)**2, k=-1))
Assuming you want to count every term only once (for that you have to move the x = [] into the outer loop) one cheap way of computing the sum would be
Create mock data
weights = np.random.random(10)
vols = np.random.random(10)
Do the calculation
wv = weights * vols
result = (wv.sum()**2 - wv#wv) / 2
Check that it's the same
def wapc_denom(weights, vols):
y = []
for i, wi in enumerate(weights):
x = []
for j, wj in enumerate(weights):
if j > i:
x.append(wi * wj * vols[i] * vols[j])
y.append(np.sum(x))
return np.sum(y)
assert np.allclose(result, wapc_denom(weights, vols))
Why does it work?
What we are doing is compute the sum of the full matrix, subtract the diagonal and divide by two. This is cheap because it is easy to verify that the sum of an outer product is just the product of the summed factors.
wi * wj * vols[i] * vols[j] is a telltale. vols is another vector, so first you want to compute the vector wv = w * vols
then (wj * vols[j]) * (wi * vols[i]) = wv^T * wv is your (matrix outer product) expression; that's a column vector * a row vector. But actually you only want the sum. So I don't see a need to construct a vector y.append(np.sum(x)), you're only going to sum it anyway np.sum(y)
also the if j > i part means you only want the sum of the Lower Triangular part, and exclude the diagonal.
EDIT: the result is fully determined just from wv, I didn't think we needed the matrix to get the sum, and we didn't need the diagonal; #PaulPanzer found the most compact expression.
You can use triangulations in numpy, check np.triu and np.meshgrid. Do:
np.product(np.triu(np.meshgrid(weights,weights), 1) * np.triu(np.meshgrid(vols,vols), 1),0).sum(1).cumsum().sum()
Example:
w = np.arange(4) +1
v = np.array([1,3,2,2])
print(np.triu(np.meshgrid(w,w), k=1))
>>array([[[0, 2, 3, 4],
[0, 0, 3, 4],
[0, 0, 0, 4],
[0, 0, 0, 0]],
[[0, 1, 1, 1],
[0, 0, 2, 2],
[0, 0, 0, 3],
[0, 0, 0, 0]]])
# example of product + triu + meshgrid (your x values):
print(np.product(np.triu(np.meshgrid(w,w), 1) * np.triu(np.meshgrid(v,v), 1),0))
>>array([[ 0, 6, 6, 8],
[ 0, 0, 36, 48],
[ 0, 0, 0, 48],
[ 0, 0, 0, 0]])
print(np.product(np.triu(np.meshgrid(w,w), 1) * np.triu(np.meshgrid(v,v), 1),0).sum(1).cumsum().sum())
>> 428
print(wapc_denom(w, v))
>> 428
I am adding each row of a matrix, to the matrix, then computing the min of each row in the new matrix.
My current code from python is with a test case is:
# Compute distances to all other nodes using landmarks
distToLM = np.array([[1,2,3],[4,5,6],[7,8,9]])
m = len(distToLM)
count = 1
dist = np.zeros((m,m))
for i in range(m):
findMin = distToLM[i,:] + distToLM.take(range(count,m),axis=0)
dist[i,count:]=np.min(findMin,axis = 1)
count = count + 1
Note: I am slicing the matrix each time as I only require the upper triangular values of the matrix
So the first iteration would add [1,2,3] to [4,5,6] and [7,8,9] to make a matrix:
[5,7,9]
[8,10,12]
From here I want the min of each row, so 5 and 8.
Next iteration I would take [4,5,6] and add it to all rows beneath it i.e [7,8,9] and take the min of each row.
This code is rather slow, around 3 seconds for a 4000x4000 matrix.
I've also tried a Cython version, there was not much of a speed increase likely due to the heavy dependence on calling the numpy functions VS executing the main code in C:
DTYPE=np.int
ctypedef np.int_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
def findDist(np.ndarray[DTYPE_t,ndim=2] distToLM):
cdef int m = distToLM.shape[0]
count = 1
cdef np.ndarray[DTYPE_t, ndim=2] dist = np.zeros((m,m),dtype=DTYPE)
cdef np.ndarray[DTYPE_t, ndim=2] findMin
for i in range(m):
findMin = distToLM[i,:] + distToLM.take(range(count,m),axis=0)
dist[i,count:]=np.min(findMin,axis = 1)
count = count + 1
return dist
I assume if there was some way to vectorize this it would be much faster.
I am open to any suggestions.
Changing it a bit helps me visualize the action better (I don't use take much):
distToLM = np.array([[1,2,3],[4,5,6],[7,8,9]])
m = distToLM.shape[0]
dist = np.zeros((m,m), distToLM.dtype)
for i in range(m):
findMin = distToLM[i,:] + distToLM[i+1:,:]
dist[i, i+1:] = np.min(findMin,axis = 1)
In fact the double iteration is even clear:
distToLM = np.array([[1,2,3],[4,5,6],[7,8,9]])
m = distToLM.shape[0]
dist = np.zeros((m,m), distToLM.dtype)
for i in range(m):
for j in range(i+1,m):
dist[i,j] = np.min(distToLM[i,:] + distToLM[j,:])
That reveals a symmetry in the 2 dimensions that is obscured in your code. It's not faster, but will be easier to implement with Cython memoryviews.
That symmetry also shows that I can perform an 'outer' sum on these rows:
In [512]: np.min(distToLM[:,None,:]+distToLM[None,:,:],axis=-1)
Out[512]:
array([[ 2, 5, 8],
[ 5, 8, 11],
[ 8, 11, 14]])
The upper tri is the desired dist.
In [518]: np.triu(_,k=1)
Out[518]:
array([[ 0, 5, 8],
[ 0, 0, 11],
[ 0, 0, 0]])
This calculates more values than the iterative approach, but can be faster. Unfortunately for your big problem, the intermediate size (4000,4000,4000) array may be too big for memory.
I could pick the triu indices before hand with:
In [530]: I,J=np.triu_indices(3,1)
In [531]: I,J
Out[531]: (array([0, 0, 1], dtype=int32), array([1, 2, 2], dtype=int32))
In [532]: np.min(distToLM[I,:]+distToLM[J,:],axis=1)
Out[532]: array([ 5, 8, 11])
I don't have a feel for how that will perform with large arrays.
This reminds me that scipy.spatial has what it calls squareform and compact representations of pairwise distances.
https://docs.scipy.org/doc/scipy/reference/spatial.distance.html
Maybe there's some useful stuff there.
I am working with matrices of (x,y,z) dimensions, and would like to index numerous values from this matrix simultaneously.
ie. if the index A[0,0,0] = 5
and A[1,1,1] = 10
A[[1,1,1], [5,5,5]] = [5, 10]
however indexing like this seems to return huge chunks of the matrix.
Does anyone know how I can accomplish this? I have a large array of indices (n, x, y, z) that i need to use to index from A)
Thanks
You are trying to use 1 as the first index 3 times and 5 as the index into the second dimension (again three times). This will give you the element at A[1,5,:] repeated three times.
A = np.random.rand(6,6,6);
B = A[[1,1,1], [5,5,5]]
# [[ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839],
# [ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839],
# [ 0.17135991, 0.80554887, 0.38614418, 0.55439258, 0.66504806, 0.33300839]]
B.shape
# (3, 6)
Instead, you will want to specify [1,5] for each axis of your matrix.
A[[1,5], [1,5], [1,5]] = [5, 10]
Advanced indexing works like this:
A[I, J, K][n] == A[I[n], J[n], K[n]]
with A, I, J, and K all arrays. That's not the full, general rule, but it's what the rules simplify down to for what you need.
For example, if you want output[0] == A[0, 0, 0] and output[1] == A[1, 1, 1], then your I, J, and K arrays should look like np.array([0, 1]). Lists also work:
A[[0, 1], [0, 1], [0, 1]]