Related
I am trying to get good at numpy and want to know if I can use values in exisiting arrays to serve as indices for a function that returns values for another array. I can do this:
def somefun(i):
return i+1
x = np.array([2, 4, 5])
k_labs = np.arange(100)
k_labs2 = k_labs[somefun(x[:])]
But how do I deal with using vectors in matrices in case x was a double array, where I just want to use one vector at a time as indices-arguments for a function, such as X[:, i], without using for-loops?
such as would be the case in:
x = np.array([[2, 4, 5],[7, 8, 9]])
def somefun(i):
return i+1
k_labs = np.arange(100)
k_labs2 = k_labs[somefun(x[:, i])]
EDIT ITERATION 2
To get the gist of what I am trying to accomplish see the code below. In the function pred as you can see i wanted to write the things I've commented out in a numpy fashion that might work better yet. I have some probelms though we the two lines I put in instead, since I get an error of wrong broadcast dimensions in the function called distance, at the the line where I try to assign the normalized vectors at a variable.
class kNN:
def __init__(self, X_train : np.array, label_train, val = None):
self.X = X_train#X[:-1, :]
self.labels = label_train#X[-1, :]
#self.k = k
self.kNN_4all = None #np.zeros(self.X.shape[1])
def distance(self, x1):
x1 = np.tile(x1, (self.X.shape[1], 1)) #creates a matrix of len of X with copyes of x1 vector for easy matrix subtraction.
dists = np.linalg.norm(x1 - self.X.T, axis = 1) #Flips to find linalg.norm for all the axis
return dists
def k_nearest(self, x_vec, k):
k_nearest = self.distance(x_vec)
k_nearest = np.argsort(k_nearest)[ :k]
kNN_labs = np.zeros(k_nearest.shape)
kNN_labs[:] = self.labels[k_nearest[:]]
unique, vote = np.unique(kNN_labs, return_counts=True)
return unique[np.argmax(vote)]
def pred(self, X_test, k):
self.kNN_4all = np.zeros(X_test.shape[1])
self.kNN_4all = self.k_nearest(X_test[:, :], k)
#for i in range(X_test.shape[1]):
# NewLabel = self.k_nearest(X_test[:, i], k) #defines x_vec in matrix X
# self.kNN_4all[i] = NewLabel
#return self.kNN_4all
def prec(self, labels_val):
elem_equal = (self.kNN_4all == labels_val).astype(int).flatten()
prec = np.sum(elem_equal)/elem_equal.shape
return 1 - prec[0]
X_train = X[:, :100]
labs_train = labs[:100]
pilot = kNN(X_train, labs_train)
pilot.pred(X[:,100:200], 10)
pilot.prec(labs[100:200])
I get the following error:
ValueError: operands could not be broadcast together with shapes (78400,100) (100,784)
As we can see from the code the k_nearest(self, x_vec, k) takes one 1D-subarray, so passing any full matrix X will cause the broad-casting error, since the functions within k_nearest relies on passing only a 1D subarray.
I don't know if it really is possible to avoid for loops in this regard and use numpy to increment through 1D subarrays as arguments for a function, such that each call of the function with the arguments can be assigned to a different cell in another array, in this case the self.kNN_4all
x = np.array([[2, 4, 5], [7, 8, 9], [33, 50, 71]])
x = x + 1
k_labs = np.arange(100)
ttt = k_labs[x]
print(ttt)
ttt creates an array that takes values from 'k_labs' based on pseudo-indexes 'x'. The array is accessed for example:
print(ttt[1])#[ 8 9 10]
If you want to refer to a certain value (for example, with indexes x[2]) alone, then the code will be as follows:
x = np.array([[2, 4, 5], [7, 8, 9], [33, 50, 71]])
x = x + 1
k_labs = np.arange(100)
print(k_labs[x[2]])
I have a matrix in the following form:
import numpy as np
matrix = np.array([[-2,2,6,7,8],[-3,7,1,0,-2]])
I want to find the location of the column with the highest possible value in the first row conditional on non-negative numbers in the second row e.g. in my case I want the algorithm to find the 4th row.
solution = np.array([7,0])
column_location = 3
I tried using numpy functions like np.min(), np.max(),np.take() but I loose the location information when subsampling the matrix.
Simply:
nn = np.where(matrix[1] >= 0)[0]
ix = nn[matrix[0, nn].argmax()]
On your data:
>>> ix
3
Here's a sketch:
pos_inds = np.where(matrix[1, :] >= 0)[0] # indices where 2nd row is positive
max_ind = matrix[0, pos_inds].argmax() # max index into row with positive values only
orig_max_ind = pos_inds[max_ind] # max index into the original array
print(orig_max_ind) # 3
print(matrix[:, orig_max_ind]) # [7, 0]
Here I use the masks to handle the numpy, and also consider that if all the numbers are negative in second column, there will be no solution:
import numpy as np
import numpy.ma as ma
from copy import deepcopy
min_int = -2147483648
matrix = np.array([[-2, 2, 6, 7, 8], [-3, 7, 1, 0, -2]])
# we keep the original matrix untouched
matrix_copy = deepcopy(matrix)
masked_array = ma.masked_less(matrix[1], 0)
matrix_copy[0][masked_array.mask] = min_int
column_location = np.argmax(matrix_copy[0])
if matrix_copy[0][column_location] == min_int:
print("No solution")
else:
solution = np.array([matrix[0][column_location], matrix[1][column_location]])
print(solution) # np.array([7,0])
print(column_location) # 3
Note that this question is not about multiple conditions within a single np.where(), see this thread for that.
I have a numpy array arr1 with some numbers (without a particular structure):
arr0 = \
np.array([[0,3,0],
[1,3,2],
[1,2,0]])
and a list of all the entries in this array:
entries = [0,1,2,3]
I also have another array, arr1:
arr1 = \
np.array([[4,5,6],
[6,2,4],
[3,7,9]])
I would like to perform some function on multiple subsets of elements of arr1. A subset consts of numbers which are at the same position as arr0 entries with a cetrain value. Let this function be finding the max value. Performing the function on each subset via a list comprehension:
res = [np.where(arr0==index,arr1,0).max() for index in entries]
res is [9, 6, 7, 5]
As expected: 0 in arr0 is on the top left, top right, bottom right corner, and the biggest number from the top left, top right, bottom right entries of arr1 (ie 4, 6, 9) is 9. Rest follow with a similar logic.
How can I achieve this without iteration?
My actual arrays are much bigger than these examples.
With broadcasting
res = np.where(arr0[...,None] == entries, arr1[...,None], 0).max(axis=(0, 1))
The result of np.where(...) is a (3, 3, 4) array, where slicing [...,0] would give you the same 3x3 array you get by manually doing the np.where with just entries[0], etc. Then taking the max of each 3x3 subarray leaves you with the desired result.
Timings
Apparently this method doesn't scale well for bigger arrays. The other answer using np.unique is more efficient because it reduces the maximum operation down to a few unique value regardless of how big the original arrays are.
import timeit
import matplotlib.pyplot as plt
import numpy as np
def loops():
return [np.where(arr0==index,arr1,0).max() for index in entries]
def broadcast():
return np.where(arr0[...,None] == entries, arr1[...,None], 0).max(axis=(0, 1))
def numpy_1d():
arr0_1D = arr0.ravel()
arr1_1D = arr1.ravel()
arg_idx = np.argsort(arr0_1D)
u, idx = np.unique(arr0_1D[arg_idx], return_index=True)
return np.maximum.reduceat(arr1_1D[arg_idx], idx)
sizes = (3, 10, 25, 50, 100, 250, 500, 1000)
lengths = (4, 10, 25, 50, 100)
methods = (loops, broadcast, numpy_1d)
fig, ax = plt.subplots(len(lengths), sharex=True)
for i, M in enumerate(lengths):
entries = np.arange(M)
times = [[] for _ in range(len(methods))]
for N in sizes:
arr0 = np.random.randint(1000, size=(N, N))
arr1 = np.random.randint(1000, size=(N, N))
for j, method in enumerate(methods):
times[j].append(np.mean(timeit.repeat(method, number=1, repeat=10)))
for t in times:
ax[i].plot(sizes, t)
ax[i].legend(['loops', 'broadcasting', 'numpy_1d'])
ax[i].set_title(f'Entries size {M}')
plt.xticks(sizes)
fig.text(0.5, 0.04, 'Array size (NxN)', ha='center')
fig.text(0.04, 0.5, 'Time (s)', va='center', rotation='vertical')
plt.show()
It's more convenient to work in 1D case. You need to sort your arr0 then find starting indices for every group and use np.maximum.reduceat.
arr0_1D = np.array([[0,3,0],[1,3,2],[1,2,0]]).ravel()
arr1_1D = np.array([[4,5,6],[6,2,4],[3,7,9]]).ravel()
arg_idx = np.argsort(arr0_1D)
>>> arr0_1D[arg_idx]
array([0, 0, 0, 1, 1, 2, 2, 3, 3])
u, idx = np.unique(arr0_1D[arg_idx], return_index=True)
>>> idx
array([0, 3, 5, 7], dtype=int64)
>>> np.maximum.reduceat(arr1_1D[arg_idx], idx)
array([9, 6, 7, 5], dtype=int32)
I have two sorted, numpy arrays similar to these ones:
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
Elements never repeat in the same array. I want to figure out a way of pythonicaly figuring out a list of indexes that contain the locations in the arrays at which the same element exists.
For instance, 1 exists in x and y at index 0. Element 2 in x doesn't exist in y, so I don't care about that item. However, 8 does exist in both arrays - in index 2 in x but index 1 in y. Similarly, 15 exists in both, in index 4 in x, but index 2 in y. So the outcome of my function would be a list that in this case returns [[0, 0], [2, 1], [4, 2]].
So far what I'm doing is:
def get_indexes(x, y):
indexes = []
for i in range(len(x)):
# Find index where item x[i] is in y:
j = np.where(x[i] == y)[0]
# If it exists, save it:
if len(j) != 0:
indexes.append([i, j[0]])
return indexes
But the problem is that arrays x and y are very large (millions of items), so it takes quite a while. Is there a better pythonic way of doing this?
Without Python loops
Code
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
Usage
x = np.array([1, 2, 8, 11, 15])
y = np.array([1, 8, 15, 17, 20, 21])
result = get_indexes_darrylg(x, y)
# result[0]: array([[0, 0],
[2, 1],
[4, 2]], dtype=int64)
Timing Posted Solutions
Results show that darrlg code has the fastest run time.
Code Adjustment
Each posted solution as a function.
Slight mod so that each solution outputs an numpy array.
Curve named after poster
Code
import numpy as np
import perfplot
def create_arr(n):
' Creates pair of 1d numpy arrays with half the elements equal '
max_val = 100000 # One more than largest value in output arrays
arr1 = np.random.randint(0, max_val, (n,))
arr2 = arr1.copy()
# Change half the elements in arr2
all_indexes = np.arange(0, n, dtype=int)
indexes = np.random.choice(all_indexes, size = n//2, replace = False) # locations to make changes
np.put(arr2, indexes, np.random.randint(0, max_val, (n//2, ))) # assign new random values at change locations
arr1 = np.sort(arr1)
arr2 = np.sort(arr2)
return (arr1, arr2)
def get_indexes_lllrnr101(x,y):
' lllrnr101 answer '
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return np.array(ans)
def get_indexes_joostblack(x, y):
'joostblack'
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
try:
if y[idy]==val:
indexes.append([idx,idy])
except IndexError:
continue # ignore index errors
return np.array(indexes)
def get_indexes_mustafa(x, y):
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2]
return np.array(list(zip(indices_in_x, indices_in_y)))
def get_indexes_darrylg(x, y):
' darrylg answer '
# Use intersect to find common elements between two arrays
overlap = np.intersect1d(x, y)
# Indexes of common elements in each array
loc1 = np.searchsorted(x, overlap)
loc2 = np.searchsorted(y, overlap)
# Result is the zip two 1d numpy arrays into 2d array
return np.dstack((loc1, loc2))[0]
def get_indexes_akopcz(x, y):
' akopcz answer '
return np.array([
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
])
perfplot.show(
setup = create_arr, # tuple of two 1D random arrays
kernels=[
lambda a: get_indexes_lllrnr101(*a),
lambda a: get_indexes_joostblack(*a),
lambda a: get_indexes_mustafa(*a),
lambda a: get_indexes_darrylg(*a),
lambda a: get_indexes_akopcz(*a),
],
labels=["lllrnr101", "joostblack", "mustafa", "darrylg", "akopcz"],
n_range=[2 ** k for k in range(5, 21)],
xlabel="Array Length",
# More optional arguments with their default values:
# logx="auto", # set to True or False to force scaling
# logy="auto",
equality_check=None, #np.allclose, # set to None to disable "correctness" assertion
# show_progress=True,
# target_time_per_measurement=1.0,
# time_unit="s", # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
# relative_to=1, # plot the timings relative to one of the measurements
# flops=lambda n: 3*n, # FLOPS plots
)
What you are doing is O(nlogn) which is decent enough.
If you want, you can do it in O(n) by iterating on both arrays with two pointers and since they are sorted, increase the pointer for the array with smaller object.
See below:
x = [1, 2, 8, 11, 15]
y = [1, 8, 15, 17, 20, 21]
def get_indexes(x,y):
ans = []
i=0
j=0
while (i<len(x) and j<len(y)):
if x[i] == y[j]:
ans.append([i,j])
i += 1
j += 1
elif (x[i]<y[j]):
i += 1
else:
j += 1
return ans
print(get_indexes(x,y))
which gives me:
[[0, 0], [2, 1], [4, 2]]
Although, this function will search for all the occurances of x[i] in the y array, if duplicates are not allowed in y it will find x[i] exactly once.
def get_indexes(x, y):
return [
[i, j]
for i, nr in enumerate(x)
for j in np.where(nr == y)[0]
]
You can use numpy.searchsorted:
def get_indexes(x, y):
indexes = []
for idx,val in enumerate(x):
idy = np.searchsorted(y,val)
if y[idy]==val:
indexes.append([idx,idy])
return indexes
One solution is to first look from x's side to see what values are included in y by getting their indices through np.isin and np.flatnonzero, and then use the same procedure from the other side; but instead of giving x entirely, we give only the (already found) intersected elements to gain time:
indices_in_x = np.flatnonzero(np.isin(x, y)) # array([0, 2, 4])
indices_in_y = np.flatnonzero(np.isin(y, x[indices_in_x])) # array([0, 1, 2])
Now you can zip them to get the result:
result = list(zip(indices_in_x, indices_in_y)) # [(0, 0), (2, 1), (4, 2)]
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing tricks?
Sure you can do it using advanced indexing, whether it is the fastest way probably depends on your array size (if your rows are large it may not be):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:, np.newaxis]
result = A[rows, column_indices]
numpy.lib.stride_tricks.as_strided stricks (abbrev pun intended) again!
Speaking of fancy indexing tricks, there's the infamous - np.lib.stride_tricks.as_strided. The idea/trick would be to get a sliced portion starting from the first column until the second last one and concatenate at the end. This ensures that we can stride in the forward direction as needed to leverage np.lib.stride_tricks.as_strided and thus avoid the need of actually rolling back. That's the whole idea!
Now, in terms of actual implementation we would use scikit-image's view_as_windows to elegantly use np.lib.stride_tricks.as_strided under the hoods. Thus, the final implementation would be -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
a_ext = np.concatenate((a,a[:,:-1]),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), (n-r)%n,0]
Here's a sample run -
In [327]: A = np.array([[4, 0, 0],
...: [1, 2, 3],
...: [0, 0, 5]])
In [328]: r = np.array([2, 0, -1])
In [329]: strided_indexing_roll(A, r)
Out[329]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
Benchmarking
# #seberg's solution
def advindexing_roll(A, r):
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
return A[rows, column_indices]
Let's do some benchmarking on an array with large number of rows and columns -
In [324]: np.random.seed(0)
...: a = np.random.rand(10000,1000)
...: r = np.random.randint(-1000,1000,(10000))
# #seberg's solution
In [325]: %timeit advindexing_roll(a, r)
10 loops, best of 3: 71.3 ms per loop
# Solution from this post
In [326]: %timeit strided_indexing_roll(a, r)
10 loops, best of 3: 44 ms per loop
In case you want more general solution (dealing with any shape and with any axis), I modified #seberg's solution:
def indep_roll(arr, shifts, axis=1):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray
How many shifting to use for each dimension. Shape: `(arr.shape[axis],)`.
axis : int
Axis along which elements are shifted.
"""
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts[shifts < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts[:, np.newaxis]
result = arr[tuple(all_idcs)]
arr = np.swapaxes(result,-1,axis)
return arr
I implement a pure numpy.lib.stride_tricks.as_strided solution as follows
from numpy.lib.stride_tricks import as_strided
def custom_roll(arr, r_tup):
m = np.asarray(r_tup)
arr_roll = arr[:, [*range(arr.shape[1]),*range(arr.shape[1]-1)]].copy() #need `copy`
strd_0, strd_1 = arr_roll.strides
n = arr.shape[1]
result = as_strided(arr_roll, (*arr.shape, n), (strd_0 ,strd_1, strd_1))
return result[np.arange(arr.shape[0]), (n-m)%n]
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
r = np.array([2, 0, -1])
out = custom_roll(A, r)
Out[789]:
array([[0, 0, 4],
[1, 2, 3],
[0, 5, 0]])
By using a fast fourrier transform we can apply a transformation in the frequency domain and then use the inverse fast fourrier transform to obtain the row shift.
So this is a pure numpy solution that take only one line:
import numpy as np
from numpy.fft import fft, ifft
# The row shift function using the fast fourrier transform
# rshift(A,r) where A is a 2D array, r the row shift vector
def rshift(A,r):
return np.real(ifft(fft(A,axis=1)*np.exp(2*1j*np.pi/A.shape[1]*r[:,None]*np.r_[0:A.shape[1]][None,:]),axis=1).round())
This will apply a left shift, but we can simply negate the exponential exponant to turn the function into a right shift function:
ifft(fft(...)*np.exp(-2*1j...)
It can be used like that:
# Example:
A = np.array([[1,2,3,4],
[1,2,3,4],
[1,2,3,4]])
r = np.array([1,-1,3])
print(rshift(A,r))
Building on divakar's excellent answer, you can apply this logic to 3D array easily (which was the problematic that brought me here in the first place). Here's an example - basically flatten your data, roll it & reshape it after::
def applyroll_30(cube, threshold=25, offset=500):
flattened_cube = cube.copy().reshape(cube.shape[0]*cube.shape[1], cube.shape[2])
roll_matrix = calc_roll_matrix_flattened(flattened_cube, threshold, offset)
rolled_cube = strided_indexing_roll(flattened_cube, roll_matrix, cube_shape=cube.shape)
rolled_cube = triggered_cube.reshape(cube.shape[0], cube.shape[1], cube.shape[2])
return rolled_cube
def calc_roll_matrix_flattened(cube_flattened, threshold, offset):
""" Calculates the number of position along time axis we need to shift
elements in order to trig the data.
We return a 1D numpy array of shape (X*Y, time) elements
"""
# armax(...) finds the position in the cube (3d) where we are above threshold
roll_matrix = np.argmax(cube_flattened > threshold, axis=1) + offset
# ensure we don't have index out of bound
roll_matrix[roll_matrix>cube_flattened.shape[1]] = cube_flattened.shape[1]
return roll_matrix
def strided_indexing_roll(cube_flattened, roll_matrix_flattened, cube_shape):
# Concatenate with sliced to cover all rolls
# otherwise we shift in the wrong direction for my application
roll_matrix_flattened = -1 * roll_matrix_flattened
a_ext = np.concatenate((cube_flattened, cube_flattened[:, :-1]), axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = cube_flattened.shape[1]
result = viewW(a_ext,(1,n))[np.arange(len(roll_matrix_flattened)), (n - roll_matrix_flattened) % n, 0]
result = result.reshape(cube_shape)
return result
Divakar's answer doesn't do justice to how much more efficient this is on large cube of data. I've timed it on a 400x400x2000 data formatted as int8. An equivalent for-loop does ~5.5seconds, Seberg's answer ~3.0seconds and strided_indexing.... ~0.5second.