I have a piece of code that is running, but is currently a bottleneck, and I was wondering whether there is a smarter way to do it.
I have a 1D array of integers between 0-20, with a length of 20-1000 (it varies) and I'm trying to compare it to a set of 1D arrays that are stored in a 2D array.
I wish to find any row in the 2D array that completely matches the 1D array.
My current approach to do this is the following:
res = np.mean(one_d_array == two_d_array,axis=1) == 1
The problem with this approach is that it will compare all elements in all rows, even if these rows don't match on the first element, second, ect... Which is of course very inefficient.
I could remedy this by looping through the rows and comparing each row individually, then I would probably be able to stop the comparison as soon as one element is false. However then I would be stuck with a slow for loop, which would also not be ideal.
So I'm wondering is there some other clever way to get the best of both of these approaches?
numpy has a few useful built-in functions for checking matrix/vector equality, this is about twice as fast:
import numpy as np
import time
x = np.random.random((1, 1000))
y = np.random.random((10000, 1000))
y[53] = x
t = time.time()
x_in_y = np.equal(x, y).all(axis=1) # equal(x, y) returns a row x col matrix of True for matches; all(axis=0) returns a vector len(rows) if the entire row in x == y is true
idx = np.where(x_in_y) # returns the indicies where x_in_y is true (here 53)
print(time.time() - t) # 0.019975900650024414
t = time.time()
res = np.mean(x == y, axis=1) == 1
print(time.time() - t) # 0.03999614715576172
Related
I am having problems with while loops in python right now:
while(j < len(firstx)):
trainingset[j][0] = firstx[j]
trainingset[j][1] = firsty[j]
trainingset[j][2] = 1 #top
print(trainingset)
j += 1
print("j is " + str(j))
i = 0
while(i < len(firsty)):
trainingset[j+i][0] = twox[i]
trainingset[j+i][1] = twoy[i]
trainingset[j+i][2] = 0 #bottom
i += 1
Where trainingset = [[0,0,0]]*points*2 where points is a number. Also firstx and firsty and twox and twoy are all numpy arrays.
I want the training set to have 2*points array entries which go [firstx[0], firsty[0], 1] all the way to [twox[points-1], twoy[points-1], 0].
After some debugging, I am realizing that for each iteration, **every single value **in the training set is being changed, so that when j = 0 all the training set values are replaced with firstx[0], firsty[0], and 1.
What am I doing wrong?
In this case, I would recommend a for loop instead of a while loop; for loops are great for when you want the index to increment and you know what the last increment value should be, which is true here.
I've had to make some assumptions about the shape of your arrays based on your code. I'm assuming that:
firstx, firsty, twox, and twoy are 1D NumPy arrays with either shape (length,) or (length, 1).
trainingset is a 2D NumPy array with least len(firstx) + len(firsty) rows and at least 3 columns.
j starts at 0 before the while loop begins.
Given these assumptions, here's some code that gives you the output you want:
len_firstx = len(firstx)
# Replace each row with index j with the desired values
for j in range(len(firstx)):
trainingset[j][0:3] = [firstx[j], firsty[j], 1]
# Because i starts at 0, len_firstx needs to be added to the trainingset row index
for i in range(len(firsty)):
trainingset[i + len_firstx][0:3] = [twox[i], twoy[i], 0]
Let me know if you have any questions.
EDIT: Alright, looks like the above doesn't work correctly on rare occasions, not sure why, so if it's being fickle, you can change trainingset[j][0:3] and trainingset[i + len_firstx][0:3] to trainingset[j, 0:3] and trainingset[i + len_firstx, 0:3].
I think it has something to do with the shape of the trainingset array, but I'm not quite sure.
EDIT 2: There's also a more Pythonic way to do what you want instead of using loops. It standardizes the shapes of the four arrays assumed to be 1D (firstx, twox, etc. -- also, if you could let me know exactly what shape these arrays have, that would be super helpful and I could simplify the code!) and then makes the appropriate rows and columns in trainingset have the corresponding values.
# Function to reshape the 1D arrays.
# Works for shapes (length,), (length, 1), and (1, length).
def reshape_1d_array(arr):
shape = arr.shape
if len(shape) == 1:
return arr[:, None]
elif shape[0] >= shape[1]:
return arr
else:
return arr.T
# Reshape the 1D arrays
firstx = reshape_1d_array(firstx)
firsty = reshape_1d_array(firsty)
twox = reshape_1d_array(twox)
twoy = reshape_1d_array(twoy)
len_firstx = len(firstx)
# The following 2 lines do what the loops did, but in 1 step.
arr1 = np.concatenate((firstx, firsty[0:len_firstx], np.array([[1]*len_firstx]).T), axis=1)
arr2 = np.concatenate((twox, twoy[0:len_firstx], np.array([[0]*len_firstx]).T), axis=1)
# Now put arr1 and arr2 where they're supposed to go in trainingset.
trainingset[:len_firstx, 0:3] = arr1
trainingset[len_firstx:len_firstx + len(firsty), 0:3] = arr2
This gives the same result as the two for loops I wrote before, but is faster if firstx has more than ~50 elements.
I have a matrix X, and the labels to each vector of that matrix as a np array Y. If the value of Y is +1 I want to multiply the vector of matrix X by another vector W. If the value of Y is -1 I want to multiply the vector of matrix X by another vector Z.
I tried the following loop:
for i in X:
if Y[i] == 1:
np.sum(np.multiply(i, np.log(w)) + np.multiply((1-i), np.log(1-w)))
elif Y[i] == -1:
np.sum(np.multiply(i, np.log(z)) + np.multiply((1-i), np.log(1-z)))
IndexError: arrays used as indices must be of integer (or boolean) type
i is the index of X, but I'm not sure how to align the index of X to the value in that index of Y.
How can I do this?
Look into np.where, it is exactly what you need:
res = np.where(Y == +1, np.dot(X, W), np.dot(X, Z))
This assumes that Y can take only value +1 and -1. If that's not the case you can adapt the script above but we need to know what do you expect as result when Y takes a different value.
Also, try to avoid explicit for loops when using numpy, the resulting code will be thousands of times faster.
While it is better to use a no-iteration approach, I think you need a refresher on basic Python iteration.
Make an array:
In [57]: x = np.array(['one','two','three'])
Your style of iteration:
In [58]: for i in x: print(i)
one
two
three
to iterate on indices use range (or arange):
In [59]: for i in range(x.shape[0]): print(i,x[i])
0 one
1 two
2 three
enumerate is a handy way of getting both indices and values:
In [60]: for i,v in enumerate(x): print(i,v)
0 one
1 two
2 three
Your previous question had the same problem (and correction):
How to iterate over a numpy matrix?
Suppose I have a numpy array with 2 rows and 10 columns. I want to select columns with even values in the first row. The outcome I want can be obtained is as follows:
a = list(range(10))
b = list(reversed(range(10)))
c = np.concatenate([a, b]).reshape(2, 10).T
c[c[:, 0] % 2 == 0].T
However, this method transposes twice and I don't suppose it's very pythonic. Is there a way to do the same job cleaner?
Numpy allows you to select along each dimension separately. You pass in a tuple of indices whose length is the number of dimensions.
Say your array is
a = np.random.randint(10, size=(2, 10))
The even elements in the first row are given by the mask
m = (a[0, :] % 2 == 0)
You can use a[0] to get the first row instead of a[0, :] because missing indices are synonymous with the slice : (take everything).
Now you can apply the mask to just the second dimension:
result = a[:, m]
You can also convert the mask to indices first. There are subtle differences between the two approaches, which you won't see in this simple case. The biggest difference is usually that linear indices are a little faster, especially if applied more than once:
i = np.flatnonzero(m)
result = a[:, i]
For example, let's consider this toy code
import numpy as np
import numpy.random as rnd
a = rnd.randint(0,10,(10,10))
k = (1,2)
b = a[:,k]
for col in np.arange(np.size(b,1)):
b[:,col] = b[:,col]+col*100
This code will work when the size of k is bigger than 1. However, with the size equal to 1, the extracted sub-matrix from a is transformed into a row vector, and applying the function in the for loop throws an error.
Of course, I could fix this by checking the dimension of b and reshaping:
if np.dim(b) == 1:
b = np.reshape(b, (np.size(b), 1))
in order to obtain a column vector, but this is expensive.
So, the question is: what is the best way to handle this situation?
This seems like something that would arise quite often and I wonder what is the best strategy to deal with it.
If you index with a list or tuple, the 2d shape is preserved:
In [638]: a=np.random.randint(0,10,(10,10))
In [639]: a[:,(1,2)].shape
Out[639]: (10, 2)
In [640]: a[:,(1,)].shape
Out[640]: (10, 1)
And I think b iteration can be simplified to:
a[:,k] += np.arange(len(k))*100
This sort of calculation will also be easier is k is always a list or tuple, and never a scalar (a scalar does not have a len).
np.column_stack ensures its inputs are 2d (and expands at the end if not) with:
if arr.ndim < 2:
arr = array(arr, copy=False, subok=True, ndmin=2).T
np.atleast_2d does
elif len(ary.shape) == 1:
result = ary[newaxis,:]
which of course could changed in this case to
if b.ndim==1:
b = b[:,None]
Any ways, I think it is better to ensure the k is a tuple rather than adjust b shape after. But keep both options in your toolbox.
I have two arrays with the same shape in the first two dimensions and I'm looking to record the minimum value in each row of the first array. However I would also like to record the elements in the corresponding position in the third dimension of the second array. I can do it like this:
A = np.random.random((5000, 100))
B = np.random.random((5000, 100, 3))
A_mins = np.ndarray((5000, 4))
for i, row in enumerate(A):
current_min = min(row)
A_mins[i, 0] = current_min
A_mins[i, 1:] = B[i, row == current_min]
I'm new to programming (so correct me if I'm wrong) but I understand that with Numpy doing calculations on whole arrays is faster than iterating over them. With this in mind is there a faster way of doing this? I can't see a way to get rid of the row == current_min bit even though the location of the minimum point must have been 'known' to the computer when it was calculating the min().
Any tips/suggestions appreciated! Thanks.
Something along what #lib talked about:
index = np.argmin(A, axis=1)
A_mins[:,0] = A[np.arange(len(A)), index]
A_mins[:,1:] = B[np.arange(len(A)), index]
It is much faster than using a for loop.
For getting the index of the minimum value, use amin instead of min + comparison
The amin function (and many other functions in numpy) also takes the argument axis, that you can use to get the minimum of each row or each column.
See http://docs.scipy.org/doc/numpy/reference/generated/numpy.amin.html