I find myself running into the following situation in numpy muliple times over the past couple of months, and I cannot imagine there is not a proper solution for it.
I have a 2d array, let's say
x = np.array([
[1, 2, 3],
[2, -5, .333],
[1, 4, 2],
[2, -5, 4]])
Now I would like to (sort/get the maximum/do argsort/argmax/ etc) this array in such a way that it first compares the first column. If the first column is equal, it compares the second column, and then the third. So this means for our example:
# max like python: max(x.tolist())
np.tuple_like_amax(x) = np.array([2, -5, 4])
# argmax does't have python equivalent, but something like: [i for i, e in enumerate(x.tolist()) if e == max(x.tolist())][0]
np.tuple_like_argmax = 3
# sorting like python: sorted(x.tolist())
np.tuple_like_sort(x) = np.array([[1.0, 2.0, 3.0], [1.0, 4.0, 2.0], [2.0, -5.0, 0.333], [2.0, -5.0, 4.0]])
# argsort doesn't have python equivalent, but something like: sorted(range(len(x)), key=lambda i: x[i].tolist())
np.tuple_like_argsort(x) = np.array([0, 2, 1, 3])
This is exactly the way how python compares tuples (so actually just calling max(x.tolist()) does the trick here for max. It does feel however like a time-and-memory waste to first convert the array to a python list, and in addition I would like to use things like argmax, sort and all the other great numpy functions.
So just to be clear, I'm not interested in python code that mimics an argmax, but for something that achieves this without converting the lists to python lists.
Found so far:
np.sort seems to work on structured arrays when order= is given. It does feel to me that creating a structured array and then using this method is overkill. Also, argmax doesn't seem to support this, meaning that one would have to use argsort, which has a much higher complexity.
Here I will focus only on finding the lexicographic argmax (the others: max, argmin, and min can be found trivially from argmax). In addition, unlike np.argmax(), we will return all rows that are at rank 0 (if there are duplicate rows), i.e. all the indices where the row is the lexicographic maximum.
The idea is that, for the "tuple-like order" desired here, the function is really:
find all indices where the first column has the maximum;
break ties with the places where the second column is max, under condition that the first column is max;
etc., as long as there are ties to break (and more columns).
def ixmax(x, k=0, idx=None):
col = x[idx, k] if idx is not None else x[:, k]
z = np.where(col == col.max())[0]
return z if idx is None else idx[z]
def lexargmax(x):
idx = None
for k in range(x.shape[1]):
idx = ixmax(x, k, idx)
if len(idx) < 2:
break
return idx
At first, I was worried that the explicit looping in Python would kill it. But it turns out that it is quite fast. In the case where there is no ties (more likely with independent float values, for instance), that returns immediately after a single np.where(x[:, 0] == x[:, 0].max()). Only in the case of ties do we need to look at the (much smaller) subset of rows that were tied. In unfavorable conditions (many repeated values in all columns), it is still ~100x or more than the partition method, and O(log n) faster than lexsort(), of course.
Test 1: correctness
for i in range(1000):
x = np.random.randint(0, 10, size=(1000, 8))
found = lexargmax(x)
assert lexargmax_by_sort(x) in found and np.unique(x[found], axis=0).shape[0] == 1
(where lexargmark_by_sort is np.lexsort(x[:, ::-1].T)[-1])
Test 2: speed
x = np.random.randint(0, 10, size=(100_000, 100))
a = %timeit -o lexargmax(x)
# 776 µs ± 313 ns per loop
b = %timeit -o lexargmax_by_sort(x)
# 507 ms ± 2.65 ms per loop
# b.average / a.average: 652
c = %timeit -o lexargmax_by_partition(x)
# 141 ms ± 2.38 ms
# c.average / a.average: 182
(where lexargmark_by_partition is based on #MadPhysicist very elegant idea:
def lexargmax_by_partition(x):
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
return np.argpartition(view, -1)[-1]
)
After some more testing on various sizes, we get the following time measurements and performance ratios:
In the LHS plot, lexargmax is the group shown with 'o-' and lexargmax_by_partition is the upper group of lines.
In the RHS plot, we just show the speed ratio.
Interestingly, lexargmax_by_partition execution time seems fairly independent of m, the number of columns, whereas our lexargmax depends a little bit on it. I believe that is reflecting the fact that, in this setting (purposeful collisions of max in each column), the more columns we have, the "deeper" we need to go when breaking ties.
Previous (wrong) answer
To find the argmax of the row by lexicographic order, I was thinking you could do:
def lexmax(x):
r = (2.0 ** np.arange(x.shape[1]))[::-1]
return np.argmax(((x == x.max(axis=0)) * r).sum(axis=1))
Explanation:
x == x.max(axis=0) (as an int) is 1 for each element that is equal to the column's max. In your example, it is (astype(int)):
[[0 0 0]
[1 0 0]
[0 1 0]
[1 0 1]]
then we multiply by a column weight that is more than the sum of 1's on the right. Powers of two achieve that. We do it in float to address cases with more than 64 columns.
But this is fatally flawed: The positions of max in the second column should be considered only in the subset where the first column had the max value (to break the tie).
Other approaches including affine transformations of all columns so that we can sum them and find the max don't work either: if the max in column 0 is, say, 1.0, and there is a second place at 0.999, then we would have to know ahead of time that difference of 0.001 and make sure no combination of values from the columns to the right to sum up to overtake that difference. So, that's a dead end.
To sort a list by the contents of a row, you can use np.lexsort. The only catch is that it sorts by the last element of the selected axis first:
index = np.lexsort(x.T[::-1])
OR
index = np.lexsort(x[:, ::-1].T)
This is "argsort". You can make it into "sort" by doing
x[index]
"min" and "max" can be done trivially by using the index:
xmin = x[index[0]]
xmax = x[index[-1]]
Alternatively, you can use a technique I suggested in one of my questions: Sorting array of objects by row using custom dtype. The idea is to make each row into a structure that has a field for each element:
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
You can sort the array in-place by operating on the view:
>>> view.sort()
>>> x
array([[ 1. , 2. , 3. ],
[ 1. , 4. , 2. ],
[ 2. , -5. , 0.333],
[ 2. , -5. , 4. ]])
That's because the ndarray constructor points to x as the original buffer.
You can not get argmin, argmax, min and max to work on the result. However, you can still get the min and max in O(n) time using my favorite function in all of numpy: np.partition:
view.partition([0, -1])
xmin = x[0]
xmax = x[-1]
You can use argpartition on the array as well to get the indices of the desired elements:
index = view.argpartition([0, -1])[[0, -1]]
xmin = x[index[0]]
xmax = x[index[-1]]
Notice that both sort and partition have an order argument that you can use to rearrange the comparison of the columns.
Related
I have a matrix of 10,000 by 10,000 filled with 1s and 0s. What i want to do is to go through each column and find the rows that contain the value 1.
Then I want to store it in a new matrix with 2 columns : column 1 = column index and Column 2 = an array of row indices that contain 1. There are some columns that do not have any 1s at all, in which case it would be an empty array.
Trying to do a for loop again but it is computationally inefficient.
I tried with a smaller matrix
#sample matrix
n = 4
mat = [[randint(0,1) for _ in range(n)] for _ in range(n)]
arr = np.random.randint(0, size=(4, 2))
for col in range(n):
arr[n][1] = n
arr[n][2] = np.where(col == 1)
but this runs quite slowly for a 10,000 by 10,000 matrix. I am wondering if this is right and if there was a better way?
Getting indices where a[i][j] == 1
You can get the data that you're looking for (locations of ones within a matrix of zeroes and ones) efficiently using numpy.argwhere() or numpy.nonzero(), however you will not be able to get them in the format specified in your original question using NumPy ndarrays alone.
You could achieve the data in you're specified format using a combination of ndarrays and standard Python lists, however since efficiency is paramount given the size of the data you are working with I would think it best to focus on getting the data rather than getting it in the format of an ndarray of irregular Python lists.
You can always reformat the results (indices of 1 within your matrix) following computation if the format you have mentioned is a hard requirement, and this way your code will benefit from optimisations provided by NumPy during the heavy computation - reducing the execution time of your procedure overall.
Example using np.argwhere()
import numpy as np
a = np.random.randint(0, 2, size=(4,4))
b = np.argwhere(a == 1)
print(f'a\n{a}')
print(f'b\n{b}')
Output
a
[[1 1 1 1]
[0 0 0 0]
[1 0 1 0]
[1 1 1 1]]
b
[[0 0]
[0 1]
[0 2]
[0 3]
[2 0]
[2 2]
[3 0]
[3 1]
[3 2]
[3 3]]
As you can see, np.argwhere(a == 1) returns an ndarray whose values are ndarrays containing the indices of locations in a whose values (x) meet the condition x == 1.
I gave the above method with a = np.random.randint(0, 2, size=(10000,10000) a try on my laptop (nothing fancy) a few times and it finished at around 3-5 seconds each time.
Getting row indices where all values != 1
If you want to store all row indices of a containing no values == 1, the most straightforward way (assuming you are using my example code above) would probably be by using numpy.setdiff1d() to return an array of row indices that are not present within b - i.e. the set difference between an array containing all row indices of a and the 1d array b[0] which will be row indices of all values in a that are != 1.
Assuming the same a and b as the above example.
c = np.setdiff1d(np.arange(a.shape[0]), b[:, 0])
print(c)
Output
array([1])
In the above example c = [1] as 1 is the only row index in a that doesn't contain any values == 1.
It is worth noting that if a is defined as np.random.randint(0, 2, size=(10000,10000), the probability of c being anything but a zero-length (i.e. empty) array is vanishingly small. This is because for a row to contain no values == 1, np.random would have to return 0 10,000 times in a row to fill a row with 0.
Why use multiple NumPy arrays?
I know that it may seem strange to use b and c to store results pertaining to locations where a == 1 and a != 1 respectively. Why not just use an irregular list as outlined in your original question?
The answer in short is efficiency. By using NumPy arrays you will be able to vectorise computations on your data and largely avoid costly Python loops, the benefits of which will be magnified considerably as reflected in time spent on execution given the size of the data you are working with.
You can always store your data in a different format that is more human friendly and map it back to NumPy as required, however the above examples will likely increase efficiency substantially at execution time when compared to the example in your original question.
i have a very large 1D python array x of somewhat repeating numbers and along with it some data d of the same size.
x = np.array([48531, 62312, 23345, 62312, 1567, ..., 23345, 23345])
d = np.array([0 , 1 , 2 , 3 , 4 , ..., 99998, 99999])
in my context "very large" refers to 10k...100k entries. Some of them are repeating so the number of unique entries is about 5k...15k.
I would like to group them into the bins. This should be done by creating two objects. One is a matrix buffer, b of data items taken from d. The other object is a vector v of unique x values each of the buffer columns refers to. Here's the example:
v = [48531, 62312, 23345, 1567, ...]
b = [[0 , 1 , 2 , 4 , ...]
[X , 3 , ....., ...., ...]
[ ...., ....., ....., ...., ...]
[X , X , 99998, X , ...]
[X , X , 99999, X , ...] ]
Since the numbers of occurrences of each unique number in x vary some of the values in the buffer b are invalid (indicated by the capital X, i.e. "don't care").
It's very easy to derive v in numpy:
v, n = np.unique(x, return_counts=True) # yay, just 5ms
and we even get n which is the number of valid entries within each column in b. Moreover, (np.max(n), v.shape[0]) returns the shape of the matrix b that needs to be allocated.
But how to efficiently generate b?
A for-loop could help
b = np.zeros((np.max(n), v.shape[0]))
for i in range(v.shape[0]):
idx = np.flatnonzero(x == v[i])
b[0:n[i], i] = d[idx]
This loop iterates over all columns of b and extracts the indices idxby identifying all the locations where x == v.
However I don't like the solution because of the rather slow for loop (taking about 50x longer than the unique command). I'd rather have the operation vectorized.
So one vectorized approach would be to create a matrix of indices where x == v and then run the nonzero() command on it along the columns. however, this matrix would require memory in the range of 150k x 15k, so about 8GB on a 32 bit system.
To me it sounds rather silly that the np.unique-operation can even efficiently return the inverted indices so that x = v[inv_indices] but that there is no way to get the v-to-x assignment lists for each bin in v. This should come almost for free when the function is scanning through x. Implementation-wise the only challenge would be the unknown size of the resulting index-matrix.
Another way of phrasing this problem assuming that the np.unique-command is the method-to-use for binning:
given the three arrays x, v, inv_indices where v are the unique elements in x and x = v[inv_indices] is there an efficient way of generating the index vectors v_to_x[i] such that all(v[i] == x[v_to_x[i]]) for all bins i?
I shouldn't have to spend more time than for the np.unique-command itself. And I'm happy to provide an upper bound for the number of items in each bin (say e.g. 50).
based on the suggestion from #user202729 I wrote this code
x_sorted_args = np.argsort(x)
x_sorted = x[x_sorted_args]
i = 0
v = -np.ones(T)
b = np.zeros((K, T))
for k,g in groupby(enumerate(x_sorted), lambda tup: tup[1]):
groups = np.array(list(g))[:,0]
size = groups.shape[0]
v[i] = k
b[0:size, i] = d[x_sorted_args[groups]]
i += 1
in runs in about ~100ms which results in some considerable speedup w.r.t. the original code posted above.
It first enumerates the values in x adding the corresponding index information. Then the enumeration is grouped by the actual x value which in fact is the second value of the tuple generated by enumerate().
The for loop iterates over all the groups turning those iterators of tuples g into the groups matrix of size (size x 2) and then throws away the second column, i.e. the x values keeping only the indices. This leads to groups being just a 1D array.
groupby() only works on sorted arrays.
Good work. I'm just wondering if we can do even better? Still a lot of unreasonable data copying seems to happen. Creating a list of tuples and then turning this into a 2D matrix just to throw away half of it still feels a bit suboptimal.
I received the answer I was looking for by rephrasing the question, see here: python: vectorized cumulative counting
by "cumulative counting" the inv_indices returned by np.unique() we receive the array indices of the sparse matrix so that
c = cumcount(inv_indices)
b[inv_indices, c] = d
cumulative counting as proposed in the thread linked above is very efficient. Run times lower than 20ms are very realistic.
I have three arrays called RowIndex, ColIndex and Entry in numpy. Essentially, this is a subset of entries from a matrix with the row indexes, column indexes, and value of that entry in these three variables respectively. I have two numpy 2D arrays (matrices) U and M. Let alpha and beta be two given constants. I need to iterate through the subset of entries of the matrix which is possible if I iterate through RowIndex, ColIndex and Value. Say,
i=RowIndex[0], j=ColIndex[0], value = Entry[0]
then I need to update i'th row and j'th column of U and M respectively according to some equation. Then, I make
i=RowIndex[1], j=ColIndex[1], value = Entry[1]
and so on. The detail is below.
for iter in np.arange(length(RowIndex)):
i = RowIndex[iter]
j = ColIndex[iter]
value = Entry[iter]
e = value - np.dot(U[i,:],M[:,j])
OldUi = U[i,:]
OldMj = M[:,j]
U[i,:] = OldUi + beta * (e*OldMj - alpha*OldUi)
M[:,j] = OldMj + beta * (e*OldUi - alpha*OldMj)
The problem is that the code is extremely slow. Is there any portion of code where I can speed this up?
PS: For the curious ones, this is a variant of the prize-winning solution to the famous NetFlix million prize problem. RowIndex corresponds to users and ColIndex correspond to movies and values corresponding to their ratings. Most of the ratings are missing. Known ratings are stacked up in RowIndex, ColIndex and Entry. Now you try to find matrices U and M, such that, the rating of i'th user for j'th movie is given by np.dot(U[i,:],M[:,j]). Now based on the available ratings, you try to find the matrices U and M (or their rows and columns) using a update equation as shown in the above code.
I think if I didn't understand wrong, that your code can be vectorized as follows:
import numpy as np
U, M = # two 2D matrices
rows_idx = # list of indexes
cols_idx = # list of indexes
values = # np.array() of values
e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal()
Uo = U.copy()
Mo = M.copy()
U[rows_idx, :] += beta * ((e * Mo[:, cols_idx]).T - alpha * Uo[rows_idx, :])
M[:, cols_idx] += beta * ((e * Uo[rows_idx, :].T) - alpha * Mo[:, cols_idx])
Here,
e = values - np.dot(U[rows_idx, :], M[:, cols_idx]).diagonal()
computes your
e = value - np.dot(U[i,:],M[:,j])
Note that the result you want resides in the diagonal of the dot product between matrices.
This wont handle sequential updates (as for that there is no available vectorization), but it will allow you to perform a batch of independent updates in a vectorized and faster way.
As stated above, the code I proposed to you can't handle sequential updates, because by definition, a sequential updating scheme can't be vectorized. Anything of the form
A(t) = A(t-1) +/* something
where t defines time, can't be updated in parallel.
So, what I proposed, is a vectorized update for independent updates.
Imagine you have M and U with 10x10 rows each, and you have the following row and columns indexes:
rows_idx = [1, 1, 3, 4, 5, 0]
cols_idx = [7, 1, 7, 5, 6, 5]
You can identify from there two independent sets (considering that indexes are ordered):
rows_idx = [1, 4, 5], [1, 3, 0]
cols_idx = [7, 5, 6], [1, 7, 5]
Note that independent sets are made by indexes in both rows and columns that are unique. With that definition, you can reduce the number of loops you need from 6 (in this case) to 2:
for i in len(rows_idx):
ridx = rows_idx[i]
cidx = cols_idx[i]
# Use the vectorized scheme proposed above the edit
e = values - np.dot(U[ridx, :], M[:, cidx]).diagonal()
Uo = U.copy()
Mo = M.copy()
U[ridx, :] += beta * ((e * Mo[:, cidx]).T - alpha * Uo[ridx, :])
M[:, cidx] += beta * ((e * Uo[ridx, :].T) - alpha * Mo[:, cidx])
So, either if you have a way of manually (or easily) extracting the independent updates, or you calculate the list by a using search algorithm, the above code would vectorize the independent updates.
For clarification just in case, in the above example:
rows_idx = [1, 1, 3, 4, 5, 0]
cols_idx = [7, 1, 7, 5, 6, 5]
2nd row can't be parallelized because 1 has appeared before, and 3rd and last columns can't be parallelized because of the same reason (with 7 and 5). So as both rows and columns need to be unique, we end up with 2 sets of tuples:
rows_idx = [1, 4, 5], [1, 3, 0]
cols_idx = [7, 5, 6], [1, 7, 5]
From here, the way to go would depend on your data. The problem of finding independent sets could be very expensive, specially if most of them are dependent on some previous updates.
If you have a way from your data (say that you have your data recorded on time) to extract independent sets, then the batch update will help you. In the other hand, if you have your data all together (which is common), it will depend on one factor:
If you can assure that the length of the independent sets N is very much larger than the number of independent sets M (which more or less means, that if you will end up with a few M = {2,3,4} independent sets for your N = 100000, with N >> M row/col indexes), then it might worth looking for independent sets.
In other words, if you are going to update 30 authors and 30 movies in 10000 different combinations, then your data will be likely to be dependent in previous updates, however, if you are going to update 100000 authors and 100000 movies in 30 combinations, then your data is likely to be independent.
Some pseudocode to find independent set, if you don't have a way of extracting them without information, would be something like this:
independent_sets = [] # list with sets
for row, col in zip(rows_idx, cols_idx):
for iset in independent_sets:
if row and col DONT exist in iset:
insert row and col
break
if nothing inserted:
add new set to independent set
add current (row, col) to the new set
as you can see, in order to find independent sets you already need to iterate over the whole list of row/column indexes. The pseudocode above is not the most efficent one, and I'm pretty sure there will be specific algorithms for this. But, the cost of finding independent set might be higher than doing all your sequential updates if your updates are likely to be dependent in previous ones.
To finish: after the whole post, it entirely depends on your data.
If you can beforehand from the way you get the rows/columns you want to update extract independent sets, then you can easily update them vectorized.
If you can ensure that most of your updates will be independent (say, 990 out of 10000 will be), it might be worth trying to find the 990 set. One way to approximate the set is by using np.unique:
# Just get the index of the unique rows and columns
_, idx_rows = np.unique(rows_idx, return_index=True)
_, idx_cols = np.unique(cols_idx, return_index=True)
# Get the index where both rows and columns are unique
idx = np.intersection1d(idx_rows, idx_cols)
Now idx contains the positions of rows_idx and cols_idx that are unique, hopefully this can reduce your computational cost a lot. You can use my batch update to update fast those rows and columns corresponding to those indexes. You can then use your initial approach to update the hopefully few entries that are repeated iterating over the non-unique indexes.
If you have multiple updates for same actors or movies, then... keep your sequential update scheme, as finding independent sets will be harder than iterative update.
I have a numpy array of values like this:
a = np.array((1, 3, 4, 5, 10))
In this case the array has length 5. Now I want to know the difference between the lowest and highest value in the array, but only within a certain continuous part of the array, for example with length 3.
So in this case it would be the difference between 4 and 10, so 6. It would also be nice to have the index of the starting point of the continuous part (in the above example that would be 2). So something like this:
def f(a, lenght_of_part):
...
return (max_difference, starting index)
I know I could iterate over sliced parts of the array, but for my actual purpose I have ~150k arrays of length 1500, so that would take too long.
What would be an easy and quick way of doing this?
Thanks in advance!
This is a bit tricky to get done in a vectorised way in Numpy. One option is to use numpy.lib.stride_tricks.as_strided, which requires care, because it allows to access arbitrary memory. Here's an example for a window size of k = 3:
>>> k = 3
>>> shape = (len(a) - k + 1, k)
>>> b = numpy.lib.stride_tricks.as_strided(
a, shape=shape, strides=(a.itemsize, a.itemsize))
>>> moving_ptp = b.ptp(axis=1)
>>> start_index = moving_ptp.argmax()
>>> moving_ptp[start_index]
6
I have two floating arrays and want to find data points which match within a certain range.
This is what I got so far:
import numpy as np
for vx in range(len(arr1)):
match = (np.abs(arr2-arr1[vx])).argmin()
if abs(arr1[vx]-arr2[match])<0.375:
point = arr2[match]
The problem is that arr1 contains 150000 elements and arr2 around 110000 elements. This takes an awful amount of time. Do you have suggestions to speed things up?
In addition to not being vectorized, your current search is (n * m) where n is the size of arr2 and m is the size of arr1. In these kinds of searches it helps to sort arr1 or arr2 so you can use a binary search. Sorting ends up being the slowest step but it's still faster if m is large because the n*log(n) sort is faster than (n*m).
Here is how you can do the search in a vectorized way using the sorted array:
def find_closest(A, target):
#A must be sorted
idx = A.searchsorted(target)
idx = np.clip(idx, 1, len(A)-1)
left = A[idx-1]
right = A[idx]
idx -= target - left < right - target
return A[idx]
arr2.sort()
closest = find_closest(arr2, arr1)
closest = np.where(abs(closest - arr1) < .375, closest, np.nan)
The whole idea of using numpy is to avoid computation with loops.
Specifying criteria to extract a new array that satisfies the criteria can be implemented easily with array computation. Here's an example extracting values from array a which satisfies the criteria that that element has an absolute different of less than 0.75 from the corresponding element in array b:-
a = array([1, 0, 0.5, 1.2])
b = array([1.2, 1.1, 1.3, 1.4])
c = a[abs(a-b)<0.75]
Which gives us
array([ 1. , 1.2])