Efficient bin-assignment in numpy

Efficient bin-assignment in numpy - python

i have a very large 1D python array x of somewhat repeating numbers and along with it some data d of the same size.
x = np.array([48531, 62312, 23345, 62312, 1567, ..., 23345, 23345])
d = np.array([0 , 1 , 2 , 3 , 4 , ..., 99998, 99999])
in my context "very large" refers to 10k...100k entries. Some of them are repeating so the number of unique entries is about 5k...15k.
I would like to group them into the bins. This should be done by creating two objects. One is a matrix buffer, b of data items taken from d. The other object is a vector v of unique x values each of the buffer columns refers to. Here's the example:
v = [48531, 62312, 23345, 1567, ...]
b = [[0 , 1 , 2 , 4 , ...]
[X , 3 , ....., ...., ...]
[ ...., ....., ....., ...., ...]
[X , X , 99998, X , ...]
[X , X , 99999, X , ...] ]
Since the numbers of occurrences of each unique number in x vary some of the values in the buffer b are invalid (indicated by the capital X, i.e. "don't care").
It's very easy to derive v in numpy:
v, n = np.unique(x, return_counts=True) # yay, just 5ms
and we even get n which is the number of valid entries within each column in b. Moreover, (np.max(n), v.shape[0]) returns the shape of the matrix b that needs to be allocated.
But how to efficiently generate b?
A for-loop could help
b = np.zeros((np.max(n), v.shape[0]))
for i in range(v.shape[0]):
idx = np.flatnonzero(x == v[i])
b[0:n[i], i] = d[idx]
This loop iterates over all columns of b and extracts the indices idxby identifying all the locations where x == v.
However I don't like the solution because of the rather slow for loop (taking about 50x longer than the unique command). I'd rather have the operation vectorized.
So one vectorized approach would be to create a matrix of indices where x == v and then run the nonzero() command on it along the columns. however, this matrix would require memory in the range of 150k x 15k, so about 8GB on a 32 bit system.
To me it sounds rather silly that the np.unique-operation can even efficiently return the inverted indices so that x = v[inv_indices] but that there is no way to get the v-to-x assignment lists for each bin in v. This should come almost for free when the function is scanning through x. Implementation-wise the only challenge would be the unknown size of the resulting index-matrix.
Another way of phrasing this problem assuming that the np.unique-command is the method-to-use for binning:
given the three arrays x, v, inv_indices where v are the unique elements in x and x = v[inv_indices] is there an efficient way of generating the index vectors v_to_x[i] such that all(v[i] == x[v_to_x[i]]) for all bins i?
I shouldn't have to spend more time than for the np.unique-command itself. And I'm happy to provide an upper bound for the number of items in each bin (say e.g. 50).

based on the suggestion from #user202729 I wrote this code
x_sorted_args = np.argsort(x)
x_sorted = x[x_sorted_args]
i = 0
v = -np.ones(T)
b = np.zeros((K, T))
for k,g in groupby(enumerate(x_sorted), lambda tup: tup[1]):
groups = np.array(list(g))[:,0]
size = groups.shape[0]
v[i] = k
b[0:size, i] = d[x_sorted_args[groups]]
i += 1
in runs in about ~100ms which results in some considerable speedup w.r.t. the original code posted above.
It first enumerates the values in x adding the corresponding index information. Then the enumeration is grouped by the actual x value which in fact is the second value of the tuple generated by enumerate().
The for loop iterates over all the groups turning those iterators of tuples g into the groups matrix of size (size x 2) and then throws away the second column, i.e. the x values keeping only the indices. This leads to groups being just a 1D array.
groupby() only works on sorted arrays.
Good work. I'm just wondering if we can do even better? Still a lot of unreasonable data copying seems to happen. Creating a list of tuples and then turning this into a 2D matrix just to throw away half of it still feels a bit suboptimal.

I received the answer I was looking for by rephrasing the question, see here: python: vectorized cumulative counting
by "cumulative counting" the inv_indices returned by np.unique() we receive the array indices of the sparse matrix so that
c = cumcount(inv_indices)
b[inv_indices, c] = d
cumulative counting as proposed in the thread linked above is very efficient. Run times lower than 20ms are very realistic.

Related

Calculation involving iterations over all combinations of values in two lists in Python [duplicate]

This question already has answers here:
How to get the cartesian product of multiple lists
(17 answers)
Closed 5 months ago.
I need to make calculations using two lists each with 36 elements in them. The calculation must use one value in each list using all combinations. Example:
listx = [x1 , x2 , x3 , ... , x36]
listy = [y1 , y2 , y3 , ... , y36]
F(x,y) = ((y-x)*(a/b))+x
x and y in F(x,y) must assume all combinations inside listx and listy. Results should be a matrix of (36 x 36)
This is what I've tried so far:
listx = np.arange(-0.05,0.301,0.01)
listy = np.arange(-0.05,0.301,0.01)
for x in listx:
for y in listy:
F = ((y-x)*(a/b))+x
print(F)

So I think the issue is that you are having trouble conceptualizing the grid that these solutions are supposed to be stored in. This calculation is good because it is an introduction to certain optimizations and additionally there are a few ways to do it. I'll show you the three I threw together.
First, you could do it with lists and loops, which is very inefficient (numpy is just to show the shape):
import numpy as np
x, y = [], []
length = 35
for i in range(length+1):
x.append(i/length) # Normalizing over the range of the grid
y.append(i/length) # to compare to later example
def func(x, y, a, b):
return ((y-x)*(a/b))+x
a=b=1 # Set a value for a and b
row = []
for i in x:
column = []
for j in y:
column.append(func(i,j,a,b))
row.append(column)
print(row)
print(np.shape(row))
This will output a solution assuming a and b are known, and it is a 36x36 matrix. To make the matrix, we have to create a large memory space which I called row and smaller memory spaces that are recreated each iteration of the loop I called column. The inner-most loop appends the values to the column list, while the evaluated column lists are appended to the top level row list. It will then have a matrix-esque appearance even if it is just a list of lists.
A more efficient way to do this is to use numpy. First, we can keep the loops if you wish and do the calculation with numpy arrays:
import numpy as np
x = y = np.linspace(0,1,36)
result = np.zeros((len(x), len(y)))
F = lambda x,y,a,b: ((y-x)*(a/b))+x
a=b=1
for idx, i in enumerate(x):
for jdx, j in enumerate(y):
result[idx, jdx] = F(i,j,a,b) # plug in value at idx, jdx grip point
print(result)
print(result.shape)
So here we create the grid using linspace and I just chose values from 0 to 1 in 36 steps. After this, I create the grid we will store the solutions in by making a numpy array with dimensions given by the length of the x and y arrays. Finally The function is created with a lambda function, which serves the same purpose of the def previously, just in one line. The loop is kept for now, which iterates over the values i, j and indexes of each idx, jdx. The results are added into the allocated storage at each index with result[idx, jdx] = F(i,j,a,b).
We can do better, because numpy exists to help remove loops in calculations. Instead, we can utilize the meshgrid function to create a matrix and evaluate the function with it, as so:
import numpy as np
x = y = np.linspace(0,1,36)
X, Y = np.meshgrid(x,y)
F = lambda x,y,a,b: ((y-x)*(a/b))+x
a=b=1
result = F(X,Y,a,b) # Plug in grid directly
print(result.T)
print(result.shape)
Here we use the numpy arrays and tell meshgrid that we want a 36x36 array with these values at each grid point. Then we define the lambda function as before and pass the new X and Y to the function. The output does not require additional storage or loops, so then we get the result.
It is good to practice using numpy for any calculation you want to do, because they can usually be done without loops.

Python - How to quickly find the index of multiple vectors in an array

I have 2 large arrays, A and B, and I want to find where the vectors of A are in B. I have to find 10,000, 1 x 800 vectors among 40,000 vectors of the same size.
Example
A = [[1,2],[2,3],[4,5]]
B = [[2,3],[4,5]]
Desired Output:
[1,2]
I can find a single vector using np.argwhere((A == B[0]).all(-1)) but I am not sure how to shape the arrays to find the indices of each vector. I can use a for loop but that is too slow. For example
np.asarray([np.argwhere((A == B_[i]).all(-1)) for i in range(np.shape(A)[0])])

Setup
import numpy as np
rows_a = 40000
rows_b = 10000
size = 800
a = np.arange(rows_a * size).reshape((rows_a, size))
np.random.shuffle(a)
b = np.arange(rows_b * size).reshape((rows_b, size))
Solution
d = {tuple(v): i for i, v in enumerate(a)}
idx = [d[tuple(row)] for row in b]
Let's say that a has size m and b has size n.
d creates a mapping of the rows in a to their index. tuple(v) is necessary if v is not hashable, like lists and ndarrays. This has O(m) time complexity because you iterable over the rows once.
idx iterates over the rows in b and checks the dictionary to fetch the respective index in a. A dictionary lookup has O(1) time complexity and the loop O(n). All in all, you're looking at O(m+n), which is linear.
What you are doing instead is for each row in b, you check every row in a to find its index. This has O(m*n) complexity, which is quadratic.

Tuple-like (lexographical) max in numpy

I find myself running into the following situation in numpy muliple times over the past couple of months, and I cannot imagine there is not a proper solution for it.
I have a 2d array, let's say
x = np.array([
[1, 2, 3],
[2, -5, .333],
[1, 4, 2],
[2, -5, 4]])
Now I would like to (sort/get the maximum/do argsort/argmax/ etc) this array in such a way that it first compares the first column. If the first column is equal, it compares the second column, and then the third. So this means for our example:
# max like python: max(x.tolist())
np.tuple_like_amax(x) = np.array([2, -5, 4])
# argmax does't have python equivalent, but something like: [i for i, e in enumerate(x.tolist()) if e == max(x.tolist())][0]
np.tuple_like_argmax = 3
# sorting like python: sorted(x.tolist())
np.tuple_like_sort(x) = np.array([[1.0, 2.0, 3.0], [1.0, 4.0, 2.0], [2.0, -5.0, 0.333], [2.0, -5.0, 4.0]])
# argsort doesn't have python equivalent, but something like: sorted(range(len(x)), key=lambda i: x[i].tolist())
np.tuple_like_argsort(x) = np.array([0, 2, 1, 3])
This is exactly the way how python compares tuples (so actually just calling max(x.tolist()) does the trick here for max. It does feel however like a time-and-memory waste to first convert the array to a python list, and in addition I would like to use things like argmax, sort and all the other great numpy functions.
So just to be clear, I'm not interested in python code that mimics an argmax, but for something that achieves this without converting the lists to python lists.
Found so far:
np.sort seems to work on structured arrays when order= is given. It does feel to me that creating a structured array and then using this method is overkill. Also, argmax doesn't seem to support this, meaning that one would have to use argsort, which has a much higher complexity.

Here I will focus only on finding the lexicographic argmax (the others: max, argmin, and min can be found trivially from argmax). In addition, unlike np.argmax(), we will return all rows that are at rank 0 (if there are duplicate rows), i.e. all the indices where the row is the lexicographic maximum.
The idea is that, for the "tuple-like order" desired here, the function is really:
find all indices where the first column has the maximum;
break ties with the places where the second column is max, under condition that the first column is max;
etc., as long as there are ties to break (and more columns).
def ixmax(x, k=0, idx=None):
col = x[idx, k] if idx is not None else x[:, k]
z = np.where(col == col.max())[0]
return z if idx is None else idx[z]
def lexargmax(x):
idx = None
for k in range(x.shape[1]):
idx = ixmax(x, k, idx)
if len(idx) < 2:
break
return idx
At first, I was worried that the explicit looping in Python would kill it. But it turns out that it is quite fast. In the case where there is no ties (more likely with independent float values, for instance), that returns immediately after a single np.where(x[:, 0] == x[:, 0].max()). Only in the case of ties do we need to look at the (much smaller) subset of rows that were tied. In unfavorable conditions (many repeated values in all columns), it is still ~100x or more than the partition method, and O(log n) faster than lexsort(), of course.
Test 1: correctness
for i in range(1000):
x = np.random.randint(0, 10, size=(1000, 8))
found = lexargmax(x)
assert lexargmax_by_sort(x) in found and np.unique(x[found], axis=0).shape[0] == 1
(where lexargmark_by_sort is np.lexsort(x[:, ::-1].T)[-1])
Test 2: speed
x = np.random.randint(0, 10, size=(100_000, 100))
a = %timeit -o lexargmax(x)
# 776 µs ± 313 ns per loop
b = %timeit -o lexargmax_by_sort(x)
# 507 ms ± 2.65 ms per loop
# b.average / a.average: 652
c = %timeit -o lexargmax_by_partition(x)
# 141 ms ± 2.38 ms
# c.average / a.average: 182
(where lexargmark_by_partition is based on #MadPhysicist very elegant idea:
def lexargmax_by_partition(x):
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
return np.argpartition(view, -1)[-1]
)
After some more testing on various sizes, we get the following time measurements and performance ratios:
In the LHS plot, lexargmax is the group shown with 'o-' and lexargmax_by_partition is the upper group of lines.
In the RHS plot, we just show the speed ratio.
Interestingly, lexargmax_by_partition execution time seems fairly independent of m, the number of columns, whereas our lexargmax depends a little bit on it. I believe that is reflecting the fact that, in this setting (purposeful collisions of max in each column), the more columns we have, the "deeper" we need to go when breaking ties.
Previous (wrong) answer
To find the argmax of the row by lexicographic order, I was thinking you could do:
def lexmax(x):
r = (2.0 ** np.arange(x.shape[1]))[::-1]
return np.argmax(((x == x.max(axis=0)) * r).sum(axis=1))
Explanation:
x == x.max(axis=0) (as an int) is 1 for each element that is equal to the column's max. In your example, it is (astype(int)):
[[0 0 0]
[1 0 0]
[0 1 0]
[1 0 1]]
then we multiply by a column weight that is more than the sum of 1's on the right. Powers of two achieve that. We do it in float to address cases with more than 64 columns.
But this is fatally flawed: The positions of max in the second column should be considered only in the subset where the first column had the max value (to break the tie).
Other approaches including affine transformations of all columns so that we can sum them and find the max don't work either: if the max in column 0 is, say, 1.0, and there is a second place at 0.999, then we would have to know ahead of time that difference of 0.001 and make sure no combination of values from the columns to the right to sum up to overtake that difference. So, that's a dead end.

To sort a list by the contents of a row, you can use np.lexsort. The only catch is that it sorts by the last element of the selected axis first:
index = np.lexsort(x.T[::-1])
OR
index = np.lexsort(x[:, ::-1].T)
This is "argsort". You can make it into "sort" by doing
x[index]
"min" and "max" can be done trivially by using the index:
xmin = x[index[0]]
xmax = x[index[-1]]
Alternatively, you can use a technique I suggested in one of my questions: Sorting array of objects by row using custom dtype. The idea is to make each row into a structure that has a field for each element:
view = np.ndarray(x.shape[0], dtype=[('', x.dtype)] * x.shape[1], buffer=x)
You can sort the array in-place by operating on the view:
>>> view.sort()
>>> x
array([[ 1. , 2. , 3. ],
[ 1. , 4. , 2. ],
[ 2. , -5. , 0.333],
[ 2. , -5. , 4. ]])
That's because the ndarray constructor points to x as the original buffer.
You can not get argmin, argmax, min and max to work on the result. However, you can still get the min and max in O(n) time using my favorite function in all of numpy: np.partition:
view.partition([0, -1])
xmin = x[0]
xmax = x[-1]
You can use argpartition on the array as well to get the indices of the desired elements:
index = view.argpartition([0, -1])[[0, -1]]
xmin = x[index[0]]
xmax = x[index[-1]]
Notice that both sort and partition have an order argument that you can use to rearrange the comparison of the columns.

Multiply two list of different sizes element wise without using libraries in python

#create a simple list in python
#list comprehension
x = [i for i in range(100)]
print (x)
#using loops
squares = []
for x in range(10):
squares.append(x**2)
print (squares)
multiples = k*[z for z in x] for k in squares
So in the last line of code I am trying to multiply both the lists. the problem is the lists are not of the same side and k*[z for z in x] this part is also incorrect.

For problems with iteration, I suggest anyone to check Loop Like A Native by Ned Batchelder and Looping like a Pro by David Baumgold
Option 1
If you want to multiply them as far as the shortest list goes, zip is your friend:
multiples = [a * b for a, b in zip (x, squares)]
Option 2
If you want a matrix with the product, then you can do it like this
result = [
[a * b for a in x]
for b in squares
]

I don't quite understand what the desired output would be. As the function stands now, you would have a list of lists, where the first element has 100 elements, the second one 400, the third 900, and so on.
One thing that's strange: The expression [z for z in x] defines a list that is identical to x. So, you might just write k*x
If you want to multiply the elements of both lists, you would have to write [[k*z for z in x] for k in squares]. This would lead to a list of 10 lists of 100 elements (or a 10x100-matrix) containing the products of your lists.
If you want to have one list of length 100 in the end that holds some kind of products, you will have to think about how to proceed with the shorter list.
EDIT: Or if you want to multiply them as far as possible until you reach the end of the shorter list, FRANCOIS CYRIL's solution is an elegant way to do so.

You can loop on each array to multiply element by element at the same position in a result array:
i = 0
arrayRes = []
while i < min (len(array1),len(array2)):
arrayRes.append(array1[i]*array2[i])
i+=1
Or do you prefer to multiply them, matrix way?
x = 0
y = 0
arrayRes = []
while x < len(array1):
arrayRes.append([])
while y < len(array2):
arrayRes[x].append(array1[x]*array2[y])
y+=1
x+=1

Optimize finding pairs of arrays which can be compared

Definition: Array A(a1,a2,...,an) is >= than B(b1,b2,...bn) if they are equal sized and a_i>=b_i for every i from 1 to n.
For example:
[1,2,3] >= [1,2,0]
[1,2,0] not comparable with [1,0,2]
[1,0,2] >= [1,0,0]
I have a list which consists of a big number of such arrays (approx. 10000, but can be bigger). Arrays' elements are positive integers. I need to remove all arrays from this list that are bigger than at least one of other arrays. In other words: if there exists such B that A >= B then remove A.
Here is my current O(n^2) approach which is extremely slow. I simply compare every array with all other arrays and remove it if it's bigger. Are there any ways to speed it up.
import numpy as np
import time
import random
def filter_minimal(lst):
n = len(lst)
to_delete = set()
for i in xrange(n-1):
if i in to_delete:
continue
for j in xrange(i+1,n):
if j in to_delete: continue
if all(lst[i]>=lst[j]):
to_delete.add(i)
break
elif all(lst[i]<=lst[j]):
to_delete.add(j)
return [lst[i] for i in xrange(len(lst)) if i not in to_delete]
def test(number_of_arrays,size):
x = map(np.array,[[random.randrange(0,10) for _ in xrange(size)] for i in xrange(number_of_arrays)])
return filter_minimal(x)
a = time.time()
result = test(400,10)
print time.time()-a
print len(result)
P.S. I've noticed that using numpy.all instead of builtin python all slows the program dramatically. What can be the reason?

Might not be exactly what you are asking for, but this should get you started.
import numpy as np
import time
import random
def compare(x,y):
#Reshape x to a higher dimensional array
compare_array=x.reshape(-1,1,x.shape[-1])
#You can now compare every x with every y element wise simultaneously
mask=(y>=compare_array)
#Create a mask that first ensures that all elements of y are greater then x and
#then ensure that this is the case at least once.
mask=np.any(np.all(mask,axis=-1),axis=-1)
#Places this mask on x
return x[mask]
def test(number_of_arrays,size,maxval):
#Create arrays of size (number_of_arrays,size) with maximum value maxval.
x = np.random.randint(maxval, size=(number_of_arrays,size))
y= np.random.randint(maxval, size=(number_of_arrays,size))
return compare(x,y)
print test(50,10,20)

First of all we need to carefully check the objective. Is it true that we delete any array that is > ANY of the other arrays, even the deleted ones? For example, if A > B and C > A and B=C, then do we need to delete only A or both A and C? If we only need to delete INCOMPATIBLE arrays, then it is a much harder problem. This is a very difficult problem because different partitions of the set of arrays may be compatible, so you have the problem of finding the largest valid partition.
Assuming the easy problem, a better way to define the problem is that you want to KEEP all arrays which have at least one element < the corresponding element in ALL the other arrays. (In the hard problem, it is the corresponding element in the other KEPT arrays. We will not consider this.)
Stage 1
To solve this problem what you do is arrange the arrays in columns and then sort each row while maintaining the key to the array and the mapping of each array-row to position (POSITION lists). For example, you might end up with a result in stage 1 like this:
row 1: B C D A E
row 2: C A E B D
row 3: E D B C A
Meaning that for the first element (row 1) array B has a value >= C, C >= D, etc.
Now, sort and iterate the last column of this matrix ({E D A} in the example). For each item, check if the element is less than the previous element in its row. For example, in row 1, you would check if E < A. If this is true you return immediately and keep the result. For example, if E_row1 < A_row1 then you can keep array E. Only if the values in the row are equal do you need to do a stage 2 test (see below).
In the example shown you would keep E, D, A (as long as they passed the test above).
Stage 2
This leaves B and C. Sort the POSITION list for each. For example, this will tell you that the row with B's mininum position is row 2. Now do a direct comparison between B and every array below it in the mininum row, here row 2. Here there is only one such array, D. Do a direct comparison between B and D. This shows that B < D in row 3, therefore B is compatible with D. If the item is compatible with every array below its minimum position keep it. We keep B.
Now we do the same thing for C. In C's case we need only do one direct comparison, with A. C dominates A so we do not keep C.
Note that in addition to testing items that did not appear in the last column we need to test items that had equality in Stage 1. For example, imagine D=A=E in row 1. In this case we would have to do direct comparisons for every equality involving the array in the last column. So, in this case we direct compare E to A and E to D. This shows that E dominates D, so E is not kept.
The final result is we keep A, B, and D. C and E are discarded.
The overall performance of this algorithm is n2*log n in Stage 1 + { n lower bound, n * log n - upper bound } in Stage 2. So, maximum running time is n2*log n + nlogn and minimum running time is n2logn + n. Note that the running time of your algorithm is n-cubed n3. Since you compare each matrix (n*n) and each comparison is n element comparisons = n*n*n.
In general, this will be much faster than the brute force approach. Most of the time will be spent sorting the original matrix, a more or less unavoidable task. Note that you could potentially improve my algorithm by using priority queues instead of sorting, but the resulting algorithm would be much more complicated.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.