Optimzation of nested for loop in python

Optimzation of nested for loop in python - python

x_tsvd is matrix of length 4.6 million(row).
svd_tfidf is matrix of length 1862(row).
Both matrix has same number of column(260).
And i wand to calcuate cosine similarity for each 4.6 M rows of x_tsvd for each 1862 svd_tfidf.
Is there any way i can optimize it so that it take less time.
from numpy.linalg import norm
best_match=[]
keys=np.array(df_5M['file'])
values=np.array(df['file'])
for i in range(len(x_tsvd)):
array_=[]
for j in range (len(svd_tfidf)):
cosine_similarity_=np.dot(x_tsvd[i],svd_tfidf[j])/(norm(x_tsvd[i])*norm(svd_tfidf[j]))
array_.append(cosine_similarity_)
index=np.array(array_).argsort()
best_match.append({keys[i]:values[index][::-1][0:5]})
Update:
from numpy.linalg import norm
best_match=[]
#b=copy.copy(svd_tfidf)
keys=np.array(df_5M['file'])
values=np.array(df['file'])
#b=copy.copy(svd_tfidf)
for i in range(len(x_tsvd)):
a=x_tsvd[i]
b=svd_tfidf
a_dot_b=np.sum(np.multiply(a,b),axis=1)
norm_a=norm(a)
norm_b=norm(b,axis=1)
cosine_similarity_=a_dot_b/(norm_a*norm_b)
index=np.argsort(cosine_similarity_)
best_match.append({keys[i]:values[index][::-1][0:6]})
```

There are several issues in your code. First of all, norm(x_tsvd[i]) is recomputed len(svd_tfidf)=1862 times while the expression can be move in the parent loop. Furthermore, norm(svd_tfidf[j]) is recomputed len(x_tsvd)=4.6e6 times while the expression can be precomputed for all j values only once. Moreover, calling np.dot(x_tsvd[i],svd_tfidf[j]) in two nested loops is not efficient. You can use a big matrix multiplication: x_tsvd # svd_tfidf.T. However, since this matrix is huge (~64 GiB), it is reasonable to split x_tsvd in chunks of size 512~4096. Additionally, you can precompute the inverse of the norm because the multiplication by the inverse value is generally significantly faster than divisions. np.argsort(tmp_matrix[i])[::-1][0:5]] is not efficient and argpartition can be used instead to only compute the 5 best items (as I pointed out in a comment of the previous answer which advised you to use argsort). Note that a partition does not behave the same way than a sort if you care about equal items (ie. stable sort). There are no stable partitioning implementation available yet in Numpy.
In the end the optimized implementation should look like:
inv_norm_j = 1.0 / norm_by_line(svd_tfidf) # Horizontal vector
for chunk_start in range(0, len(x_tsvd), chunk_size):
chunk_end = min(chunk_start + chunk_size, len(x_tsvd))
inv_norm_i = 1.0 / norm_by_line(x_tsvd_block)[:,None] # Vertical vector
x_tsvd_block = x_tsvd[chunk_start:chunk_end]
tmp_matrix = (x_tsvd_block # svd_tfidf.T) * inv_norm_i * inv_norm_j
best_match_values = values[np.sort(np.argpartition(tmp_matrix, len(svd_tfidf)-5)[:,-5:])[:,::-1]]
# Pure-Python part that can hardly be optimized
for i in range(chunk_start, chunk_end):
best_match.append({keys[i]: best_match_values[i]})
Where norm_by_line can be computed in a vectorized way (certainly with Scipy for example). Note that this is a untested draft and not a code that you should trust completely and copy-part blindly ;) .
Regarding the recent update (which is a code computing a different result), most optimizations are identical but there is a big improvement you can do on np.sum(np.multiply(a,b),axis=1). Indeed, you can use np.einsum('ij,ij->i', a, b) instead so not to compute the large expensive temporary matrix. It is 3 times faster on my machine.

Related

Calculate every 4-element-product in a vector with python

I have a (500000,30) numpy array and we can look it as a length-500000 list of size-30 vectors. I want to choose arbitrary 4 elements in the vector, calculate its product, and store all the 4-element-products. Finally I need to calculate the mean of 500000 results.
I have tried it with np.einsum but it runs really slow. How can I improve the efficiency?
# array.shape = (500000,30)
expect = np.sum(np.einsum('ni,nj,nk,nr->ijkr',array,array,array,array),axis=0)/500000

The sum of all possible products of 4 entries in a vector v is equal to the 4th power of the sum of entries of v.
Note that this assumes that the same vector entry can appear more than once in a product, and the sum will include products that differ only by the order of their entries (so e.g. v[1] * v[2] * v[3] * v[4] and v[4] * v[3] * v[2] * v[1] will count as different products). Since your code performs the same computations, I assume that this is what you want. In any case, the value of
np.sum(np.einsum('ni,nj,nk,nr->ijkr', array, array, array, array))
is the same as that of
(array.sum(axis=1)**4).sum()
but the latter will be computed much faster.
In your code you are taking the sum along the 0-axis of the 30x30x30x30 array produced by np.einsum, but I am not sure why.

You can compute the solution much more efficiently by factorizing the dot products in the last dimension. Moreover, you can tell Numpy to optimize the einsum (at the expense of a higher latency which is not a problem here). Here is the resulting code:
expect = np.einsum('n,nj,nk,nr->jkr',np.sum(array, axis=1),array,array,array,optimize=True)/500000
This is 63 times faster on my machine. If you want to optimize this further, then you can perform the computation in parallel using multiple threads. Indeed, the default Numpy implementation is sequential. You can use Numba to do that. I expect the computation to be about 360 times faster on my 6-core machine (since the computation is compute-bound).

Time complexity of torch.topk

When m and n of n.topk(m) exceed 20 million and 200,000 respectively, the sorting becomes very slow(over 3 hours). I want to know the time complexity of torch.topk and improvement measures of the sorting.
topv, topi = outline.topk(beam_size) # beam_size = 200,000, outline: 1 × 20,000,000

I don't know how pytorch implements topk for CPU tensors. However, since you are working on CPU, you can use existing partial sorting implementations for numpy arrays.
For example, using the bottleneck.argpartition:
import bottleneck
with torch.no_grad():
topi = bottleneck.argpartition(outline.numpy(), kth=beam_size)
topv = outline[topi] # allow gradients to propagate through indexing
Note that the efficient implementation of bottleneck.argpartition does not sort the array and therefore you are guaranteed that topv are indeed larger than all other elements of the array, but the values in topv are not sorted and it might be the case that topv[i] < topv[i+1] for some is.

Is MATLAB's bsxfun the best? Python's numpy.einsum?

I have a very large multiply and sum operation that I need to implement as efficiently as possible. The best method I've found so far is bsxfun in MATLAB, where I formulate the problem as:
L = 10000;
x = rand(4,1,L+1);
A_k = rand(4,4,L);
tic
for k = 2:L
i = 2:k;
x(:,1,k+1) = x(:,1,k+1)+sum(sum(bsxfun(#times,A_k(:,:,2:k),x(:,1,k+1-i)),2),3);
end
toc
Note that L will be larger in practice. Is there a faster method? It's strange that I need to first add the singleton dimension to x and then sum over it, but I can't get it to work otherwise.
It's still much faster than any other method I've tried, but not enough for our application. I've heard rumors that the Python function numpy.einsum may be more efficient, but I wanted to ask here first before I consider porting my code.
I'm using MATLAB R2017b.

I believe both of your summations can be removed, but I only removed the easier one for the time being. The summation over the second dimension is trivial, since it only affects the A_k array:
B_k = sum(A_k,2);
for k = 2:L
i = 2:k;
x(:,1,k+1) = x(:,1,k+1) + sum(bsxfun(#times,B_k(:,1,2:k),x(:,1,k+1-i)),3);
end
With this single change the runtime is reduced from ~8 seconds to ~2.5 seconds on my laptop.
The second summation could also be removed, by transforming times+sum into a matrix-vector product. It needs some singleton fiddling to get the dimensions right, but if you define an auxiliary array that is B_k with the second dimension reversed, you can generate the remaining sum as ~x*C_k with this auxiliary array C_k, give or take a few calls to reshape.
So after a closer look I realized that my original assessment was overly optimistic: you have multiplications in both dimensions in your remaining term, so it's not a simple matrix product. Anyway, we can rewrite that term to be the diagonal of a matrix product. This implies that we're computing a bunch of unnecessary matrix elements, but this still seems to be slightly faster than the bsxfun approach, and we can get rid of your pesky singleton dimension too:
L = 10000;
x = rand(4,L+1);
A_k = rand(4,4,L);
B_k = squeeze(sum(A_k,2)).';
tic
for k = 2:L
ii = 1:k-1;
x(:,k+1) = x(:,k+1) + diag(x(:,ii)*B_k(k+1-ii,:));
end
toc
This runs in ~2.2 seconds on my laptop, somewhat faster than the ~2.5 seconds obtained previously.

Since you're using an new version of Matlab you might try broadcasting / implicit expansion instead of bsxfun:
x(:,1,k+1) = x(:,1,k+1)+sum(sum(A_k(:,:,2:k).*x(:,1,k-1:-1:1),3),2);
I also changed the order of summation and removed the i variable for further improvement. On my machine, and with Matlab R2017b, this was about 25% faster for L = 10000.

Fast way to obtain a random index from an array of weights in python

I regularly find myself in the position of needing a random index to an array or a list, where the probabilities of indices are not uniformly distributed, but according to certain positive weights. What's a fast way to obtain them? I know I can pass weights to numpy.random.choice as optional argument p, but the function seems quite slow, and building an arange to pass it is not ideal either. The sum of weights can be an arbitrary positive number and is not guaranteed to be 1, which makes the approach to generate a random number in (0,1] and then substracting weight entries until the result is 0 or less impossible.
While there are answers on how to implement similar things (mostly not about obtaining the array index, but the corresponding element) in a simple manner, such as Weighted choice short and simple, I'm looking for a fast solution, because the appropriate function is executed very often. My weights change frequently, so the overhead of building something like an alias mask (a detailed introduction can be found on http://www.keithschwarz.com/darts-dice-coins/) should be considered part of the calculation time.

Cumulative summing and bisect
In any generic case, it seems advisable to calculate the cumulative sum of weights, and use bisect from the bisect module to find a random point in the resulting sorted array
def weighted_choice(weights):
cs = numpy.cumsum(weights)
return bisect.bisect(cs, numpy.random.random() * cs[-1])
if speed is a concern. A more detailed analysis is given below.
Note: If the array is not flat, numpy.unravel_index can be used to transform a flat index into a shaped index, as seen in https://stackoverflow.com/a/19760118/1274613
Experimental Analysis
There are four more or less obvious solutions using numpy builtin functions. Comparing all of them using timeit gives the following result:
import timeit
weighted_choice_functions = [
"""import numpy
wc = lambda weights: numpy.random.choice(
range(len(weights)),
p=weights/weights.sum())
""",
"""import numpy
# Adapted from https://stackoverflow.com/a/19760118/1274613
def wc(weights):
cs = numpy.cumsum(weights)
return cs.searchsorted(numpy.random.random() * cs[-1], 'right')
""",
"""import numpy, bisect
# Using bisect mentioned in https://stackoverflow.com/a/13052108/1274613
def wc(weights):
cs = numpy.cumsum(weights)
return bisect.bisect(cs, numpy.random.random() * cs[-1])
""",
"""import numpy
wc = lambda weights: numpy.random.multinomial(
1,
weights/weights.sum()).argmax()
"""]
for setup in weighted_choice_functions:
for ps in ["numpy.ones(40)",
"numpy.arange(10)",
"numpy.arange(200)",
"numpy.arange(199,-1,-1)",
"numpy.arange(4000)"]:
timeit.timeit("wc(%s)"%ps, setup=setup)
print()
The resulting output is
178.45797914802097
161.72161589498864
223.53492237901082
224.80936180002755
1901.6298267539823
15.197789980040397
19.985687876993325
20.795070077001583
20.919113760988694
41.6509403079981
14.240949985047337
17.335801470966544
19.433710905024782
19.52205040602712
35.60536142199999
26.6195822560112
20.501282756973524
31.271995796996634
27.20013752405066
243.09768892999273
This means that numpy.random.choice is surprisingly very slow, and even the dedicated numpy searchsorted method is slower than the type-naive bisect variant. (These results were obtained using Python 3.3.5 with numpy 1.8.1, so things may be different for other versions.) The function based on numpy.random.multinomial is less efficient for large weights than the methods based on cumulative summing. Presumably the fact that argmax has to iterate over the whole array and run comparisons each step plays a significant role, as can be seen as well from the four second difference between an increasing and a decreasing weight list.

Large matrix multiplication in Python - what is the best option?

I have two boolean sparse square matrices of c. 80,000 x 80,000 generated from 12BM of data (and am likely to have orders of magnitude larger matrices when I use GBs of data).
I want to multiply them (which produces a triangular matrix - however I dont get this since I don't limit the dot product to yield a triangular matrix).
I am wondering what the best way of multiplying them is (memory-wise and speed-wise) - I am going to do the computation on a m2.4xlarge AWS instance which has >60GB of RAM. I would prefer to keep the calc in RAM for speed reasons.
I appreciate that SciPy has sparse matrices and so does h5py, but have no experience in either.
Whats the best option to go for?
Thanks in advance
UPDATE: sparsity of the boolean matrices is <0.6%

If your matrices are relatively empty it might be worthwhile encoding them as a data structure of the non-False values. Say a list of tuples describing the location of the non-False values. Or a dictionary with the tuples as the keys.
If you use e.g. a list of tuples you could use a list comprehension to find the items in the second list that can be multiplied with an element from the first list.
a = [(0,0), (3,7), (5,2)] # et cetera
b = ... # idem
for r, c in a:
res = [(r, k) for j, k in b if k == j]

-- EDITED TO SATISFY BELOW COMMENT / DOWNVOTER --
You're asking how to multiply matrices fast and easy.
SOLUTION 1: This is a solved problem: use numpy. All these operations are easy in numpy, and since they are implemented in C, are rather blazingly fast.
http://www.numpy.org/
http://www.scipy.org
also see:
Very large matrices using Python and NumPy
http://docs.scipy.org/doc/scipy/reference/sparse.html
SciPy and Numpy have sparse matrices and matrix multiplication. It doesn't use much memory since (at least if I wrote it in C) it probably uses linked lists, and thus will only use the memory required for the sum of the datapoints, plus some overhead. And, it will almost certainly be blazingly fast compared to pure python solution.
SOLUTION 2
Another answer here suggests storing values as tuples of (x, y), presuming value is False unless it exists, then it's true. Alternate to this is a numeric matrix with (x, y, value) tuples.
REGARDLESS: Multiplying these would be Nasty time-wise: find element one, decide which other array element to multiply by, then search the entire dataset for that specific tuple, and if it exists, multiply and insert the result into the result matrix.
SOLUTION 3 ( PREFERRED vs. Solution 2, IMHO )
I would prefer this because it's simpler / faster.
Represent your sparse matrix with a set of dictionaries. Matrix one is a dict with the element at (x, y) and value v being (with x1,y1, x2,y2, etc.):
matrixDictOne = { 'x1:y1' : v1, 'x2:y2': v2, ... }
matrixDictTwo = { 'x1:y1' : v1, 'x2:y2': v2, ... }
Since a Python dict lookup is O(1) (okay, not really, probably closer to log(n)), it's fast. This does not require searching the entire second matrix's data for element presence before multiplication. So, it's fast. It's easy to write the multiply and easy to understand the representations.
SOLUTION 4 (if you are a glutton for punishment)
Code this solution by using a memory-mapped file of the required size. Initialize a file with null values of the required size. Compute the offsets yourself and write to the appropriate locations in the file as you do the multiplication. Linux has a VMM which will page in and out for you with little overhead or work on your part. This is a solution for very, very large matrices that are NOT SPARSE and thus won't fit in memory.
Note this solves the complaint of the below complainer that it won't fit in memory. However, the OP did say sparse, which implies very few actual datapoints spread out in giant arrays, and Numpy / SciPy handle this natively and thus nicely (lots of people at Fermilab use Numpy / SciPy regularly, I'm confident the sparse matrix code is well tested).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.