Numpy matmul, treat each row in matrix as individual row vectors - python

I have a code below:
import numpy as np
wtsarray # shape(5000000,21)
covmat # shape(21,21)
portvol = np.zeros(shape=(wtsarray.shape[0],))
for i in range(0, wtsarray.shape[0]):
portvol[i] = np.sqrt(np.dot(wtsarray[i].T, np.dot(covmat, wtsarray[i]))) * np.sqrt(mtx)
Nothing wrong with the above code, except that there's 5 million rows of row vector, and the for loop can be a little slow, I was wondering if you guys know of a way to vectorise it, so far I have tried with little success.
Or if there is any way to treat each individual row in a numpy matrix as a row vector and perform the above operation?
Thanks, if there are any suggestions on rephrasing my questions, please let me know as well.

portvol = np.sqrt(np.sum(wtsarray * (wtsarray # covmat.T), axis=1)) * np.sqrt(mtx)
should give you what you want. It replaces the first np.dot with elementwise multiplication followed by summation and it replaces the second np.dot(covmat, wtsarray[i]) with matrix multiplication, wtsarray # covmat.T.

For a smaller sample arrays:
In [24]: wtsarray = np.arange(15).reshape((5,3)); covmat=np.arange(9).reshape((3,3))
In [25]: portvol = np.zeros((5))
In [26]: for i in range(0, wtsarray.shape[0]):
...: portvol[i] = np.sqrt(np.dot(wtsarray[i], np.dot(covmat, wtsarray[i])))
...:
In [27]: portvol
Out[27]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
#ogdenkev's solution:
In [28]: np.sqrt(np.sum(wtsarray * (wtsarray # covmat.T), axis=1))
Out[28]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
In [30]: timeit np.sqrt(np.sum(wtsarray * (wtsarray # covmat.T), axis=1))
20.4 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Same thing using einsum:
In [29]: np.sqrt(np.einsum('ij,jk,ik->i',wtsarray,covmat,wtsarray))
Out[29]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
In [31]: timeit np.sqrt(np.einsum('ij,jk,ik->i',wtsarray,covmat,wtsarray))
12.9 µs ± 24.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
A matmul version is in the works
In [35]: np.sqrt(np.squeeze(wtsarray[:,None,:]#covmat#wtsarray[:,:,None]))
Out[35]: array([ 7.74596669, 25.92296279, 43.95452195, 61.96773354, 79.97499609])
In [36]: timeit np.sqrt(np.squeeze(wtsarray[:,None,:]#covmat#wtsarray[:,:,None]))
13.5 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Related

Faster Indexing of 3D NumPy Array From np.argpartition Index

I have a large 3D NumPy array:
x = np.random.rand(1_000_000_000).reshape(500, 1000, 2000)
And for each of the 500 2D arrays, I need to keep only the largest 800 elements within each column of each 2D array. To avoid costly sorting, I decided to use np.argpartition:
k = 800
idx = np.argpartition(x, -k, axis=1)[:, -k:]
result = x[np.arange(x.shape[0])[:, None, None], idx, np.arange(x.shape[2])]
While np.argpartition is reasonably fast, using idx to index back into x is really slow. Is there a faster (and memory efficient) way to perform this indexing?
Note that the results do not need to be in ascending sorted order. They just need to be the top 800
cutting the size by 10 to fit my memory, here are times for the various steps:
Creationg:
In [65]: timeit x = np.random.rand(1_000_000_00).reshape(500, 1000, 200)
1.89 s ± 82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [66]: x = np.random.rand(1_000_000_00).reshape(500, 1000, 200)
In [67]: k=800
sort:
In [68]: idx = np.argpartition(x, -k, axis=1)[:, -k:]
In [69]: timeit idx = np.argpartition(x, -k, axis=1)[:, -k:]
2.52 s ± 292 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the indexing:
In [70]: result = x[np.arange(x.shape[0])[:, None, None], idx, np.arange(x.shape[2])]
In [71]: timeit result = x[np.arange(x.shape[0])[:, None, None], idx, np.arange(x.shape[2])]
The slowest run took 4.11 times longer than the fastest. This could mean that an intermediate result is being cached.
2.6 s ± 1.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
All three steps take about the same time. I don't see anything unusual about the last indexing. This .8 GB.
A simple copy, without indexing is nearly 1 sec.
In [75]: timeit x.copy()
980 ms ± 231 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and full copy with advanced indexing:
In [77]: timeit x[np.arange(x.shape[0])[:, None, None], np.arange(x.shape[1])[:,
...: None], np.arange(x.shape[2])]
1.47 s ± 37.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
trying the idx again:
In [78]: timeit result = x[np.arange(x.shape[0])[:, None, None], idx, np.arange(x.shape[2])]
1.71 s ± 42.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Keep in mind that when the operations start using nearly all the memory, and/or start requiring swapping and special memory requests to the OS, timings can really go to pot.
edit
You don't need the two step process. Just use partition:
out = np.partition(x, -k, axis=1)[:, -k:]
This is the same as result, and takes the same time as the idx step.

Why is SymPy's integrate function much slower when doing a definite integral than an approximation?

Consider
f = lambda x : 1/x
and I want to get its definite integral between 2 and 7.
The first method is using a linspace and evaluating a Riemann Sum over 10^4 terms.
l = list(np.linspace(2,7,10**4))
s = 0
for i in l:
s+=f(i)*(l[1]-l[0])
The second method is using SymPy's integrate function and evaluating it.
x = sp.symbols('x')
t = sp.integrate(f(x),(x,2,7)).evalf()
The output gives us :
Riemann Sum : 1.2529237036377492
--- 13.025045394897461 milliseconds ---
SymPy : 1.25276296849537
--- 71.07734680175781 milliseconds ---
Delta : 0.0128304512843464 %
My question is: Why is sympyaround 4 to 5 times slower than a Riemann Sum for a delta <.1% and is there any way to improve any of the two methods ?
sympy is a symbolic/algebraic package, manipulating complex "symbol/expression" objects.
In an isympy session:
In [7]: f = lambda x : 1/x
In [8]: integrate(f(x),(x,2,7)).evalf()
Out[8]: 1.25276296849537
In [9]: integrate(f(x),(x,2,7))
Out[9]: -log(2) + log(7)
In [10]: timeit integrate(f(x),(x,2,7))
10.6 ms ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: timeit integrate(f(x),(x,2,7)).evalf()
10.8 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The bulk of the time is spent in the symbolic part, with the final numeric evaluation being relatively fast.
Your iterative numeric solution:
In [45]: f = lambda x : 1/x
In [46]: %%timeit
...: s = 0
...: for i in l:
...: s+=f(i)*(l[1]-l[0])
...:
5.91 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But using numpy we can do that a lot faster:
In [47]: (f(np.array(l))*(l[1]-l[0])).sum()
Out[47]: 1.2529237036377558
In [48]: timeit (f(np.array(l))*(l[1]-l[0])).sum()
631 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and even better if the input is an array already (your linspace without the `tolist()):
In [49]: %%timeit larr=np.array(l)
...: (f(larr)*(l[1]-l[0])).sum()
61.2 µs ± 735 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
scipy has a bunch of integration functions, most of which use compiled libraries like QUADPACK. A basic one is quad:
In [50]: from scipy.integrate import quad
In [52]: quad(f,2,7)
Out[52]: (1.2527629684953678, 3.2979213205748694e-12)
In [53]: timeit quad(f,2,7)
7.22 µs ± 57.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
According the full_output display quad only has to call f() 21 times, rather than the 10**4 your iteration does.

How to speed up multiply and sum operations in numpy [duplicate]

This question already has an answer here:
Dot product between 2D and 3D numpy arrays
(1 answer)
Closed 2 years ago.
I need to solve a Finite Element Method problem and have to calculate the following C from A and B with a large M (M>1M). For example,
import numpy as np
M=4000000
A=np.random.rand(4, M, 3)
B=np.random.rand(M,3)
C = (A * B).sum(axis = -1) # need to be optimized
Could anyone come up with a code which is faster than (A * B).sum(axis = -1)? You can reshape or re-arrange the axes of A, B, and C freely.
You can use np.einsum for a slightly more efficient approach, both in performance and memory usage:
M=40000
A=np.random.rand(4, M, 3)
B=np.random.rand(M,3)
out = (A * B).sum(axis = -1) # need to be optimized
%timeit (A * B).sum(axis = -1) # need to be optimized
# 5.23 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.einsum('ijk,jk->ij', A, B)
# 1.31 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.allclose(out, np.einsum('ijk,jk->ij', A, B))
# True
To speed up numpy multiplication in general, one possible approach is using ctypes. However, as far as I know, this approach probably will give you limited performance improvements (if any).
You could use NumExpr like this for a 3x speedup:
import numpy as np
import numexpr as ne
M=40000
A=np.random.rand(4, M, 3)
B=np.random.rand(M,3)
%timeit out = (A * B).sum(axis = -1)
2.12 ms ± 57.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit me = ne.evaluate('sum(A*B,2)')
662 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
out = (A * B).sum(axis = -1)
me = numexpr.evaluate('sum(A*B,2)')
np.allclose(out,me)
Out[29]: True

numpy - einsum vs naive implementation runtime performaned

I have a two dimensional array Y of size (N,M), say for instance:
N, M = 200, 100
Y = np.random.normal(0,1,(N,M))
For each N, I want to compute the dot product of the vector (M,1) with its transpose, which returns a (M,M) matrix. One way to do it inefficiently is:
Y = Y[:,:,np.newaxis]
[Y[i,:,:] # Y[i,:,:].T for i in range(N)]
which is quite slow: timeit on the second line returns
11.7 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I thought a much better way to do it is the use the einsum numpy function (https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html):
np.einsum('ijk,imk->ijm', Y, Y, optimize=True)
(which means: for each row i, create a (j,k) matrix where its elements results from the dot product on the last dimension m)
The two methods does returns the exact same result, but the runtime of this new version is disappointing (only a bit more than twice the speed)
3.82 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One would expect much more improvement by using the vectorized einsum function since the first method is very inefficient... Do you have an explanation for this ? Does there exists a better way to do this calculation ?
In [60]: N, M = 200, 100
...: Y = np.random.normal(0,1,(N,M))
In [61]: Y1 = Y[:,:,None]
Your iteration, 200 steps to produce (100,100) arrays:
In [62]: timeit [Y1[i,:,:]#Y1[i,:,:].T for i in range(N)]
18.5 ms ± 784 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
einsum only modestly faster:
In [64]: timeit np.einsum('ijk,imk->ijm', Y1,Y1)
14.5 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but you could apply the # in full 'batch' mode with:
In [65]: timeit Y[:,:,None]#Y[:,None,:]
7.63 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
But as Divakar notes, the sum axis is size 1, so you could use plain broadcasted multiply. This is an outer product, not a matrix one.
In [66]: timeit Y[:,:,None]*Y[:,None,:]
8.2 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
'vectorizing' gives big gains when doing many iterations on a simple operation. For fewer operations on a more complex operation, the gain isn't as great.
This is an old post, yet covers the subject in many details: efficient outer product.
In particular if you are interested in adding numba dependency, that may be your fastest option.
Updating part of numba code from the original post and adding the multi outer product:
import numpy as np
from numba import jit
from numba.typed import List
#jit(nopython=True)
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n))
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
#jit(nopython=True)
def multi_outer_numba(Y):
all_result = List()
for k in range(Y.shape[0]):
y = Y[k]
n = y.shape[0]
tmp_res = np.empty((n, n))
for i in range(n):
for j in range(n):
tmp_res[i, j] = y[i]*y[j]
all_result.append(tmp_res)
return all_result
r = [outer_numba(Y[i],Y[i]) for i in range(N)]
r = multi_outer_numba(Y)

Performance of numpy.insert dependant from array size - workaround?

Using the following code, I get the impression that the insert into a numpy array is dependant from the array size.
Are there any numpy based workarounds for this performance limit (or also non numpy based)?
if True:
import numpy as np
import datetime
import timeit
myArray = np.empty((0, 2), dtype='object')
myString = "myArray = np.insert(myArray, myArray.shape[0], [[ds, runner]], axis=0)"
runner = 1
ds = datetime.datetime.utcfromtimestamp(runner)
% timeit myString
19.3 ns ± 0.715 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
for runner in range(30_000):
ds = datetime.datetime.utcfromtimestamp(runner)
myArray = np.insert(myArray, myArray.shape[0], [[ds, runner]], axis=0)
print("len(myArray):", len(myArray))
% timeit myString
len(myArray): 30000
38.1 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
This has to do with the way numpy works. For each insert operation, it takes the whole array and stores it in a new place. I would recommend using list append and convert it then to a numpy array. Maybe duplicate of this question
Your approach:
In [18]: arr = np.array([])
In [19]: for i in range(1000):
...: arr = np.insert(arr, arr.shape[0],[1,2,3])
...:
In [20]: arr.shape
Out[20]: (3000,)
In [21]: %%timeit
...: arr = np.array([])
...: for i in range(1000):
...: arr = np.insert(arr, arr.shape[0],[1,2,3])
...:
31.9 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Compare that with concatenate:
In [22]: %%timeit
...: arr = np.array([])
...: for i in range(1000):
...: arr = np.concatenate((arr, [1,2,3]))
...:
5.49 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and with a list extend:
In [23]: %%timeit
...: alist = []
...: for i in range(1000):
...: alist.extend([1,2,3])
...: arr = np.array(alist)
384 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We discourage the use of concatenate (or np.append) because it is slow, and can be hard to initialize. List append, or extend, is faster. Your use of insert is even worse than concatenate.
concatenate makes a whole new array each time. insert does so as well, but because it's designed to put the new values anywhere in the original, it is much more complicated, and hence slower. Look at its code if you don't believe me.
lists are designed for growth; new items are added via a simple object (pointer) insertion into a buffer that has growth growth. That is, the growth takes occurs in-place.
Insertion into a full array is also pretty good:
In [27]: %%timeit
...: arr = np.zeros((1000,3),int)
...: for i in range(1000):
...: arr[i,:] = [1,2,3]
...: arr = arr.ravel()
1.69 ms ± 9.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Categories

Resources