I am working on a project in python. Due to some reason, I have to call matlab for calculation
ubuntu 14.04 64bit
python 2.7.6
numpy 1.11.1
matlab 2016a linux-64bit
import matlab
import matlab.engine
import numpy as np
import time
data = np.random.rand(1000, 100, 100)
print ('pass begin')
st = time.time()
data_matlab = matlab.double(data.tolist())
print ('pass numpy to matlab finished in {:.2f} sec'.format(time.time() - st))
passing a float64 type numpy array with shape of 1000,100,100 to matlab array takes 63.49 seconds. This is unacceptable. Is there any efficient way to passing big data array from numpy to matlab array in python ?
pass begin
pass numpy to matlab finished in 63.49 sec
Starting with MATLAB R2022a, this operation is at least an order of magnitude faster than in previous releases. When I run the code sample above on a Windows 10 machine, the operation takes consistently less than 2 seconds now as opposed to the more than 63 seconds reported in the original question. See release notes under "Performance"/"MATLAB Engine API for Python: Improved performance with large multidimensional arrays in Python".
Related
I have a a loop in which I'm calculating several pseudoinverses of rather large, non-sparse matrices (eg. 20000x800).
As my code spends most time on the pinv, I was trying to find a way to speed up the computation. I'm already using multiprocessing (joblib/loky) to run with several processes, but that of course increases also overhead. Using jit did not help much.
Is there a faster way / better implementation to compute pseudoinverse using any function? Precision isn't key.
My current benchmark
import time
import numba
import numpy as np
from numpy.linalg import pinv as np_pinv
from scipy.linalg import pinv as scipy_pinv
from scipy.linalg import pinv2 as scipy_pinv2
#numba.njit
def np_jit_pinv(A):
return np_pinv(A)
matrix = np.random.rand(20000, 800)
for pinv in [np_pinv, scipy_pinv, scipy_pinv2, np_jit_pinv]:
start = time.time()
pinv(matrix)
print(f'{pinv.__module__ +"."+pinv.__name__} took {time.time()-start:.3f}')
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
EDIT:
JAX seems to be 30% faster! impressive! Thanks for letting me know #yuri-brigance . For Windows it works well under WSL.
numpy.linalg.pinv took 2.774
scipy.linalg.basic.pinv took 1.906
scipy.linalg.basic.pinv2 took 1.682
__main__.np_jit_pinv took 2.446
jax._src.numpy.linalg.pinv took 0.995
Try with JAX:
import jax.numpy as jnp
jnp.linalg.pinv(A)
Seems to be slightly faster than regular numpy.linalg.pinv. On my machine your benchmark looks like this:
jax._src.numpy.linalg.pinv took 3.127
numpy.linalg.pinv took 4.284
I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?
What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.
NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.
It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.
I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.
I've been trying to optimize my computations; and for most operations that I've tried, tensorflow is much faster. I'm trying to do a fairly simple operation...Transform a matrix (multiply each value by 1/2 and then add 1/2 to that value).
With the help of #mrry , I was able to do these operations in tensorflow. However to my surprise, the numpy method was significantly faster?!
tensorflow seems like an extremely useful tool for data scientists and I think this could help clarify it's use and advantages.
Am I not using tensorflow data structures and operations in the most efficient way? I'm not sure how non-tensorflow methods would be faster. I'm using a Mid-2012 Macbook Air 4GB RAM
trans1 is the tensorflow version while trans2 is numpy. DF_var is a pandas dataframe object
import pandas as pd
import tensorflow as tf
import numpy as np
def trans1(DF_var):
#Total user time is 31.8532807827 seconds
#Create placeholder
T_feed = tf.placeholder(tf.float32,DF_var.shape)
#Matrix transformation
T_signed = tf.add(
tf.constant(0.5,dtype=tf.float32),
tf.mul(T_feed,tf.constant(0.5,dtype=tf.float32))
)
#Get rid of of top triangle
T_ones = tf.constant(np.tril(np.ones(DF_var.shape)),dtype=tf.float32)
T_tril = tf.mul(T_signed,T_ones)
#Start Graph Session
sess = tf.Session()
DF_signed = pd.DataFrame(
sess.run(T_tril,feed_dict={T_feed: DF_var.as_matrix()}),
columns = DF_var.columns, index = DF_var.index
)
#Close Graph Session
sess.close()
return(DF_signed)
def trans2(DF_var):
#Total user time is 1.71233415604 seconds
M_computed = np.tril(np.ones(DF_var.shape))*(0.5 + 0.5*DF_var.as_matrix())
DF_signed = pd.DataFrame(M_computed,columns=DF_var.columns, index=DF_var.index)
return(DF_signed)
My timing method was:
import time
start_time = time.time()
#operation
print str(time.time() - start_time)
Your results are compatible with the benchmarks from another guy.
In his benchmark he compared NumPy, Theano and Tensorflow on
an Intel core i5-4460 CPU with 16GiB RAM and a Nvidia GTX 970 with 4
GiB RAM using Theano 0.8.2, Tensorflow 0.11.0, CUDA 8.0 on Linux Mint
18
His results for addition shows that:
He also tested a few other functions such as matrix multiplication:
The results are:
It is clear that the main strengths of Theano and TensorFlow are very
fast dot products and matrix exponents. The dot product is
approximately 8 and 7 times faster respectively with Theano/Tensorflow
compared to NumPy for the largest matrices. Strangely, matrix addition
is slow with the GPU libraries and NumPy is the fastest in these
tests.
The minimum and mean of matrices are slow in Theano and quick in
Tensorflow. It is not clear why Theano is as slow (worse than NumPy)
for these operations.
This question is about precision of computation using NumPy vs. Octave/MATLAB (the MATLAB code below has only been tested with Octave, however). I am aware of a similar question on Stackoverflow, namely this, but that seems somewhat far from what I'm asking below.
Setup
Everything is running on Ubuntu 14.04.
Python version 3.4.0.
NumPy version 1.8.1 compiled against OpenBLAS.
Octave version 3.8.1 compiled against OpenBLAS.
Sample Code
Sample Python code.
import numpy as np
from scipy import linalg as la
def build_laplacian(n):
lap=np.zeros([n,n])
for j in range(n-1):
lap[j+1][j]=1
lap[j][j+1]=1
lap[n-1][n-2]=1
lap[n-2][n-1]=1
return lap
def evolve(s, lap):
wave=la.expm(-1j*s*lap).dot([1]+[0]*(lap.shape[0]-1))
for i in range(len(wave)):
wave[i]=np.linalg.norm(wave[i])**2
return wave
We now run the following.
np.min(evolve(2, build_laplacian(500)))
which gives something on the order of e-34.
We can produce similar code in Octave/MATLAB:
function lap=build_laplacian(n)
lap=zeros(n,n);
for i=1:(n-1)
lap(i+1,i)=1;
lap(i,i+1)=1;
end
lap(n,n-1)=1;
lap(n-1,n)=1;
end
function result=evolve(s, lap)
d=zeros(length(lap(:,1)),1); d(1)=1;
result=expm(-1i*s*lap)*d;
for i=1:length(result)
result(i)=norm(result(i))^2;
end
end
We then run
min(evolve(2, build_laplacian(500)))
and get 0. In fact, evolve(2, build_laplacian(500)))(60) gives something around e-100 or less (as expected).
The Question
Does anyone know what would be responsible for such a large discrepancy between NumPy and Octave (again, I haven't tested the code with MATLAB, but I'd expect to see similar results).
Of course, one can also compute the matrix exponential by first diagonalizing the matrix. I have done this and have gotten similar or worse results (with NumPy).
EDITS
My scipy version is 0.14.0. I am aware that Octave/MATLAB use the Pade approximation scheme, and am familiar with this algorithm. I am not sure what scipy does, but we can try the following.
Diagonalize the matrix with numpy's eig or eigh (in our case the latter works fine since the matrix is Hermitian). As a result we get two matrices: a diagonal matrix D, and the matrix U, with D consisting of eigenvalues of the original matrix on the diagonal, and U consists of the corresponding eigenvectors as columns; so that the original matrix is given by U.T.dot(D).dot(U).
Exponentiate D (this is now easy since D is diagonal).
Now, if M is the original matrix and d is the original vector d=[1]+[0]*n, we get scipy.linalg.expm(-1j*s*M).dot(d)=U.T.dot(numpy.exp(-1j*s*D).dot(U.dot(d)).
Unfortunately, this produces the same result as before. Thus this probably has something to do either with the way numpy.linalg.eig and numpy.linalg.eigh work, or with the way numpy does arithmetic internally.
So the question is: how do we increase numpy's precision? Indeed, as mentioned above, Octave seems to do a much finer job in this case.
The following code
import numpy as np
from scipy import linalg as la
import scipy
print np.__version__
print scipy.__version__
def build_laplacian(n):
lap=np.zeros([n,n])
for j in range(n-1):
lap[j+1][j]=1
lap[j][j+1]=1
lap[n-1][n-2]=1
lap[n-2][n-1]=1
return lap
def evolve(s, lap):
wave=la.expm(-1j*s*lap).dot([1]+[0]*(lap.shape[0]-1))
for i in range(len(wave)):
wave[i]=la.norm(wave[i])**2
return wave
r = evolve(2, build_laplacian(500))
print np.min(abs(r))
print r[59]
prints
1.8.1
0.14.0
0
(2.77560227344e-101+0j)
for me, with OpenBLAS 0.2.8-6ubuntu1.
So it appears your problem is not immediately reproduced. Your code examples above are not runnable as-is (typos).
As mentioned in scipy.linalg.expm documentation, the algorithm is from Al-Mohy and Higham (2009), which is different from the simpler scale-and-square-Pade in Octave.
As a consequence, the results also I get from Octave are slightly different, although the results are eps-close in matrix norms (1,2,inf). MATLAB uses the Pade approach from Higham (2005), which seems to give the same results as Scipy above.