I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?
What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.
NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.
It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.
I am trying to calculate sparse matrix calculations using scipy for an algorithm that require intensive dependent computations(PageRank) on very large RDF datasets. I want to use multiple cores for the scipy calculation within the following code
F = sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
W = sparse.coo_matrix((y['data'],(y['row'],y['col'])),shape=y['shape'])
P = sparse.bmat([[None, W], [F, None]])
previous = np.ones(n)/n
ones = np.ones(n)/n
while error > epsilon:
tmp = np.array(previous)
previous = damping*P.T.dot(previous) + (1-damping)*ones
error = np.linalg.norm(tmp - previous)
if(printerror):
print(error)
I have searched every possible answer I could find and I tried integrating the mkl(anaconda build) within the code but the performance on multiple cores does not seem to scale up. I have come to an understanding that the scipy call csr.h does not make use of BLAS call, I am wondering whether I need to make changes and replace the call to csr_matvec in from scipy/sparsetools with an appropriate Sparse BLAS call since MKL has those and then link scipy to mkl. Am I understanding something wrong or missing something. I would really appreciate some help in the matter. One similar question is here Thanks!!
I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.
I have 20,000*515 numpy matrix,representing biological datas. I need to find the correlation coefficient of the biological data,meaning as a result I would have a 20,000*20,000 matrix with correlation value. Then I populate a numpy array with 1's and 0's if each correlation coefficient is greater than a threshold value.
I used numpy.corrcoef to find the correlation coefficient and it works well in my machine.
Then I wanted to put in the cluster(having 10 computers and node varying from 2 to 8). when I tried putting it in the cluster, each node generating (40)random numbers and getting those 40 random columns from the biological data resulting in 20,000*40 matrix, I ran into memory issue saying.
mpirun noticed that process rank # with PID # on node name exited on signal 9 (Killed).
Then I tried rewriting the program like getting each rows finding the correlation coefficient and if the value is greater than the threshold then store 1 in the matrix or else 0 rather than creating a correlation matrix.
But it takes 1.30 hrs to run this program.I need to run it 100 times.
Can anyone please suggest a better way of doing this, like how to solve the memory issue by allocating jobs once each core has finished it's job. I am new to MPI. Below is my code.
If you need any more information please let me know.
Thank you
import numpy
from mpi4py import MPI
import time
Size=MPI.COMM_WORLD.Get_size();
Rank=MPI.COMM_WORLD.Get_rank();
Name=MPI.Get_processor_name();
RandomNumbers={};
rndm_indx=numpy.random.choice(range(515),40,replace=False)
rndm_indx=numpy.sort(rndm_indx)
Data=numpy.genfromtxt('MyData.csv',usecols=rndm_indx);
RandomNumbers[Rank]=rndm_indx;
CORR_CR=numpy.zeros((Data.shape[0],Data.shape[0]));
start=time.time();
for i in range(0,Data.shape[0]):
Data[i]=Data[i]-np.mean(Data[i]);
alpha1=1./np.linalg.norm(Data[i]);
for j in range(i,Data.shape[0]):
if(i==j):
CORR_CR[i][j]=1;
else:
Data[j]=Data[j]-np.mean(Data[j]);
alpha2=1./np.linalg.norm(Data[j]);
corr=np.inner(Data[i],Data[j])*(alpha1*alpha2);
corr=int(np.absolute(corrcoef)>=0.9)
CORR_CR[i][j]=CORR_CR[j][i]=corr
end=time.time();
CORR_CR=CORR_CR-numpy.eye(CORR_CR.shape[0]);
elapsed=(end-start)
print('Total Time',elapsed)
The execution time of the program you posted is about 96s on my computer. Let's optimize a couple of things before exploring parallel computations.
Let's store the norms of the vectors to avoid computing it each time the norm is needed. Getting alpha1=1./numpy.linalg.norm(Data[i]); out of the second loop is a good starting point. Since the vectors do not change during computations, their norms can be computed in advance:
alpha=numpy.zeros(Data.shape[0])
for i in range(0,Data.shape[0]):
Data[i]=Data[i]-numpy.mean(Data[i])
alpha[i]=1./numpy.linalg.norm(Data[i])
for i in range(0,Data.shape[0]):
for j in range(i,Data.shape[0]):
if(i==j):
CORR_CR[i][j]=1;
else:
corr=numpy.inner(Data[i],Data[j])*(alpha[i]*alpha[j]);
corr=int(numpy.absolute(corr)>=0.9)
CORR_CR[i][j]=CORR_CR[j][i]=corr
The computation time is already down to 17s.
Assuming that the vectors are not highly correlated, most of the correlation coefficients will be rounded to zero. Hence, the matrix is likely to be sparce (full of zeros). Let's use the scipy.sparse.coo_matrix sparce matrix format, which is very easy to populate: the non-null items and their coordinates i,j are to be stored in lists.
data=[]
ii=[]
jj=[]
...
if(corr!=0):
data.append(corr)
ii.append(i)
jj.append(j)
data.append(corr)
ii.append(j)
jj.append(i)
...
CORR_CR=scipy.sparse.coo_matrix((data,(ii,jj)), shape=(Data.shape[0],Data.shape[0]))
The computation time is down to 13s (negligeable improvement ?) and the memory footprint is greatly reduced. It would be a major improvement if larger datasets were to be considered.
for loops in python are pretty unefficient. See for loop in python is 10x slower than matlab for instance. But there are plenty of ways around, such as using vectorized operations or optimized iterators such as those provided by numpy.nditer. One of the reasons for for loops being unefficient is that python is a interpreted language: not compilation occurs in the process. Hence, to overcome this problem, the most tricky part of the code can be compiled by using a tool like Cython.
The critical part of the code are written in Cython, in a dedicated file correlator.pyx.
This file is turned into a correlator.c file by Cython
This file is compiled by your favorite c compiler gcc to build a shared library correlator.so
The optimized function can be used in your program after import correlator.
The content of correlator.pyx, starting from Numpy vs Cython speed , looks like:
import numpy
cimport numpy
cimport scipy.linalg.cython_blas
ctypedef numpy.float64_t DTYPE_t
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def process(numpy.ndarray[DTYPE_t, ndim=2] array,numpy.ndarray[DTYPE_t, ndim=1] alpha,int imin,int imax):
cdef unsigned int rows = array.shape[0]
cdef int cols = array.shape[1]
cdef unsigned int row, row2
cdef int one=1
ii=[]
jj=[]
data=[]
for row in range(imin,imax):
for row2 in range(row,rows):
if row==row2:
data.append(0)
ii.append(row)
jj.append(row2)
else:
corr=scipy.linalg.cython_blas.ddot(&cols,&array[row,0],&one,&array[row2,0],&one)*alpha[row]*alpha[row2]
corr=int(numpy.absolute(corr)>=0.9)
if(corr!=0):
data.append(corr)
ii.append(row)
jj.append(row2)
data.append(corr)
ii.append(row2)
jj.append(row)
return ii,jj,data
where scipy.linalg.cython_blas.ddot() will perform the inner product.
To cythonize and compile the .pyx file, the following makefile can be used (I hope you are using Linux...)
all: correlator correlatorb
correlator: correlator.pyx
cython -a correlator.pyx
correlatorb: correlator.c
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I/usr/include/python2.7 -o correlator.so correlator.c
The new correlation function is called in the main python file by:
import correlator
ii,jj,data=correlator.process(Data,alpha,0,Data.shape[0])
By using a compiled loop, the time is down to 5.4s! It's already ten times faster. Moreover, this computations are performed on a single process!
Let's introduce parallel computations.
Please notice that two arguments are added to the function process : imin and imax. These arguments signals the range of rows of CORR_CR that are computed. It is performed so as to anticipate the use of parallel computation. Indeed, a straightforward way to parallelize the program is to split the outer for loop (index i) to the different processes.
Each processes will perform the outer for loop for a particular range of the index i which is computed so as to balance the workload between the processes.
The program has to complete the following operations:
Process 0 ("root process") reads the vectors Data in the file.
The Data is broadcast to all processes by using the MPI function bcast().
The range of indexes i to be considered by each process is computed.
The correlation coefficient are computed by each process for the corresponding indexes. The non-null terms of the matrix are stored in lists data,ii,jj on each process.
These lists are gathered on the root process by calling the MPI function gather(). It produces three lists of Size lists which are concatenated to get 3 lists required to create the sparce adjacency matrix.
Here goes the code:
import numpy
from mpi4py import MPI
import time
import scipy.sparse
import warnings
warnings.simplefilter('ignore',scipy.sparse.SparseEfficiencyWarning)
Size=MPI.COMM_WORLD.Get_size();
Rank=MPI.COMM_WORLD.Get_rank();
Name=MPI.Get_processor_name();
#a dummy set of data is created here.
#Samples such that (i-j)%10==0 are highly correlated.
RandomNumbers={};
rndm_indx=numpy.random.choice(range(515),40,replace=False)
rndm_indx=numpy.sort(rndm_indx)
Data=numpy.ascontiguousarray(numpy.zeros((2000,515),dtype=numpy.float64))
if Rank==0:
#Data=numpy.genfromtxt('MyData.csv',usecols=rndm_indx);
Data=numpy.ascontiguousarray(numpy.random.rand(2000,515))
lin=numpy.linspace(0.,1.,515)
for i in range(Data.shape[0]):
Data[i]+=numpy.sin((1+i%10)*10*lin)*100
start=time.time();
#braodcasting the matrix
Data=MPI.COMM_WORLD.bcast(Data, root=0)
RandomNumbers[Rank]=rndm_indx;
print Data.shape[0]
#an array to store the inverse of the norm of each sample
alpha=numpy.zeros(Data.shape[0],dtype=numpy.float64)
for i in range(0,Data.shape[0]):
Data[i]=Data[i]-numpy.mean(Data[i])
if numpy.linalg.norm(Data[i])==0:
print "process "+str(Rank)+" is facing a big problem"
else:
alpha[i]=1./numpy.linalg.norm(Data[i])
#balancing the computational load between the processes.
#Each process must perform about Data.shape[0]*Data.shape[0]/(2*Size) correlation.
#each process cares for a set of rows.
#Of course, the last rank must care about more rows than the first one.
ilimits=numpy.zeros(Size+1,numpy.int32)
if Rank==0:
nbtaskperprocess=Data.shape[0]*Data.shape[0]/(2*Size)
icurr=0
for i in range(Size):
nbjob=0
while(nbjob<nbtaskperprocess and icurr<=Data.shape[0]):
nbjob+=(Data.shape[0]-icurr)
icurr+=1
ilimits[i+1]=icurr
ilimits[Size]=Data.shape[0]
ilimits=MPI.COMM_WORLD.bcast(ilimits, root=0)
#the "local" job has been cythonized in file main2.pyx
import correlator
ii,jj,data=correlator.process(Data,alpha,ilimits[Rank],ilimits[Rank+1])
#gathering the matrix inputs from every processes on the root process.
data = MPI.COMM_WORLD.gather(data, root=0)
ii = MPI.COMM_WORLD.gather(ii, root=0)
jj = MPI.COMM_WORLD.gather(jj, root=0)
if Rank==0:
#concatenate the lists
data2=sum(data,[])
ii2=sum(ii,[])
jj2=sum(jj,[])
#create the adjency matrix
CORR_CR=scipy.sparse.coo_matrix((data2,(ii2,jj2)), shape=(Data.shape[0],Data.shape[0]))
print CORR_CR
end=time.time();
elapsed=(end-start)
print('Total Time',elapsed)
By running mpirun -np 4 main.py, the computation time is 3.4s. It's not 4 time faster... This is likely due to the fact that the bottleneck of the computation is the computations of scalar products, which requires a large memory bandwidth. And my personnal computer is really limited regarding the memory bandwidth...
Last comment: there are plenty of possibilities for improvements.
- The vectors in Data are copied to every processes... This affects the memory required by the program. Dispatching the computation in a different way and trading memory against communications could overcome this problem...
- Each process still computes the norms of all the vectors...It could be improved by having each process computing the norms of some vector and using the MPI function allreduce() on alpha...
What to do with this adjacency matrix ?
I think you already know the answer to this question, but you can provide this adjacency matrix to sparse.csgraph routines such as connected_components() or laplacian(). Indeed, you are not very far from spectral clustering!
For instance, if the clusters are obvious, using connected_components() is sufficient:
if Rank==0:
# coo to csr format
S=CORR_CR.tocsr()
print S
n_components, labels=scipy.sparse.csgraph.connected_components(S, directed=False, connection='weak', return_labels=True)
print "number of different famillies "+str(n_components)
numpy.set_printoptions(threshold=numpy.nan)
print labels
I've been trying to optimize my computations; and for most operations that I've tried, tensorflow is much faster. I'm trying to do a fairly simple operation...Transform a matrix (multiply each value by 1/2 and then add 1/2 to that value).
With the help of #mrry , I was able to do these operations in tensorflow. However to my surprise, the numpy method was significantly faster?!
tensorflow seems like an extremely useful tool for data scientists and I think this could help clarify it's use and advantages.
Am I not using tensorflow data structures and operations in the most efficient way? I'm not sure how non-tensorflow methods would be faster. I'm using a Mid-2012 Macbook Air 4GB RAM
trans1 is the tensorflow version while trans2 is numpy. DF_var is a pandas dataframe object
import pandas as pd
import tensorflow as tf
import numpy as np
def trans1(DF_var):
#Total user time is 31.8532807827 seconds
#Create placeholder
T_feed = tf.placeholder(tf.float32,DF_var.shape)
#Matrix transformation
T_signed = tf.add(
tf.constant(0.5,dtype=tf.float32),
tf.mul(T_feed,tf.constant(0.5,dtype=tf.float32))
)
#Get rid of of top triangle
T_ones = tf.constant(np.tril(np.ones(DF_var.shape)),dtype=tf.float32)
T_tril = tf.mul(T_signed,T_ones)
#Start Graph Session
sess = tf.Session()
DF_signed = pd.DataFrame(
sess.run(T_tril,feed_dict={T_feed: DF_var.as_matrix()}),
columns = DF_var.columns, index = DF_var.index
)
#Close Graph Session
sess.close()
return(DF_signed)
def trans2(DF_var):
#Total user time is 1.71233415604 seconds
M_computed = np.tril(np.ones(DF_var.shape))*(0.5 + 0.5*DF_var.as_matrix())
DF_signed = pd.DataFrame(M_computed,columns=DF_var.columns, index=DF_var.index)
return(DF_signed)
My timing method was:
import time
start_time = time.time()
#operation
print str(time.time() - start_time)
Your results are compatible with the benchmarks from another guy.
In his benchmark he compared NumPy, Theano and Tensorflow on
an Intel core i5-4460 CPU with 16GiB RAM and a Nvidia GTX 970 with 4
GiB RAM using Theano 0.8.2, Tensorflow 0.11.0, CUDA 8.0 on Linux Mint
18
His results for addition shows that:
He also tested a few other functions such as matrix multiplication:
The results are:
It is clear that the main strengths of Theano and TensorFlow are very
fast dot products and matrix exponents. The dot product is
approximately 8 and 7 times faster respectively with Theano/Tensorflow
compared to NumPy for the largest matrices. Strangely, matrix addition
is slow with the GPU libraries and NumPy is the fastest in these
tests.
The minimum and mean of matrices are slow in Theano and quick in
Tensorflow. It is not clear why Theano is as slow (worse than NumPy)
for these operations.