What could be wrong with this matrix multiplication benchmark (Matlab vs Numpy)?

What could be wrong with this matrix multiplication benchmark (Matlab vs Numpy)? - python

Here's the code I wrote for comparing performance of numpy vs Matlab. It just measures the average time taken for matrix multiplication (1701x576 matrix M1 * 576x576 matrix M2).
Matlab version : (M1 is (1701x576) while M2 is (576x576) matrix)
function r = benchmark(M1,M2)
total_time=0;
for i=1:4
for j=1:1500
tic;
a=M1*M2;
tim=toc;
total_time =total_time+tim;
end
end
avg_time = total_time/4
r=avg_time
end
Python version :
def benchmark():
iters = range(1500)
for i in range(4):
for j in iters:
tic = time.time()
a=M1.dot(M2);
toc = time.time() - tic
t_time=t_time+toc;
return t_time/4
Matlab version takes almost ~18.2s , while Python takes ~19.3s . Ive repeated this test multiple times , and Matlab was always performing better than Python (even if smaller difference) in all cases . My understanding is Numpy uses efficient and compiled code for vector operations , and is supposed to be faster than Matlab.
Then , why could Matlab perform faster than Numpy ? The test was done on a 32 core machine .
Where did I go wrong ? or is this expected for Numpy to be slower than Matlab.
Are there ways to improve the performance for Python ?
Edit : Updated the matlab code to fix the loop index/return value error . The error was the result of me trying to edit the names in the snipper to make it presentable just before posting(a bad idea everytime :) ).

[edited to remove the mention of loops; that was my mistake]
Couple things--
First, the multicore nature of the machine doesn't really matter unless you're explicitly using those extra cores (or linking NumPy against a BLAS library that uses multiple cores -- thanks #ali_m). If you're not, it'll run about as fast on a 32-core machine as it will on a 4-core machine (assuming the clock speeds of the cores themselves are roughly equal).
Second, using purely off-the-shelf Matlab vs off-the-shelf NumPy, Matlab generally beats out NumPy. This is a very general statement, though; YMMV. Also, speaking of Matlab, there does indeed appear to be a bug in the looping indices.
Third, this may not be the best benchmark for performance; there may be some unseen caching issues taking place under the hood that aren't obvious. A better one would be to randomly generate the matrices on-the-fly in each iteration and multiply them, but even this could be problematic depending on the random number generator.

There are two major issues I can see with the test.
The first is that you are using global variable lookup in Python while you are using local variable lookup in MATLAB. Global variable lookup in Python is relatively slow. Making sure the variables are local like they are in MATLAB will affect the performance.
The second is that you are re-doing the same calculation over and over. MATLAB has a JIT for loops and numpy has a cache for calculations, both of which can reduce the time for repeated calculations.
So to make the comparisons more equal and reliable, you should create new, random matrices each time through the loop. This will prevent caching and JIT from messing up your results, and will make sure the variables are all local.

There is a bug in your Matlab code. It appears that you are using the same loop control variable in nested loops.
The outer loop actually only runs once.
Edit: The outer loop actually runs the correct number of times. The two loop control variables seem to be independent.

Related

Python slow on for-loops and hundreds of attribute lookups. Use Numba?

i am working on a simple showcase SPH (smoothed particle hydrodynamics, not relevant here though) implementation in python. The code works, but the execution is kinda sluggish. I often have to compare individual particles with a certain amount of neighbours. In an earlier implementation i kept all particle positions and all distances-to-each-existing-particle in large numpy arrays -> to a certain point this was pretty fast. But visually not pleasing and n**2. Now i want it clean and simple with classes + kdTree to speed up the neighbour search.
this all happens in my global Simulation-Class. Additionally there's a class called "particle" that contains all individual informations. i create hundreds of instances before and loop through them.
def calculate_density(self):
#Using scipys advanced nearest neighbour seach magic
tree = scipy.spatial.KDTree(self.particle_positions)
#here we go... loop through all existing particles. set attributes..
for particle in self.my_particles:
#get the indexes for the nearest neighbours
particle.index_neighbours = tree.query_ball_point(particle.position,self.h,p=2)
#now loop through the list of neighbours and perform some additional math
particle.density = 0
for neighbour in particle.index_neighbours:
r = np.linalg.norm(particle.position - self.my_particles[neighbour].position)
particle.density += particle.mass * (315/(64*math.pi*self.h**9)) *(self.h**2-r**2)**3
i timed 0.2717630863189697s for only 216 particles.
Now i wonder: what to do to speed it up?
Most tools online like "Numba" show how they speed up math-heavy individual functions. I dont know which to choose. On a sidenode, i cannot even get Numba to work in this case. I get a looong error message. And i hoped it is as simple as slapping "#jit" in front of it.
I know its the loops with the attribute calls that crush my performance anyway - not the math or the neighbour search. Sadly iam a novice to programming and i liked the clean approach i got to work here :( any thoughts?

These kind of loop-intensive calculations are slow in Python. In these cases, the first thing you want to do is to see if you can vectorize these operations and get rid of the loops. Then actual calculations will be done in C or Fortran libraries and you will get a lot of speed up. If you can do it usually this is the way to go, since it is much easier to maintain your code.
Some operations, however, are just inherently loop-intensive. In these cases using Cython will help you a lot - you can usually expect 60X+ speed up when you cythonize your loop. I also had similar experiences with numba - when my function becomes complicated, it failed to make it faster, so usually I just use Cython.
Coding in Cython is not too bad - much easier than actually code in C because you can access numpy arrays easily via memoryviews. Another advantage is that it is pretty easy to parallelize the loop with openMP, which can gives you additional 4X+ speedups (of course, depending on the number of cores you have in your machine), so your code can be hundreds times faster.
One issue is that to get the optimal speed, you have to remove all the python calls inside your loop, which means you cannot call numpy/scipy functions. So you have to convert tree.query_ball_point and np.linalg.norm part to Cython for optimal speed.

Efficient Matrix-Vector Multiplication: Multithreading directly in Python vs. using ctypes to bind a multithreaded C function

I have a simple problem: multiply a matrix by a vector. However, the implementation of the multiplication is complicated because the matrix is 18 gb (3000^2 by 500).
Some info:
The matrix is stored in HDF5 format. It's Matlab output. It's dense so no sparsity savings there.
I have to do this matrix multiplication roughly 2000 times over the course of my algorithm (MCMC Bayesian Inversion)
My program is a combination of Python and C, where the Python code handles most of the MCMC procedure: keeping track of the random walk, generating perturbations, checking MH Criteria, saving accepted proposals, monitoring the burnout, etc. The C code is simply compiled into a separate executable and called when I need to solve the forward (acoustic wave) problem. All communication between the Python and C is done via the file system. All this is to say I don't already have ctype stuff going on.
The C program is already parallelized using MPI, but I don't think that's an appropriate solution for this MV multiplication problem.
Our program is run mainly on linux, but occasionally on OSX and Windows. Cross-platform capabilities without too much headache is a must.
Right now I have a single-thread implementation where the python code reads in the matrix a few thousand lines at a time and performs the multiplication. However, this is a significant bottleneck for my program since it takes so darn long. I'd like to multithread it to speed it up a bit.
I'm trying to get an idea of whether it would be faster (computation-time-wise, not implementation time) for python to handle the multithreading and to continue to use numpy operations to do the multiplication, or to code an MV multiplication function with multithreading in C and bind it with ctypes.
I will likely do both and time them since shaving time off of an extremely long running program is important. I was wondering if anyone had encountered this situation before, though, and had any insight (or perhaps other suggestions?)
As a side question, I can only find algorithmic improvements for nxn matrices for m-v multiplication. Does anyone know of one that can be used on an mxn matrix?

Hardware
As Sven Marnach wrote in the comments, your problem is most likely I/O bound since disk access is orders of magnitude slower than RAM access.
So the fastest way is probably to have a machine with enough memory to keep the whole matrix multiplication and the result in RAM. It would save lots of time if you read the matrix only once.
Replacing the harddisk with an SSD would also help, because that can read and write a lot faster.
Software
Barring that, for speeding up reads from disk, you could use the mmap module. This should help, especially once the OS figures out you're reading pieces of the same file over and over and starts to keep it in the cache.
Since the calculation can be done by row, you might benefit from using numpy in combination with a multiprocessing.Pool for that calculation. But only really if a single process cannot use all available disk read bandwith.

MATLAB twice as fast as Numpy

I am an engineering grad student currently making the transition from MATLAB to Python for the purposes of numerical simulation. I was under the impression that for basic array manipulation, Numpy would be as fast as MATLAB. However, it appears for two different programs I write that MATLAB is a little under twice as fast as Numpy. The test code I am using for Numpy (Python 3.3) is:
import numpy as np
import time
a = np.random.rand(5000,5000,3)
tic = time.time()
a[:,:,0] = a[:,:,1]
a[:,:,2] = a[:,:,0]
a[:,:,1] = a[:,:,2]
toc = time.time() - tic
print(toc)
Whereas for MATLAB 2012a I am using:
a = rand(5000,5000,3);
tic;
a(:,:,1) = a(:,:,2);
a(:,:,3) = a(:,:,1);
a(:,:,2) = a(:,:,3);
toc
The algorithm I am using is the one used on a NASA website comparing Numpy and MATLAB. The website shows that Numpy surpasses MATLAB in terms of speed for this algorithm. Yet my results show a 0.49 s simulation time for Numpy and a 0.29 s simulation time for MATLAB. I also have run a Gauss-Seidel solver on both Numpy and Matlab and I get similar results (16.5 s vs. 9.5 s)
I am brand new to Python and am not extremely literate in terms of programming. I am using the WinPython 64 bit Python distribution but have also tried Pythonxy to no avail.
One thing I have read which should improve performance is building Numpy using MKL. Unfortunately I have no idea how to do this on Windows. Do I even need to do this?
Any suggestions?

That comparison ends up being apples to oranges due to caching, because it is more efficient to transfer or do some work on contiguous chunks of memory. This particular benchmark is memory bound, since in fact no computation is done, and thus the percentage of cache hits is key to achieve good performance.
Matlab lays the data in column-major order (Fortran order), so a(:,:,k) is a contiguous chunk of memory, which is fast to copy.
Numpy defaults to row-major order (C order), so in a[:,:,k] there are big jumps between elements and that slows down the memory transfer. Actually, the data layout can be chosen. In my laptop, creating the array with a = np.asfortranarray(np.random.rand(5000,5000,3)) leds to a 5x speed up (1 s vs 0.19 s).
This result should be very similar both for numpy-MKL and plain numpy because MKL is a fast LAPACK implementation and here you're not calling any function that uses it (MKL definitely helps when solving linear systems, computing dot products...).
I don't really know what's going on on the Gauss Seidel solver, but some time ago I wrote an answer to a question titled Numpy running at half the speed of MATLAB that talks a little bit about MKL, FFT and Matlab's JIT.

You are attempting to recreate the NASA experiment, however you have changed many of the variables. For example:
Your hardware and operating system is different (www.nccs.nasa.gov/dali_front.html)
Your Python version is different (2.5.3 vs 3.3)
Your MATLAB version is different (2008 vs 2012)
Assuming the NASA results are correct, the difference in results is due to one or more of these changed variables. I recommend you:
Retest with the SciPy prebuilt binaries.
Research if any improvements were made to MATLAB relative to this type of calculation.
Also, you may find this link useful.

Will multiprocessing be a good solution for this operation?

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.

It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate

Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

python for-loop slower each iteration

I am trying to optimize some python code (to speed up some matrix operations), my code is something similar to this one (my real dataset is also similar to 'gps'),
import numpy as np
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')
for i in xrange(len(gps)):
for j in xrange(0,i+1):
ips[i,j]= f.innerProd(gps[i],gps[j])
ips[j,i]= ips[i,j]
print "Inner product matrix: %3.0f %% done (%d of %d)"% \
(((i+1)**2.)/(len(gps)**2.)*100, i, len(gps))
def innerProd(mat1,mat2):
return float(np.sum(np.dot(np.dot(mat1,mat2),mat1)))
What I would like to understand is , why is it that the program begins running fast during the first iterations and then slows down as it iterates further? I know the question might be a bit naive but I really want to have a clearer idea of what is happening before I attempt anything else. I already implemented my function in Fortran (leaving within the Fortran realm any for loops) and used f2py to create a dynamic lib to call the function from python, this would be the new code in python..
import numpy as np
import myfortranInnProd as fip
gps = [np.random.rand(50,50) for i in xrange(1000)]
ips = np.zeros( (len(gps),len(gps)), dtype='float32')
ips = fip.innerProd(gps)
unfortunately I only found out (surprisingly) that my fortran-python version runs 1.5 ~ 2 times slower than the first version (it is important to mention that I used MATMUL() on the Fortran implementation). I have been googling around for a while and I believe that this "slow down" has something to do with the memory bandwidth, memory allocation or caching, given the large datasets, but I am not very sure about what is really happening behind and how could I improve the performance. I have run the code on both a small intel atom , 2GB ram and a 4 core intel xeon, with 8GB (of course with a correspondingly scaled dataset) and the "slow down" behavior is the same.
I just need to understand why is it that this 'slow down' happens? would it do any good if i implement the function in C ? or try to implement it to run on a GPU ? Any other ideas how to improve it? Thanks in advance

At the risk of stating the obvious, the number of executions of the inner loop will grow each time you complete an execution of the outer loop. When i is 0, the inner loop will only be executed once, but when i is 100, it will be executed 101 times. Could this explain your observations, or do you mean that each execution of the inner loop itself is getting slower over time?

The number of executions of the inner for loop depends on the value of i, the index of the outer for loop. Since you're displaying your debug each time the inner loop finishes, it gets displayed less and less often as i grows. (Note that the percentage increases regularly, however.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.