Speeding up a big a loop with a logarithm in Python

Speeding up a big a loop with a logarithm in Python - python

I'm trying to speed up the following code:
from math import log
from random import random
def logtest1(N):
tr=0
for i in range(1,N):
T= 40 + 10*random()
tr += -log(random())/T
I'm fairly new to python (coming from matlab)... and this same code runs 5x slower in python than matlab (and Julia), which got my attention.
I tried using a numba and a parakeet wrapper, and numpy functions instead of python ones, but didn't get any improvement at all.
I'd appreciate any help.
Thanks.
edit: the whole thing is a Monte Carlo simulation, so N is very large... 10e6 for testing purposes

You should really be looking into numpy. And Scipy, while you're at it. Numpy is a speed-optimized package for N-dimensional array numerics, and Scipy is a collection of scientific computing stuff built upon numpy.
If you write the function using numpy arrays, it looks like this:
def logtest2(N):
T = 40. + 10. * np.random.rand(N)
return np.sum(-1*np.log(np.random.rand(N)) / T)
It's also a lot faster. Testing with N = 1000000 gave me a runtime of 500ms for your version and 75ms for this one.

Right off the bat id say use xrange if you are using Python 2.7x
So in 2.7 it would be:
def logtest1(N):
tr=0
for i in xrange(N):
a = random() # Just generate the random number once
T= 40 + 10*a
tr += -log(a)/T
Here is a summary on why xrange is better: Should you always favor xrange() over range()?

Related

Numpy ufunc.reduce SLOWER THAN applying native python reduce after ndarray.tolist?

I enjoy using a lot of functional programming features when playing with Python lists. When I switch to Numpy for big dataset, I would expect that it is significantly more efficient than native Python list operations over ndarray.tolist() since it is stored differently.
So when I try to apply map, reduce, filter such FP things on Numpy array, I first search over the Numpy's doc for some "optimized things". And what I get is numpy.ufunc.reduce it seems to be the right thing. However, for curiosity, I did a simple test on both approaches:
Use Numpy reduce
import numpy as np
a = np.array(range(100000000))
adf = lambda res, a: res + a
u_adf = np.frompyfunc(adf, 2, 1)
print(u_adf.reduce(a, initial=0))
Use ndarray.tolist() and then use Python native reduce
import numpy as np
from functools import reduce
a = np.array(range(100000000))
adf = lambda res, a: res + a
print(reduce(adf, a.tolist(), 0))
Here comes the most unexpected thing:
> python 1.py
4999999950000000
python 1.py 28.00s user 5.71s system 102% cpu 32.925 total
> python 2.py
4999999950000000
python 2.py 26.38s user 6.38s system 103% cpu 31.792 total
The so-called "stupid" approach is actually the more efficient way?
How can that be? Can anyone please explain this for me? And hopefully gives some advice on using functional programming features on Numpy arrays.
Appreciate ^_^

For loop vs Numpy vectorization computation time

I was randomly comparing the computation times of an explicit for-loop with vectorized implementation in numpy. I ran exactly 1 million iterations and found some astounding differences. For-loop took about 646ms while the np.exp() function computed the same result in less than 20ms.
import time
import math
import numpy as np
iter = 1000000
x = np.zeros((iter,1))
v = np.random.randn(iter,1)
before = time.time()
for i in range(iter):
x[i] = math.exp(v[i])
after = time.time()
print(x)
print("Non vectorized= " + str((after-before)*1000) + "ms")
before = time.time()
x = np.exp(v)
after = time.time()
print(x)
print("Vectorized= " + str((after-before)*1000) + "ms")
The result I got:
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Non vectorized= 646.1577415466309ms
[[0.9256753 ]
[1.2529006 ]
[3.47384978]
...
[1.14945181]
[0.80263805]
[1.1938528 ]]
Vectorized= 19.547224044799805ms
My questions are:
What exactly is happening in the second case? The first one is using
an explicit for-loop and thus the computation time is justified.
What is happening "behind the scenes" in the second case?
How can one implement such computations (second case) without using numpy (in plain Python)?

What is happening is that NumPy is calling high quality numerical libraries (BLAS for instance) which are very good at vector arithmetic.
I imagine you could specifically call the exact libraries used by NumPy, however, NumPy would likely know best which to use.

NumPy is a Python wrapper over libraries and code written in C. This is a large part of the efficiency of NumPy. C code compiles directly to instructions which are executed by your processor or GPU. On the other hand, Python code must be interpreted as it executes. Despite the ever increasing speed we can get from interpreted languages with advances like Just In Time Compilers, for some tasks they will never be able to approach the speed of compiled languages.

It comes down to the fact that Python does not have direct access to the hardware level.
Python can't use the SIMD (Single instruction, multiple data) assembly instructions that most modern CPU's and GPU's have. These SIMD instruction allow a single operation to execute on a vector of data all at once (within a single clock cycle) at the hardware level.
NumPy on the other hand has functions built in C, and C is a language capable of running SIMD instructions. Therefore NumPy can take advantage of the vectorization hardware in your processor.

My python codes in general are very slow, is this normal?

I recently began self-learning python, and have been using this language for an online course in algorithms. For some reason, many of my codes I created for this course are very slow (relatively to C/C++ Matlab codes I have created in the past), and I'm starting to worry that I am not using python properly.
Here is a simple python and matlab code to compare their speed.
MATLAB
for i = 1:100000000
a = 1 + 1
end
Python
for i in list(range(0, 100000000)):
a=1 + 1
The matlab code takes about 0.3 second, and the python code takes about 7 seconds. Is this normal? My python codes for much complex problems are very slow. For example, as a HW assignment, I'm running depth first search on a graph with about 900000 nodes, and this is taking forever. Thank you.

Performance is not an explicit design goal of Python:
Don’t fret too much about performance--plan to optimize later when
needed.
That's one of the reasons why Python integrated with a lot of high performance calculating backend engines, such as numpy, OpenBLAS and even CUDA, just to name a few.
The best way to go foreward if you want to increase performance is to let high-performance libraries do the heavy lifting for you. Optimizing loops within Python (by using xrange instead of range in Python 2.7) won't get you very dramatic results.
Here is a bit of code that compares different approaches:
Your original list(range())
The suggestes use of xrange()
Leaving the i out
Using numpy to do the addition using numpy array's (vector addition)
Using CUDA to do vector addition on the GPU
Code:
import timeit
import matplotlib.pyplot as mplplt
iter = 100
testcode = [
"for i in list(range(1000000)): a = 1+1",
"for i in xrange(1000000): a = 1+1",
"for _ in xrange(1000000): a = 1+1",
"import numpy; one = numpy.ones(1000000); a = one+one",
"import pycuda.gpuarray as gpuarray; import pycuda.driver as cuda; import pycuda.autoinit; import numpy;" \
"one_gpu = gpuarray.GPUArray((1000000),numpy.int16); one_gpu.fill(1); a = (one_gpu+one_gpu).get()"
]
labels = ["list(range())", "i in xrange()", "_ in xrange()", "numpy", "numpy and CUDA"]
timings = [timeit.timeit(t, number=iter) for t in testcode]
print labels, timings
label_idx = range(len(labels))
mplplt.bar(label_idx, timings)
mplplt.xticks(label_idx, labels)
mplplt.ylabel('Execution time (sec)')
mplplt.title('Timing of integer addition in python 2.7\n(smaller value is better performance)')
mplplt.show()
Results (graph) ran on Python 2.7.13 on OSX:
The reason that Numpy performs faster than the CUDA solution is that the overhead of using CUDA does not beat the efficiency of Python+Numpy. For larger, floating point calculations, CUDA does even better than Numpy.
Note that the Numpy solution performs more that 80 times faster than your original solution. If your timings are correct, this would even be faster than Matlab...
A final note on DFS (Depth-afirst-Search): here is an interesting article on DFS in Python.

Try using xrange instead of range.
The difference between them is that **xrange** generates the values as you use them instead of range, which tries to generate a static list at runtime.

Unfortunately, python's amazing flexibility and ease comes at the cost of being slow. And also, for such large values of iteration, I suggest using itertools module as it has faster caching.
The xrange is a good solution however if you want to iterate over dictionaries and such, it's better to use itertools as in that, you can iterate over any type of sequence object.

python parallel with shared numpy array

I'm writing an iterative algorithm, where the most time consuming part is the execution of a function oneiter() which looks like the following:
def oneiter(M,h):
res = []
for i in range(M.shape[2]):
res.append(f(M[:,:,i],h))
return res
where M is a large n by n by p numpy array, and f is a function that does some kind of regression using M[:,:,i] and h. In all the iterations, M stays the same and h can be different.
To make it faster, I tried to parallelize the for loop in oneiter using joblib:
Parallel(n_jobs=4)(delayed(f)(M[:,:,i],h) for i in range(M.shape[2]))
This way turned out to be even slower. I am very new to writing parallelized python code. Could somebody tell why this happened, and what are some better ways to do it?

Creating same random number sequence in Python, NumPy and R

Python, NumPy and R all use the same algorithm (Mersenne Twister) for generating random number sequences. Thus, theoretically speaking, setting the same seed should result in same random number sequences in all 3. This is not the case. I think the 3 implementations use different parameters causing this behavior.
R
>set.seed(1)
>runif(5)
[1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
Python
In [3]: random.seed(1)
In [4]: [random.random() for x in range(5)]
Out[4]:
[0.13436424411240122,
0.8474337369372327,
0.763774618976614,
0.2550690257394217,
0.49543508709194095]
NumPy
In [23]: import numpy as np
In [24]: np.random.seed(1)
In [25]: np.random.rand(5)
Out[25]:
array([ 4.17022005e-01, 7.20324493e-01, 1.14374817e-04,
3.02332573e-01, 1.46755891e-01])
Is there some way, where NumPy and Python implementation could produce the same random number sequence? Ofcourse as some comments and answers point out, one could use rpy. What I am specifically looking for is to fine tune the parameters in the respective calls in Python and NumPy to get the sequence.
Context: The concern comes from an EDX course offering in which R is used. In one of the forums, it was asked if Python could be used and the staff replied that some assignments would require setting specific seeds and submitting answers.
Related:
Comparing Matlab and Numpy code that uses random number generation From this it seems that the underlying NumPy and Matlab implementation are similar.
python vs octave random generator: This question does come fairly close to the intended answer. Some sort of wrapper around the default state generator is required.

use rpy2 to call r in python, here is a demo, the numpy array data is sharing memory with x in R:
import rpy2.robjects as robjects
data = robjects.r("""
set.seed(1)
x <- runif(5)
""")
print np.array(data)
data[1] = 1.0
print robjects.r["x"]

I realize this is an old question, but I've stumbled upon the same problem recently, and created a solution which can be useful to others.
I've written a random number generator in C, and linked it to both R and Python. This way, the random numbers are guaranteed to be the same in both languages since they are generated using the same C code.
The program is called SyncRNG and can be found here: https://github.com/GjjvdBurg/SyncRNG.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Speeding up a big a loop with a logarithm in Python - python

Right off the bat id say use xrange if you are using Python 2.7x So in 2.7 it would be: def logtest1(N): tr=0 for i in xrange(N): a = random() # Just generate the random number once T= 40 + 10*a tr += -log(a)/T Here is a summary on why xrange is better: Should you always favor xrange() over range()?

Related

Numpy ufunc.reduce SLOWER THAN applying native python reduce after ndarray.tolist?

For loop vs Numpy vectorization computation time

My python codes in general are very slow, is this normal?

python parallel with shared numpy array

Creating same random number sequence in Python, NumPy and R

Categories

Resources