Python code optimization (20x slower than C) - python

I've written this very badly optimized C code that does a simple math calculation:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
#define MAX(a, b) (((a) > (b)) ? (a) : (b))
unsigned long long int p(int);
float fullCheck(int);
int main(int argc, char **argv){
int i, g, maxNumber;
unsigned long long int diff = 1000;
if(argc < 2){
fprintf(stderr, "Usage: %s maxNumber\n", argv[0]);
return 0;
}
maxNumber = atoi(argv[1]);
for(i = 1; i < maxNumber; i++){
for(g = 1; g < maxNumber; g++){
if(i == g)
continue;
if(p(MAX(i,g)) - p(MIN(i,g)) < diff && fullCheck(p(MAX(i,g)) - p(MIN(i,g))) && fullCheck(p(i) + p(g))){
diff = p(MAX(i,g)) - p(MIN(i,g));
printf("We have a couple %llu %llu with diff %llu\n", p(i), p(g), diff);
}
}
}
return 0;
}
float fullCheck(int number){
float check = (-1 + sqrt(1 + 24 * number))/-6;
float check2 = (-1 - sqrt(1 + 24 * number))/-6;
if(check/1.00 == (int)check)
return check;
if(check2/1.00 == (int)check2)
return check2;
return 0;
}
unsigned long long int p(int n){
return n * (3 * n - 1 ) / 2;
}
And then I've tried (just for fun) to port it under Python to see how it would react. My first version was almost a 1:1 conversion that run terribly slow (120+secs in Python vs <1sec in C).
I've done a bit of optimization, and this is what I obtained:
#!/usr/bin/env/python
from cmath import sqrt
import cProfile
from pstats import Stats
def quickCheck(n):
partial_c = (sqrt(1 + 24 * (n)))/-6
c = 1/6 + partial_c
if int(c.real) == c.real:
return True
c = c - 2*partial_c
if int(c.real) == c.real:
return True
return False
def main():
maxNumber = 5000
diff = 1000
for i in range(1, maxNumber):
p_i = i * (3 * i - 1 ) / 2
for g in range(i, maxNumber):
if i == g:
continue
p_g = g * (3 * g - 1 ) / 2
if p_i > p_g:
ma = p_i
mi = p_g
else:
ma = p_g
mi = p_i
if ma - mi < diff and quickCheck(ma - mi):
if quickCheck(ma + mi):
print ('New couple ', ma, mi)
diff = ma - mi
cProfile.run('main()','script_perf')
perf = Stats('script_perf').sort_stats('time', 'calls').print_stats(10)
This runs in about 16secs which is better but also almost 20 times slower than C.
Now, I know C is better than Python for this kind of calculations, but what I would like to know is if there something that I've missed (Python-wise, like an horribly slow function or such) that could have made this function faster.
Please note that I'm using Python 3.1.1, if this makes a difference

Since quickCheck is being called close to 25,000,000 times, you might want to use memoization to cache the answers.
You can do memoization in C as well as Python. Things will be much faster in C, also.
You're computing 1/6 in each iteration of quickCheck. I'm not sure if this will be optimized out by Python, but if you can avoid recomputing constant values, you'll find things are faster. C compilers do this for you.
Doing things like if condition: return True; else: return False is silly -- and time consuming. Simply do return condition.
In Python 3.x, /2 must create floating-point values. You appear to need integers for this. You should be using //2 division. It will be closer to the C version in terms of what it does, but I don't think it's significantly faster.
Finally, Python is generally interpreted. The interpreter will always be significantly slower than C.

I made it go from ~7 seconds to ~3 seconds on my machine:
Precomputed i * (3 * i - 1 ) / 2 for each value, in yours it was computed twice quite a lot
Cached calls to quickCheck
Removed if i == g by adding +1 to the range
Removed if p_i > p_g since p_i is always smaller than p_g
Also put the quickCheck-function inside main, to make all variables local (which have faster lookup than global).
I'm sure there are more micro-optimizations available.
def main():
maxNumber = 5000
diff = 1000
p = {}
quickCache = {}
for i in range(maxNumber):
p[i] = i * (3 * i - 1 ) / 2
def quickCheck(n):
if n in quickCache: return quickCache[n]
partial_c = (sqrt(1 + 24 * (n)))/-6
c = 1/6 + partial_c
if int(c.real) == c.real:
quickCache[n] = True
return True
c = c - 2*partial_c
if int(c.real) == c.real:
quickCache[n] = True
return True
quickCache[n] = False
return False
for i in range(1, maxNumber):
mi = p[i]
for g in range(i+1, maxNumber):
ma = p[g]
if ma - mi < diff and quickCheck(ma - mi) and quickCheck(ma + mi):
print('New couple ', ma, mi)
diff = ma - mi

Because the function p() monotonically increasing you can avoid comparing the values as g > i implies p(g) > p(i). Also, the inner loop can be broken early because p(g) - p(i) >= diff implies p(g+1) - p(i) >= diff.
Also for correctness, I changed the equality comparison in quickCheck to compare difference against an epsilon because exact comparison with floating point is pretty fragile.
On my machine this reduced the runtime to 7.8ms using Python 2.6. Using PyPy with JIT reduced this to 0.77ms.
This shows that before turning to micro-optimization it pays to look for algorithmic optimizations. Micro-optimizations make spotting algorithmic changes much harder for relatively tiny gains.
EPS = 0.00000001
def quickCheck(n):
partial_c = sqrt(1 + 24*n) / -6
c = 1/6 + partial_c
if abs(int(c) - c) < EPS:
return True
c = 1/6 - partial_c
if abs(int(c) - c) < EPS:
return True
return False
def p(i):
return i * (3 * i - 1 ) / 2
def main(maxNumber):
diff = 1000
for i in range(1, maxNumber):
for g in range(i+1, maxNumber):
if p(g) - p(i) >= diff:
break
if quickCheck(p(g) - p(i)) and quickCheck(p(g) + p(i)):
print('New couple ', p(g), p(i), p(g) - p(i))
diff = p(g) - p(i)

There are some python compilers that might actually do a good bit for you. Have a look at Psyco.
Another way of dealing with math intensive programs is to rewrite the majority of the work into a math kernel, such as NumPy, so that heavily optimized code is doing the work, and your python code only guides the calculation. To get the most out of this strategy, avoid doing calculations in loops, and instead let the math kernel do all of that.

The other respondents have already mentioned several optimizations that will help. However, ultimately, you're not going to be able to match the performance of C in Python. Python is a nice tool, but since it's interpreted, it isn't really suited for heavy number crunching or other apps where performance is key.
Also, even in your C version, your inner loop could use quite a bit of help. Updated version:
for(i = 1; i < maxNumber; i++){
for(g = 1; g < maxNumber; g++){
if(i == g)
continue;
max=i;
min=g;
if (max<min) {
// xor swap - could use swap(p_max,p_min) instead.
max=max^min;
min=max^min;
max=max^min;
}
p_max=P(max);
p_min=P(min);
p_i=P(i);
p_g=P(g);
if(p_max - p_min < diff && fullCheck(p_max-p_min) && fullCheck(p_i + p_g)){
diff = p_max - p_min;
printf("We have a couple %llu %llu with diff %llu\n", p_i, p_g, diff);
}
}
}
///////////////////////////
float fullCheck(int number){
float den=sqrt(1+24*number)/6.0;
float check = 1/6.0 - den;
float check2 = 1/6.0 + den;
if(check == (int)check)
return check;
if(check2 == (int)check2)
return check2;
return 0.0;
}
Division, function calls, etc are costly. Also, calculating them once and storing in vars such as I've done can make things a lot more readable.
You might consider declaring P() as inline or rewrite as a preprocessor macro. Depending on how good your optimizer is, you might want to perform some of the arithmetic yourself and simplify its implementation.
Your implementation of fullCheck() would return what appear to be invalid results, since 1/6==0, where 1/6.0 would return 0.166... as you would expect.
This is a very brief take on what you can do to your C code to improve performance. This will, no doubt, widen the gap between C and Python performance.

20x difference between Python and C for a number crunching task seems quite good to me.
Check the usual performance differences for some CPU intensive tasks (keep in mind that the scale is logarithmic).
But look on the bright side, what's 1 minute of CPU time compared with the brain and typing time you saved writing Python instead of C? :-)

Related

How to speed up this numpy.arange loop?

In a python program, the following function is called about 20,000 times from another function that is called about 1000 times from yet another function that executes 30 times. Thus the total number of times this particular function is called is about 600,000,000. In python it takes more than two hours (perhaps much longer; I aborted the program without waiting for it to finish), while essentially the same task coded in Java takes less than 5 minutes. If I change the 20,000 above to 400 (keeping everything else in the rest of the program untouched), the total time drops to about 4 minutes (this means this particular function is the culprit). What can I do to speed up the Python version, or is it just not possible? No lists are manipulated inside this function (there are lists elsewhere in the whole program, but in those places I tried to use numpy arrays as far as possible). I understand that replacing python lists with numpy arrays speeds things up, but there are cases in my program (not in this particular function) where I must build a list iteratively, using append; and those must-have lists are lists of objects (not floats or ints), so numpy would be of little help even if I converted those lists of objects to numpy arrays.
def compute_something(arr):
'''
arr is received as a numpy array of ints and floats (I think python upcasts them to all floats,
doesn’t it?).
Inside this function, elements of arr are accessed using indexing (arr[0], arr[1], etc.), because
each element of the array has its own unique use. It’s not that I need the array as a whole (as in
arr**2 or sum(arr)).
The arr elements are used in several simple arithmetic operations involving nothing costlier than
+, -, *, /, and numpy.log(). There is no other loop inside this function; there are a few if’s though.
Inside this function, use is made of constants imported from other modules (I doubt the
importing, as in AnotherModule.x is expensive).
'''
for x in numpy.arange(float1, float2, float3):
do stuff
return a, b, c # Return a tuple of three floats
Edit:
Thanks for all the comments. Here’s the inside of the function (I made the variable names short for convenience). The ndarray array arr has only 3 elements in it. Can you please suggest any improvement?
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
max = 0.0
for c in np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f):
i = c / arr[2]
m1 = Mod.A * np.log( (i / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2 = -Mod.B * np.log(1.0 - (i/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V = arr[0] * (Mod.E - Mod.r * i / arr[1] - Mod.r * Mod.d -
m1 – m2)
p = c * V /1000.0
if p > max:
max = p
vmp = V
pen = Mod.COEFF1 * (Mod.COEFF2 - max) if max < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max, w
Python supports profiling of code. (module cProfile). Also there is option to use line_profiler to find most expensive part of code tool here.
So you do not need to guessing which part of code is most expensive.
In this code which you presten the problem is in usage for loop which generates many conversion between types of objects. If you use numpy you can vectorize your calculation.
I try to rewrite your code to vectorize your operation. You do not provide information what is Mod object, but I have hope it will work.
def compute_something(arr):
a = Mod.b * arr[1] * arr[2] + Mod.c
# start calculation on vectors instead of for lop
c_arr = np.arange(a, arr[1] * arr[2] * (Mod.d – Mod.e), Mod.f)
i_arr = c_arr/arr[2]
m1_arr = Mod.A * np.log( (i_arr / (arr[1] *Mod.d)) + (Mod.d/Mod.e))
m2_arr = -Mod.B * np.log(1.0 - (i_arr/ (arr[1] *Mod.d)) - (Mod.d /
Mod.e))
V_arr = arr[0] * (Mod.E - Mod.r * i_arr / arr[1] - Mod.r * Mod.d -
m1_arr – m2_arr)
p = c_arr * V_arr / 1000.0
max_val = p.max() # change name to avoid conflict with builtin function
max_ind = np.nonzero(p == max_val)[0][0]
vmp = V_arr[max_ind]
pen = Mod.COEFF1 * (Mod.COEFF2 - max_val) if max_val < Mod.CONST else 0.0
wo = Mod.COEFF3 * arr[1] * arr[0] + Mod.COEFF4 * abs(Mod.R5 - vmp) +
Mod.COEFF6 * arr[2]
w = wo + pen
return vmp, max_val, w
I would suggest to use range as it is approximately 2 times faster:
def python():
for i in range(100000):
pass
def numpy():
for i in np.arange(100000):
pass
from timeit import timeit
print(timeit(python, number=1000))
print(timeit(numpy, number=1000))
Output:
5.59282787179696
10.027646953771665

accelerated FFT to be invoked from Python Numba CUDA kernel

I need to calculate the Fourier transform of a 256 element float64 signal. The requirement is as such that I need to invoke these FFTs from inside a cuda.jitted section and it must be completed within 25usec. Alas cuda.jit-compiled functions do not allow to invoke external libraries => I wrote my own. Alas my single-core code is still way too slow (~250usec on a Quadro P4000). Is there a better way?
I created a single core FFT-function that gives correct results, but is alas 10x too slow. I don't understand how to make good use of multiple cores.
---fft.py
from numba import cuda, boolean, void, int32, float32, float64, complex128
import math, sys, cmath
def _transform_radix2(vector, inverse, out):
n = len(vector)
levels = int32(math.log(float32(n))/math.log(float32(2)))
assert 2**levels==n # error: Length is not a power of 2
#uncomment either Numba.Cuda or Numpy memory allocation, (intelligent conditional compileation??)
exptable = cuda.local.array(1024, dtype=complex128)
#exptable = np.zeros(1024, np.complex128)
assert (n // 2) <= len(exptable) # error: FFT length > MAXFFTSIZE
coef = complex128((2j if inverse else -2j) * math.pi / n)
for i in range(n // 2):
exptable[i] = cmath.exp(i * coef)
for i in range(n):
x = i
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
out[i] = vector[y]
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
temp = out[j + halfsize] * exptable[k]
out[j + halfsize] = out[j] - temp
out[j] += temp
k += tablestep
size *= 2
scale=float64(n if inverse else 1)
for i in range(n):
out[i]=out[i]/scale # the inverse requires a scaling
# now create the Numba.cuda version to be called by a GPU
gtransform_radix2 = cuda.jit(device=True)(_transform_radix2)
---test.py
from numba import cuda, void, float64, complex128, boolean
import cupy as cp
import numpy as np
import timeit
import fft
#cuda.jit(void(float64[:],boolean, complex128[:]))
def fftbench(y, inverse, FT):
Y = cuda.local.array(256, dtype=complex128)
for i in range(len(y)):
Y[i]=complex128(y[i])
fft.gtransform_radix2(Y, False, FT)
str='\nbest [%2d/%2d] iterations, min:[%9.3f], max:[%9.3f], mean:[%9.3f], std:[%9.3f] usec'
a=[127.734375 ,130.87890625 ,132.1953125 ,129.62109375 ,118.6015625
,110.2890625 ,106.55078125 ,104.8203125 ,106.1875 ,109.328125
,113.5 ,118.6640625 ,125.71875 ,127.625 ,120.890625
,114.04296875 ,112.0078125 ,112.71484375 ,110.18359375 ,104.8828125
,104.47265625 ,106.65625 ,109.53515625 ,110.73828125 ,111.2421875
,112.28125 ,112.38671875 ,112.7734375 ,112.7421875 ,113.1328125
,113.24609375 ,113.15625 ,113.66015625 ,114.19921875 ,114.5
,114.5546875 ,115.09765625 ,115.2890625 ,115.7265625 ,115.41796875
,115.73828125 ,116. ,116.55078125 ,116.5625 ,116.33984375
,116.63671875 ,117.015625 ,117.25 ,117.41015625 ,117.6640625
,117.859375 ,117.91015625 ,118.38671875 ,118.51171875 ,118.69921875
,118.80859375 ,118.67578125 ,118.78125 ,118.49609375 ,119.0078125
,119.09375 ,119.15234375 ,119.33984375 ,119.31640625 ,119.6640625
,119.890625 ,119.80078125 ,119.69140625 ,119.65625 ,119.83984375
,119.9609375 ,120.15625 ,120.2734375 ,120.47265625 ,120.671875
,120.796875 ,120.4609375 ,121.1171875 ,121.35546875 ,120.94921875
,120.984375 ,121.35546875 ,120.87109375 ,120.8359375 ,121.2265625
,121.2109375 ,120.859375 ,121.17578125 ,121.60546875 ,121.84375
,121.5859375 ,121.6796875 ,121.671875 ,121.78125 ,121.796875
,121.8828125 ,121.9921875 ,121.8984375 ,122.1640625 ,121.9375
,122. ,122.3515625 ,122.359375 ,122.1875 ,122.01171875
,121.91015625 ,122.11328125 ,122.1171875 ,122.6484375 ,122.81640625
,122.33984375 ,122.265625 ,122.78125 ,122.44921875 ,122.34765625
,122.59765625 ,122.63671875 ,122.6796875 ,122.6171875 ,122.34375
,122.359375 ,122.7109375 ,122.83984375 ,122.546875 ,122.25390625
,122.06640625 ,122.578125 ,122.7109375 ,122.83203125 ,122.5390625
,122.2421875 ,122.06640625 ,122.265625 ,122.13671875 ,121.8046875
,121.87890625 ,121.88671875 ,122.2265625 ,121.63671875 ,121.14453125
,120.84375 ,120.390625 ,119.875 ,119.34765625 ,119.0390625
,118.4609375 ,117.828125 ,117.1953125 ,116.9921875 ,116.046875
,115.16015625 ,114.359375 ,113.1875 ,110.390625 ,108.41796875
,111.90234375 ,117.296875 ,127.0234375 ,147.58984375 ,158.625
,129.8515625 ,120.96484375 ,124.90234375 ,130.17578125 ,136.47265625
,143.9296875 ,150.24609375 ,141. ,117.71484375 ,109.80859375
,115.24609375 ,118.44140625 ,120.640625 ,120.9921875 ,111.828125
,101.6953125 ,111.21484375 ,114.91015625 ,115.2265625 ,118.21875
,125.3359375 ,139.44140625 ,139.76953125 ,135.84765625 ,137.3671875
,141.67578125 ,139.53125 ,136.44921875 ,135.08203125 ,135.7890625
,137.58203125 ,138.7265625 ,154.33203125 ,172.01171875 ,152.24609375
,129.8046875 ,125.59375 ,125.234375 ,127.32421875 ,132.8984375
,147.98828125 ,152.328125 ,153.7734375 ,155.09765625 ,156.66796875
,159.0546875 ,151.83203125 ,138.91796875 ,138.0546875 ,140.671875
,143.48046875 ,143.99609375 ,146.875 ,146.7578125 ,141.15234375
,141.5 ,140.76953125 ,140.8828125 ,145.5625 ,150.78125
,148.89453125 ,150.02734375 ,150.70703125 ,152.24609375 ,148.47265625
,131.95703125 ,125.40625 ,123.265625 ,123.57421875 ,129.859375
,135.6484375 ,144.51171875 ,155.05078125 ,158.4453125 ,140.8125
,100.08984375 ,104.29296875 ,128.55078125 ,139.9921875 ,143.38671875
,143.69921875 ,137.734375 ,124.48046875 ,116.73828125 ,114.84765625
,113.85546875 ,117.45703125 ,122.859375 ,125.8515625 ,133.22265625
,139.484375 ,135.75 ,122.69921875 ,115.7734375 ,116.9375
,127.57421875]
y1 =cp.zeros(len(a), cp.complex128)
FT1=cp.zeros(len(a), cp.complex128)
for i in range(len(a)):
y1[i]=a[i] #convert to complex to feed the FFT
r=1000
series=sorted(timeit.repeat("fftbench(y1, False, FT1)", number=1, repeat=r, globals=globals()))
series=series[0:r-5]
print(str % (len(series), r, 1e6*np.min(series), 1e6*np.max(series), 1e6*np.mean(series), 1e6*np.std(series)));
a faster implementation t<<25usec
The drawback of your algorithm is that even on GPU it runs on a single-core.
In order to understand how to design algorithms on Nvidia GPGPU I recommend to look at :
the CUDA C Programming guide and to the numba documentation to apply the code in python.
Moreover to understand what's wrong with your code, I recommend to use Nvidia profiler.
The following parts of the answer will explained how to apply the basics on your example.
Run multiples threads
To improve performances, you will first need to launch multiples threads that can run in parallel, CUDA handle threads as follow:
Threads are grouped into blocs of n threads (n < 1024)
Each thread withing the same bloc can be synchronized and have access to a (fast) common memory space called "shared memory".
You can run multiples blocs in parallel in a "grid" but you will lose the synchronization mechanism.
The syntax to run multiples threads is the following:
fftbench[griddim, blockdim](y1, False, FT1)
to simplify, I will use only one bloc of size 256:
fftbench[1, 256](y1, False, FT1)
Memory
To improve GPU performances it's important to look where the data will be stored, their is three main spaces:
global memory: it's the "RAM" of your GPU, it's slow and have a high latency, this is where all your array are placed when you send them to the GPU.
shared memory: it's a little fast access memory, all the thread of a bloc have access to the same shared memory.
local memory: physically it's the same that global memory, but each thread access its own local memory.
Typically, if you use multiples times the sames data, you should try store them in shared memory to prevent latency from the global memory.
In your code, you can store exptable in shared memory:
exptable = cuda.shared.array(1024, dtype=complex128)
and if n is not too big, you may want to use a working instead of using out:
working = cuda.shared.array(256, dtype=complex128)
Assign tasks to each thread
Of course if you don't change your function, all thread will do the same job and it will just slow down your program.
In this example we will assign each thread to one cell of the array. To do so, we have to get the unique id of thread withing a bloc:
idx = cuda.threadIdx.x
Now we will be able to speed up the for loops, lets handle them one by one:
exptable = cuda.shared.array(1024, dtype=complex128)
...
for i in range(n // 2):
exptable[i] = cmath.exp(i * coef)
Here is the goal: we will want the n/2 first threads to fill this array, then all the thread will be able to use it.
So in this case just replace the for loop by a condition on the thread idx's:
if idx < n // 2:
exptable[idx] = cmath.exp(idx * coef)
For the two last loops it's easier, each thread will deal with one cell of the array:
for i in range(n):
x = i
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
out[i] = vector[y]
become
x = idx
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
working[idx] = vector[y]
and
for i in range(n):
out[i]=out[i]/scale # the inverse requires a scaling
become
out[idx]=working[idx]/scale # the inverse requires a scaling
I use the shared array working but you can replace it by out if you want to use global memory.
Now, lets look at the while loop, we said that we want each thread to only deal with one cell of the array. So we can try to parallelize the two for loops inside.
...
for i in range(0, n, size):
k = 0
for j in range(i, i + halfsize):
temp = out[j + halfsize] * exptable[k]
out[j + halfsize] = out[j] - temp
out[j] += temp
k += tablestep
...
To simplify I will only use half of the threads, we will take the 128 first threads and determine j as follow:
...
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
...
k is:
k = tablestep*(idx%halfsize)
so we got the loop:
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
k = tablestep*(idx%halfsize)
temp = working[j + halfsize] * exptable[k]
working[j + halfsize] = working[j] - temp
working[j] += temp
size *= 2
Synchronization
Last but not least, we need to synchronize all theses threads. In fact the program will not work if we do not synch. On the GPU thread may not run at the same time so you can get issues when data are produced by one thread and used by another one, for example:
exptable[0] is used by thread_2 before thread_0 fill store its value
working[j + halfsize] is moddified by another thread before you store it in temp
to prevent this we can use the function:
cuda.syncthreads()
All the threads in the same bloc will finish this line before execution the rest of the code.
In this example, you need to synchronize at two point, after the working initialization and after each iteration of the while loop.
then your code look like:
def _transform_radix2(vector, inverse, out):
n = len(vector)
levels = int32(math.log(float32(n))/math.log(float32(2)))
assert 2**levels==n # error: Length is not a power of 2
exptable = cuda.shared.array(1024, dtype=complex128)
working = cuda.shared.array(256, dtype=complex128)
assert (n // 2) <= len(exptable) # error: FFT length > MAXFFTSIZE
coef = complex128((2j if inverse else -2j) * math.pi / n)
if idx < n // 2:
exptable[idx] = cmath.exp(idx * coef)
x = idx
y = 0
for j in range(levels):
y = (y << 1) | (x & 1)
x >>= 1
working[idx] = vector[y]
cuda.syncthreads()
size = 2
while size <= n:
halfsize = size // 2
tablestep = n // size
if idx < 128:
j = (idx%halfsize) + size*(idx//halfsize)
k = tablestep*(idx%halfsize)
temp = working[j + halfsize] * exptable[k]
working[j + halfsize] = working[j] - temp
working[j] += temp
size *= 2
cuda.syncthreads()
scale=float64(n if inverse else 1)
out[idx]=working[idx]/scale # the inverse requires a scaling
I feel like your question is a good way to introduce some basics about GPGPU computing and I try to answer it in a didactic way. The final code is far from perfect and can be optimized a lot, I highly recommend you to read this Programming guide if you want to learn more about GPU optimizations.

How to make python of exponential growth to code equivalent to c++?

In this code I am computing a numerical approximation of the solution of an ODE u'(tk)=u(tk)=uk and storing all the uk and tk values as shown below.
Code:
def compute_u(u0,T,n):
t = linspace(0,T,n+1)
t[0] = 0
u=zeros(n+1)
u[0]= u0
dt = T/float(n)
for k in range(0, n, 1):
u[k+1] = (1+dt)*u[k]
t[k+1] = t[k] + dt
return u, t
I am now trying to implement this code into c++ and I am facing a few rocks along the way. I am relatively new in C++ and I was wondering if anyone in this forum could point me to the right direction since python has functions that c++ does not such as linspace or zeros. Any input will be helpful.
Here you have linspace:
std::vector< float > linspace(float a, float b, uint32_t n)
{
std::vector< float > result(n);
float step = (b - a) / (float) (n - 1);
for (uint32_t i = 0; i <= n - 2; i++) {
result[i] = a + (float) i * step;
}
result.back() = b;
return result;
}
try out zeros yourself.
Or a better solution: use Eigen, it has both functions.

how can i use dynamic programming to optimize this code

Daulat Ram is an affluent business man. After demonetization, IT raid was held at his accommodation in which all his money was seized. He is very eager to gain his money back, he started investing in certain ventures and earned out of them. On the first day, his income was Rs. X, followed by Rs. Y on the second day. Daulat Ram observed his growth as a function and wanted to calculate his income on the Nth day.
The function he found out was FN = FN-1 + FN-2 + FN-1×FN-2
Given his income on day 0 and day 1, calculate his income on the Nth day (yeah Its that simple).
INPUT:
The first line of input consists of a single integer T denoting number of test cases.
Each of the next T lines consists of three integers F0, F1 and N respectively.
OUTPUT:
For each test case, print a single integer FN, as the output can be large, calculate the answer modulo 109+7.
CONSTRAINTS:
1 ≤ T ≤ 105
0 ≤ F0, F1, N ≤ 109
def function(x1):
if x1==2: return fnc__1+fnc__0*fnc__1+fnc__0
elif x1==1: return fnc__1
elif x1==0: return fnc__0
return function(x1-1)+function(x1-2)*function(x1-1)+function(x1-2)
for i in range(int(input())): #input() is the no of test cases
rwINput = input().split()
fnc__0 =int(rwINput[0])
fnc__1 = int(rwINput[1])
print(function(int(rwINput[2])))
a simple way to optimize is to cache the results of your function. python provides a mechanism for just hat with its lru_cache. all you need to do is decorate your function with this:
from functools import lru_cache
#lru_cache()
def function(n, F0=1, F1=2):
if n == 0:
return F0
elif n == 1:
return F1
else:
f1 = function(n-1, F0, F1)
f2 = function(n-2, F0, F1)
return f1+f2 + f1*f2
you can tweak lru_cache a bit for your needs. and it plays very nice with the python garbage collector as it stores WeakRefs to your objects only.
test cases:
for i in range(7):
print('{}: {:7d}'.format(i, function(i)))
prints:
0: 1
1: 2
2: 5
3: 17
4: 107
5: 1943
6: 209951
to get your answer modulo an integer (not clear about the modulus in your question) you can do this:
MOD = 10**9 + 7 # ???
#lru_cache()
def function(n, F0=1, F1=2):
if n == 0:
return F0
elif n == 1:
return F1
else:
f1 = function(n-1, F0, F1)
f2 = function(n-2, F0, F1)
return (f1+f2 + f1*f2) % MOD
You could just start execute the function and assign f1 to f0 and result to f1. Iterate over this n times and the desired result is in f0:
MOD = 10**9 + 7
for _ in range(int(input())):
f0, f1, n = (int(x) for x in input().split())
for _ in range(n):
f0, f1 = f1, (f0 + f1 + f0 * f1) % MOD
print(f0)
Input:
8
1 2 0
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
10 13 100
Output:
1
2
5
17
107
1943
209951
276644752
Someone gave this answer to me and it worked but i don't know how?Complexity O(logn)
#include <stdio.h>
#include <stdlib.h>
#define mod 1000000007
long long int power(long long int,long long int);
void mult(long long int[2][2],long long int[2][2]);
int main()
{
int test;
scanf("%d",&test);
while(test--)
{
int n;
int pp,p;
scanf("%d%d%d",&pp,&p,&n);
long long int A[2][2] = {{1,1},{1,0}};
n = n-1;
long long int B[2][2] = {{1,0},{0,1}};
while(n>0)
{
if(n%2==1)
mult(B,A);
n = n/2;
mult(A,A);
}
long long int result = ((power(pp+1,B[0][1])*power(p+1,B[0][0]))%mod - 1 + mod)%mod;
printf("%lld\n",result);
}
}
long long int power(long long int a,long long int b)
{
long long int result = 1;
while(b>0)
{
if(b%2==1)
result = (result*a)%mod;
a = (a*a)%mod;
b = b/2;
}
return result;
}
void mult(long long int A[2][2],long long int B[2][2])
{
long long int C[2][2];
C[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0];
C[0][1] = A[0][0]*B[0][1] + A[0][1]*B[1][1];
C[1][0] = A[1][0]*B[0][0] + A[1][1]*B[1][0];
C[1][1] = A[1][0]*B[0][1] + A[1][1]*B[1][1];
A[0][0] = C[0][0]%(mod-1);
A[0][1] = C[0][1]%(mod-1);
A[1][0] = C[1][0]%(mod-1);
A[1][1] = C[1][1]%(mod-1);
}
I know this post is old, but I want to point out that an important issue has been eluded: the function quickly gets huge values, and only a modulo is required. The modulo of a sum or of a product can be computed with the sum or product of the modulos. So the only way to get a correct answer for a big N is to store the modulos, instead of the Fn!
Here is my view on how dynamic programming should be used. Dynamic programming is just about caching the results in order to avoid recomputing all sub-branches of the recursion tree. Storing the successive Fn is everything that's needed. If the algorithm only needs to be used once, you even don't have to store the whole array here: compute f0 and f1, and keep the last two computed values (with the modulo), to find the result with a simple loop. If the algorithm is run multiple times, and the result has still not been computed, you just need to retrieve the last two computed values (a variable to store the index of the last computed value would be useful) in order to restart from there.

Can I parallelize this small Python script with a global dict?

I have this problem that takes about 2.8 seconds in my MBA with Python3. Since at its core we have a caching dictionary, I figure that it doesn't matter which call hits the cache first, and so maybe I can get some gains from threading. I can't quite figure it out, though. This is a bit higher level than the questions I normally ask, but can someone walk me through the parallelization process for this problem?
import time
import threading
even = lambda n: n%2==0
next_collatz = lambda n: n//2 if even(n) else 3*n+1
cache = {1: 1}
def collatz_chain_length(n):
if n not in cache: cache[n] = 1 + collatz_chain_length(next_collatz(n))
return cache[n]
if __name__ == '__main__':
valid = range(1, 1000000)
for n in valid:
# t = threading.Thread(target=collatz_chain_length, args=[n] )
# t.start()
collatz_chain_length(n)
print( max(valid, key=cache.get) )
Or, if it is a bad candidate, why?
You won't get a good boost out of threading in Python if your workload is CPU intensive. That's because only one thread will actually be using the processor at a time due to the GIL (Global Interpreter Lock).
However, if your workload was I/O bound (waiting for responses from a network request for example), threads would give you a bit of a boost, because if your thread is blocked waiting for a network response, another thread can do useful work.
As HDN mentioned, using multiprocessing will help - this uses multiple Python interpreters to get the work done.
The way I would approach this is to divide the number of iterations by the number of processes you plan to create. For example if you create 4 processes, give each process a 1000000/4 slice of the work.
At the end you will need to aggregate the results of each process and apply your max() to get the result.
Threading won't give you much in terms of performance gains because it won't get around the Global Interpreter Lock, which will only run one thread at any given moment. It might actually even slow you down because of the context switching.
If you want to leverage parallelization for performance purposes in Python, you're going to have to use multiprocessing to actually leverage more than one core at a time.
I managed to speedup your code 16.5x times on single core, read further.
As said before multi-threading doesn't give any improvement in pure Python, because of Global Interpreter Lock.
Regarding multi-processing - there are two options 1) to implement shared dictionary and read/write to it directly from different processes. 2) to cut range of values into parts and solve task for separate subranges on different processes, then just take maximum out of all processes answers.
First option will be very slow, because in your code reading/writing to dictionary is main time-consuming operation, using shared between processes dictionary will slow it down 5 times more giving no improvements from multi-core.
Second option will give some improvement but also not great because different processes will recompute same values many times. This option will give considerable improvement only if you have very many cores or use many separate machines in cluster.
I decided to implement another way to improve your task (option-3) - to use Numba and to do other optimizations. My solution is then also suitable for option-2 (parallelization of sub-ranges).
Numba is Just-in-Time compiler and optimizer, it converts pure Python code to optimized C++ and then compiles to machine code. Numba can usually give 10x-100x times speedup.
To run code with numba you just need to install pip install numba (currently Numba is supported for Python version <= 3.8, support for 3.9 will be soon too!).
All improvements that I did gave 16.5x times speedup on single-core (e.g. if on your algorithm it was 64 seconds for some range then on my code it is 4 seconds).
I had to rewrite your code, algorithm and idea is same like yours, but I made algorithm non-recursive (because Numba doesn't deal well with recursion) and also used list instead of dictionary for not too large values.
My single core numba-based version may use sometimes too much of memory, that is only because of cs parameter which controls threshold for using list instead of dictionary, currently this cs is set to be stop * 10 (search this in code), if you don't have much memory just set it to e.g. stop * 2 (but not less than stop * 1). I have 16GB of memory and program runs correctly even for 64000000 upper limit.
Also besides Numba code I implemented C++ solution, it appeared to be same in speed like Numba, it means Numba did a good work! C++ code is located after Python code.
I did timings measurement of your algorithm (solve_py()) and my (solve_nm()) and compared them. Timings are listed after code.
Just for reference, I did multi-core processing version too using my numba solution, but it didn't give any improvements over single-core version, there was even slow-down. That all happened because multi-core version computed same values many times. Maybe multi-machine version will give noticable improvement, but probably not multi-core.
Try-it-online links below allow to run only small ranges because of limited memory on thos free online servers!
Try it online!
import time, threading, time, numba
def solve_py(start, stop):
even = lambda n: n%2==0
next_collatz = lambda n: n//2 if even(n) else 3*n+1
cache = {1: 1}
def collatz_chain_length(n):
if n not in cache: cache[n] = 1 + collatz_chain_length(next_collatz(n))
return cache[n]
for n in range(start, stop):
collatz_chain_length(n)
r = max(range(start, stop), key = cache.get)
return r, cache[r]
#numba.njit(cache = True, locals = {'n': numba.int64, 'l': numba.int64, 'zero': numba.int64})
def solve_nm(start, stop):
zero, l, cs = 0, 0, stop * 10
ns = [zero] * 10000
cache_lo = [zero] * cs
cache_lo[1] = 1
cache_hi = {zero: zero}
for n in range(start, stop):
if cache_lo[n] != 0:
continue
nsc = 0
while True:
if n < cs:
cg = cache_lo[n]
else:
cg = cache_hi.get(n, zero)
if cg != 0:
l = 1 + cg
break
ns[nsc] = n
nsc += 1
n = (n >> 1) if (n & 1) == 0 else 3 * n + 1
for i in range(nsc - 1, -1, -1):
if ns[i] < cs:
cache_lo[ns[i]] = l
else:
cache_hi[ns[i]] = l
l += 1
maxn, maxl = 0, 0
for k in range(start, stop):
v = cache_lo[k]
if v > maxl:
maxn, maxl = k, v
return maxn, maxl
if __name__ == '__main__':
solve_nm(1, 100000) # heat-up, precompile numba
for stop in [1000000, 2000000, 4000000, 8000000, 16000000, 32000000, 64000000]:
tr, resr = None, None
for is_nm in [False, True]:
if stop > 16000000 and not is_nm:
continue
tb = time.time()
res = (solve_nm if is_nm else solve_py)(1, stop)
te = time.time()
print(('py', 'nm')[is_nm], 'limit', stop, 'time', round(te - tb, 2), 'secs', end = '')
if not is_nm:
resr, tr = res, te - tb
print(', n', res[0], 'len', res[1])
else:
if tr is not None:
print(', boost', round(tr / (te - tb), 2))
assert resr == res, (resr, res)
else:
print(', n', res[0], 'len', res[1])
Output:
py limit 1000000 time 3.34 secs, n 837799 len 525
nm limit 1000000 time 0.19 secs, boost 17.27
py limit 2000000 time 6.72 secs, n 1723519 len 557
nm limit 2000000 time 0.4 secs, boost 16.76
py limit 4000000 time 13.47 secs, n 3732423 len 597
nm limit 4000000 time 0.83 secs, boost 16.29
py limit 8000000 time 27.32 secs, n 6649279 len 665
nm limit 8000000 time 1.68 secs, boost 16.27
py limit 16000000 time 55.42 secs, n 15733191 len 705
nm limit 16000000 time 3.48 secs, boost 15.93
nm limit 32000000 time 7.38 secs, n 31466382 len 706
nm limit 64000000 time 16.83 secs, n 63728127 len 950
C++ version of same algorithm as Numba is located below:
Try it online!
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <tuple>
#include <iostream>
#include <stdexcept>
#include <chrono>
typedef int64_t i64;
static std::tuple<i64, i64> Solve(i64 start, i64 stop) {
i64 cs = stop * 10, n = 0, l = 0, nsc = 0;
std::vector<i64> cache_lo(cs), ns(10000);
cache_lo[1] = 1;
std::unordered_map<i64, i64> cache_hi;
for (i64 i = start; i < stop; ++i) {
if (cache_lo[i] != 0)
continue;
n = i;
nsc = 0;
while (true) {
i64 cg = 0;
if (n < cs)
cg = cache_lo[n];
else {
auto it = cache_hi.find(n);
if (it != cache_hi.end())
cg = it->second;
}
if (cg != 0) {
l = 1 + cg;
break;
}
ns.at(nsc) = n;
++nsc;
n = (n & 1) ? 3 * n + 1 : (n >> 1);
}
for (i64 i = nsc - 1; i >= 0; --i) {
i64 n = ns[i];
if (n < cs)
cache_lo[n] = l;
else
cache_hi[n] = l;
++l;
}
}
i64 maxn = 0, maxl = 0;
for (size_t i = start; i < stop; ++i)
if (cache_lo[i] > maxl) {
maxn = i;
maxl = cache_lo[i];
}
return std::make_tuple(maxn, maxl);
}
int main() {
try {
for (auto stop: std::vector<i64>({1000000, 2000000, 4000000, 8000000, 16000000, 32000000, 64000000})) {
auto tb = std::chrono::system_clock::now();
auto r = Solve(1, stop);
auto te = std::chrono::system_clock::now();
std::cout << "cpp limit " << stop
<< " time " << double(std::chrono::duration_cast<std::chrono::milliseconds>(te - tb).count()) / 1000.0 << " secs"
<< ", n " << std::get<0>(r) << " len " << std::get<1>(r) << std::endl;
}
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
}
}
Output:
cpp limit 1000000 time 0.17 secs, n 837799 len 525
cpp limit 2000000 time 0.357 secs, n 1723519 len 557
cpp limit 4000000 time 0.757 secs, n 3732423 len 597
cpp limit 8000000 time 1.571 secs, n 6649279 len 665
cpp limit 16000000 time 3.275 secs, n 15733191 len 705
cpp limit 32000000 time 7.112 secs, n 31466382 len 706
cpp limit 64000000 time 17.165 secs, n 63728127 len 950

Categories

Resources