How to make python of exponential growth to code equivalent to c++?

How to make python of exponential growth to code equivalent to c++? - python

In this code I am computing a numerical approximation of the solution of an ODE u'(tk)=u(tk)=uk and storing all the uk and tk values as shown below.
Code:
def compute_u(u0,T,n):
t = linspace(0,T,n+1)
t[0] = 0
u=zeros(n+1)
u[0]= u0
dt = T/float(n)
for k in range(0, n, 1):
u[k+1] = (1+dt)*u[k]
t[k+1] = t[k] + dt
return u, t
I am now trying to implement this code into c++ and I am facing a few rocks along the way. I am relatively new in C++ and I was wondering if anyone in this forum could point me to the right direction since python has functions that c++ does not such as linspace or zeros. Any input will be helpful.

Here you have linspace:
std::vector< float > linspace(float a, float b, uint32_t n)
{
std::vector< float > result(n);
float step = (b - a) / (float) (n - 1);
for (uint32_t i = 0; i <= n - 2; i++) {
result[i] = a + (float) i * step;
}
result.back() = b;
return result;
}
try out zeros yourself.
Or a better solution: use Eigen, it has both functions.

Related

Why does the same algorithm result in different outputs in C++ & Python?

I am running a small code in which there are periodic boundary conditions i.e.,for point 0 the left point is the last point and for the last point zeroth point is the right point. When I run the same code in Python and C++, the answer I am getting is very different.
Python Code
import numpy as np
c= [0.467894,0.5134679,0.5123,0.476489,0.499764,0.564578]
n= len(c)
Δx= 1.0
A= 1.0
M = 1.0
K=1.0
Δt= 0.05
def delf(comp):
ans = 2*A*comp*(1-comp)*(1-2*comp)
return ans
def lap(e,v,w):
laplace = (w -2*v+ e) / (Δx**2)
return laplace
for t in range(1000000):
μ = np.zeros(n)
for i in range(n):
ans1= delf(c[i])
east= i+1
west= i-1
if i==0:
west= i-1+n
if i== (n-1):
east= i+1-n
ans2= lap(c[east], c[i], c[west])
μ[i] = ans1 - K* ans2
dc_dt= np.zeros(n)
for j in range(n):
east= j+1
west= j-1
if j==0:
west= j-1+n
if j== (n-1):
east= j+1-n
ans= lap(μ[east], μ[j], μ[west])
dc_dt[j] = M* ans
for p in range(n):
c[p] = c[p] + Δt * dc_dt[p]
print(c)
The output in Python 3.7.6 version is
[0.5057488166795907, 0.5057488166581386, 0.5057488166452102,
0.5057488166537337, 0.5057488166751858, 0.5057488166881142]
C++ Code
#include <iostream>
using namespace std ;
const float dx =1.0;
const float dt =0.05;
const float A= 1.0;
const float M=1.0;
const float K=1.0;
const int n = 6 ;
float delf(float comp){
float answer = 0.0;
answer= 2 * A* comp * (1-comp) * (1-2*comp);
return answer; }
float lap(float e, float v, float w){
float laplacian= 0.0 ;
laplacian = (e - 2*v +w) / (dx *dx);
return laplacian; }
int main(){
float c[n]= {0.467894,0.5134679,0.5123,0.476489,0.499764,0.564578};
for(int t=0; t<50;++t){
float mu[n];
for(int k=0; k<n; ++k){
int east, west =0 ;
float ans1,ans2 = 0;
ans1= delf(c[k]);
if (k ==0){ west = k-1+n; }
else{ west = k-1; }
if (k== (n-1)) { east = k+1-n; }
else{ east= k+1; }
ans2= lap(c[east], c[k], c[west]);
mu[k] = ans1 - K*ans2;
}
float dc_dt[n];
for(int p=0; p<n; ++p){
float ans3= 0;
int east, west =0 ;
if (p ==0){ west = p-1+n; }
else{ west = p-1;}
if (p== (n-1)) { east = p+1-n; }
else{ east= p+1; }
ans3= lap(mu[east], mu[p], mu[west]);
dc_dt[p] = M* ans3;
}
for(int q=0; q<n; ++q){
c[q] = c[q] + dt* dc_dt[q];
}
}
for(int i=0; i<n; ++i){
cout<< c[i]<< " ";
}
return 0;
}
Output in C++ is
0.506677 0.504968 0.50404 0.50482 0.506528 0.507457
When I am iterating for small steps say t<1000 there is no significant difference in outputs but I am supposed to do this calculation for large number of iterations (in order of 10^7) and here the difference in output is very large.

I took your code, added the missing closing bracket of the large "for" loop and also changed the length from "50" to "1000000" as in the python version.
Then I replaced all "float" with "double" and the resulting output is:
0.505749 0.505749 0.505749 0.505749 0.505749 0.505749
Thus, of course, implementing the same code in python and in c++ gives the same result. However, the types are obviously important. For example, integers are implemented in a very very different way in python3 than in c++ or almost any other language. But here it is much simpler. Python3 "float" is a "double" in c++ by definition. See https://docs.python.org/3/library/stdtypes.html
Fun fact
the simplest python program that you will have major trouble to reproduce in C++ or most other languages is something like
myInt=10000000000000000000000000000000345365753523466666
myInt = myInt*13 + 1
print (myInt)
since python can work with arbitrary large integers (until your entire computer memory is filled). The corresponding
#include <iostream>
int main(){
long int myInt = 10000000000000000000000000000000345365753523466666;
myInt = myInt*13 + 1;
std::cout << myInt << std::endl;
return 0;
}
will simply report warning: integer constant is too large for its type and will overflow and print a wrong result.

Cython Memoryview Seg Fault

I am running into a segmentation fault when trying using Cython's memoryview. This is my code:
def fock_build_init_with_inputs(tei_ints):
# set the number of orbitals
norb = tei_ints.shape[0]
# get memory view of TEIs
cdef double[:,:,:,::1] tei_memview = tei_ints
# get index pairs
prep_ipss_serial(norb, &tei_memview[0,0,0,0])
void prep_ipss_serial(const int n, const double * const tei) {
int p, q, r, s, np;
double maxval;
const double thresh = 1.0e-9;
// first we count the number of index pairs with above-threshold integrals
np = 0;
for (q = 0; q < n; q++)
for (p = q; p < n; p++) {
maxval = 0.0;
for (s = 0; s < n; s++)
for (r = s; r < n; r++) {
maxval = fmax( maxval, fabs( tei[ r + n*s + n*n*p + n*n*n*q ] ) );
}
if ( maxval > thresh )
np++;
}
ipss_np = np;
When I run the code by calling the first function with an input of numpy.zeros([n,n,n,n]), I run into a segmentation fault when n exceeds certain number (212). Does anyone know what is causing this problem and how to resolve it?
Thanks,
Luning

This looks like a 32bit integer overflow - i.e. 213*213*213*213 it's greater than the maximum 32 bit integer. You should use 64bit integers as your indexes (long or more explicitly int64_t).
Why are you converting your memoryview to a pointer though? You'll won't have gained much speed and you'll have lost any information about the shape (for example, you have an assumption that all the dimensions are the same) and you could let Cython handle the multi-dimensional indexing for you. It would be much better to make the tei argument a memoryview rather than a pointer.

Efficient way to read a set of 3 channel images from Python into a two dimensional array to be used in C

I am working on a project involving object detection through deep learning, with the underlying detection code written in C. Due to the requirements of the project, this code has a Python wrapper around it, which interfaces with the required C functions through ctypes. Images are read from Python, and then transferred into C to be processed as a batch.
In its current state, the code is very unoptimized: the images (640x360x3 each) are read using cv2.imread then stacked into a numpy array. For example, for a batch size of 16, the dimensions of this array are (16,360,640,3). Once this is done, a pointer to this array is passed through ctypes into C where the array is parsed, pixel values are normalized and rearranged into a 2D array. The dimensions of the 2D array are 16x691200 (16x(640*360*3)), arranged as follows.
row [0]: Image 0: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
row [1]: Image 1: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
.
.
row [15]: Image 15: (B)r0(B)r1(B)r2.... (G)r0(G)r1(G)r2.... (R)r0(R)r1(R)r2....
`
The C code for doing this currently looks like this, where the pixel values are accessed through strides and arranged sequentially per image. nb is the total number of images in the batch (usually 16); h, w, c are 360,640 and 3 respectively.
matrix ndarray_to_matrix(unsigned char* src, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
matrix X = make_matrix(nb, h*w*c);
int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];
int b, i, j, k;
int index1, index2 = 0;
for(b = 0; b < nb ; ++b) {
for(i = 0; i < h; ++i) {
for(k= 0; k < c; ++k) {
for(j = 0; j < w; ++j) {
index1 = k*w*h + i*w + j;
index2 = step_b*b + step_h*i + step_w*j + step_c*k;
X.vals[b][index1] = src[index2]/255.;
}
}
}
}
return X;
}
And the corresponding Python code that calls this function: (array is the original numpy array)
for i in range(start, end):
imgName = imgDir + '/' + allImageName[i]
img = cv2.imread(imgName, 1)
batchImageData[i-start,:,:] = img[:,:]
data = batchImageData.ctypes.data_as(POINTER(c_ubyte))
resmatrix = self.ndarray_to_matrix(data, batchImageData.ctypes.shape, batchImageData.ctypes.strides)
As of now, this ctypes implementation takes about 35 ms for a batch of 16 images. I'm working on a very FPS critical image processing pipeline, so is there a more efficient way of doing these operations? Specifically:
Can I read the image directly as a 'strided' one dimensional array in Python from disk, thus avoiding the iterative access and copying?
I have looked into numpy operations such as:
np.ascontiguousarray(img.transpose(2,0,1).flat, dtype=float)/255. which should achieve something similar, but this is actually taking more time possibly because of it being called in Python.
Would Cython help anywhere during the read operation?

Regarding the ascontiguousarray method, I'm assuming that it's pretty slow as python has to do some memory works to return a C-like contiguous array.
EDIT 1:
I saw this answer, apparently openCV's imread function should already return a contiguous array.
I am not very familiar with ctypes, but happen to use the PyBind library and can only recommend using it. It implements Python's buffer protocol hence allowing you to interact with python data with almost no overhead.
I've answered a question explaining how to pass a numpy array from Python to C/C++, do something dummy to it in C++ and return a dynamically created array back to Python.
EDIT 2 : I've added a simple example that receives a Numpy array, send it to C and prints it from C. You can find it here. Hope it helps!
EDIT 3 :
To answer your last comment, yes you can definitely do that.
You could modify your code to (1) instantiate a 2D numpy array in C++, (2) pass its reference to the data to your C function that will modify it instead of declaring a Matrix and (3) return that instance to Python by reference.
Your function would become:
void ndarray_to_matrix(unsigned char* src, double * x, long* shape, long* strides)
{
int nb = shape[0];
int h = shape[1];
int w = shape[2];
int c = shape[3];
int step_b = strides[0];
int step_h = strides[1];
int step_w = strides[2];
int step_c = strides[3];
int b, i, j, k;
int index1, index2 = 0;
for(b = 0; b < nb ; ++b) {
for(i = 0; i < h; ++i) {
for(k= 0; k < c; ++k) {
for(j = 0; j < w; ++j) {
index1 = k*w*h + i*w + j;
index2 = step_b*b + step_h*i + step_w*j + step_c*k;
X.vals[b][index1] = src[index2]/255.;
}
}
}
}
}
And you'd add, in your C++ wrapper code
// Instantiate the output array, assuming we know b, h, c,w
py::array_t<double> x = py::array_t<double>(b*h*c*w);
py::buffer_info bufx = x.request();
double*ptrx = (double *) bufx.ptr;
// Call to your C function with ptrx as input
ndarray_to_matrix(src, ptrx, shape, strides);
// now reshape x
x.reshape({b, h*c*w});
Do not forget to modify the prototype of the C++ wrapper function to return a numpy array like:
py::array_t<double> read_matrix(...){}...
This should work, I didn't test it though :)

Is there an equivalent to a nested recursive function in C?

First of all, I know that nested functions are not supported by the C standard.
However, it's often very useful, in other languages, to define an auxiliary recursive function that will make use of data provided by the outer function.
Here is an example, computing the number of solutions of the N-queens problem, in Python. It's easy to write the same in Lisp, Ada or Fortran for instance, which all allow some kind of nested function.
def queens(n):
a = list(range(n))
u = [True]*(2*n - 1)
v = [True]*(2*n - 1)
m = 0
def sub(i):
nonlocal m
if i == n:
m += 1
else:
for j in range(i, n):
p = i + a[j]
q = i + n - 1 - a[j]
if u[p] and v[q]:
u[p] = v[q] = False
a[i], a[j] = a[j], a[i]
sub(i + 1)
u[p] = v[q] = True
a[i], a[j] = a[j], a[i]
sub(0)
return m
Now my question: is there a way to do something like this in C? I would think of two solutions: using globals or passing data as parameters, but they both look rather unsatisfying.
There is also a way to write this as an iterative program, but it's clumsy:actually, I first wrote the iterative solution in Fortran 77 for Rosetta Code and then wanted to sort out this mess. Fortran 77 does not have recursive functions.
For those who wonder, the function manages the NxN board as a permutation of [0, 1 ... N-1], so that queens are alone on lines and columns. The function is looking for all permutations that are also solutions of the problem, starting to check the first column (actually nothing to check), then the second, and recursively calling itself only when the first i columns are in a valid configuration.

Of course. You need to simulate the special environment in use by your nested function, as static variables on the module level. Declare them above your nested function.
To not mess things up, you put this whole thing into a separate module.

Editor's Note: This answer was moved from the content of a question edit, it is written by the Original Poster.
Thanks all for the advice. Here is a solution using a structure passed as an argument. This is roughly equivalent to what gfortran and gnat do internally to deal with nested functions. The argument i could also be passed in the structure, by the way.
The inner function is declared static so as to help compiler optimizations. If it's not recursive, the code can then be integrated to the outer function (tested with GCC on a simple example), since the compiler knows the function will not be called from the "outside".
#include <stdio.h>
#include <stdlib.h>
struct queens_data {
int n, m, *a, *u, *v;
};
static void queens_sub(int i, struct queens_data *e) {
if(i == e->n) {
e->m++;
} else {
int p, q, j;
for(j = i; j < e->n; j++) {
p = i + e->a[j];
q = i + e->n - 1 - e->a[j];
if(e->u[p] && e->v[q]) {
int k;
e->u[p] = e->v[q] = 0;
k = e->a[i];
e->a[i] = e->a[j];
e->a[j] = k;
queens_sub(i + 1, e);
e->u[p] = e->v[q] = 1;
k = e->a[i];
e->a[i] = e->a[j];
e->a[j] = k;
}
}
}
}
int queens(int n) {
int i;
struct queens_data s;
s.n = n;
s.m = 0;
s.a = malloc((5*n - 2)*sizeof(int));
s.u = s.a + n;
s.v = s.u + 2*n - 1;
for(i = 0; i < n; i++) {
s.a[i] = i;
}
for(i = 0; i < 2*n - 1; i++) {
s.u[i] = s.v[i] = 1;
}
queens_sub(0, &s);
free(s.a);
return s.m;
}
int main() {
int n;
for(n = 1; n <= 16; n++) {
printf("%d %d\n", n, queens(n));
}
return 0;
}

Python code optimization (20x slower than C)

I've written this very badly optimized C code that does a simple math calculation:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
#define MAX(a, b) (((a) > (b)) ? (a) : (b))
unsigned long long int p(int);
float fullCheck(int);
int main(int argc, char **argv){
int i, g, maxNumber;
unsigned long long int diff = 1000;
if(argc < 2){
fprintf(stderr, "Usage: %s maxNumber\n", argv[0]);
return 0;
}
maxNumber = atoi(argv[1]);
for(i = 1; i < maxNumber; i++){
for(g = 1; g < maxNumber; g++){
if(i == g)
continue;
if(p(MAX(i,g)) - p(MIN(i,g)) < diff && fullCheck(p(MAX(i,g)) - p(MIN(i,g))) && fullCheck(p(i) + p(g))){
diff = p(MAX(i,g)) - p(MIN(i,g));
printf("We have a couple %llu %llu with diff %llu\n", p(i), p(g), diff);
}
}
}
return 0;
}
float fullCheck(int number){
float check = (-1 + sqrt(1 + 24 * number))/-6;
float check2 = (-1 - sqrt(1 + 24 * number))/-6;
if(check/1.00 == (int)check)
return check;
if(check2/1.00 == (int)check2)
return check2;
return 0;
}
unsigned long long int p(int n){
return n * (3 * n - 1 ) / 2;
}
And then I've tried (just for fun) to port it under Python to see how it would react. My first version was almost a 1:1 conversion that run terribly slow (120+secs in Python vs <1sec in C).
I've done a bit of optimization, and this is what I obtained:
#!/usr/bin/env/python
from cmath import sqrt
import cProfile
from pstats import Stats
def quickCheck(n):
partial_c = (sqrt(1 + 24 * (n)))/-6
c = 1/6 + partial_c
if int(c.real) == c.real:
return True
c = c - 2*partial_c
if int(c.real) == c.real:
return True
return False
def main():
maxNumber = 5000
diff = 1000
for i in range(1, maxNumber):
p_i = i * (3 * i - 1 ) / 2
for g in range(i, maxNumber):
if i == g:
continue
p_g = g * (3 * g - 1 ) / 2
if p_i > p_g:
ma = p_i
mi = p_g
else:
ma = p_g
mi = p_i
if ma - mi < diff and quickCheck(ma - mi):
if quickCheck(ma + mi):
print ('New couple ', ma, mi)
diff = ma - mi
cProfile.run('main()','script_perf')
perf = Stats('script_perf').sort_stats('time', 'calls').print_stats(10)
This runs in about 16secs which is better but also almost 20 times slower than C.
Now, I know C is better than Python for this kind of calculations, but what I would like to know is if there something that I've missed (Python-wise, like an horribly slow function or such) that could have made this function faster.
Please note that I'm using Python 3.1.1, if this makes a difference

Since quickCheck is being called close to 25,000,000 times, you might want to use memoization to cache the answers.
You can do memoization in C as well as Python. Things will be much faster in C, also.
You're computing 1/6 in each iteration of quickCheck. I'm not sure if this will be optimized out by Python, but if you can avoid recomputing constant values, you'll find things are faster. C compilers do this for you.
Doing things like if condition: return True; else: return False is silly -- and time consuming. Simply do return condition.
In Python 3.x, /2 must create floating-point values. You appear to need integers for this. You should be using //2 division. It will be closer to the C version in terms of what it does, but I don't think it's significantly faster.
Finally, Python is generally interpreted. The interpreter will always be significantly slower than C.

I made it go from ~7 seconds to ~3 seconds on my machine:
Precomputed i * (3 * i - 1 ) / 2 for each value, in yours it was computed twice quite a lot
Cached calls to quickCheck
Removed if i == g by adding +1 to the range
Removed if p_i > p_g since p_i is always smaller than p_g
Also put the quickCheck-function inside main, to make all variables local (which have faster lookup than global).
I'm sure there are more micro-optimizations available.
def main():
maxNumber = 5000
diff = 1000
p = {}
quickCache = {}
for i in range(maxNumber):
p[i] = i * (3 * i - 1 ) / 2
def quickCheck(n):
if n in quickCache: return quickCache[n]
partial_c = (sqrt(1 + 24 * (n)))/-6
c = 1/6 + partial_c
if int(c.real) == c.real:
quickCache[n] = True
return True
c = c - 2*partial_c
if int(c.real) == c.real:
quickCache[n] = True
return True
quickCache[n] = False
return False
for i in range(1, maxNumber):
mi = p[i]
for g in range(i+1, maxNumber):
ma = p[g]
if ma - mi < diff and quickCheck(ma - mi) and quickCheck(ma + mi):
print('New couple ', ma, mi)
diff = ma - mi

Because the function p() monotonically increasing you can avoid comparing the values as g > i implies p(g) > p(i). Also, the inner loop can be broken early because p(g) - p(i) >= diff implies p(g+1) - p(i) >= diff.
Also for correctness, I changed the equality comparison in quickCheck to compare difference against an epsilon because exact comparison with floating point is pretty fragile.
On my machine this reduced the runtime to 7.8ms using Python 2.6. Using PyPy with JIT reduced this to 0.77ms.
This shows that before turning to micro-optimization it pays to look for algorithmic optimizations. Micro-optimizations make spotting algorithmic changes much harder for relatively tiny gains.
EPS = 0.00000001
def quickCheck(n):
partial_c = sqrt(1 + 24*n) / -6
c = 1/6 + partial_c
if abs(int(c) - c) < EPS:
return True
c = 1/6 - partial_c
if abs(int(c) - c) < EPS:
return True
return False
def p(i):
return i * (3 * i - 1 ) / 2
def main(maxNumber):
diff = 1000
for i in range(1, maxNumber):
for g in range(i+1, maxNumber):
if p(g) - p(i) >= diff:
break
if quickCheck(p(g) - p(i)) and quickCheck(p(g) + p(i)):
print('New couple ', p(g), p(i), p(g) - p(i))
diff = p(g) - p(i)

There are some python compilers that might actually do a good bit for you. Have a look at Psyco.
Another way of dealing with math intensive programs is to rewrite the majority of the work into a math kernel, such as NumPy, so that heavily optimized code is doing the work, and your python code only guides the calculation. To get the most out of this strategy, avoid doing calculations in loops, and instead let the math kernel do all of that.

The other respondents have already mentioned several optimizations that will help. However, ultimately, you're not going to be able to match the performance of C in Python. Python is a nice tool, but since it's interpreted, it isn't really suited for heavy number crunching or other apps where performance is key.
Also, even in your C version, your inner loop could use quite a bit of help. Updated version:
for(i = 1; i < maxNumber; i++){
for(g = 1; g < maxNumber; g++){
if(i == g)
continue;
max=i;
min=g;
if (max<min) {
// xor swap - could use swap(p_max,p_min) instead.
max=max^min;
min=max^min;
max=max^min;
}
p_max=P(max);
p_min=P(min);
p_i=P(i);
p_g=P(g);
if(p_max - p_min < diff && fullCheck(p_max-p_min) && fullCheck(p_i + p_g)){
diff = p_max - p_min;
printf("We have a couple %llu %llu with diff %llu\n", p_i, p_g, diff);
}
}
}
///////////////////////////
float fullCheck(int number){
float den=sqrt(1+24*number)/6.0;
float check = 1/6.0 - den;
float check2 = 1/6.0 + den;
if(check == (int)check)
return check;
if(check2 == (int)check2)
return check2;
return 0.0;
}
Division, function calls, etc are costly. Also, calculating them once and storing in vars such as I've done can make things a lot more readable.
You might consider declaring P() as inline or rewrite as a preprocessor macro. Depending on how good your optimizer is, you might want to perform some of the arithmetic yourself and simplify its implementation.
Your implementation of fullCheck() would return what appear to be invalid results, since 1/6==0, where 1/6.0 would return 0.166... as you would expect.
This is a very brief take on what you can do to your C code to improve performance. This will, no doubt, widen the gap between C and Python performance.

20x difference between Python and C for a number crunching task seems quite good to me.
Check the usual performance differences for some CPU intensive tasks (keep in mind that the scale is logarithmic).
But look on the bright side, what's 1 minute of CPU time compared with the brain and typing time you saved writing Python instead of C? :-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to make python of exponential growth to code equivalent to c++? - python

Related

Why does the same algorithm result in different outputs in C++ & Python?

Cython Memoryview Seg Fault

Efficient way to read a set of 3 channel images from Python into a two dimensional array to be used in C

Is there an equivalent to a nested recursive function in C?

Python code optimization (20x slower than C)

Categories

Resources