C++ vs python numpy complex arrays performance

C++ vs python numpy complex arrays performance - python

Can anyone tell me why these two programs have a huge difference in run time? I am simply multiplying two large complex arrays and comparing the time in python (numpy) and c++. I am using the -O3 flag with g++ to compile this C++ code. I find that the huge difference comes only when I use complex floats in C++, its more than 20 times faster in numpy.
python code:
import numpy as np
import time
if __name__ == "__main__":
# check the data type is the same
a = np.zeros((1), dtype=np.complex128)
a[0] = np.complex(3.4e38,3.5e38)
print(a)
b = np.zeros((1), dtype=np.complex64)
b[0] = np.complex(3.4e38,3.5e38)
print(b) # imaginary part is infinity
length = 5000;
A = np.ones((length), dtype=np.complex64) * np.complex(1,1)
B = np.ones((length), dtype=np.complex64) * np.complex(1,0)
num_iterations = 1000000
time1 = time.time()
for _ in range(num_iterations):
A *= B
time2 = time.time()
duration = ((time2 - time1)*1e6)/num_iterations
print(duration)
C++ code:
#include <iostream>
#include <complex>
#include <chrono>
using namespace std::chrono;
using namespace std;
int main()
{
// check the data type is the same
complex<double> a = complex<double>(3.4e38, 3.5e38);
cout << a << endl;
complex<float> b = complex<float>(3.4e38, 3.5e38);
cout << b << endl; // imaginary part is infinity
const int length = 5000;
static complex<float> A[length];
static complex<float> B[length];
for(int i=0; i < length; i++) {
A[i] = complex<float>(1,1);
B[i] = complex<float>(1,0);
}
int num_iterations = 1000000;
auto time1 = high_resolution_clock::now();
for(int k=0; k < num_iterations; k++)
for(int i=0; i < length; i++)
A[i] *= B[i];
auto time2 = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(time2 - time1);
cout << "average time:" << duration.count() / num_iterations << endl;
}

The C++ compiler is doing some extra checking gymnastics for you in order to properly handle NaNs and other such "standard" behavior.
If you add the -ffast-math optimization flag, you'll get more sane speed, but less "standard" behavior. e.g. complex<float>(inf,0)*complex<float>(inf,0) won't be evaluated as complex<float>(inf,0). Do you really care?
numpy is doing what makes sense, not hindered by a narrow reading of the C++ standard.
e.g. until very recent g++ versions, the latter of the following functions is much faster unless -ffast-math is used.
complex<float> mul1( complex<float> a,complex<float> b)
{
return a*b;
}
complex<float> mul2( complex<float> a,complex<float> b)
{
float * fa = reinterpret_cast<float*>(&a);
const float * fb = reinterpret_cast<float*>(&b);
float cr = fa[0]*fb[0] - fa[1]*fb[1];
float ci = fa[0]*fb[1] + fa[1]*fb[0];
return complex<float>(cr,ci);
}
You can experiment with this on https://godbolt.org/z/kXPgCh for the assembly output and how the former function defaults to calling __mulsc3
P.S. Ready for another wave of anger at what the C++ standard says about std::complex<T>? Can you guess how std::norm must be implemented by default? Play along. Follow the link and spend ten seconds thinking about it.
Spoiler: it probably is using a sqrt then squaring it.

Related

Why c++ code implementation is not performing better than the python implementation?

I am doing benchmarking for finding nearest neighbour for the datapoints. My c++ implementation and python implementation are taking almost same execution time. Shouldn't be c++ works better than the raw python implementation.
C++ Execution Time : 8.506 seconds
Python Execution Time : 8.7202 seconds
C++ Code:
#include <iostream>
#include <random>
#include <map>
#include <cmath>
#include <numeric>
#include <algorithm>
#include <chrono>
#include <vector> // std::iota
using namespace std;
using namespace std::chrono;
double edist(double* arr1, double* arr2, uint n) {
double sum = 0.0;
for (int i=0; i<n; i++) {
sum += pow(arr1[i] - arr2[i], 2);
}
return sqrt(sum); }
template <typename T> vector<size_t> argsort(const vector<T> &v) {
// initialize original index locations
vector<size_t> idx(v.size()); iota(idx.begin(), idx.end(), 0);
// sort indexes based on comparing values in v
sort(idx.begin(), idx.end(),
[&v](size_t i1, size_t i2) {return v[i1] < v[i2];});
return std::vector<size_t>(idx.begin() + 1, idx.end()); }
int main() {
uint N, M;
// cin >> N >> M;
N = 1000;
M = 800;
double **arr = new double*[N];
std::random_device rd; // obtain a random number from hardware
std::mt19937 eng(rd()); // seed the generator
std::uniform_real_distribution<> distr(10.0, 60.0);
for (int i = 0; i < N; i++) {
arr[i] = new double[M];
for(int j=0; j < M; j++) {
arr[i][j] = distr(eng);
}
}
auto start = high_resolution_clock::now();
map<int, vector<size_t> > dist;
for (int i=0; i<N; i++) {
vector<double> distances;
for(int j=0; j<N; j++) {
distances.push_back(edist(arr[i], arr[j], N));
}
dist[i] = argsort(distances);
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop-start);
int dur = duration.count();
cout<<"Time taken by code: "<<dur<<" microseconds"<<endl;
cout<<" In seconds: "<<dur/pow(10,6);
return 0; }
Python Code:
import time
import numpy as np
def comp_inner_raw(i, x):
res = np.zeros(x.shape[0], dtype=np.float64)
for j in range(x.shape[0]):
res[j] = np.sqrt(np.sum((i-x[j])**2))
return res
def nearest_ngbr_raw(x): # x = [[1,2,3],[4,5,6],[7,8,9]]
#print("My array: ",x)
dist = {}
for idx,i in enumerate(x):
#lst = []
lst = comp_inner_raw(i,x)
s = np.argsort(lst)#[1:]
sorted_array = np.array(x)[s][1:]
dist[idx] = s[1:]
return dist
arr = np.random.rand(1000, 800)
start = time.time()
table = nearest_ngbr_raw(arr)
print("Time taken to execute the code using raw python is {}".format(time.time()-start))
Compile Command:
g++ -std=c++11 knn.cpp -o knn
C++ compiler(g++) version for ubuntu 18.04.1: 7.4.0
Coded in c++11
Numpy version : 1.16.2
Edit
Tried with compiler optimization, now it is taking around 1 second.
Can this c++ code be optimized further from coding or any other perspective?

Can this c++ code be optimized further from coding or any other perspective?
I can see at least three optimisations. The first two are easy and should definitely be done but in my testing they end up not impacting the runtime measurably. The third one requires rethinking the code minimally.
edist caculates a costly square root, but you are only using the distance for pairwise comparison. Since the square root function is monotonically increasing, it has no impact on the comparison result. Similarly, pow(x, 2) can be replaced with x * x and this is sometimes faster:
double edist(std::vector<double> const& arr1, std::vector<double> const& arr2, uint n) {
double sum = 0.0;
for (unsigned int i = 0; i < n; i++) {
auto const diff = arr1[i] - arr2[i];
sum += diff * diff;
}
return sum;
}
argsort performs a copy because it returns the indices excluding the first element. If you instead include the first element (change the return statement to return idx;), you avoid a potentially costly copy.
Your matrix is represented as a nested array (and you’re for some reason using raw pointers instead of a nested std::vector). It’s generally more efficient to represent matrices as contiguous N*M arrays: std::vector<double> arr(N * M);. This is also how numpy represents matrices internally. This requires changing the code to calculate the indices.

Performance Discrepancy B/n C++ and Python for Project Euler

I'm experiencing a slightly bizarre performance discrepancy between two equatable programs and I cannot reason about the difference for any real reason.
I'm solving Project Euler Problem 46. Both code solutions (one in Python and one in Cpp) get the right answer. However, the python solution seems to be more performant, which is contradictory to what I was expecting.
Do not worry about the actual algorithm being optimal - all I care about is that they are two equatable programs. I'm sure there is a more optimal algorithm.
Python Solution
import math
import time
UPPER_LIMIT = 1000000
HIT_COUNT = 0
def sieveOfErato(number):
sieve = [True] * number
for i in xrange(2, int(math.ceil(math.sqrt(number)))):
if sieve[i]:
for j in xrange(i**2, number, i):
sieve[j] = False
primes = [i for i, val in enumerate(sieve) if i > 1 and val == True]
return set(primes)
def isSquare(number):
ans = math.sqrt(number).is_integer()
return ans
def isAppropriateGolbachNumber(number, possiblePrimes):
global HIT_COUNT
for possiblePrime in possiblePrimes:
if possiblePrime < number:
HIT_COUNT += 1
difference = number - possiblePrime
if isSquare(difference / 2):
return True
return False
if __name__ == '__main__':
start = time.time()
primes = sieveOfErato(UPPER_LIMIT)
answer = -1
for odd in xrange(3, UPPER_LIMIT, 2):
if odd not in primes:
if not isAppropriateGolbachNumber(odd, primes):
answer = odd
break
print('Hit Count: {}'.format(HIT_COUNT))
print('Loop Elapsed Time: {}'.format(time.time() - start))
print('Answer: {}'.format(answer))
C++ Solution
#include <iostream>
#include <unordered_set>
#include <vector>
#include <math.h>
#include <cstdio>
#include <ctime>
int UPPER_LIMIT = 1000000;
std::unordered_set<int> sieveOfErato(int number)
{
std::unordered_set<int> primes;
bool sieve[number+1];
memset(sieve, true, sizeof(sieve));
for(int i = 2; i * i <= number; i++)
{
if (sieve[i] == true)
{
for (int j = i*i; j < number; j+=i)
{
sieve[j] = false;
}
}
}
for(int i = 2; i < number; i++)
{
if (sieve[i] == true)
{
primes.insert(i);
}
}
return primes;
}
bool isPerfectSquare(const int& number)
{
int root(round(sqrt(number)));
return number == root * root;
}
int hitCount = 0;
bool isAppropriateGoldbachNumber(const int& number, const std::unordered_set<int>& primes)
{
int difference;
for (const auto& prime : primes)
{
if (prime < number)
{
hitCount++;
difference = (number - prime)/2;
if (isPerfectSquare(difference))
{
return true;
}
}
}
return false;
}
int main(int argc, char** argv)
{
std::clock_t start;
double duration;
start = std::clock();
std::unordered_set<int> primes = sieveOfErato(UPPER_LIMIT);
int answer = -1;
for(int odd = 3; odd < UPPER_LIMIT; odd+=2)
{
if (primes.find(odd) == primes.end())
{
if (!isAppropriateGoldbachNumber(odd, primes))
{
answer = odd;
break;
}
}
}
duration = (std::clock() - start) / (double) CLOCKS_PER_SEC;
std::cout << "Hit Count: " << hitCount << std::endl;
std::cout << std::fixed << "Loop Elapsed Time: " << duration << std::endl;
std::cout << "Answer: " << answer << std::endl;
}
I'm compiling my cpp code by g++ -std=c++14 file.cpp and then executing with just ./a.out.
On a couple of test runs just using the time command from the command line, I get:
Python
Hit Count: 128854
Loop Elapsed Time: 0.393740177155
Answer: 5777
real 0m0.525s
user 0m0.416s
sys 0m0.049s
C++
Hit Count: 90622
Loop Elapsed Time: 0.993970
Answer: 5777
real 0m1.027s
user 0m0.999s
sys 0m0.013s
Why would there be more hits in the python version and it still be returning more quickly? I would think that more hits, means more iterations, means slower (and it's in python). I'm guessing that there's just a performance blunder in my cpp code, but I haven't found it yet. Any ideas?

I concur with Kunal Puri's answer that a better algorithm and data-structure can improve performance, but it does not answer the core question: Why does the same algorithm, that uses the same data-structure, runs faster with python.
It all boils down to the difference between std::unordered_set and python's set. Note that the same C++ code with std::set runs faster than python's alternative, and if optimization is enabled (with -O2) then C++ code with std::set runs more than 10 times faster than python.
There are several works showing that, and why, std::unordered_set is broken performance-wise. For example you can watch C++Now 2018: You Can Do Better than std::unordered_map: New Improvements to Hash Table Performance. It seems that python does not suffer from these design flaws in its set.
One of the things that make std::unordered_set so poor is the big amount of indirections it mandates to simply reach an element. For example, during iteration, the iterator points to a bucket before the current bucket. Another thing to consider is the poorer cache locality. The set of python seems to prefer to retain the original order of elements, but the GCC's std::unordered_set tends to create a random order. This is the cause of the difference in HIT_COUNT between C++ and python. Once the code starts to use std::set then the HIT_COUNT becomes the same for C++ and python. Retaining the original order during iteration tends to improves the cache locality of nodes in a new process, since they are iterated in the same order as they are allocated (and two adjacent allocations, of a new process, have higher chance to be allocated in consecutive memory addresses).

Apart from compiler optimization as suggested by DYZ, I have some more observations regarding optimization.
1) Use std::vector instead of std::unordered_set.
In your code, you are doing this:
std::unordered_set<int> sieveOfErato(int number)
{
std::unordered_set<int> primes;
bool sieve[number+1];
memset(sieve, true, sizeof(sieve));
for(int i = 2; i * i <= number; i++)
{
if (sieve[i] == true)
{
for (int j = i*i; j < number; j+=i)
{
sieve[j] = false;
}
}
}
for(int i = 2; i < number; i++)
{
if (sieve[i] == true)
{
primes.insert(i);
}
}
return primes;
}
I don't see any reason of using std::unordered_set here. Instead, you could do this:
std::vector<int> sieveOfErato(int number)
{
bool sieve[number+1];
memset(sieve, true, sizeof(sieve));
int numPrimes = 0;
for(int i = 2; i * i <= number; i++)
{
if (sieve[i] == true)
{
for (int j = i*i; j < number; j+=i)
{
sieve[j] = false;
}
numPrimes++;
}
}
std::vector<int> primes(numPrimes);
int j = 0;
for(int i = 2; i < number; i++)
{
if (sieve[i] == true)
{
primes[j++] = i;
}
}
return primes;
}
As far as find() is concerned, you may do this:
int j = 0;
for(int odd = 3; odd < UPPER_LIMIT; odd+=2)
{
while (j < primes.size() && primes[j] < odd) {
j++;
}
if (primes[j] != odd)
{
if (!isAppropriateGoldbachNumber(odd, primes))
{
answer = odd;
break;
}
}
}
2) Pre Compute perfect squares in a std::vector before hand instead of calling sqrt always.

Why is my Python NumPy code faster than C++?

Why is this Python NumPy code,
import numpy as np
import time
k_max = 40000
N = 10000
data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)
t1 = time.time()
for k in xrange(1,k_max+1):
cos_k = np.cos(k*data[0,:])
sin_k = np.sin(k*data[0,:])
coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()
print('Time:')
print(t2-t1)
faster than the following C++ code?
#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>
using namespace std;
// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;
int main()
{
time_t start, stop;
double diff;
// table with data
double data1[ N ];
double data2[ N ];
// table of results
double coefs1[ k_max ];
double coefs2[ k_max ];
// main loop
time( & start );
for( unsigned int j = 1; j<N; j++ )
{
for( unsigned int i = 0; i<k_max; i++ )
{
coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
}
}
// end of main loop
time( & stop );
// speed result
diff = difftime( stop, start );
cout << "Time: " << diff << " seconds";
return 0;
}
The first one shows: "Time: 8 seconds"
while the second: "Time: 11 seconds"
I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?
Version 2 of the code
I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.
Version 3 of the code
I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).
The answer is: your C++ code is not slower than your Python code when properly compiled.
I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.
I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):
#include <math.h>
#include <stdio.h>
#include <time.h>
const int k_max = 100000;
const int N = 10000;
int main(void)
{
clock_t t_start, t_end;
double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
int z;
for( z = 0; z < N; z++ )
{
data1[z] = 0.1;
data2[z] = 0.4;
}
int i, j;
t_start = clock();
for( i = 0; i < k_max; i++ )
{
for( j = 0; j < N-1; j++ )
{
coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
}
}
t_end = clock();
seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("Time: %f s\n", seconds);
return coefs1[0];
}
For k_max = 100000, N = 10000 results where following:
Python 70.284362 s
C++ 69.133199 s
C 61.638186 s
Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.
For k_max = 1000000, N = 1000 we have:
Python 115.42766 s
C++ 70.781380 s
For k_max = 1000000, N = 100:
Python 52.86826 s
C++ 7.050597 s
So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:
Python 0.651587 s
C++ 0.568518 s
Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.
Python script:
import numpy as np
from time import time
N = 10000
x_len = 100000
def main():
x = np.ones(x_len, dtype=np.float64) * 1.2345
start = time()
for i in xrange(N):
y1 = np.cos(x, dtype=np.float64)
end = time()
print('cos: {} s'.format(end-start))
start = time()
for i in xrange(N):
y2 = x * 7.9463
end = time()
print('multi: {} s'.format(end-start))
start = time()
for i in xrange(N):
res = np.sum(x, dtype=np.float64)
end = time()
print('sum: {} s'.format(end-start))
return y1, y2, res
if __name__ == '__main__':
main()
# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s
C script:
#include <math.h>
#include <stdio.h>
#include <time.h>
const int N = 10000;
const int x_len = 100000;
int main()
{
clock_t t_start, t_end;
double x[x_len], y1[x_len], y2[x_len], res, time;
int i, j;
for( i = 0; i < x_len; i++ )
{
x[i] = 1.2345;
}
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y1[i] = cos(x[i]);
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("cos: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y2[i] = x[i] * 7.9463;
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("multi: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
res = 0.0;
for( i = 0; i < x_len; i++ )
{
res += x[i];
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("sum: %f s\n", time);
return y1[0], y2[0], res;
}
// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s
Python results:
cos: 22.7199969292 s
multi: 0.841291189194 s
sum: 1.15971088409 s
C results:
cos: 20.910590 s
multi: 0.633281 s
sum: 1.153001 s
As you can see NumPy is incredibly fast, but always a bit slower than pure C.

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...
Numpy is using complex Linear Algebra libraries !
Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.
I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.
C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).
Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from #Jfs, and we observe: "There is no difference between C++ and numpy on my machine."
Some more reference:
Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).
I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:
#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>
const unsigned int k_max = 40000;
const unsigned int N = 10000;
template <class T>
class matrix2 {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
T &operator()(size_t y, size_t x) {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
T operator()(size_t y, size_t x) const {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
};
int main() {
matrix2<double> data(N, 2);
matrix2<double> coeffs(k_max, 2);
using namespace std::chrono;
auto start = high_resolution_clock::now();
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
auto end = high_resolution_clock::now();
std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}
This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.
The next obvious step would be to use multiple cores. To do that in C++, we can add this line:
#pragma omp parallel for
...before the outer for loop:
#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:
#include <iostream>
#include <cmath>
#include <time.h>
const int k_max = 40000;
const int N = 10000;
double cos_k, sin_k;
int main(int argc, char const *argv[])
{
time_t start, stop;
double data[2][N];
double coefs[k_max][2];
time(&start);
for(int i=0; i<k_max; ++i)
{
for(int j=0; j<N; ++j)
{
coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
}
}
// End of main loop
time(&stop);
// Speed result
double diff = difftime(stop, start);
std::cout << "Time: " << diff << " seconds" << std::endl;
return 0;
}
Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is often slower than C++ since C++ can do more optimizations. They'll only be around each other with Python running somewhere between the same time as C++ to about twice C++ when doing the majority of your computation in large computations that Python pushes off to compiled binaries to calculate. Most anything beyond large matrix multiplication, addition, scalar on matrix multiplication, etc. will perform much worse in Python. For example, look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. That makes sense too if you do anything other than simple math operations - anything with linked lists, binary search trees, procedural code, etc. The interpreted nature of Python combined with it storing metadata for each object (even int, double, float, etc.) significantly bogs things down in a way that no Python programmer can fix.

Why is Python sort faster than that of C++

Sorting a list of ints in python 3 seems to be faster than sorting an array of ints in C++. Below is the code for 1 python program and 2 C++ programs that I used for the test. Any reason why the C++ programs are slower? It doesn't make sense to me.
----- Program 1 - python 3.4 -----
from time import time
x = 10000
y = 1000
start = time()
for _ in range(y):
a = list(range(x))
a.reverse()
a.sort()
print(round(time() - start, 2), 'seconds')
----- Program 2 - c++ using sort from algorithm ------
using namespace std;
#include <iostream>
#include <algorithm>
int main(){
int x = 10000;
int y = 1000;
int b[10000];
cout << "start" << endl;
for (int j = 0; j < y; j++){
for (int i = 0; i < x; i++){
b[i] = x - i;
} // still slower than python with this clause taken out
sort(b, b + x); // regular sort
}
cout << "done";
system("pause");
}
----- Program 3 - c++ using hand written merge sort ------
using namespace std;
#include <iostream>
void merge(int * arr, int *temp, int first_start, int second_start, int second_finish){
int a1 = first_start, b1 = second_start, r = 0;
while (a1 < second_start && b1 < second_finish){
if (arr[a1] < arr[b1]){
temp[r] = arr[a1];
a1++; r++;
}
else {
temp[r] = arr[b1];
b1++; r++;
}
}
if (a1 < second_start){
while (a1 < second_start){
temp[r] = arr[a1];
a1++; r++;
}
}
else {
while (b1 < second_finish){
temp[r] = arr[b1];
b1++; r++;
}
}
for (int i = first_start; i < second_finish; i++){
arr[i] = temp[i - first_start];
}
}
void merge_sort(int *a, int a_len, int *temp){
int c = 1, start = 0;
while (c < a_len){
while (start + c * 2 < a_len){
merge(a, temp, start, start + c, start + c * 2);
start += c * 2;
}
if (start + c <= a_len){
merge(a, temp, start, start + c, a_len);
}
c *= 2; start = 0;
}
}
int main(){
int x = 10000; // size of array to be sorted
int y = 1000; // number of times to sort it
int b[10000], temp[10000];
cout << "start" << endl;
for (int j = 0; j < y; j++){
for (int i = 0; i < x; i++){
b[i] = x - i; // reverse sorted array (even with this assignment taken out still runs slower than python)
}
merge_sort(b, x, temp);
}
cout << "done";
system("pause");
}

The core reason is no doubt timsort -- http://en.wikipedia.org/wiki/Timsort -- first conceived by Tim Peters for Python though now also in some Java VMs (for non-primitives only).
It's a truly amazing algorithm and you can find a C++ implementation at https://github.com/swenson/sort for example.
Lesson to retain: the proper architecture and algorithms can let you run circles around supposedly-faster languages if the latter are using less-perfect A & As!-) So, if you have really big problems to solve, deal with determining perfect architecture and algorithms first -- the language and optimizations within it are inevitably lower-priority issues.

OpenMP, Python, C Extension, Memory Access and the evil GIL

so I am currently trying to do something like A**b for some 2d ndarray and a double b in parallel for Python. I would like to do it with a C extension using OpenMP (yes I know, there is Cython etc. but at some point I always ran into trouble with those 'high-level' approaches...).
So here is the gaussian.c Code for my gaussian.so:
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
(A is a square double Matrix, out has the same shape and n is the number of rows/columns) So the point is to update some symmetric distance matrix - ind2 is the transposed index of ind1.
I compile it using gcc -shared -fopenmp -o gaussian.so -lm gaussian.c. I access the function directly via ctypes in Python:
test = c_gaussian.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sample
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'), # array of sampl
ctypes.c_int # number of samples
]
The function 'test' is working smoothly as long as I comment the #pragma line - otherwise it ends with error number 139.
A = np.random.rand(1000, 1000) + 2.0
out = np.empty((1000, 1000))
test(A, out, 1000)
When I change the inner loop to just print ind1 and ind2 it runs smoothly in parallel. It also works, when I just access the ind1 location and leave ind2 alone (even in parallel)! Where do I screw up the memory access? How can I fix this?
thank you!
Update: Well I guess this is running into the GIL, but I am not yet sure...
Update: Okay, I am pretty sure now, that it is evil GIL killing me here, so I altered the example:
I now have gil.c:
#include <Python.h>
#define _USE_MATH_DEFINES
#include <math.h>
void scale(const double *A, double *out, int n) {
int i, j, ind1, ind2;
double power, denom;
power = 10.0 / M_PI;
denom = sqrt(M_PI);
Py_BEGIN_ALLOW_THREADS
#pragma omp parallel for
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Py_END_ALLOW_THREADS
}
which is compiled using gcc -shared -fopenmp -o gil.so -lm gil.c -I /usr/include/python2.7 -L /usr/lib/python2.7/ -lpython2.7 and the corresponding Python file:
import ctypes
import numpy as np
from numpy.ctypeslib import ndpointer
import pylab as pl
path = '../src/gil.so'
c_gil = ctypes.cdll.LoadLibrary(path)
test = c_gil.scale
test.restype = None
test.argtypes = [ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ndpointer(ctypes.c_double,
ndim=2,
flags='C_CONTIGUOUS'),
ctypes.c_int
]
n = 100
A = np.random.rand(n, n) + 2.0
out = np.empty((n,n))
test(A, out, n)
This gives me
Fatal Python error: PyEval_SaveThread: NULL tstate
Process finished with exit code 134
Now somehow it seems to not be able to save the current thread - but the API doc does not go into detail here, I was hoping that I could ignore Python when writing my C function, but this seems to be quite messy :( any ideas? I found this very helpful: GIL

Your problem is much simpler than you think and does not involve GIL in any way. You are running in an out-of-bound access to out[] when you access it via ind2 since j easily becomes larger than n. The reason is simply that you have not applied any data sharing clause to your parallel region and all variables except i remain shared (as per default in OpenMP) and therefore subject to data races - in that case multiple simultaneous increments being done by the different threads. Having too large j is less of a problem with ind1, but not with ind2 since there the too large value is multiplied by n and thus becomes far too large.
Simply make j, ind1 and ind2 private as they should be:
#pragma omp parallel for private(j,ind1,ind2)
for (i = 0; i < n; i++) {
for (j = i; j < n; j++) {
ind1 = i*n + j;
ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}
Even better, declare them inside the scope where they are being used. That automatically makes them private:
#pragma omp parallel for
for (i = 0; i < n; i++) {
int j;
for (j = i; j < n; j++) {
int ind1 = i*n + j;
int ind2 = j*n + i;
out[ind1] = pow(A[ind1], power) / denom;
out[ind2] = out[ind1];
}
}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.