Performance Discrepancy B/n C++ and Python for Project Euler

Performance Discrepancy B/n C++ and Python for Project Euler - python

I'm experiencing a slightly bizarre performance discrepancy between two equatable programs and I cannot reason about the difference for any real reason.
I'm solving Project Euler Problem 46. Both code solutions (one in Python and one in Cpp) get the right answer. However, the python solution seems to be more performant, which is contradictory to what I was expecting.
Do not worry about the actual algorithm being optimal - all I care about is that they are two equatable programs. I'm sure there is a more optimal algorithm.
Python Solution
import math
import time
UPPER_LIMIT = 1000000
HIT_COUNT = 0
def sieveOfErato(number):
sieve = [True] * number
for i in xrange(2, int(math.ceil(math.sqrt(number)))):
if sieve[i]:
for j in xrange(i**2, number, i):
sieve[j] = False
primes = [i for i, val in enumerate(sieve) if i > 1 and val == True]
return set(primes)
def isSquare(number):
ans = math.sqrt(number).is_integer()
return ans
def isAppropriateGolbachNumber(number, possiblePrimes):
global HIT_COUNT
for possiblePrime in possiblePrimes:
if possiblePrime < number:
HIT_COUNT += 1
difference = number - possiblePrime
if isSquare(difference / 2):
return True
return False
if __name__ == '__main__':
start = time.time()
primes = sieveOfErato(UPPER_LIMIT)
answer = -1
for odd in xrange(3, UPPER_LIMIT, 2):
if odd not in primes:
if not isAppropriateGolbachNumber(odd, primes):
answer = odd
break
print('Hit Count: {}'.format(HIT_COUNT))
print('Loop Elapsed Time: {}'.format(time.time() - start))
print('Answer: {}'.format(answer))
C++ Solution
#include <iostream>
#include <unordered_set>
#include <vector>
#include <math.h>
#include <cstdio>
#include <ctime>
int UPPER_LIMIT = 1000000;
std::unordered_set<int> sieveOfErato(int number)
{
std::unordered_set<int> primes;
bool sieve[number+1];
memset(sieve, true, sizeof(sieve));
for(int i = 2; i * i <= number; i++)
{
if (sieve[i] == true)
{
for (int j = i*i; j < number; j+=i)
{
sieve[j] = false;
}
}
}
for(int i = 2; i < number; i++)
{
if (sieve[i] == true)
{
primes.insert(i);
}
}
return primes;
}
bool isPerfectSquare(const int& number)
{
int root(round(sqrt(number)));
return number == root * root;
}
int hitCount = 0;
bool isAppropriateGoldbachNumber(const int& number, const std::unordered_set<int>& primes)
{
int difference;
for (const auto& prime : primes)
{
if (prime < number)
{
hitCount++;
difference = (number - prime)/2;
if (isPerfectSquare(difference))
{
return true;
}
}
}
return false;
}
int main(int argc, char** argv)
{
std::clock_t start;
double duration;
start = std::clock();
std::unordered_set<int> primes = sieveOfErato(UPPER_LIMIT);
int answer = -1;
for(int odd = 3; odd < UPPER_LIMIT; odd+=2)
{
if (primes.find(odd) == primes.end())
{
if (!isAppropriateGoldbachNumber(odd, primes))
{
answer = odd;
break;
}
}
}
duration = (std::clock() - start) / (double) CLOCKS_PER_SEC;
std::cout << "Hit Count: " << hitCount << std::endl;
std::cout << std::fixed << "Loop Elapsed Time: " << duration << std::endl;
std::cout << "Answer: " << answer << std::endl;
}
I'm compiling my cpp code by g++ -std=c++14 file.cpp and then executing with just ./a.out.
On a couple of test runs just using the time command from the command line, I get:
Python
Hit Count: 128854
Loop Elapsed Time: 0.393740177155
Answer: 5777
real 0m0.525s
user 0m0.416s
sys 0m0.049s
C++
Hit Count: 90622
Loop Elapsed Time: 0.993970
Answer: 5777
real 0m1.027s
user 0m0.999s
sys 0m0.013s
Why would there be more hits in the python version and it still be returning more quickly? I would think that more hits, means more iterations, means slower (and it's in python). I'm guessing that there's just a performance blunder in my cpp code, but I haven't found it yet. Any ideas?

I concur with Kunal Puri's answer that a better algorithm and data-structure can improve performance, but it does not answer the core question: Why does the same algorithm, that uses the same data-structure, runs faster with python.
It all boils down to the difference between std::unordered_set and python's set. Note that the same C++ code with std::set runs faster than python's alternative, and if optimization is enabled (with -O2) then C++ code with std::set runs more than 10 times faster than python.
There are several works showing that, and why, std::unordered_set is broken performance-wise. For example you can watch C++Now 2018: You Can Do Better than std::unordered_map: New Improvements to Hash Table Performance. It seems that python does not suffer from these design flaws in its set.
One of the things that make std::unordered_set so poor is the big amount of indirections it mandates to simply reach an element. For example, during iteration, the iterator points to a bucket before the current bucket. Another thing to consider is the poorer cache locality. The set of python seems to prefer to retain the original order of elements, but the GCC's std::unordered_set tends to create a random order. This is the cause of the difference in HIT_COUNT between C++ and python. Once the code starts to use std::set then the HIT_COUNT becomes the same for C++ and python. Retaining the original order during iteration tends to improves the cache locality of nodes in a new process, since they are iterated in the same order as they are allocated (and two adjacent allocations, of a new process, have higher chance to be allocated in consecutive memory addresses).

Apart from compiler optimization as suggested by DYZ, I have some more observations regarding optimization.
1) Use std::vector instead of std::unordered_set.
In your code, you are doing this:
std::unordered_set<int> sieveOfErato(int number)
{
std::unordered_set<int> primes;
bool sieve[number+1];
memset(sieve, true, sizeof(sieve));
for(int i = 2; i * i <= number; i++)
{
if (sieve[i] == true)
{
for (int j = i*i; j < number; j+=i)
{
sieve[j] = false;
}
}
}
for(int i = 2; i < number; i++)
{
if (sieve[i] == true)
{
primes.insert(i);
}
}
return primes;
}
I don't see any reason of using std::unordered_set here. Instead, you could do this:
std::vector<int> sieveOfErato(int number)
{
bool sieve[number+1];
memset(sieve, true, sizeof(sieve));
int numPrimes = 0;
for(int i = 2; i * i <= number; i++)
{
if (sieve[i] == true)
{
for (int j = i*i; j < number; j+=i)
{
sieve[j] = false;
}
numPrimes++;
}
}
std::vector<int> primes(numPrimes);
int j = 0;
for(int i = 2; i < number; i++)
{
if (sieve[i] == true)
{
primes[j++] = i;
}
}
return primes;
}
As far as find() is concerned, you may do this:
int j = 0;
for(int odd = 3; odd < UPPER_LIMIT; odd+=2)
{
while (j < primes.size() && primes[j] < odd) {
j++;
}
if (primes[j] != odd)
{
if (!isAppropriateGoldbachNumber(odd, primes))
{
answer = odd;
break;
}
}
}
2) Pre Compute perfect squares in a std::vector before hand instead of calling sqrt always.

Related

C++ Memory Leak in Swig Python Module

Background
I have created a python module that wraps a c++ program using SWIG. It works just fine, but it has a pretty serious memory leak issue that I think is a result of poorly handled pointers to large map objects. I have very little experience with c++, and I have questions as to whether delete[] can be used on an object created with new in a different function or method.
The program was written in 2007, so excuse the lack of useful c++11 tricks.
The swig extension basically just wraps a single c++ class (Matrix) and a few functions.
Matrix.h
#ifndef __MATRIX__
#define __MATRIX__
#include <string>
#include <vector>
#include <map>
#include <cmath>
#include <fstream>
#include <cstdlib>
#include <stdio.h>
#include <unistd.h>
#include "FileException.h"
#include "ParseException.h"
#define ROUND_TO_INT(n) ((long long)floor(n))
#define MIN(a,b) ((a)<(b)?(a):(b))
#define MAX(a,b) ((a)>(b)?(a):(b))
using namespace std;
class Matrix {
private:
/**
* Split a string following delimiters
*/
void tokenize(const string& str, vector<string>& tokens, const string& delimiters) {
// Skip delimiters at beginning.
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first "non-delimiter".
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos)
{
// Found a token, add it to the vector.
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters. Note the "not_of"
lastPos = str.find_first_not_of(delimiters, pos);
// Find next "non-delimiter"
pos = str.find_first_of(delimiters, lastPos);
}
}
public:
// used for efficiency tests
long long totalMapSize;
long long totalOp;
double ** mat; // the matrix as it is stored in the matrix file
int length;
double granularity; // the real granularity used, greater than 1
long long ** matInt; // the discrete matrix with offset
double errorMax;
long long *offsets; // offset of each column
long long offset; // sum of offsets
long long *minScoreColumn; // min discrete score at each column
long long *maxScoreColumn; // max discrete score at each column
long long *sum;
long long minScore; // min total discrete score (normally 0)
long long maxScore; // max total discrete score
long long scoreRange; // score range = max - min + 1
long long *bestScore;
long long *worstScore;
double background[4];
Matrix() {
granularity = 1.0;
offset = 0;
background[0] = background[1] = background[2] = background[3] = 0.25;
}
Matrix(double pA, double pC, double pG, double pT) {
granularity = 1.0;
offset = 0;
background[0] = pA;
background[1] = pC;
background[2] = pG;
background[3] = pT;
}
~Matrix() {
for (int k = 0; k < 4; k++ ) {
delete[] matInt[k];
}
delete[] matInt;
delete[] mat;
delete[] offsets;
delete[] minScoreColumn;
delete[] maxScoreColumn;
delete[] sum;
delete[] bestScore;
delete[] worstScore;
}
void toLogOddRatio () {
for (int p = 0; p < length; p++) {
double sum = mat[0][p] + mat[1][p] + mat[2][p] + mat[3][p];
for (int k = 0; k < 4; k++) {
mat[k][p] = log((mat[k][p] + 0.25) /(sum + 1)) - log (background[k]);
}
}
}
void toLog2OddRatio () {
for (int p = 0; p < length; p++) {
double sum = mat[0][p] + mat[1][p] + mat[2][p] + mat[3][p];
for (int k = 0; k < 4; k++) {
mat[k][p] = log2((mat[k][p] + 0.25) /(sum + 1)) - log2 (background[k]);
}
}
}
/**
* Transforms the initial matrix into an integer and offseted matrix.
*/
void computesIntegerMatrix (double granularity, bool sortColumns = true);
// computes the complete score distribution between score min and max
void showDistrib (long long min, long long max) {
map<long long, double> *nbocc = calcDistribWithMapMinMax(min,max);
map<long long, double>::iterator iter;
// computes p values and stores them in nbocc[length]
double sum = 0;
map<long long, double>::reverse_iterator riter = nbocc[length-1].rbegin();
while (riter != nbocc[length-1].rend()) {
sum += riter->second;
nbocc[length][riter->first] = sum;
riter++;
}
iter = nbocc[length].begin();
while (iter != nbocc[length].end() && iter->first <= max) {
//cout << (((iter->first)-offset)/granularity) << " " << (iter->second) << " " << nbocc[length-1][iter->first] << endl;
iter ++;
}
}
/**
* Computes the pvalue associated with the threshold score requestedScore.
*/
void lookForPvalue (long long requestedScore, long long min, long long max, double *pmin, double *pmax);
/**
* Computes the score associated with the pvalue requestedPvalue.
*/
long long lookForScore (long long min, long long max, double requestedPvalue, double *rpv, double *rppv);
/**
* Computes the distribution of scores between score min and max as the DP algrithm proceeds
* but instead of using a table we use a map to avoid computations for scores that cannot be reached
*/
map<long long, double> *calcDistribWithMapMinMax (long long min, long long max);
void readMatrix (string matrix) {
vector<string> str;
tokenize(matrix, str, " \t|");
this->length = 0;
this->length = str.size() / 4;
mat = new double*[4];
int idx = 0;
for (int j = 0; j < 4; j++) {
this->mat[j] = new double[this->length];
for (int i = 0; i < this->length; i++) {
mat[j][i] = atof(str.at(idx).data());
idx++;
}
}
str.clear();
}
}; /* Matrix */
#endif
Matrix.cpp
#include "Matrix.h"
#define MEMORYCOUNT
void Matrix::computesIntegerMatrix (double granularity, bool sortColumns) {
double minS = 0, maxS = 0;
double scoreRange;
// computes precision
for (int i = 0; i < length; i++) {
double min = mat[0][i];
double max = min;
for (int k = 1; k < 4; k++ ) {
min = ((min < mat[k][i])?min:(mat[k][i]));
max = ((max > mat[k][i])?max:(mat[k][i]));
}
minS += min;
maxS += max;
}
// score range
scoreRange = maxS - minS + 1;
if (granularity > 1.0) {
this->granularity = granularity / scoreRange;
} else if (granularity < 1.0) {
this->granularity = 1.0 / granularity;
} else {
this->granularity = 1.0;
}
matInt = new long long *[length];
for (int k = 0; k < 4; k++ ) {
matInt[k] = new long long[length];
for (int p = 0 ; p < length; p++) {
matInt[k][p] = ROUND_TO_INT((double)(mat[k][p]*this->granularity));
}
}
this->errorMax = 0.0;
for (int i = 1; i < length; i++) {
double maxE = mat[0][i] * this->granularity - (matInt[0][i]);
for (int k = 1; k < 4; k++) {
maxE = ((maxE < mat[k][i] * this->granularity - matInt[k][i])?(mat[k][i] * this->granularity - (matInt[k][i])):(maxE));
}
this->errorMax += maxE;
}
if (sortColumns) {
// sort the columns : the first column is the one with the greatest value
long long min = 0;
for (int i = 0; i < length; i++) {
for (int k = 0; k < 4; k++) {
min = MIN(min,matInt[k][i]);
}
}
min --;
long long *maxs = new long long [length];
for (int i = 0; i < length; i++) {
maxs[i] = matInt[0][i];
for (int k = 1; k < 4; k++) {
if (maxs[i] < matInt[k][i]) {
maxs[i] = matInt[k][i];
}
}
}
long long **mattemp = new long long *[4];
for (int k = 0; k < 4; k++) {
mattemp[k] = new long long [length];
}
for (int i = 0; i < length; i++) {
long long max = maxs[0];
int p = 0;
for (int j = 1; j < length; j++) {
if (max < maxs[j]) {
max = maxs[j];
p = j;
}
}
maxs[p] = min;
for (int k = 0; k < 4; k++) {
mattemp[k][i] = matInt[k][p];
}
}
for (int k = 0; k < 4; k++) {
for (int i = 0; i < length; i++) {
matInt[k][i] = mattemp[k][i];
}
}
for (int k = 0; k < 4; k++) {
delete[] mattemp[k];
}
delete[] mattemp;
delete[] maxs;
}
// computes offsets
this->offset = 0;
offsets = new long long [length];
for (int i = 0; i < length; i++) {
long long min = matInt[0][i];
for (int k = 1; k < 4; k++ ) {
min = ((min < matInt[k][i])?min:(matInt[k][i]));
}
offsets[i] = -min;
for (int k = 0; k < 4; k++ ) {
matInt[k][i] += offsets[i];
}
this->offset += offsets[i];
}
// look for the minimum score of the matrix for each column
minScoreColumn = new long long [length];
maxScoreColumn = new long long [length];
sum = new long long [length];
minScore = 0;
maxScore = 0;
for (int i = 0; i < length; i++) {
minScoreColumn[i] = matInt[0][i];
maxScoreColumn[i] = matInt[0][i];
sum[i] = 0;
for (int k = 1; k < 4; k++ ) {
sum[i] = sum[i] + matInt[k][i];
if (minScoreColumn[i] > matInt[k][i]) {
minScoreColumn[i] = matInt[k][i];
}
if (maxScoreColumn[i] < matInt[k][i]) {
maxScoreColumn[i] = matInt[k][i];
}
}
minScore = minScore + minScoreColumn[i];
maxScore = maxScore + maxScoreColumn[i];
//cout << "minScoreColumn[" << i << "] = " << minScoreColumn[i] << endl;
//cout << "maxScoreColumn[" << i << "] = " << maxScoreColumn[i] << endl;
}
this->scoreRange = maxScore - minScore + 1;
bestScore = new long long[length];
worstScore = new long long[length];
bestScore[length-1] = maxScore;
worstScore[length-1] = minScore;
for (int i = length - 2; i >= 0; i--) {
bestScore[i] = bestScore[i+1] - maxScoreColumn[i+1];
worstScore[i] = worstScore[i+1] - minScoreColumn[i+1];
}
}
/**
* Computes the pvalue associated with the threshold score requestedScore.
*/
void Matrix::lookForPvalue (long long requestedScore, long long min, long long max, double *pmin, double *pmax) {
map<long long, double> *nbocc = calcDistribWithMapMinMax(min,max);
map<long long, double>::iterator iter;
// computes p values and stores them in nbocc[length]
double sum = nbocc[length][max+1];
long long s = max + 1;
map<long long, double>::reverse_iterator riter = nbocc[length-1].rbegin();
while (riter != nbocc[length-1].rend()) {
sum += riter->second;
if (riter->first >= requestedScore) s = riter->first;
nbocc[length][riter->first] = sum;
riter++;
}
//cout << " s found : " << s << endl;
iter = nbocc[length].find(s);
while (iter != nbocc[length].begin() && iter->first >= s - errorMax) {
iter--;
}
//cout << " s - E found : " << iter->first << endl;
#ifdef MEMORYCOUNT
// for tests, store the number of memory bloc necessary
for (int pos = 0; pos <= length; pos++) {
totalMapSize += nbocc[pos].size();
}
#endif
*pmax = nbocc[length][s];
*pmin = iter->second;
}
/**
* Computes the score associated with the pvalue requestedPvalue.
*/
long long Matrix::lookForScore (long long min, long long max, double requestedPvalue, double *rpv, double *rppv) {
map<long long, double> *nbocc = calcDistribWithMapMinMax(min,max);
map<long long, double>::iterator iter;
// computes p values and stores them in nbocc[length]
double sum = 0.0;
map<long long, double>::reverse_iterator riter = nbocc[length-1].rbegin();
long long alpha = riter->first+1;
long long alpha_E = alpha;
nbocc[length][alpha] = 0.0;
while (riter != nbocc[length-1].rend()) {
sum += riter->second;
nbocc[length][riter->first] = sum;
if (sum >= requestedPvalue) {
break;
}
riter++;
}
if (sum > requestedPvalue) {
alpha_E = riter->first;
riter--;
alpha = riter->first;
} else {
if (riter == nbocc[length-1].rend()) { // path following the remark of the mail
riter--;
alpha = alpha_E = riter->first;
} else {
alpha = riter->first;
riter++;
sum += riter->second;
alpha_E = riter->first;
}
nbocc[length][alpha_E] = sum;
//cout << "Pv(S) " << riter->first << " " << sum << endl;
}
#ifdef MEMORYCOUNT
// for tests, store the number of memory bloc necessary
for (int pos = 0; pos <= length; pos++) {
totalMapSize += nbocc[pos].size();
}
#endif
if (alpha - alpha_E > errorMax) alpha_E = alpha;
*rpv = nbocc[length][alpha];
*rppv = nbocc[length][alpha_E];
delete[] nbocc;
return alpha;
}
// computes the distribution of scores between score min and max as the DP algrithm proceeds
// but instead of using a table we use a map to avoid computations for scores that cannot be reached
map<long long, double> *Matrix::calcDistribWithMapMinMax (long long min, long long max) {
// maps for each step of the computation
// nbocc[length] stores the pvalue
// nbocc[pos] for pos < length stores the qvalue
map<long long, double> *nbocc = new map<long long, double> [length+1];
map<long long, double>::iterator iter;
long long *maxs = new long long[length+1]; // # pos i maximum score reachable with the suffix matrix from i to length-1
maxs[length] = 0;
for (int i = length-1; i >= 0; i--) {
maxs[i] = maxs[i+1] + maxScoreColumn[i];
}
// initializes the map at position 0
for (int k = 0; k < 4; k++) {
if (matInt[k][0]+maxs[1] >= min) {
nbocc[0][matInt[k][0]] += background[k];
}
}
// computes q values for scores greater or equal than min
nbocc[length-1][max+1] = 0.0;
for (int pos = 1; pos < length; pos++) {
iter = nbocc[pos-1].begin();
while (iter != nbocc[pos-1].end()) {
for (int k = 0; k < 4; k++) {
long long sc = iter->first + matInt[k][pos];
if (sc+maxs[pos+1] >= min) {
// the score min can be reached
if (sc > max) {
// the score will be greater than max for all suffixes
nbocc[length-1][max+1] += nbocc[pos-1][iter->first] * background[k]; //pow(4,length-pos-1) ;
totalOp++;
} else {
nbocc[pos][sc] += nbocc[pos-1][iter->first] * background[k];
totalOp++;
}
}
}
iter++;
}
//cerr << " map size for " << pos << " " << nbocc[pos].size() << endl;
}
delete[] maxs;
return nbocc;
}
pytfmpval.i
%module pytfmpval
%{
#include "../src/Matrix.h"
#define SWIG_FILE_WITH_INIT
%}
%include "cpointer.i"
%include "std_string.i"
%include "std_vector.i"
%include "typemaps.i"
%include "../src/Matrix.h"
%pointer_class(double, doublep)
%pointer_class(int, intp)
%nodefaultdtor Matrix;
The c++ functions are called in a python module.
I worry that nbocc in Matrix.cpp is not being properly dereferenced or deleted. Is this use valid?
I have tried using gc.collect() and I am using the multiprocessing module as recommended in this question to call these functions from my python program. I've also tried deleting the Matrix object from within python to no avail.
I'm out of characters, but will provide any additional needed info in the comments as well as I can.
UPDATE: I've removed all of the python code, as it wasn't the issue and made for an absurdly long post. As I stated in the comments below, this was ultimately solved by taking the suggestion of many users and creating a minimal example that exhibited the issue in pure C++. I then used valgrind to identify the problematic pointers created with new and made sure that they were properly dereferenced. This fixed almost all memory leaks. One remains, but it leaks only a few hundred bytes over thousands of iterations and would require refactoring the entire Matrix class, which simply isn't worth the time for what it is. Bad practice, I know. To any other newbie in C++ out there, seriously try to avoid dynamic memory allocation or utilize std::unique_ptr or std::shared_ptr.
Thanks again to everyone who provided input and suggestions.

It’s hard to follow what’s happening, but I’m pretty sure your matrices are not being cleaned up correctly.
In readMatrix, you have a loop over j which contains the line this->mat[j] = new double[this->length];. This allocates memory, which mat[j] points to. This memory needs to be freed at some point, by calling delete[] mat[j] (or some other loop variable). However, in the destructor, you just call delete[] mat, which leaks all of the arrays inside it.
Some general suggestions on cleaning this up:
If you know the bounds of an array, such as that matInt will always have a length of 4, you should declare it with that fixed length (long long* matInt[4] will make an array of four pointers to long long, each of which could be a pointer to an array); this will mean you don’t need to either new or delete it.
If you have a double pointer like double ** mat, and you allocate both the first and second layers of pointers with new[], you need to deallocate the inner layer with delete[] (and you need to do it before you delete[] the outer layer).
If you still have trouble, more of your code will be clear if you remove the methods which don’t seem relevant to the problem. For example, toLogOddRatio doesn’t allocate or deallocate memory at all; it almost certainly isn’t contributing to the problem and you can remove it from the code you post here (once you’ve removed the parts which you think don’t contribute, test again to make sure the problem’s still there; if not then you know that it was one of those parts somehow causing the leak).

Two questions are in play here: managing memory in C++, and then nudging the C++ side from the Python side to clean up. I'm guessing SWIG is generating a wrapper for the Matrix destructor and calling the destructor at some useful time. (I might convince myself of that by having the dtor make some noise.) That should handle the second question.
So let's focus on the C++ side. Passing around a bare map * is a well-known invitation to mischief. Here are two alternatives.
Alternative one: make the map a member of Matrix. Then it gets cleaned up automatically by ~Matrix(). This is the easiest thing. If the lifetime of the map does not exceed the lifetime of the Matrix, then this route will work.
Alternative two: if the map needs to persist after the Matrix object, then instead of passing around map *, use a shared pointer, std::shared_ptr<map>. The shared pointer reference counts the pointee (i.e. the dynamically allocated Matrix). When the ref count goes to zero, it deletes the underlying object.
They both build on the rule to allocate resources (memory in this case) in constructors and deallocate in destructors. This is called RAII (Resource Allocation Is Initialization). Another application of RAII in your code would be to use std::vector<long long> offsets instead of long long *offsets etc. Then you just resize the vectors as needed. When the Matrix is destroyed, the vectors are deleted with no intervention on your part. For the matrix, you could use a vector of vectors, and so on.

to answer your question, yes you can use delete on diffrent function or method. and you should, any memory you allocate in c/c++ you need to free (delete in c++ lingo)
python isn't aware of this memory, it's not a python object, so gc.collect() won't help.
you should add a c function that would take a Matrix struct and free/delete the memory use on that struct. and call it from python, swig in not handling memory allocation (only for the objects swig creates)
I would recommended looking into newer packages other then swig, like cython or cffi (or even NumPy matrix handling, I've heard he's good at)

Why is my Python NumPy code faster than C++?

Why is this Python NumPy code,
import numpy as np
import time
k_max = 40000
N = 10000
data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)
t1 = time.time()
for k in xrange(1,k_max+1):
cos_k = np.cos(k*data[0,:])
sin_k = np.sin(k*data[0,:])
coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()
print('Time:')
print(t2-t1)
faster than the following C++ code?
#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>
using namespace std;
// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;
int main()
{
time_t start, stop;
double diff;
// table with data
double data1[ N ];
double data2[ N ];
// table of results
double coefs1[ k_max ];
double coefs2[ k_max ];
// main loop
time( & start );
for( unsigned int j = 1; j<N; j++ )
{
for( unsigned int i = 0; i<k_max; i++ )
{
coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
}
}
// end of main loop
time( & stop );
// speed result
diff = difftime( stop, start );
cout << "Time: " << diff << " seconds";
return 0;
}
The first one shows: "Time: 8 seconds"
while the second: "Time: 11 seconds"
I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?
Version 2 of the code
I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.
Version 3 of the code
I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).
The answer is: your C++ code is not slower than your Python code when properly compiled.
I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.
I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):
#include <math.h>
#include <stdio.h>
#include <time.h>
const int k_max = 100000;
const int N = 10000;
int main(void)
{
clock_t t_start, t_end;
double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
int z;
for( z = 0; z < N; z++ )
{
data1[z] = 0.1;
data2[z] = 0.4;
}
int i, j;
t_start = clock();
for( i = 0; i < k_max; i++ )
{
for( j = 0; j < N-1; j++ )
{
coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
}
}
t_end = clock();
seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("Time: %f s\n", seconds);
return coefs1[0];
}
For k_max = 100000, N = 10000 results where following:
Python 70.284362 s
C++ 69.133199 s
C 61.638186 s
Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.
For k_max = 1000000, N = 1000 we have:
Python 115.42766 s
C++ 70.781380 s
For k_max = 1000000, N = 100:
Python 52.86826 s
C++ 7.050597 s
So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:
Python 0.651587 s
C++ 0.568518 s
Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.
Python script:
import numpy as np
from time import time
N = 10000
x_len = 100000
def main():
x = np.ones(x_len, dtype=np.float64) * 1.2345
start = time()
for i in xrange(N):
y1 = np.cos(x, dtype=np.float64)
end = time()
print('cos: {} s'.format(end-start))
start = time()
for i in xrange(N):
y2 = x * 7.9463
end = time()
print('multi: {} s'.format(end-start))
start = time()
for i in xrange(N):
res = np.sum(x, dtype=np.float64)
end = time()
print('sum: {} s'.format(end-start))
return y1, y2, res
if __name__ == '__main__':
main()
# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s
C script:
#include <math.h>
#include <stdio.h>
#include <time.h>
const int N = 10000;
const int x_len = 100000;
int main()
{
clock_t t_start, t_end;
double x[x_len], y1[x_len], y2[x_len], res, time;
int i, j;
for( i = 0; i < x_len; i++ )
{
x[i] = 1.2345;
}
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y1[i] = cos(x[i]);
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("cos: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
for( i = 0; i < x_len; i++ )
{
y2[i] = x[i] * 7.9463;
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("multi: %f s\n", time);
t_start = clock();
for( j = 0; j < N; j++ )
{
res = 0.0;
for( i = 0; i < x_len; i++ )
{
res += x[i];
}
}
t_end = clock();
time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
printf("sum: %f s\n", time);
return y1[0], y2[0], res;
}
// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s
Python results:
cos: 22.7199969292 s
multi: 0.841291189194 s
sum: 1.15971088409 s
C results:
cos: 20.910590 s
multi: 0.633281 s
sum: 1.153001 s
As you can see NumPy is incredibly fast, but always a bit slower than pure C.

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...
Numpy is using complex Linear Algebra libraries !
Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.
I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.
C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).
Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from #Jfs, and we observe: "There is no difference between C++ and numpy on my machine."
Some more reference:
Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).
I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:
#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>
const unsigned int k_max = 40000;
const unsigned int N = 10000;
template <class T>
class matrix2 {
std::vector<T> data;
size_t cols;
size_t rows;
public:
matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
T &operator()(size_t y, size_t x) {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
T operator()(size_t y, size_t x) const {
assert(x <= cols);
assert(y <= rows);
return data[y*cols + x];
}
};
int main() {
matrix2<double> data(N, 2);
matrix2<double> coeffs(k_max, 2);
using namespace std::chrono;
auto start = high_resolution_clock::now();
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
auto end = high_resolution_clock::now();
std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}
This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.
The next obvious step would be to use multiple cores. To do that in C++, we can add this line:
#pragma omp parallel for
...before the outer for loop:
#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
for (int j = 0; j < N - 1; j++) {
coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
}
}
With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:
#include <iostream>
#include <cmath>
#include <time.h>
const int k_max = 40000;
const int N = 10000;
double cos_k, sin_k;
int main(int argc, char const *argv[])
{
time_t start, stop;
double data[2][N];
double coefs[k_max][2];
time(&start);
for(int i=0; i<k_max; ++i)
{
for(int j=0; j<N; ++j)
{
coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
}
}
// End of main loop
time(&stop);
// Speed result
double diff = difftime(stop, start);
std::cout << "Time: " << diff << " seconds" << std::endl;
return 0;
}
Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is often slower than C++ since C++ can do more optimizations. They'll only be around each other with Python running somewhere between the same time as C++ to about twice C++ when doing the majority of your computation in large computations that Python pushes off to compiled binaries to calculate. Most anything beyond large matrix multiplication, addition, scalar on matrix multiplication, etc. will perform much worse in Python. For example, look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. That makes sense too if you do anything other than simple math operations - anything with linked lists, binary search trees, procedural code, etc. The interpreted nature of Python combined with it storing metadata for each object (even int, double, float, etc.) significantly bogs things down in a way that no Python programmer can fix.

Transform python yield into c++

I have a piece of python code I need to use in c++. The algorithm is a recursion that uses yield.
Here is the python function:
def getSubSequences(self, s, minLength=1):
if len(s) >= minLength:
for i in range(minLength, len(s) + 1):
for p in self.getSubSequences(s[i:], 1 if i > 1 else 2):
yield [s[:i]] + p
elif not s:
yield []
and here is my attempt so far
vector< vector<string> > getSubSequences(string number, unsigned int minLength=1) {
if (number.length() >= minLength) {
for (unsigned int i=minLength; i<=number.length()+1; i++) {
string sub = "";
if (i <= number.length())
sub = number.substr(i);
vector< vector<string> > res = getSubSequences(sub, (i > 1 ? 1 : 2));
vector< vector<string> > container;
vector<string> tmp;
tmp.push_back(number.substr(0, i));
container.push_back(tmp);
for (unsigned int j=0; j<res.size(); j++) {
container.push_back(res.at(j));
return container;
}
}
} else if (number.length() == 0)
return vector< vector<string> >();
}
Unfortunately I get a segmentation fault when executing it. Is this even the right attempt or is there an easier way to do this? The data structures are not fixed I just need the same result as I get in the python code!

The loops in your above code snippets are not equivalent.
The Python code has
for i in range(minLength, len(s) + 1):
The C++ code has
for (unsigned int i=minLength; i<=number.length()+1; i++) {
So the Python loop terminates one iteration sooner than the C++ one.
The question has really nothing to do with yield. I think you should print stuff out from implementations, in these cases, and study them. In this case, it would have shown that the two algorithms diverge.

unusual speed difference between python and c++ programs

I wrote same program in C++ and Python. In Python it takes unusual amount of time(Actually I did't get answer in it). Can anybody explain why is that?
C++ code:
#include<iostream>
using namespace std;
int main(){
int n = 1000000;
int *solutions = new int[n];
for (int i = 1; i <= n; i++){
solutions[i] = 0;
}
for (int v = 1; v <= n; v++){
for (int u = 1; u*v <= n; u++){
if ((3 * v>u) & (((u + v) % 4) == 0) & (((3 * v - u) % 4) == 0)){
solutions[u*v]++;
}
}
}
int count = 0;
for (int i = 1; i < n; i++){
if ((solutions[i])==10)
count += 1;
}
cout << count;
}
Python code:
n=1000000
l=[0 for x in range(n+1)]
for u in range(1,n+1):
v=1
while u*v<n+1:
if (((u+v)%4)==0) and (((3*v-u)%4)==0) and (3*v>u):
l[u*v]+=1
v+=1
l.count(10)

You can try optimizing this loop, for example make it a single block with no ifs in it, or otherwise use a module in C.
C++ compiler does optimalizations Python runtime can't, so with pure interpreter you will never get performance being anything close.
And 1M interactions is a lot, I woulnt start with any interpreter in that range, you'd be better doing it in a browser and JavaScript.

Permutation with backtraking from C to Python

I have to do a program that gives all permutations of n numbers {1,2,3..n} using backtracking. I managed to do it in C, and it works very well, here is the code:
int st[25], n=4;
int valid(int k)
{
int i;
for (i = 1; i <= k - 1; i++)
if (st[k] == st[i])
return 0;
return 1;
}
void bktr(int k)
{
int i;
if (k == n + 1)
{
for (i = 1; i <= n; i++)
printf("%d ", st[i]);
printf("\n");
}
else
for (i = 1; i <= n; i++)
{
st[k] = i;
if (valid(k))
bktr(k + 1);
}
}
int main()
{
bktr(1);
return 0;
}
Now I have to write it in Python. Here is what I did:
st=[]
n=4
def bktr(k):
if k==n+1:
for i in range(1,n):
print (st[i])
else:
for i in range(1,n):
st[k]=i
if valid(k):
bktr(k+1)
def valid(k):
for i in range(1,k-1):
if st[k]==st[i]:
return 0
return 1
bktr(1)
I get this error:
list assignment index out of range
at st[k]==st[i].

Python has a "permutations" functions in the itertools module:
import itertools
itertools.permutations([1,2,3])
If you need to write the code yourself (for example if this is homework), here is the issue:
Python lists do not have a predetermined size, so you can't just set e.g. the 10th element to 3. You can only change existing elements or add to the end.
Python lists (and C arrays) also start at 0. This means you have to access the first element with st[0], not st[1].
When you start your program, st has a length of 0; this means you can not assign to st[1], as it is not the end.
If this is confusing, I recommend you use the st.append(element) method instead, which always adds to the end.
If the code is done and works, I recommend you head over to code review stack exchange because there are a lot more things that could be improved.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.