I am currently learning C++ and being quite proficient in python, I decided to try porting some of my python code to C++. Specifically I tried porting this generator I wrote that gives the fibonacci sequence up to a certain given stop value.
def yieldFib(stop):
a = 0
b = 1
yield i
for i in range(2):
for i in range(stop-2):
fib = a+b
a = b
b = fib
yield fib
fib = list(yieldFib(100))
print(fib)
to this
int* fib(int stopp){
int a = 0;
int b = 1;
int fibN;
int fibb[10000000];
fibb[0] = 0;
fibb[1] = 1;
for(int i=2; i<stopp-2; i++){
fibN = a+b;
a = b;
b = fibN;
fibb[i] = fibN;
}
return fibb;
}
int main(){
int stop;
cin >> stop;
int* fibbb = fib(stop);
cout << fibbb;
}
I admit the c++ is very crude, but this is just to aid my learning. for some reason the code just crashes and quits after it takes user input, I suspect it has something to do with the way i try to use the array, but I'm not quite sure what. Any help will be appreciated
An integer array of size 10000000 is generally too large to be allocated on the stack, which causes the crash. Instead, use a std::vector<int>. In addition to that:
The variable b is unused.
fibN is not initialized. Its value will be indeterminate.
Returning a pointer to stack memory will not work, as that pointer is no longer valid once the function has returned.
The code would print the value of an integer pointer instead of the values of the array. Instead, iterate over the array and print the values one by one.
On a side note: It seems that you are trying to learn C++ by trial-and-error, while it is best learned from the ground up using a book or course.
In provided code, I see many mistakes.
First: You're creating int-array with 10000000 length inside stack (you're not allocating memory) what is 40 MB! You just exceed the stack length (1 MB, as I remember). Just allocate it with new operator. If you don't want to work with this kind of array (or don't want to calculate its precise length), you can use std::vector which can expand himself in memory.
int* fibb = new int[precise_length];
//or
std::vector<int> fibb = std::vector<int>(); //and fill it by calling fibb.push_back()
Second: cout usage. You try to print array POINTER, not contents. Print every member of array separately.
#include<bits/stdc++.h>
using namespace std;
vector<int> fib( const int& n ){
vector<int> v = {0, 1};
for(int i = 2; i <= n; i++){
v.push_back( v[i - 1] + v[i - 2] );
}
return v;
}
int main(){
int n;
cin >> n;
vector<int> _fib = fib( n );
for( auto x : _fib ){
cout << x << ' ';
}
return 0;
}
Related
I am trying to rewrite a fibonacci algorithm from python to C, but am having some problems. Below is the algorithm in python and I get the correct answer, but after writing in C:
def fib(n):
a, b = 1,1
for i in range(n):
a, b = a+b, a
print(a, b)
return b
print(fib(8))
When I wrote it in C, I get the powers of 2 - I am not sure how and if possible to correct it.
#include<stdio.h>
int fib(n){
int a = 1;
int b = 1;
for (int i=0; i<n; i++){
a = a+b;
b = a;
}
return b;
}
int main(void){
int n;
printf("Enter the # of the fibonacci number in sequence:");
scanf("%d", &n);
int r = fib(n);
printf("%d\n", r);
}
Just add a 3rd variable for extra handling
#include <stdio.h>
int fib(int n) {
int a, b, c;
a = 0;
b = 1;
c = 0;
for (int i = 0; i < n; i++) {
a = b+c;
b = c;
c = a;
}
return a;
}
int main() {
int n;
n = fib(10); // 55
printf("%d\n", n);
}
Zero is a Fibonacci number too!
Try:
int fib(int n){
int a = -1;
int b = 1;
for (int i = 0; i < n; i++) {
int t = a + b;
a = b;
b = t;
}
return b;
}
Also, none of the answers so far (including this one) account for signed integer overflow which is undefined in C. Though, for a trivial program like this it's okay to ignore it. When n is greater than 47 it overflows on my pc.
It can be done with just two variables with a simple change b = a-b. However, an optimizing compiler will use a third register as a temp variable instead of doing b = a - b;
#include<stdio.h>
int fib(n){
int a = 1;
int b = 0;
for (int i = 0; i < n; i++){
a = a+b;
b = a-b;
}
return b;
}
int main(void){
int n;
printf("Enter the # of the fibonacci number in sequence:");
scanf("%d", &n);
int r = fib(n);
printf("%d\n", r);
return 0;
}
In Python, when a multiple assignment statement is executed, all expressions on the right-hand side are evaluated using the current values of the variables, and then these values are bound to the variables given in the left-hand side. For example, the statement
a, b = a+b, a
evaluates both the expressions a+b and a, using the current values of a and b, and then binds these values to a and b, respectively. If the current values are a=1, b=1, then the expressions have values 1+1=2 and 1, respectively, and so we get a=2, b=1.
However, your C code executes the two statements a=a+b and b=a in sequence, rather than in parallel (like Python does). First the expression a+b is evaluated (if a=1, b=1, then a+b=2), and its value is bound to the variable a (so a is 2), and finally the expression a in the second statement is evaluated (it has value 2), and its value is bound to b (so b=2). If you want to use the original value of a (a=1) in the statement b=a, then you must store that original value in a temporary variable before overwriting a by a+b, as follows:
temp=a;
a=a+b;
b=temp;
In general, you can figure out yourself what is wrong with the program by learning to do a hand simulation of the program, where you write down (on paper) the current values of each variable in the program. For example, you can write the variables a and b in separate lines, with their initial values 1 and 1, respectively, next to them. Each time you execute an assignment statement, evaluate the expression on the right-hand side using the current values, and overwrite the left-hand-side variable with this new value. So, when you encounter a=a+b, you would evaluate a+b to be 2 (using the current values), and then cross out the current value of 1 for a and replace it by 2. In this manner, you can obtain the output using pencil and paper, and figure out what is wrong with the program.
In Python I have the following simple code:
N = 10000
mu = 0.0001
iterations = 10000
l = 10
#nb.njit()
def func1(N, l, mu, iterations):
intList = [0]*N
for x in range(iterations):
for number in range(N):
for position in range(l):
if random.random() < mu:
intList[number] = intList[number] ^ (1 << position)
func1(N, l, mu, iterations)
count = 1
print(timeit(lambda: func1(N, l, mu, iterations), number=count))
>>> 5.677
I'm not used to C++ but wanted to see, how quick it would be compared to the Python version. Since my Python code is quite simple I thought I could give it a try. My C++ code that should be equivalent to the Python code is
#include <iostream>
#include <random>
using namespace std;
int func1(int iterations, int l, int N, float mu)
{
std::random_device rd; //Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
std::uniform_real_distribution<> dis(0.0, 1.0);
std::vector<int> intList(N);
//for (int i = 0; i < N; i++)
// cout << intList[i];
//cout << "\n";
int x, number, position;
for (x = 0; x < iterations; x++) {
for (number = 0; number < N; number++) {
for (position = 0; position < l; position++) {
if (dis(gen) < mu) {
intList[number] = intList[number] ^ (1 << position);
}
}
}
}
//for (int i = 0; i < N; i++)
// cout << intList[i];
return 0;
}
int main()
{
int N, l, iterations;
float mu;
N = 10000;
mu = 0.0001;
iterations = 10000;
l = 10;
func1(iterations, l, N, mu);
cout << "\nFertig";
}
But this code takes up to 5-10 times longer. I'm really surprised by that. What is the explanation for that?
Numba internally translates random.random calls into its own inlined internal Mersenne Twister implementation. So effectively the entire func1 gets compiled down to efficient code by LLVM. It might as well have been written in C++ like the other implementation.
And so it's no surprise to me, that when I compile your C++ implementation with optimizations turned on, I can not reproduce your issue. Both implementation are essentially the same. On my machine the Python code runs in ~6.1 seconds and the C++ code in ~6.9 seconds.
If you wish to go faster however, note that if you wish to efficiently implement genetic mutation with low mutation probability (which it appears you are), you're better off first generating the Binomial distribution with probability μ, and then selecting that many indices without replacement from your genome length. Alternatively the method I describe here.
Update: The C++ programs (as shown below) were compiled with no additional flags, i.e. g++ program.cpp. However raising the optimisation level does not change the fact that brute force runs faster than memoization technique (0.1 second VS 1 second on my machine).
Context
I try to calculate the number (< 1 million) with the longest Collatz sequence. I wrote a brute force algorithm and compared it with the suggested optimised program (which basically uses memoization).
My question is: What could possibly be the reason that the brute force executes faster than the supposedly optimised (memoization) version in C++ ?
Below the comparisons I have on my machine (a Macbook Air); the times are in the beginning of the program code in comments.
C++ (brute force)
/**
* runs in 1 second
*/
#include <iostream>
#include <vector>
unsigned long long nextSequence(unsigned long long n)
{
if (n % 2 == 0)
return n / 2;
else
{
return 3 * n + 1;
}
}
int main()
{
int max_counter = 0;
unsigned long long result;
for (size_t i = 1; i < 1000000; i++)
{
int counter = 1;
unsigned long long n = i;
while (n != 1)
{
n = nextSequence(n);
counter++;
}
if (counter > max_counter)
{
max_counter = counter;
result = i;
}
}
std::cout << result << " has " << max_counter << " sequences." << std::endl;
return 0;
}
C++ (memoization)
/**
* runs in 2-3 seconds
*/
#include <iostream>
#include <unordered_map>
int countSequence(uint64_t n, std::unordered_map<uint64_t, uint64_t> &cache)
{
if (cache.count(n) == 1)
return cache[n];
if (n % 2 == 0)
cache[n] = 1 + countSequence(n / 2, cache);
else
cache[n] = 2 + countSequence((3 * n + 1) / 2, cache);
return cache[n];
}
int main()
{
uint64_t max_counter = 0;
uint64_t result;
std::unordered_map<uint64_t, uint64_t> cache;
cache[1] = 1;
for (uint64_t i = 500000; i < 1000000; i++)
{
if (countSequence(i, cache) > max_counter)
{
max_counter = countSequence(i, cache);
result = i;
}
}
std::cout << result << std::endl;
return 0;
}
In Python the memoization technique really runs faster.
Python (memoization)
# runs in 1.5 seconds
def countChain(n):
if n in values:
return values[n]
if n % 2 == 0:
values[n] = 1 + countChain(n / 2)
else:
values[n] = 2 + countChain((3 * n + 1) / 2)
return values[n]
values = {1: 1}
longest_chain = 0
answer = -1
for number in range(500000, 1000000):
if countChain(number) > longest_chain:
longest_chain = countChain(number)
answer = number
print(answer)
Python (brute force)
# runs in 30 seconds
def countChain(n):
if n == 1:
return 1
if n % 2 == 0:
return 1 + countChain(n / 2)
return 2 + countChain((3 * n + 1) / 2)
longest_chain = 0
answer = -1
for number in range(1, 1000000):
temp = countChain(number)
if temp > longest_chain:
longest_chain = temp
answer = number
print(answer)
I understand that your question is about the difference between the two C++ variants and not between the copiled C++ and the interpreted python. Answering it decisively would require to compile the code with optimizations turned on and profiling its execution. And clarity about whether the compiler target is 64 or 32 bits.
But given the order of magnitude between both versions of the C++ code, a quick inspection already shows that your memoization consumes more resources than it makes you gain.
One important performance bottleneck here is the memory management of the unordered map. An unordered_map works with buckets of items. The map adjust the number of buckets when necessary, but this requires memory allocation (and potentially moving chunks of memory, depending how the buckets are implemented).
Now, if you add the following statement just after the initialisation of the cache, and just before displaying the result, you'll see that there is a huge change in the number of buckets allocated:
std::cout << "Bucket count: "<<cache.bucket_count()<<"/"<<cache.max_bucket_count()<<std::endl;
To avoid the overhead associated to this, you could preallocate the number of buckets at construction:
std::unordered_map<uint64_t, uint64_t> cache(3000000);
Doing this in on ideone for a small and informal test saved almost 50% of the performance.
But notheless... Storing and finding objects in an unordered_map requires to calculate hash codes made of a lot of arithmetic operations. So I guess that these operations are simply heavier than doing the brute force calculations.
Main memory access is vastly slower than computation, so much so that when it's time to care, you should treat anything over a very few (cpu-model-dependent) meg as retrieved from an I/O or network device.
Even fetching from L1 is expensive compared to integer ops.
Long, long ago, this wasn't true. Computation and memory access were at least in the same ballpark for many decades, because there simply wasn't enough room in the transistor budget to make fast caches large enough to pay.
So people counted CPU operations and just assumed memory could more or less keep up.
Nowadays, it just … can't. The penalty for a CPU cache miss is hundreds of integer ops, and your million-16-byte-entry hash map is pretty much guaranteed to blow not just the cpu's memory caches but also the TLB, which takes the delay penalty from painful to devastating.
I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes
Here's the best of what I came up with
#X - 3000 by 100 bool np.array
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
print("object nr " + str(i) + "/" + str(len(X)))
arr = np.array([x] * len(Y))
C = Y^arr # just xor this array by all the arrays in the other group simultainously
hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
i+=1
return np.array(hd)
And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?
You should be able to dramatically improve the summing speed by using numpy to perform it, rather than using a list comprehension and the built-in sum function (that takes no advantage of numpy vectorized operations).
Just replace:
hd.append([sum(c) for c in C])
with:
# Explicitly use uint16 to reduce memory cost; if array sizes might increase
# you can use uint32 to leave some wiggle room
hd.append(C.sum(1, dtype=np.uint16))
which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis 1). For example:
>>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
>>> arr.sum(1, np.uint16)
array([ 2, 1, 3], dtype=uint16)
Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy value one by one from np.bool to Python level ints just to sum them), this should run substantially faster for the array scales you're describing.
Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate can do that more quickly and easily. Simply replace:
i=1
for x in X:
... rest of loop ...
i+=1
with:
for i, x in enumerate(X, 1):
... rest of loop ...
and you'll get the same behavior, but slightly faster, more concise and cleaner in general.
IIUC, you can use np.logical_xor and list comprehension:
result = np.array([[np.logical_xor(X[a], Y[b].T).sum() for b in range(len(Y))]
for a in range(len(X))])
The whole operation runs in 7 seconds in my machine.
0:00:07.226645
Just in case you are not limited to using Python, this is a solution in C++ using bitset:
#include <iostream>
#include <bitset>
#include <vector>
#include <random>
#include <chrono>
using real = double;
std::mt19937_64 rng;
std::uniform_real_distribution<real> bitset_dist(0, 1);
real prob(0.75);
std::bitset<100> rand_bitset()
{
std::bitset<100> bitset;
for (size_t idx = 0; idx < bitset.size(); ++idx)
{
bitset[idx] = (bitset_dist(rng) < prob) ? true : false;
}
return std::move(bitset);
}
int main()
{
rng.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());
size_t v1_size(3000);
size_t v2_size(10000);
std::vector<size_t> hd;
std::vector<std::bitset<100>> vec1;
std::vector<std::bitset<100>> vec2;
vec1.reserve(v1_size);
vec2.reserve(v2_size);
hd.reserve(v1_size * v2_size); /// Edited from hd.reserve(v1_size);
for (size_t i = 0; i < v1_size; ++i)
{
vec1.emplace_back(rand_bitset());
}
for (size_t i = 0; i < v2_size; ++i)
{
vec2.emplace_back(rand_bitset());
}
std::cout << "vec1 size: " << vec1.size() << '\n'
<< "vec2 size: " << vec2.size() << '\n';
auto start(std::chrono::high_resolution_clock::now());
for (const auto& b1 : vec1)
{
for (const auto& b2 : vec2)
{
/// Count only the bits that are set and store them
hd.emplace_back((b1 ^ b2).count());
}
}
auto time(std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count());
std::cout << vec1.size() << " x " << vec2.size()
<< " xor operations on 100 bits took " << time << " ms\n";
return 0;
}
On my machine, the whole operation (3000 x 10000) takes about 300 ms.
You could put this into a function, compile it into a library and call it from Python. Another option is to store the distances to a file and then read them in Python.
EDIT: I had the wrong size for the hd vector. Reserving the proper amount of memory reduces the operation to about 190 ms because relocations are avoided.
I have an array in python, which is declared as follow:
u[time][space]
My code in Python's for loop requires me to do the following:
for n in range(0,10):
u[n][:] = 1
This colon indicates the whole range of my [space].
Now, how would I go about to use the colon (:) to indicate the whole range when doing for loop in c++?
Thanks
for(auto& inner : u)
std::fill(inner.begin(), inner.end(), 1);
Use two loops as suggested.
Something like:
int n, s;
for(n=0; n<10; n++){
for(s=0; s<space; s++){
u[n][s] = 1;
}
}
should work.
C++ does not have an equivalent to
for n in range(0,10):
u[n][:] = 1
You are going to have to write the loop that u[n][:] = 1 represents. Fortunately with ranged based for loops it is pretty trivial. That would look like
int foo[3][3];
for (auto& row : foo)
for (auto& col : row)
col = 1;
I don't remember a quick way to do it in C++. I'm afraid you'd have to loop in each table.
This would look like this (admitting that space is a known integer):
for (int i = 0; i < 10; i++)
for (int j = 0; j < space; j++)
u[i][j] = 1;
You're going to have to use two loops. Modern C++ has some tricks up its sleeve to make initializing these loops simple, but here is a basic C example that will work anywhere:
#define TIME_MAX 100
#define SPACE_MAX 350
int u[TIME_MAX][SPACE_MAX];
int i, j;
for (i = 0; i < TIME_MAX; i++)
for (j = 0; j < SPACE_MAX; j++)
u[i][j] = 1;