Fibonacci sequence: C vs Python

Fibonacci sequence: C vs Python - python

This is a code I wrote in C for Fibonacci sequence:
#include <stdio.h>
#include <stdlib.h>
int fib(int n)
{
int a = 0, b = 1, c, i;
if (n == 0)
return a;
for (i = 2; i <= n; i++) {
c = a + b;
a = b;
b = c;
}
return b;
}
int main()
{
printf("%d",fib(1000));
return 0;
}
And this is the direct translation in Python:
def fib(n):
a=0
b=1
if n == 0:
return a
for _ in range(n-1):
c = a + b
a = b
b = c
return b
print(fib(1000))
The C program outputs:
1556111435
Where Python (correctly) outputs:
43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875
I realize the problem with C is with the variable type (since the fib(50) works just fine in C), but I have two major questions:
How should I correct the C program in a way that I can calculate fib of any number? In other words, rather than just using double (which has its own limitation), how can I calculate any fib in C?
How does Python handle this? Because apparently, it has no limitation in the size of integers.

C does not offer any dynamically sized integer types directly. The biggest you can go within the language itself is long long. However there is nothing stopping you from writing your own big-integer functions that allocate memory and handle carry as needed.
Or you can just use someone else's big integer lib, for instance BigInt.
(Looking at BigInt's source code will also answer the question how Python does this.)
Edit: I just had a bit of a closer look at BigInt myself. Beware that it uses the regular pen&paper method unconditionally for multiplication, which is fast for "small" numbers, but for "large" numbers has worse performance than the Karatsuba method. However please note that the border between "small" and "large" in this context is probably so high, that in most practical cases the pen&paper method is enough (see the linked Wiki article). It's also worth noting that you can combine both algorithms for multiplication, by writing them recursively and having Karatsuba's method fall back to pen&paper if the number of bits is below a given threshold.

Related

Porting python code to C++ / (Printing out arrays in c++)

I am currently learning C++ and being quite proficient in python, I decided to try porting some of my python code to C++. Specifically I tried porting this generator I wrote that gives the fibonacci sequence up to a certain given stop value.
def yieldFib(stop):
a = 0
b = 1
yield i
for i in range(2):
for i in range(stop-2):
fib = a+b
a = b
b = fib
yield fib
fib = list(yieldFib(100))
print(fib)
to this
int* fib(int stopp){
int a = 0;
int b = 1;
int fibN;
int fibb[10000000];
fibb[0] = 0;
fibb[1] = 1;
for(int i=2; i<stopp-2; i++){
fibN = a+b;
a = b;
b = fibN;
fibb[i] = fibN;
}
return fibb;
}
int main(){
int stop;
cin >> stop;
int* fibbb = fib(stop);
cout << fibbb;
}
I admit the c++ is very crude, but this is just to aid my learning. for some reason the code just crashes and quits after it takes user input, I suspect it has something to do with the way i try to use the array, but I'm not quite sure what. Any help will be appreciated

An integer array of size 10000000 is generally too large to be allocated on the stack, which causes the crash. Instead, use a std::vector<int>. In addition to that:
The variable b is unused.
fibN is not initialized. Its value will be indeterminate.
Returning a pointer to stack memory will not work, as that pointer is no longer valid once the function has returned.
The code would print the value of an integer pointer instead of the values of the array. Instead, iterate over the array and print the values one by one.
On a side note: It seems that you are trying to learn C++ by trial-and-error, while it is best learned from the ground up using a book or course.

In provided code, I see many mistakes.
First: You're creating int-array with 10000000 length inside stack (you're not allocating memory) what is 40 MB! You just exceed the stack length (1 MB, as I remember). Just allocate it with new operator. If you don't want to work with this kind of array (or don't want to calculate its precise length), you can use std::vector which can expand himself in memory.
int* fibb = new int[precise_length];
//or
std::vector<int> fibb = std::vector<int>(); //and fill it by calling fibb.push_back()
Second: cout usage. You try to print array POINTER, not contents. Print every member of array separately.

#include<bits/stdc++.h>
using namespace std;
vector<int> fib( const int& n ){
vector<int> v = {0, 1};
for(int i = 2; i <= n; i++){
v.push_back( v[i - 1] + v[i - 2] );
}
return v;
}
int main(){
int n;
cin >> n;
vector<int> _fib = fib( n );
for( auto x : _fib ){
cout << x << ' ';
}
return 0;
}

Possible explanation for faster execution of two programs in C++ (with Python comparison)?

Update: The C++ programs (as shown below) were compiled with no additional flags, i.e. g++ program.cpp. However raising the optimisation level does not change the fact that brute force runs faster than memoization technique (0.1 second VS 1 second on my machine).
Context
I try to calculate the number (< 1 million) with the longest Collatz sequence. I wrote a brute force algorithm and compared it with the suggested optimised program (which basically uses memoization).
My question is: What could possibly be the reason that the brute force executes faster than the supposedly optimised (memoization) version in C++ ?
Below the comparisons I have on my machine (a Macbook Air); the times are in the beginning of the program code in comments.
C++ (brute force)
/**
* runs in 1 second
*/
#include <iostream>
#include <vector>
unsigned long long nextSequence(unsigned long long n)
{
if (n % 2 == 0)
return n / 2;
else
{
return 3 * n + 1;
}
}
int main()
{
int max_counter = 0;
unsigned long long result;
for (size_t i = 1; i < 1000000; i++)
{
int counter = 1;
unsigned long long n = i;
while (n != 1)
{
n = nextSequence(n);
counter++;
}
if (counter > max_counter)
{
max_counter = counter;
result = i;
}
}
std::cout << result << " has " << max_counter << " sequences." << std::endl;
return 0;
}
C++ (memoization)
/**
* runs in 2-3 seconds
*/
#include <iostream>
#include <unordered_map>
int countSequence(uint64_t n, std::unordered_map<uint64_t, uint64_t> &cache)
{
if (cache.count(n) == 1)
return cache[n];
if (n % 2 == 0)
cache[n] = 1 + countSequence(n / 2, cache);
else
cache[n] = 2 + countSequence((3 * n + 1) / 2, cache);
return cache[n];
}
int main()
{
uint64_t max_counter = 0;
uint64_t result;
std::unordered_map<uint64_t, uint64_t> cache;
cache[1] = 1;
for (uint64_t i = 500000; i < 1000000; i++)
{
if (countSequence(i, cache) > max_counter)
{
max_counter = countSequence(i, cache);
result = i;
}
}
std::cout << result << std::endl;
return 0;
}
In Python the memoization technique really runs faster.
Python (memoization)
# runs in 1.5 seconds
def countChain(n):
if n in values:
return values[n]
if n % 2 == 0:
values[n] = 1 + countChain(n / 2)
else:
values[n] = 2 + countChain((3 * n + 1) / 2)
return values[n]
values = {1: 1}
longest_chain = 0
answer = -1
for number in range(500000, 1000000):
if countChain(number) > longest_chain:
longest_chain = countChain(number)
answer = number
print(answer)
Python (brute force)
# runs in 30 seconds
def countChain(n):
if n == 1:
return 1
if n % 2 == 0:
return 1 + countChain(n / 2)
return 2 + countChain((3 * n + 1) / 2)
longest_chain = 0
answer = -1
for number in range(1, 1000000):
temp = countChain(number)
if temp > longest_chain:
longest_chain = temp
answer = number
print(answer)

I understand that your question is about the difference between the two C++ variants and not between the copiled C++ and the interpreted python. Answering it decisively would require to compile the code with optimizations turned on and profiling its execution. And clarity about whether the compiler target is 64 or 32 bits.
But given the order of magnitude between both versions of the C++ code, a quick inspection already shows that your memoization consumes more resources than it makes you gain.
One important performance bottleneck here is the memory management of the unordered map. An unordered_map works with buckets of items. The map adjust the number of buckets when necessary, but this requires memory allocation (and potentially moving chunks of memory, depending how the buckets are implemented).
Now, if you add the following statement just after the initialisation of the cache, and just before displaying the result, you'll see that there is a huge change in the number of buckets allocated:
std::cout << "Bucket count: "<<cache.bucket_count()<<"/"<<cache.max_bucket_count()<<std::endl;
To avoid the overhead associated to this, you could preallocate the number of buckets at construction:
std::unordered_map<uint64_t, uint64_t> cache(3000000);
Doing this in on ideone for a small and informal test saved almost 50% of the performance.
But notheless... Storing and finding objects in an unordered_map requires to calculate hash codes made of a lot of arithmetic operations. So I guess that these operations are simply heavier than doing the brute force calculations.

Main memory access is vastly slower than computation, so much so that when it's time to care, you should treat anything over a very few (cpu-model-dependent) meg as retrieved from an I/O or network device.
Even fetching from L1 is expensive compared to integer ops.
Long, long ago, this wasn't true. Computation and memory access were at least in the same ballpark for many decades, because there simply wasn't enough room in the transistor budget to make fast caches large enough to pay.
So people counted CPU operations and just assumed memory could more or less keep up.
Nowadays, it just … can't. The penalty for a CPU cache miss is hundreds of integer ops, and your million-16-byte-entry hash map is pretty much guaranteed to blow not just the cpu's memory caches but also the TLB, which takes the delay penalty from painful to devastating.

How to use 64-bit unsigned integer math in Python, respecting C overflow?

I'm trying to implement the djb2 hash in Python.
Here it is in C:
/* djb2 hash http://www.cse.yorku.ca/~oz/hash.html */
uint64_t djb2(size_t len, char const str[len]) {
uint64_t hash = 5381;
uint8_t c;
for(size_t i = 0; i < len; i++) {
c = str[i];
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
}
return hash;
}
And here's my attempt in Python:
from ctypes import c_uint64, c_byte, cast, POINTER
def djb2(string: str) -> c_uint64:
hash = c_uint64(5381)
raw_bytes = cast(string, POINTER(c_byte * len(string)))[0]
for i in range(0, len(raw_bytes)):
hash = c_uint64((((((hash.value << 5) & 0xffffffffffffffff) + hash.value) & 0xffffffffffffffff) + raw_bytes[i]) & 0xffffffffffffffff) # hash * 33 + c
return hash
However, I'm getting different results between the two, which I suspect is because of different overflow behavior, or otherwise mathematical differences.
The reason for the masking in the python version was to attempt to force an overflow (based on this answer).

You can implement the algorithm being run by the C code very easily in pure Python, without needing any ctypes stuff. Just do it all with regular Python integers, and take a modulus at the end (the high bits won't effect the lower ones for the operations you're doing):
def djb2(string: bytes) -> int: # note, use a bytestring for this, not a Unicode string!
h = 5381
for c in string: # iterating over the bytestring directly gives integer values
h = h * 33 + c # use the computation from the C comments, but consider ^ instead of +
return h % 2**64 # note you may actually want % 2**32, as this hash is often 32-bit
As I commented in the code, since this is an operation defined on bytestrings, you should use a bytes instance as the argument. Note that there are a bunch of different implementations of this algorithm. Some use use ^ (bitwise xor) instead of + in the step where you update the hash value, and it's often defined to use an unsigned long which was usually 32-bits instead of the explicitly 64-bit integer the C version in your question uses.

When calculating DJB2 hash in Python, you have to avoid using long arithmetic. For this purpose, you have to do hash &= 0xFFFFFFFFFFFFFFFF after each iteration.
Here is a proper one-liner implementation of DJB2 in Python:
import functools, itertools
djb2 = lambda x: functools.reduce(lambda x,c: (x*33 + c) & ((1<<64)-1), itertools.chain([5381], x))
Notes:
because Python is a scripting language, doing the (x << 5) + x instead of x*33 is not more efficient
((1<<64)-1) is just a short for 0xFFFFFFFFFFFFFFFF

Comparing Python, Numpy, Numba and C++ for matrix multiplication

In a program I am working on, I need to multiply two matrices repeatedly. Because of the size of one of the matrices, this operation takes some time and I wanted to see which method would be the most efficient. The matrices have dimensions (m x n)*(n x p) where m = n = 3 and 10^5 < p < 10^6.
With the exception of Numpy, which I assume works with an optimized algorithm, every test consists of a simple implementation of the matrix multiplication:
Below are my various implementations:
Python
def dot_py(A,B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m,p))
for i in range(0,m):
for j in range(0,p):
for k in range(0,n):
C[i,j] += A[i,k]*B[k,j]
return C
Numpy
def dot_np(A,B):
C = np.dot(A,B)
return C
Numba
The code is the same as the Python one, but it is compiled just in time before being used:
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython = True)(dot_py)
So far, each method call has been timed using the timeit module 10 times. The best result is kept. The matrices are created using np.random.rand(n,m).
C++
mat2 dot(const mat2& m1, const mat2& m2)
{
int m = m1.rows_;
int n = m1.cols_;
int p = m2.cols_;
mat2 m3(m,p);
for (int row = 0; row < m; row++) {
for (int col = 0; col < p; col++) {
for (int k = 0; k < n; k++) {
m3.data_[p*row + col] += m1.data_[n*row + k]*m2.data_[p*k + col];
}
}
}
return m3;
}
Here, mat2 is a custom class that I defined and dot(const mat2& m1, const mat2& m2) is a friend function to this class. It is timed using QPF and QPC from Windows.h and the program is compiled using MinGW with the g++ command. Again, the best time obtained from 10 executions is kept.
Results
As expected, the simple Python code is slower but it still beats Numpy for very small matrices. Numba turns out to be about 30% faster than Numpy for the largest cases.
I am surprised with the C++ results, where the multiplication takes almost an order of magnitude more time than with Numba. In fact, I expected these to take a similar amount of time.
This leads to my main question: Is this normal and if not, why is C++ slower that Numba? I just started learning C++ so I might be doing something wrong. If so, what would be my mistake, or what could I do to improve the efficiency of my code (other than choosing a better algorithm) ?
EDIT 1
Here is the header of the mat2 class.
#ifndef MAT2_H
#define MAT2_H
#include <iostream>
class mat2
{
private:
int rows_, cols_;
float* data_;
public:
mat2() {} // (default) constructor
mat2(int rows, int cols, float value = 0); // constructor
mat2(const mat2& other); // copy constructor
~mat2(); // destructor
// Operators
mat2& operator=(mat2 other); // assignment operator
float operator()(int row, int col) const;
float& operator() (int row, int col);
mat2 operator*(const mat2& other);
// Operations
friend mat2 dot(const mat2& m1, const mat2& m2);
// Other
friend void swap(mat2& first, mat2& second);
friend std::ostream& operator<<(std::ostream& os, const mat2& M);
};
#endif
Edit 2
As many suggested, using the optimization flag was the missing element to match Numba. Below are the new curves compared to the previous ones. The curve tagged v2 was obtained by switching the two inner loops and shows another 30% to 50% improvement.

Definitely use -O3 for optimization. This turns vectorizations on, which should significantly speed your code up.
Numba is supposed to do that already.

What I would recommend
If you want maximum efficiency, you should use a dedicated linear algebra library, the classic of which is BLAS/LAPACK libraries. There are a number of implementations, eg. Intel MKL. What you write is NOT going to outpeform hyper-optimized libraries.
Matrix matrix multiply is going to be the dgemm routine: d stands for double, ge for general, and mm for matrix matrix multiply. If your problem has additional structure, a more specific function may be called for additional speedup.
Note that Numpy dot ALREADY calls dgemm! You're probably not going to do better.
Why your c++ is slow
Your classic, intuitive algorithm for matrix-matrix multiplication turns out to be slow compared to what's possible. Writing code that takes advantage of how processors cache etc... yields important performance gains. The point is, tons of smart people have devoted their lives to making matrix matrix multiply extremely fast, and you should use their work and not reinvent the wheel.

In your current implementation most likely compiler is unable to auto vectorize the most inner loop because its size is 3. Also m2 is accessed in a "jumpy" way. Swapping loops so that iterating over p is in the most inner loop will make it work faster (col will not make "jumpy" data access) and compiler should be able to do better job (autovectorize).
for (int row = 0; row < m; row++) {
for (int k = 0; k < n; k++) {
for (int col = 0; col < p; col++) {
m3.data_[p*row + col] += m1.data_[n*row + k] * m2.data_[p*k + col];
}
}
}
On my machine the original C++ implementation for p=10^6 elements build with g++ dot.cpp -std=c++11 -O3 -o dot flags takes 12ms and above implementation with swapped loops takes 7ms.

You can still optimize these loops by improving the memory acces, your function could look like (assuming the matrizes are 1000x1000):
CS = 10
NCHUNKS = 100
def dot_chunked(A,B):
C = np.zeros(1000,1000)
for i in range(NCHUNKS):
for j in range(NCHUNKS):
for k in range(NCHUNKS):
for ii in range(i*CS,(i+1)*CS):
for jj in range(j*CS,(j+1)*CS):
for kk in range(k*CS,(k+1)*CS):
C[ii,jj] += A[ii,kk]*B[kk,jj]
return C
Explanation: the loops i and ii obviously together perform the same way as i did before, the same hold for j and k, but this time regions in A and B of size CSxCS can be kept in the cache (I guess) and can used more then once.
You can play around with CS and NCHUNKS. For me CS=10 and NCHUNKS=100 worked well. When using numba.jit, it accelerates the code from 7s to 850 ms (notice i use 1000x1000, the graphics above are run with 3x3x10^5, so its a bit of another scenario).

Why is matrix multiplication faster with numpy than with ctypes in Python?

I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:
Pure python implementation: no surprises here.
Numpy implementation using numpy.dot(a, b)
Interfacing with C using ctypes module in Python.
This is the C code that is transformed into a shared library:
#include <stdio.h>
#include <stdlib.h>
void matmult(float* a, float* b, float* c, int n) {
int i = 0;
int j = 0;
int k = 0;
/*float* c = malloc(nay * sizeof(float));*/
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
int sub = 0;
for (k = 0; k < n; k++) {
sub = sub + a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sub;
}
}
return ;
}
And the Python code that calls it:
def C_mat_mult(a, b):
libmatmult = ctypes.CDLL("./matmult.so")
dima = len(a) * len(a)
dimb = len(b) * len(b)
array_a = ctypes.c_float * dima
array_b = ctypes.c_float * dimb
array_c = ctypes.c_float * dima
suma = array_a()
sumb = array_b()
sumc = array_c()
inda = 0
for i in range(0, len(a)):
for j in range(0, len(a[i])):
suma[inda] = a[i][j]
inda = inda + 1
indb = 0
for i in range(0, len(b)):
for j in range(0, len(b[i])):
sumb[indb] = b[i][j]
indb = indb + 1
libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);
res = numpy.zeros([len(a), len(a)])
indc = 0
for i in range(0, len(sumc)):
res[indc][i % len(a)] = sumc[i]
if i % len(a) == len(a) - 1:
indc = indc + 1
return res
I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that numpy is stupidly fast:
I'd like to understand why the numpy version is faster than the ctypes version, I'm not even talking about the pure Python implementation since it is kind of obvious.

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:
/**begin repeat
*
* #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
* LONG, ULONG, LONGLONG, ULONGLONG,
* FLOAT, DOUBLE, LONGDOUBLE,
* DATETIME, TIMEDELTA#
* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
* #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
*/
static void
#name#_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
void *NPY_UNUSED(ignore))
{
#out# tmp = (#out#)0;
npy_intp i;
for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
tmp += (#out#)(*((#type# *)ip1)) *
(#out#)(*((#type# *)ip2));
}
*((#type# *)op) = (#type#) tmp;
}
/**end repeat**/
This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.
One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).
Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.
Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.
Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:
for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) *
SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}
This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

The language used to implement a certain functionality is a bad measure of performance by itself. Often, using a more suitable algorithm is the deciding factor.
In your case, you're using the naive approach to matrix multiplication as taught in school, which is in O(n^3). However, you can do much better for certain kinds of matrices, e.g. square matrices, spare matrices and so on.
Have a look at the Coppersmith–Winograd algorithm (square matrix multiplication in O(n^2.3737)) for a good starting point on fast matrix multiplication. Also see the section "References", which lists some pointers to even faster methods.
For a more earthy example of astonishing performance gains, try to write a fast strlen() and compare it to the glibc implementation. If you don't manage to beat it, read glibc's strlen() source, it has fairly good comments.

Numpy is also highly optimized code. There is an essay about parts of it in the book Beautiful Code.
The ctypes has to go through a dynamic translation from C to Python and back that adds some overhead. In Numpy most matrix operations are done completely internal to it.

The people who wrote NumPy obviously know what they're doing.
There are many ways to optimize matrix multiplication. For example, order you traverse the matrix affects the memory access patterns, which affect performance.
Good use of SSE is another way to optimize, which NumPy probably employs.
There may be more ways, which the developers of NumPy know and I don't.
BTW, did you compile your C code with optiomization?
You can try the following optimization for C. It does work in parallel, and I suppose NumPy does something along the same lines.
NOTE: Only works for even sizes. With extra work, you can remove this limitation and keep the performance improvement.
for (i = 0; i < n; i++) {
for (j = 0; j < n; j+=2) {
int sub1 = 0, sub2 = 0;
for (k = 0; k < n; k++) {
sub1 = sub1 + a[i * n + k] * b[k * n + j];
sub1 = sub1 + a[i * n + k] * b[k * n + j + 1];
}
c[i * n + j] = sub;
c[i * n + j + 1] = sub;
}
}
}

The most common reason given for Fortran's speed advantage in numerical code, afaik, is that the language makes it easier to detect aliasing - the compiler can tell that the matrices being multiplied don't share the same memory, which can help improve caching (no need to be sure results are written back immediately into "shared" memory). This is why C99 introduced restrict.
However, in this case, I wonder if also the numpy code is managing to use some special instructions that the C code is not (as the difference seems particularly large).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.