In a program I am working on, I need to multiply two matrices repeatedly. Because of the size of one of the matrices, this operation takes some time and I wanted to see which method would be the most efficient. The matrices have dimensions (m x n)*(n x p) where m = n = 3 and 10^5 < p < 10^6.
With the exception of Numpy, which I assume works with an optimized algorithm, every test consists of a simple implementation of the matrix multiplication:
Below are my various implementations:
Python
def dot_py(A,B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m,p))
for i in range(0,m):
for j in range(0,p):
for k in range(0,n):
C[i,j] += A[i,k]*B[k,j]
return C
Numpy
def dot_np(A,B):
C = np.dot(A,B)
return C
Numba
The code is the same as the Python one, but it is compiled just in time before being used:
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython = True)(dot_py)
So far, each method call has been timed using the timeit module 10 times. The best result is kept. The matrices are created using np.random.rand(n,m).
C++
mat2 dot(const mat2& m1, const mat2& m2)
{
int m = m1.rows_;
int n = m1.cols_;
int p = m2.cols_;
mat2 m3(m,p);
for (int row = 0; row < m; row++) {
for (int col = 0; col < p; col++) {
for (int k = 0; k < n; k++) {
m3.data_[p*row + col] += m1.data_[n*row + k]*m2.data_[p*k + col];
}
}
}
return m3;
}
Here, mat2 is a custom class that I defined and dot(const mat2& m1, const mat2& m2) is a friend function to this class. It is timed using QPF and QPC from Windows.h and the program is compiled using MinGW with the g++ command. Again, the best time obtained from 10 executions is kept.
Results
As expected, the simple Python code is slower but it still beats Numpy for very small matrices. Numba turns out to be about 30% faster than Numpy for the largest cases.
I am surprised with the C++ results, where the multiplication takes almost an order of magnitude more time than with Numba. In fact, I expected these to take a similar amount of time.
This leads to my main question: Is this normal and if not, why is C++ slower that Numba? I just started learning C++ so I might be doing something wrong. If so, what would be my mistake, or what could I do to improve the efficiency of my code (other than choosing a better algorithm) ?
EDIT 1
Here is the header of the mat2 class.
#ifndef MAT2_H
#define MAT2_H
#include <iostream>
class mat2
{
private:
int rows_, cols_;
float* data_;
public:
mat2() {} // (default) constructor
mat2(int rows, int cols, float value = 0); // constructor
mat2(const mat2& other); // copy constructor
~mat2(); // destructor
// Operators
mat2& operator=(mat2 other); // assignment operator
float operator()(int row, int col) const;
float& operator() (int row, int col);
mat2 operator*(const mat2& other);
// Operations
friend mat2 dot(const mat2& m1, const mat2& m2);
// Other
friend void swap(mat2& first, mat2& second);
friend std::ostream& operator<<(std::ostream& os, const mat2& M);
};
#endif
Edit 2
As many suggested, using the optimization flag was the missing element to match Numba. Below are the new curves compared to the previous ones. The curve tagged v2 was obtained by switching the two inner loops and shows another 30% to 50% improvement.
Definitely use -O3 for optimization. This turns vectorizations on, which should significantly speed your code up.
Numba is supposed to do that already.
What I would recommend
If you want maximum efficiency, you should use a dedicated linear algebra library, the classic of which is BLAS/LAPACK libraries. There are a number of implementations, eg. Intel MKL. What you write is NOT going to outpeform hyper-optimized libraries.
Matrix matrix multiply is going to be the dgemm routine: d stands for double, ge for general, and mm for matrix matrix multiply. If your problem has additional structure, a more specific function may be called for additional speedup.
Note that Numpy dot ALREADY calls dgemm! You're probably not going to do better.
Why your c++ is slow
Your classic, intuitive algorithm for matrix-matrix multiplication turns out to be slow compared to what's possible. Writing code that takes advantage of how processors cache etc... yields important performance gains. The point is, tons of smart people have devoted their lives to making matrix matrix multiply extremely fast, and you should use their work and not reinvent the wheel.
In your current implementation most likely compiler is unable to auto vectorize the most inner loop because its size is 3. Also m2 is accessed in a "jumpy" way. Swapping loops so that iterating over p is in the most inner loop will make it work faster (col will not make "jumpy" data access) and compiler should be able to do better job (autovectorize).
for (int row = 0; row < m; row++) {
for (int k = 0; k < n; k++) {
for (int col = 0; col < p; col++) {
m3.data_[p*row + col] += m1.data_[n*row + k] * m2.data_[p*k + col];
}
}
}
On my machine the original C++ implementation for p=10^6 elements build with g++ dot.cpp -std=c++11 -O3 -o dot flags takes 12ms and above implementation with swapped loops takes 7ms.
You can still optimize these loops by improving the memory acces, your function could look like (assuming the matrizes are 1000x1000):
CS = 10
NCHUNKS = 100
def dot_chunked(A,B):
C = np.zeros(1000,1000)
for i in range(NCHUNKS):
for j in range(NCHUNKS):
for k in range(NCHUNKS):
for ii in range(i*CS,(i+1)*CS):
for jj in range(j*CS,(j+1)*CS):
for kk in range(k*CS,(k+1)*CS):
C[ii,jj] += A[ii,kk]*B[kk,jj]
return C
Explanation: the loops i and ii obviously together perform the same way as i did before, the same hold for j and k, but this time regions in A and B of size CSxCS can be kept in the cache (I guess) and can used more then once.
You can play around with CS and NCHUNKS. For me CS=10 and NCHUNKS=100 worked well. When using numba.jit, it accelerates the code from 7s to 850 ms (notice i use 1000x1000, the graphics above are run with 3x3x10^5, so its a bit of another scenario).
Related
Was fooling around about what is the best way to calculate a mean of a list in python. Although I thought that numpy as optimized My results show that you shouldn't use numpy when it comes to this. I was wondering why and how python achieve this performance.
So basically I am trying to figure out how come native python is faster than numpy.
My code for testing:
import random
import numpy as np
import timeit
def average_native(l):
return sum(l)/len(l)
def average_np(l):
return np.mean(l)
def test_time(func, arg):
starttime = timeit.default_timer()
for _ in range(500):
func(arg)
return (timeit.default_timer() - starttime) / 500
for i in range(1, 7):
numbers = []
for _ in range(10**i):
numbers.append(random.randint(0, 100))
print("for " + str(10**i) + " numbers:")
print(test_time(average_native, numbers))
print(test_time(average_np, numbers))
The results:
for 10 numbers:
2.489999999999992e-07
8.465800000000023e-06
for 100 numbers:
8.554000000000061e-07
1.3220000000000009e-05
for 1000 numbers:
7.2817999999999495e-06
6.22666e-05
for 10000 numbers:
6.750499999999993e-05
0.0005553966000000001
for 100000 numbers:
0.0006954238
0.005352444999999999
for 1000000 numbers:
0.007034196399999999
0.0568878216
BTW I was running same code in c++ and was surprised to see that the python code is faster. test code:
#include <iostream>
#include <cstdlib>
#include <vector>
#include <chrono>
float calculate_average(std::vector<int> vec_of_num)
{
double sum=0;
uint64_t cnt=0;
for(auto & elem : vec_of_num)
{
cnt++;
sum = sum + elem;
}
return sum / cnt;
}
int main()
{
// This program will create same sequence of
// random numbers on every program run
std::vector<int> vec;
for(int i = 0; i < 1000000; i++)
vec.push_back(rand());
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 500; i++)
calculate_average(vec);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> float_ms = end - start;
std::cout << "calculate_average() elapsed time is " << float_ms.count()/500 << " ms )" << std::endl;
return 0;
}
results:
calculate_average() elapsed time is 11.2082 ms )
Am I missing something?
Edit: I was running the c++ code on an online compiler (probebly without any optimization). Also it isn't the same Hardware, and how know what is going on it that server. After running and compiling the code in my device the code is much faster.
Edit2: So I changed the code for a numpy array in the numpy function and we do see that for smaller array/list the native python is better, however after around 1000 values numpy is preforming better. I don't really understand why. which optimizations numpy have that produce these results?
new results:
for 10 numbers:
2.4540000000000674e-07
6.722200000000012e-06
for 100 numbers:
8.497999999999562e-07
6.583400000000017e-06
for 1000 numbers:
6.990799999999964e-06
7.916000000000034e-06
for 10000 numbers:
6.61604e-05
1.5475799999999985e-05
for 100000 numbers:
0.0006671193999999999
8.412259999999994e-05
for 1000000 numbers:
0.0068192092
0.0008199298000000005
Maybe I need to restart this question :)
C++ is much much slower than it needs to be.
First, for the C++ code, you're copying the vector, which is probably what's taking most of the time. You want to write:
float calculate_average(const std::vector<int>& vec_of_num)
instead of
float calculate_average(std::vector<int> vec_of_num)
In order to avoid making the copy.
Second, make sure you've compiled with optimizations on.
For the numpy version, you're doing an extra conversion, which slows you down.
From the docs:
a: array_like
Array containing numbers whose mean is desired. If a is not an array, a conversion is attempted.
So whatever is passed to numpy.mean is first converted into a numpy.array, then the mean is computed. Making the Numpy array is probably taking a good portion of your time here.
I'd suggest doing two more benchmarks and seeing how they compare with what you already have:
(1) C++ version without the copying, as I describe above, and make sure optimizations are on.
(2) Numpy version where you pass in a numpy array instead of a Python list.
The function numpy.mean() is doing a lot more than what sum() and len() is doing, that is why it is so "slow".
The kind of functionalities included in np.mean() is essentially what it makes it a ufunc, and especially the support for n-dimensional arrays.
However, the largest contributor to the speed difference between the naïve implementation and np.mean() is actually converting the list to a NumPy array.
Consider the following ways to compute the average:
this is essentially what you think it is super-fast
def mean_naive(seq):
return sum(seq) / len(seq)
this is a numeric-safe implementation that is present in the standard Python library
import statistics
def mean_st(seq):
return statistics.mean(seq)
this uses the NumPy mean() function:
import numpy as np
def mean_np(seq):
return np.mean(seq)
this is the same as the naïve approach but a conversion to a NumPy array is performed to factor out the NumPy array conversion cost:
import numpy as np
def mean_naive_conv(seq):
np.array(seq) # the result of the conversion is not used!
return sum(seq) / len(seq)
this is a Numba-accelerated version of the naïve approach acting on NumPy arrays. The Numba acceleration essentially converts the Python code to optimized C++ code via just-in-time compilation with llvm. If sum() / len() is faster-than-C, then mean_naive_conv() should outperform this one.
import numpy as np
import numba as nb
#nb.njit
def mean_naive_nb(seq):
sum_ = 0
for x in seq:
sum_ += x
return sum_ / len(seq)
def mean_naive_np_nb(seq):
seq = np.array(seq)
return mean_naive_nb(seq)
However, when we benchmarks these with the following code:
import random
funcs = (
mean_naive, mean_st, mean_np, mean_naive_conv, mean_naive_np_nb, only_conv)
timings = {}
for k in range(1, 20):
n = 2 ** k
seq = tuple(random.random() for _ in range(n))
print(f"n = {n}, k = {k}")
timings[n] = []
base = funcs[0](seq)
for func in funcs:
res = func(seq) # this ensures that JIT-ted code is compiled before benchmarking
is_good = np.allclose(base, res)
timed = %timeit -r 4 -n 8 -q -o func(seq)
timing = timed.best * 1e6
timings[n].append(timing)
print(f"{func.__name__:>24} {is_good!s:>5} {timing:10.3f} µs")
to be plotted with:
import pandas as pd
df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', ylim=[0, 40000])
fig = plt.gcf()
fig.patch.set_facecolor('white')
and with:
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', ylim=[0, 600], xlim=[0, 9000])
fig = plt.gcf()
fig.patch.set_facecolor('white')
(for some zooming on the smaller input sizes)
we can observe:
The statistics-based approach is the slowest by far and large
The naïve approach is the fastest by far and large
When comparing the all the methods that do have a type conversion from Python list to NumPy array:
np.mean() is the fastest for larger input sizes, likely because it is compiled with specific optimizations (I'd speculate making optimal use of SIMD instructions); for smaller inputs, the running time is dominated by the overhead for supporting all ufunc functionalities
the Numba-accelerated version is the fastest for medium input sizes; for very small inputs, the running time is lengthened by the small, roughly constant, overhead of calling a Numba function
for very small inputs the sum() / len() eventually gets to be the fastest
This indicates that sum() / len() is essentially slower than optimized C++ code acting on arrays.
You are copying the array for each call to average which takes a lot of extra time.
#include <numeric>
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
//!! pass vector by reference to avoid copies!!!!
double calculate_average(const std::vector<int>& vec_of_num)
{
return static_cast<double>(std::accumulate(vec_of_num.begin(), vec_of_num.end(), 0)) / static_cast<double>(vec_of_num.size());
}
int main()
{
std::mt19937 generator(1); // static std::mt19937 generator(std::random_device{}());
std::uniform_int_distribution<int> distribution{ 0,1000 };
// This program will create same sequence of
// random numbers on every program run
std::vector<int> values(1000000);
for (auto& value : values)
{
value = distribution(generator);
}
auto start = std::chrono::high_resolution_clock::now();
double sum{ 0.0 };
for (int i = 0; i < 500; i++)
{
// force compiler to use average so it can't be optimized away
sum += calculate_average(values);
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> float_ms = end - start;
// force compiler to use sum so it can't be optimized away
std::cout << "sum = " << sum << "\n";
std::cout << "calculate_average() elapsed time is " << float_ms.count() / 500 << " ms )" << std::endl;
return 0;
}
This is a code I wrote in C for Fibonacci sequence:
#include <stdio.h>
#include <stdlib.h>
int fib(int n)
{
int a = 0, b = 1, c, i;
if (n == 0)
return a;
for (i = 2; i <= n; i++) {
c = a + b;
a = b;
b = c;
}
return b;
}
int main()
{
printf("%d",fib(1000));
return 0;
}
And this is the direct translation in Python:
def fib(n):
a=0
b=1
if n == 0:
return a
for _ in range(n-1):
c = a + b
a = b
b = c
return b
print(fib(1000))
The C program outputs:
1556111435
Where Python (correctly) outputs:
43466557686937456435688527675040625802564660517371780402481729089536555417949051890403879840079255169295922593080322634775209689623239873322471161642996440906533187938298969649928516003704476137795166849228875
I realize the problem with C is with the variable type (since the fib(50) works just fine in C), but I have two major questions:
How should I correct the C program in a way that I can calculate fib of any number? In other words, rather than just using double (which has its own limitation), how can I calculate any fib in C?
How does Python handle this? Because apparently, it has no limitation in the size of integers.
C does not offer any dynamically sized integer types directly. The biggest you can go within the language itself is long long. However there is nothing stopping you from writing your own big-integer functions that allocate memory and handle carry as needed.
Or you can just use someone else's big integer lib, for instance BigInt.
(Looking at BigInt's source code will also answer the question how Python does this.)
Edit: I just had a bit of a closer look at BigInt myself. Beware that it uses the regular pen&paper method unconditionally for multiplication, which is fast for "small" numbers, but for "large" numbers has worse performance than the Karatsuba method. However please note that the border between "small" and "large" in this context is probably so high, that in most practical cases the pen&paper method is enough (see the linked Wiki article). It's also worth noting that you can combine both algorithms for multiplication, by writing them recursively and having Karatsuba's method fall back to pen&paper if the number of bits is below a given threshold.
I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes
Here's the best of what I came up with
#X - 3000 by 100 bool np.array
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
print("object nr " + str(i) + "/" + str(len(X)))
arr = np.array([x] * len(Y))
C = Y^arr # just xor this array by all the arrays in the other group simultainously
hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
i+=1
return np.array(hd)
And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?
You should be able to dramatically improve the summing speed by using numpy to perform it, rather than using a list comprehension and the built-in sum function (that takes no advantage of numpy vectorized operations).
Just replace:
hd.append([sum(c) for c in C])
with:
# Explicitly use uint16 to reduce memory cost; if array sizes might increase
# you can use uint32 to leave some wiggle room
hd.append(C.sum(1, dtype=np.uint16))
which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis 1). For example:
>>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
>>> arr.sum(1, np.uint16)
array([ 2, 1, 3], dtype=uint16)
Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy value one by one from np.bool to Python level ints just to sum them), this should run substantially faster for the array scales you're describing.
Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate can do that more quickly and easily. Simply replace:
i=1
for x in X:
... rest of loop ...
i+=1
with:
for i, x in enumerate(X, 1):
... rest of loop ...
and you'll get the same behavior, but slightly faster, more concise and cleaner in general.
IIUC, you can use np.logical_xor and list comprehension:
result = np.array([[np.logical_xor(X[a], Y[b].T).sum() for b in range(len(Y))]
for a in range(len(X))])
The whole operation runs in 7 seconds in my machine.
0:00:07.226645
Just in case you are not limited to using Python, this is a solution in C++ using bitset:
#include <iostream>
#include <bitset>
#include <vector>
#include <random>
#include <chrono>
using real = double;
std::mt19937_64 rng;
std::uniform_real_distribution<real> bitset_dist(0, 1);
real prob(0.75);
std::bitset<100> rand_bitset()
{
std::bitset<100> bitset;
for (size_t idx = 0; idx < bitset.size(); ++idx)
{
bitset[idx] = (bitset_dist(rng) < prob) ? true : false;
}
return std::move(bitset);
}
int main()
{
rng.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());
size_t v1_size(3000);
size_t v2_size(10000);
std::vector<size_t> hd;
std::vector<std::bitset<100>> vec1;
std::vector<std::bitset<100>> vec2;
vec1.reserve(v1_size);
vec2.reserve(v2_size);
hd.reserve(v1_size * v2_size); /// Edited from hd.reserve(v1_size);
for (size_t i = 0; i < v1_size; ++i)
{
vec1.emplace_back(rand_bitset());
}
for (size_t i = 0; i < v2_size; ++i)
{
vec2.emplace_back(rand_bitset());
}
std::cout << "vec1 size: " << vec1.size() << '\n'
<< "vec2 size: " << vec2.size() << '\n';
auto start(std::chrono::high_resolution_clock::now());
for (const auto& b1 : vec1)
{
for (const auto& b2 : vec2)
{
/// Count only the bits that are set and store them
hd.emplace_back((b1 ^ b2).count());
}
}
auto time(std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count());
std::cout << vec1.size() << " x " << vec2.size()
<< " xor operations on 100 bits took " << time << " ms\n";
return 0;
}
On my machine, the whole operation (3000 x 10000) takes about 300 ms.
You could put this into a function, compile it into a library and call it from Python. Another option is to store the distances to a file and then read them in Python.
EDIT: I had the wrong size for the hd vector. Reserving the proper amount of memory reduces the operation to about 190 ms because relocations are avoided.
I'm currently porting a C++ program to Python using Numpy arrays. I'm looking for a way to implement, if possible, the following loops in a more Pythonic way:
for (int j = start_y; j < end_y; j++)
{
for (int i = start_x; i < end_x; i++)
{
plasmaFreq[i][j] = plasmaFreq_0*(tanh((i - 50)/10) - tanh((i - (nx - 50))/10))/2.0;
}
}
Above, plasmaFreq_0 is a constant passed into the surrounding function, as is nx. Obviously it's easy to vectorize the loop bounds to operate on a particular region of a numpy array, but this leaves me with the issue of how to map the above index-dependent function across the array.
You'll need an array i,
i = np.arange(start_x, end_x)
plasmaFreq[start_x:end_x, start_y: end_y] = plasmaFreq_0 *(np.tanh((i - 50)/10) - np.tanh((i - (nx - 50))/10))/2.0
I think that broadcasting should take it from there.
Note that your original code is quite inefficient1... First, you're calculating the right hand side for each j, but it doesn't depend on j, so you only really need to calculate it once. Second, your inner loop is over the slow index so you won't be effectively using your cache. I would probably write it as:
for (int i = start_x; i < end_x; i++)
{
rhs = plasmaFreq_0*(tanh((i - 50)/10) - tanh((i - (nx - 50))/10))/2.0;
for (int j = start_y; j < end_y; j++)
{
plasmaFreq[i][j] = rhs;
}
}
1How inefficient depends on how well the compiler can do figuring out the loops. Someday maybe some compilers could generate the same code from yours and mine
I was trying to figure out the fastest way to do matrix multiplication and tried 3 different ways:
Pure python implementation: no surprises here.
Numpy implementation using numpy.dot(a, b)
Interfacing with C using ctypes module in Python.
This is the C code that is transformed into a shared library:
#include <stdio.h>
#include <stdlib.h>
void matmult(float* a, float* b, float* c, int n) {
int i = 0;
int j = 0;
int k = 0;
/*float* c = malloc(nay * sizeof(float));*/
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
int sub = 0;
for (k = 0; k < n; k++) {
sub = sub + a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sub;
}
}
return ;
}
And the Python code that calls it:
def C_mat_mult(a, b):
libmatmult = ctypes.CDLL("./matmult.so")
dima = len(a) * len(a)
dimb = len(b) * len(b)
array_a = ctypes.c_float * dima
array_b = ctypes.c_float * dimb
array_c = ctypes.c_float * dima
suma = array_a()
sumb = array_b()
sumc = array_c()
inda = 0
for i in range(0, len(a)):
for j in range(0, len(a[i])):
suma[inda] = a[i][j]
inda = inda + 1
indb = 0
for i in range(0, len(b)):
for j in range(0, len(b[i])):
sumb[indb] = b[i][j]
indb = indb + 1
libmatmult.matmult(ctypes.byref(suma), ctypes.byref(sumb), ctypes.byref(sumc), 2);
res = numpy.zeros([len(a), len(a)])
indc = 0
for i in range(0, len(sumc)):
res[indc][i % len(a)] = sumc[i]
if i % len(a) == len(a) - 1:
indc = indc + 1
return res
I would have bet that the version using C would have been faster ... and I'd have lost ! Below is my benchmark which seems to show that I either did it incorrectly, or that numpy is stupidly fast:
I'd like to understand why the numpy version is faster than the ctypes version, I'm not even talking about the pure Python implementation since it is kind of obvious.
NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.
I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:
/**begin repeat
*
* #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
* LONG, ULONG, LONGLONG, ULONGLONG,
* FLOAT, DOUBLE, LONGDOUBLE,
* DATETIME, TIMEDELTA#
* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
* #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
*/
static void
#name#_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
void *NPY_UNUSED(ignore))
{
#out# tmp = (#out#)0;
npy_intp i;
for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
tmp += (#out#)(*((#type# *)ip1)) *
(#out#)(*((#type# *)ip2));
}
*((#type# *)op) = (#type#) tmp;
}
/**end repeat**/
This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.
One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).
Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.
Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.
Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:
for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) *
SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}
This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.
The language used to implement a certain functionality is a bad measure of performance by itself. Often, using a more suitable algorithm is the deciding factor.
In your case, you're using the naive approach to matrix multiplication as taught in school, which is in O(n^3). However, you can do much better for certain kinds of matrices, e.g. square matrices, spare matrices and so on.
Have a look at the Coppersmith–Winograd algorithm (square matrix multiplication in O(n^2.3737)) for a good starting point on fast matrix multiplication. Also see the section "References", which lists some pointers to even faster methods.
For a more earthy example of astonishing performance gains, try to write a fast strlen() and compare it to the glibc implementation. If you don't manage to beat it, read glibc's strlen() source, it has fairly good comments.
Numpy is also highly optimized code. There is an essay about parts of it in the book Beautiful Code.
The ctypes has to go through a dynamic translation from C to Python and back that adds some overhead. In Numpy most matrix operations are done completely internal to it.
The people who wrote NumPy obviously know what they're doing.
There are many ways to optimize matrix multiplication. For example, order you traverse the matrix affects the memory access patterns, which affect performance.
Good use of SSE is another way to optimize, which NumPy probably employs.
There may be more ways, which the developers of NumPy know and I don't.
BTW, did you compile your C code with optiomization?
You can try the following optimization for C. It does work in parallel, and I suppose NumPy does something along the same lines.
NOTE: Only works for even sizes. With extra work, you can remove this limitation and keep the performance improvement.
for (i = 0; i < n; i++) {
for (j = 0; j < n; j+=2) {
int sub1 = 0, sub2 = 0;
for (k = 0; k < n; k++) {
sub1 = sub1 + a[i * n + k] * b[k * n + j];
sub1 = sub1 + a[i * n + k] * b[k * n + j + 1];
}
c[i * n + j] = sub;
c[i * n + j + 1] = sub;
}
}
}
The most common reason given for Fortran's speed advantage in numerical code, afaik, is that the language makes it easier to detect aliasing - the compiler can tell that the matrices being multiplied don't share the same memory, which can help improve caching (no need to be sure results are written back immediately into "shared" memory). This is why C99 introduced restrict.
However, in this case, I wonder if also the numpy code is managing to use some special instructions that the C code is not (as the difference seems particularly large).