Let's suppose I have a kernel to compute the element-wise sum of two arrays. Rather than passing a, b, and c as three parameters, I make them structure members as follows:
typedef struct
{
__global uint *a;
__global uint *b;
__global uint *c;
} SumParameters;
__kernel void compute_sum(__global SumParameters *params)
{
uint id = get_global_id(0);
params->c[id] = params->a[id] + params->b[id];
return;
}
There is information on structures if you RTFM of PyOpenCL [1], and others have addressed this question too [2] [3] [4]. But none of the OpenCL struct examples I've been able to find have pointers as members.
Specifically, I'm worried about whether host/device address spaces match, and whether host/device pointer sizes match. Does anyone know the answer?
[1] http://documen.tician.de/pyopencl/howto.html#how-to-use-struct-types-with-pyopencl
[2] Struct Alignment with PyOpenCL
[3] http://enja.org/2011/03/30/adventures-in-opencl-part-3-constant-memory-structs/
[4] http://acooke.org/cute/Somesimple0.html
No, there is no guaranty that address spaces match. For the basic types (float, int,…) you have alignment requirement (section 6.1.5 of the standard) and you have to use the cl_type name of the OpenCL implementation (when programming in C, pyopencl does the job under the hood I’d say).
For the pointers it’s even simpler due to this mismatch. The very beginning of section 6.9 of the standard v 1.2 (it’s section 6.8 for version 1.1) states:
Arguments to kernel functions declared in a program that are pointers
must be declared with the __global, __constant or __local qualifier.
And in the point p.:
Arguments to kernel functions that are declared to be a struct or
union do not allow OpenCL objects to be passed as elements of the
struct or union.
Note also the point d.:
Variable length arrays and structures with flexible (or unsized)
arrays are not supported.
So, no way to make you kernel runs as described in your question and that's why you haven’t been able to find some examples of OpenCl struct have pointers as members.
I still can propose a workaround that takes advantage of the fact that the kernel is compiled in JIT. It still requires that you pack you data properly and that you pay attention to the alignment and finally that the size doesn’t change during the execution of the program. I honestly would go for a kernel taking 3 buffers as arguments, but anyhow, there it is.
The idea is to use the preprocessor option –D as in the following example in python:
Kernel:
typedef struct {
uint a[SIZE];
uint b[SIZE];
uint c[SIZE];
} SumParameters;
kernel void foo(global SumParameters *params){
int idx = get_global_id(0);
params->c[idx] = params->a[idx] + params->b[idx];
}
Host code:
import numpy as np
import pyopencl as cl
def bar():
mf = cl.mem_flags
ctx = cl.create_some_context()
queue = cl.CommandQueue(self.ctx)
prog_f = open('kernels.cl', 'r')
#a = (1, 2, 3), b = (4, 5, 6)
ary = np.array([(1, 2, 3), (4, 5, 6), (0, 0, 0)], dtype='uint32, uint32, uint32')
cl_ary = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=ary)
#Here should compute the size, but hardcoded for the example
size = 3
#The important part follows using -D option
prog = cl.Program(ctx, prog_f.read()).build(options="-D SIZE={0}".format(size))
prog.foo(queue, (size,), None, cl_ary)
result = np.zeros_like(ary)
cl.enqueue_copy(queue, result, cl_ary).wait()
print result
And the result:
[(1L, 2L, 3L) (4L, 5L, 6L) (5L, 7L, 9L)]
I don't know the answer to my own question, but there are 3 workarounds I can come up with off the top of my head. I consider Workaround 3 the best option.
Workaround 1: We only have 3 parameters here, so we could just make a, b, and c kernel parameters. But I've read there's a limit on the number of parameters you can pass to a kernel, and I personally like to refactor any function that takes more than 3-4 arguments to use structs (or, in Python, tuples or keyword arguments). So this solution makes the code harder to read, and doesn't scale.
Workaround 2: Dump everything in a single giant array. Then the kernel would look like this:
typedef struct
{
uint ai;
uint bi;
uint ci;
} SumParameters;
__kernel void compute_sum(__global SumParameters *params, uint *data)
{
uint id = get_global_id(0);
data[params->ci + id] = data[params->ai + id] + data[params->bi + id];
return;
}
In other words, instead of using pointers, use offsets into a single array. This looks an awful lot like the beginnings of implementing my own memory model, and it feels like it's reinventing a wheel that exists somewhere in PyOpenCL, or OpenCL, or both.
Workaround 3: Make setter kernels. Like this:
__kernel void set_a(__global SumParameters *params, __global uint *a)
{
params->a = a;
return;
}
and ditto for set_b, set_c. Then execute these kernels with worksize 1 to set up the data structure. You still need to know how big a block to allocate for params, but if it's too big, nothing bad will happen (except a little wasted memory), so I'd say just assume the pointers are 64 bits.
This workaround's performance is probably awful (I imagine a kernel call has enormous overhead), but fortunately that shouldn't matter too much for my application (my kernel is going to run for seconds at a time, it's not a graphics thing that has to run at 30-60 fps, so I imagine that the time taken by extra kernel calls to set parameters will end up being a tiny fraction of my workload, no matter how high the per-kernel-call overhead is).
Related
I tried to compare the performance of both Python and Ctypes version of the sum function. I found that Python is faster than ctypes version.
Sum.c file:
int our_function(int num_numbers, int *numbers) {
int i;
int sum = 0;
for (i = 0; i < num_numbers; i++) {
sum += numbers[i];
}
return sum;
}
int our_function2(int num1, int num2) {
return num1 + num2;
}
I compiled it to a shared library:
gcc -shared -o Sum.so sum.c
Then I imported both the shared library and ctypes to use the C function. Below Sum.py:
import ctypes
_sum = ctypes.CDLL('.\junk.so')
_sum.our_function.argtypes = (ctypes.c_int, ctypes.POINTER(ctypes.c_int))
def our_function_c(numbers):
global _sum
num_numbers = len(numbers)
array_type = ctypes.c_int * num_numbers
result = _sum.our_function(ctypes.c_int(num_numbers), array_type(*numbers))
return int(result)
def our_function_py(numbers):
sum = 0
for i in numbers:
sum += i
return sum
import time
start = time.time()
print(our_function_c([1, 2, 3]))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py([1, 2, 3]))
end1 = time.time()
print("time taken py", end1-start1)
Output:
6
time taken C 0.0010006427764892578
6
time taken py 0.0
For larger list like list(range(int(1e5))):
start = time.time()
print(our_function_c(list(range(int(1e5)))))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(list(range(int(1e5)))))
end1 = time.time()
print("time taken py", end1-start1)
Output:
704982704
time taken C 0.011005163192749023
4999950000
time taken py 0.00500178337097168
Question: I tried to use more numbers, but Python still beats ctypes in terms of performance. So my question is, is there a rule of thumb when I should move to ctypes over Python (in terms of the order of magnitude of code)? Also, what is the cost to convert Python to Ctypes, please?
Why
Well, yes, in such a case, it is not really worth it.
Because before calling the c function, you spend lot of time converting numbers in to c_int.
Which is not less expansive as an addition.
Usually we use ctypes when, either the data are generated on the C-side. Or when we generate them from python, but then use them for more than 1 simple operation.
Same with pandas
This is for example what happens with numpy or pandas. Two well known example of libraries in C (or compiled anyway) that allow huge time gain (in the order of 1000×), as long as data don't go back and forth between C space and python space.
Numpy is faster than list for many operations, for example. As long as you don't count data conversion for each atomic operation.
Pandas often works with data read from CSV, by pandas. Data stays in pandas space.
import time
import pandas as pd
lst=list(range(1000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
df=pd.DataFrame({'x':lst})
middle2=time.time()
s2=df.x.sum()
end2=time.time()
print("python", s1, "t=", end1-start1)
print("pandas", s2, "t=", end2-start2, end2-middle2)
python 499999500000 t= 0.13175106048583984
pandas 499999500000 t= 0.35060644149780273 0.0020313262939453125
As you see pandas also is way slower than python by this standard.
But way faster if you don't count data creation.
Faster without data conversion
Try to run your code this way
import time
lst=list(range(1000))*1000
c_lst = (ctypes.c_int * len(lst))(*lst)
c_num = ctypes.c_int(len(lst))
start = time.time()
print(int(_sum.our_function(c_num, c_lst)))
end = time.time()
print("time taken C", end-start)
start1 = time.time()
print(our_function_py(lst))
end1 = time.time()
print("time taken py", end1-start1)
And c code is way faster.
So, like with panda, it doesn't worth it, if really, all you need from it is to do one summation, and then forget the data.
No such problem with c-extension
Note that with python c-extension, that allows c functions to handle python types, you don't have this problem (yet, it is often less efficient, because, well, python arrays are not just int * like C loves. But at least, you don't need a conversion to C made from python)
That is why you may sometimes see some libraries for which, even counting conversion, using external libraries is faster.
import numpy as np
np.array(lst).sum()
for example is slightly faster. But almost not so, when we are used to have numpy 1000× faster. Because numpy.array helps itself from python list data.
But that is not just ctypes, (by ctypes, I mean "using c-functions from the c-world handling c-data, not caring about python at all."). Plus, I am not even sure that this is the only reason. Numpy might be cheating, using several threads, and vectorization, which, neither python nor your c code does.
Example that needs no big data conversion
So, let's add another example, and add this to your code
int sumFirst(int n){
int s=0;
for(int i=0; i<n; i++){
s+=i;
}
return s;
}
And try it with
import ctypes
_sum = ctypes.CDLL('./ctypeBench.so')
_sum.sumFirst.argtypes = (ctypes.c_int,)
def c_sumFirst(n):
return _sum.sumFirst(ctypes.c_int(n))
import time
lst=list(range(10000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=c_sumFirst(10000)
end2=time.time()
print(f"python {s1=}, Δt={end1-start1}")
print(f"c {s2=}, Δt={end2-start2}")
Result is
python s1=49995000, Δt=0.0012884140014648438
c s2=49995000, Δt=4.267692565917969e-05
And note that I was fair to python: I did not count data generation in its time (I explicitly listed the range. Which doesn't change much).
So, conclusion is, you can't expect ctypes function to gain time for a single operation per data such as +, when you need 1 conversion per data to use them.
Either you need to use c-extension and write ad-hoc code than handle a python list (and even there, you won't gain much if you have just one addition to do per value).
Or you need to keep the data on the c-side, creating them from c, and letting them there (like you do with pandas or numpy: you use dataframe or ndarrays, as much as possible with pandas and numpy functions or operator, not getting all of them in python with full indexation or .iloc).
Or you need to have really more than one addition per data to do.
Addendum: c-extension
Just to add another argument in favor of "problem is conversion", but also to explicit what to do if you really need to do one simple operation on a list, and don't want to convert every elements before, you can try this
modmy.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#define PY3K
static PyObject *mysum(PyObject *self, PyObject *args){
PyObject *list;
PyArg_ParseTuple(args, "O", &list);
PyObject *it = PyObject_GetIter(list);
long int s=0;
for(;;){
PyObject *v = PyIter_Next(it);
if(!v) break;
long int iv=PyLong_AsLong(v);
s+=iv;
}
return PyLong_FromLong(s);
}
static PyMethodDef MyMethods[] = {
{"mysum", mysum, METH_VARARGS, "Sum list"},
{NULL, NULL, 0, NULL} /* Sentinel */
};
static struct PyModuleDef modmy = {
PyModuleDef_HEAD_INIT,
"modmy",
NULL,
-1,
MyMethods
};
PyMODINIT_FUNC
PyInit_modmy()
{
return PyModule_Create(&modmy);
}
Compile with
gcc -fPIC `python3-config --cflags` -c modmy.c
gcc -shared -fPIC `python3-config --ldflags` -o modmy.so modmy.o
Then
import time
import modmy
lst=list(range(10000000))
start1=time.time()
s1=0
for x in lst:
s1+=x
end1=time.time()
start2=time.time()
s2=modmy.mysum(lst)
end2=time.time()
print("python res=%d t=%5.2f"%(s1, end1-start1))
print("c res=%d t=%5.2f"%(s2, end2-start2))
This time no need for conversion (or, to be more accurate, yes, there is still a need for conversion. But it is done by C code, since it is not any C code, but code made ad-hoc to extend python.
(And after all, python interpreter, under the hood, also need to unpack the elements)
Note that my code checks nothing. It assumes that you are really calling mysum with a single argument being a list of integers. God knows what happens if you don't. Well, not just God. just try:
>>> import modmy
>>> modmy.mysum(12)
Segmentation fault (core dumped)
$
Python crashes (not just python's code. It is not a python error. The python process dies)
But result worth it
python res=49999995000000 t= 1.22
c res=49999995000000 t= 0.11
So, you see, this times C wins. Because it is really the same rules (they are doing the same. Just C does it faster)
So, you need to know what you are doing. But well, this does what you expected: a very simple operation on a list of integers, that runs faster in C than in python.
I'm trying to transfer some code I've previously written in python into C++, and I'm currently testing xtensor to see if it can be faster than numpy for doing what I need it to.
One of my functions takes a square matrix d and a scalar alpha, and performs the elementwise operation alpha/(alpha+d). Background: this function is used to test which value of alpha is 'best', so it is in a loop where d is always the same, but alpha varies.
All of the following time scales are an average of 100 instances of running the function.
In numpy, it takes around 0.27 seconds to do this, and the code is as follows:
def kfun(d,alpha):
k = alpha /(d+alpha)
return k
but xtensor takes about 0.36 seconds, and the code looks like this:
xt::xtensor<double,2> xk(xt::xtensor<double,2> d, double alpha){
return alpha/(alpha+d);
}
I've also attempted the following version using std::vector but this something I do not want to use in long run, even though it only took 0.22 seconds.
std::vector<std::vector<double>> kloops(std::vector<std::vector<double>> d, double alpha, int d_size){
for (int i = 0; i<d_size; i++){
for (int j = 0; j<d_size; j++){
d[i][j] = alpha/(alpha + d[i][j]);
}
}
return d;
}
I've noticed that the operator/ in xtensor uses "lazy broadcasting", is there maybe a way to make it immediate?
EDIT:
In Python, the function is called as follows, and timed using the "time" package
t0 = time.time()
for i in range(100):
kk = k(dsquared,alpha_squared)
print(time.time()-t0)
In C++ I call the function has follows, and is timed using chronos:
//d is saved as a 1D npy file, an artefact from old code
auto sd2 = xt::load_npy<double>("/path/to/d.npy");
shape = {7084, 7084};
xt::xtensor<double, 2> xd2(shape);
for (int i = 0; i<7084;i++){
for (int j=0; j<7084;j++){
xd2(i,j) = (sd2(i*7084+j));
}
}
auto start = std::chrono::steady_clock::now();
for (int i = 0;i<10;i++){
matrix<double> kk = kfun(xd2,4000*4000,7084);
}
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "k takes: " << elapsed_seconds.count() << "\n";
If you wish to run this code, I'd suggest using xd2 as a symmetric 7084x7084 random matrix with zeros on the diagonal.
The output of the function, a matrix called k, then goes on to be used in other functions, but I still need d to be unchanged as it will be reused later.
END EDIT
To run my C++ code I use the following line in the terminal:
cd "/path/to/src/" && g++ -mavx2 -ffast-math -DXTENSOR_USE_XSIMD -O3 ccode.cpp -o ccode -I/path/to/xtensorinclude && "/path/to/src/"ccode
Thanks in advance!
A problem with the C++ implementation may be that it creates one or possibly even two temporary copies that could be avoided. The first copy comes from not passing the argument by reference (or perfect forwarding). Without looking at the rest of the code its hard to judge if this has an impact on the performance or not. The compiler may move d into the method if its guaranteed to be not used after the method xk(), but it is more likely to copy the data into d.
To pass by reference, the method could be changed to
xt::xtensor<double,2> xk(const xt::xtensor<double,2>& d, double alpha){
return alpha/(alpha+d);
}
To use perfect forwarding (and also enable other xtensor containers like xt::xarray or xt::xtensor_fixed), the method could be changed to
template<typename T>
xt::xtensor<double,2> xk(T&& d, double alpha){
return alpha/(alpha+d);
}
Furthermore, its possible that you can save yourself from reserving memory for the return value. Again, its hard to judge without seeing the rest of the code. But if the method is used inside a loop, and the return value always has the same shape, then it can be beneficial to create the return value outside of the loop and return by reference. To do this, the method could be changed to:
template<typename T, typename U>
void xk(T& r, U&& d, double alpha){
r = alpha/(alpha+d);
}
If it is guaranteed that d and r do not point to the same memory, you can further wrap r in xt::noalias() to avoid a temporary copy before assigning the result. The same is true for the return value of the function in case you do not return by reference.
Good luck and happy coding!
I am using ctypes to try and speed up my code.
My problem is similar to the one in this tutorial : https://cvstuff.wordpress.com/2014/11/27/wraping-c-code-with-python-ctypes-memory-and-pointers/
As pointed out in the tutorial I should free the memory after using the C function. Here is my C code
//C functions
double* getStuff(double *R_list, int items){
double results[items];
double* results_p;
for(int i = 0; i < items; i++){
res = calculation ; \\do some calculation
results[i] = res; }
results_p = results;
printf("C allocated address %p \n", results_p);
return results_p; }
void free_mem(double *a){
printf("freeing address: %p\n", a);
free(a); }
Which I compile with gcc -shared -Wl,-lgsl,-soname, simps -o libsimps.so -fPIC simps.c
And python:
//Python
from ctypes import *
import numpy as np
mydll = CDLL("libsimps.so")
mydll.getStuff.restype = POINTER(c_double)
mydll.getStuff.argtypes = [POINTER(c_double),c_int]
mydll.free_mem.restype = None
mydll.free_mem.argtypes = [POINTER(c_double)]
R = np.logspace(np.log10(0.011),1, 100, dtype = float) #input
tracers = c_int(len(R))
R_c = R.ctypes.data_as(POINTER(c_double))
for_list = mydll.getStuff(R_c,tracers)
print 'Python allocated', hex(for_list)
for_list_py = np.array(np.fromiter(for_list, dtype=np.float64, count=len(R)))
mydll.free_mem(for_list)
Up to the last line the code does what I want it to and the for_list_py values are correct. However, when I try to free the memory, I get a Segmentation fault and on closer inspection the address associated with for_list --> hex(for_list) is different to the one allocated to results_p within C part of the code.
As pointed out in this question, Python ctypes: how to free memory? Getting invalid pointer error , for_list will return the same address if mydll.getStuff.restype is set to c_void_p. But then I struggle to put the actual values I want into for_list_py. This is what I've tried:
cast(for_list, POINTER(c_double) )
for_list_py = np.array(np.fromiter(for_list, dtype=np.float64, count=len(R)))
mydll.free_mem(for_list)
where the cast operation seems to change for_list into an integer. I'm fairly new to C and very confused. Do I need to free that chunk of memory? If so, how do I do that whilst also keeping the output in a numpy array? Thanks!
Edit: It appears that the address allocated in C and the one I'm trying to free are the same, though I still recieve a Segmentation fault.
C allocated address 0x7ffe559a3960
freeing address: 0x7ffe559a3960
Segmentation fault
If I do print for_list I get <__main__.LP_c_double object at 0x7fe2fc93ab00>
Conclusion
Just to let everyone know, I've struggled with c_types for a bit.
I've ended up opting for SWIG instead of c_types. I've found that the code runs faster on the whole (compared to the version presented here). I found this documentation on dealing with memory deallocation in SWIG very useful https://scipy-cookbook.readthedocs.io/items/SWIG_Memory_Deallocation.html as well as the fact that SWIG gives you a very easy way of dealing with numpy n-dimensional arrays.
After getStuff function exits, the memory allocated to results array is not available any more, so when you try to free it, it crashes the program.
Try this instead:
double* getStuff(double *R_list, int items)
{
double* results_p = malloc(sizeof((*results_p) * (items + 1));
if (results_p == NULL)
{
// handle error
}
for(int i = 0; i < items; i++)
{
res = calculation ; \\do some calculation
results_p[i] = res;
}
printf("C allocated address %p \n", results_p);
return results_p;
}
in my C OpenCL code I use clSetKernelArg to create 'variable size' __local memory for use in my kernels, which is not available in OpenCL per se. See my example:
clSetKernelArg(clKernel, ArgCounter++, sizeof(cl_mem), (void *)&d_B);
...
clSetKernelArg(clKernel, ArgCounter++, sizeof(float)*block_size*block_size, NULL);
...
kernel="
matrixMul(__global float* C,
...
__local float* A_temp,
...
)"
{...
My question is now, how to do the same in pyopencl?
I looked through the examples that come with pyopencl, but the only thing I could find was an approach using templates, which seems as to me as I understood it like an overkill. See example.
kernel = """
__kernel void matrixMul(__global float* C,...){
...
__local float A_temp[ %(mem_size) ];
...
}
What do you recommend?
It is similar to C. You pass it a fixed size array as a local. Here is an example from Enja's radix sort. Notice the last argument is a local memory array.
def naive_scan(self, num):
nhist = num/2/self.cta_size*16
global_size = (nhist,)
local_size = (nhist,)
extra_space = nhist / 16 #NUM_BANKS defined as 16 in RadixSort.cpp
shared_mem_size = self.uintsz * (nhist + extra_space)
scan_args = ( self.mCountersSum,
self.mCounters,
np.uint32(nhist),
cl.LocalMemory(2*shared_mem_size)
)
self.radix_prg.scanNaive(self.queue, global_size, local_size, *(scan_args)).wait()
I am no familiar with Python and its OpenCL implementation, but a local memory can also be created within the kernel with a fixed size (similar what you did):
__kernel void matrixMul(...) {
__local float A_templ[1024];
}
Instead of 1024 a defined preprocessor symbol can be used and can be set during compilation to change the size:
#define SIZE 1024
__kernel void matrixMul(...) {
__local float A_templ[SIZE];
}
SIZE can be defined within the same soure, as compiler parameter for cLBuildProgram or as an additional source for clCreateProgramWithSource.
EDIT: Found something with Google ;-): http://www.google.com/url?sa=t&source=web&cd=4&ved=0CC8QFjAD&url=http%3A%2F%2Flinksceem.eu%2Fjoomla%2Ffiles%2FPRACE_Winter_School%2FLinkSCEMM_pyOpenCL.pdf&rct=j&q=Pyopencl%20__local%20memory&ei=BTbETbWhOsvBswadp62ODw&usg=AFQjCNG6rXEEkDpE1304pmQDu3GFdRA0BQ&sig2=vHOGOqwA1HHUl10c6HO8WQ&cad=rja
I have a data file of almost 9 million lines (soon to be more than 500 million lines) and I'm looking for the fastest way to read it in. The five aligned columns are padded and separated by spaces, so I know where on each line to look for the two fields that I want.
My Python routine takes 45 secs:
import sys,time
start = time.time()
filename = 'test.txt' # space-delimited, aligned columns
trans=[]
numax=0
for line in open(linefile,'r'):
nu=float(line[-23:-11]); S=float(line[-10:-1])
if nu>numax: numax=nu
trans.append((nu,S))
end=time.time()
print len(trans),'transitions read in %.1f secs' % (end-start)
print 'numax =',numax
whereas the routine I've come up with in C is a more pleasing 4 secs:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define BPL 47
#define FILENAME "test.txt"
#define NTRANS 8858226
int main(void) {
size_t num;
unsigned long i;
char buf[BPL];
char* sp;
double *nu, *S;
double numax;
FILE *fp;
time_t start,end;
nu = (double *)malloc(NTRANS * sizeof(double));
S = (double *)malloc(NTRANS * sizeof(double));
start = time(NULL);
if ((fp=fopen(FILENAME,"rb"))!=NULL) {
i=0;
numax=0.;
do {
if (i==NTRANS) {break;}
num = fread(buf, 1, BPL, fp);
buf[BPL-1]='\0';
sp = &buf[BPL-10]; S[i] = atof(sp);
buf[BPL-11]='\0';
sp = &buf[BPL-23]; nu[i] = atof(sp);
if (nu[i]>numax) {numax=nu[i];}
++i;
} while (num == BPL);
fclose(fp);
end = time(NULL);
fprintf(stdout, "%d lines read; numax = %12.6f\n", (int)i, numax);
fprintf(stdout, "that took %.1f secs\n", difftime(end,start));
} else {
fprintf(stderr, "Error opening file %s\n", FILENAME);
free(nu); free(S);
return EXIT_FAILURE;
}
free(nu); free(S);
return EXIT_SUCCESS;
}
Solutions in Fortran, C++ and Java take intermediate amounts of time (27 secs, 20 secs, 8 secs).
My question is: have I made any outrageous blunders in the above (particularly the C-code)? And is there any way to speed up the Python routine? I quickly realised that storing my data in an array of tuples was better than instantiating a class for each entry.
Some points:
Your C routine is cheating; it is being tipped off with the filesize, and is pre-allocating ...
Python: consider using array.array('d') ... one each for S and nu. Then try pre-allocation.
Python: write your routine as a function and call it -- accessing function-local variables is rather faster than accessing module-global variables.
An approach that could probably be applied to the C, C++ and python version would be to use memory map the file. The most signficant benefit is that it can reduce the amount of double-handling of data as it is copied from one buffer to another. In many cases there are also benefits due to the reduction in the number of system calls for I/O.
In the C implementation, you could try swapping the fopen()/fread()/fclose() library functions for the lower-level system calls open()/read()/close(). A speedup may come from the fact that fread() does a lot of buffering, whereas read() does not.
Additionally, calling read() less often with bigger chunks will reduce the number of system calls and therefore you'll have less switching between userspace and kernelspace. What the kernel does when you issue a read() system call (doesn't matter if it was invoked from the fread() library function) is read the data from the disk and then copy it to the userspace. The copying part becomes expensive if you issue the system call very often in your code. By reading in larger chunks you'll end up with less context switches and less copying.
Keep in mind though that read() isn't guaranteed to return a block of the exact number of bytes you wanted. This is why in a reliable and proper implementation you always have to check the return value of the read().
You have the 1 and the BPL arguments the wrong way around in fread() (the way you have it, it could read a partial line, which you don't test for). You should also be testing the return value of fread() before you try and use the returned data.
You can might be able to speed the C version up a bit by reading more than a line at a time
#define LINES_PER_READ 1000
char buf[LINES_PER_READ][BPL];
/* ... */
while (i < NTRANS && (num = fread(buf, BPL, LINES_PER_READ, fp)) > 0) {
int line;
for (line = 0; i < NTRANS && line < num; line++)
{
buf[line][BPL-1]='\0';
sp = &buf[line][BPL-10]; S[i] = atof(sp);
buf[line][BPL-11]='\0';
sp = &buf[line][BPL-23]; nu[i] = atof(sp);
if (nu[i]>numax) {numax=nu[i];}
++i;
}
}
On systems supporting posix_fadvise(), you should also do this upfront, after opening the file:
posix_fadvise(fileno(fp), 0, 0, POSIX_FADV_SEQUENTIAL);
Another possible speed-up, given the number of times you need to do it, is to use pointers to S and nu instead of indexing into arrays, e.g.,
double *pS = S, *pnu = nu;
...
*pS++ = atof(sp);
*pnu = atof(sp);
...
Also, since you are always converting from char to double at the same locations in buf, pre-compute the addresses outside of your loop instead of computing them each time in the loop.