Ahoi. I was tasked to improve performance of Bit.ly's Data_Hacks' sample.py, as a practice excercise.
I have cythonized part of the code. and included a PCG random generator, which has thus far improved performance by about 20 seconds (down from 72s), as well as optimizing print output (by using a basic c function, instead of python's write()).
This has all worked well, but aside from these fix-ups, I'd like to optimized the loop itself.
The basic function, as seen in bit.ly's sample.py:
def run(sample_rate):
input_stream = sys.stdin
for line in input_stream:
if random.randint(1,100) <= sample_rate:
sys.stdout.write(line)
My implementation:
cdef int take_sample(float sample_rate):
cdef unsigned int floor = 1
cdef unsigned int top = 100
if pcg32_random() % 100 <= sample_rate:
return 1
else:
return 0
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
if take_sample(sample_rate):
out(line)
What I would like to improve on now, is specifically skipping the next line (and preferably do so repeatedly) if my take_sample() doesn't return True.
My current implementation is this:
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
out(line)
while not take_sample(sample_rate):
next(f)
Which appears to do nothing to improve performance - leading me to suspect i've merely replaced a continue call after an if condition at the top of the loop, with my next(f).
So the question is this:
Is there a more efficient way to loop over a file (in Cython)?
I'd like to omit lines entirely, meaning they should only be truly accessed if I call my out() - is this already the case in python's for loop?
Is line a pointer (or comparable to such) to the line of the file? Or does the loop actually load this?
I realize that I could improve on it by writing it in C entirely, but I'd like to know how far I can push this staying with python/cython.
Update:
I've tested a C variant of my code - using the same test case - and it clocks in at under 2s (surprising no one). So, while it is true that the random generator and file I/O are two major bottlenecks generally speaking, it should be pointed out that python's file handling is in itself already darn slow.
So, is there a way to make use of C's file reading, other than implementing the loop itself into cython? The overhead is still slowing the python code down significantly, which makes me wonder if I'm simply at the sonic wall of performance, when it comes to file handling using Cython?
If the file is small, you may read it whole with .readlines() at once (possibly reducing IO traffic) and iterate the sequence of lines.
If the sample rate is small enough, you may consider sampling from geometric distribution which may be more efficient.
I do not know cython, but I would consider also:
simplifying take_sample() by removal of unnecessary variables and returning boolean result of the test instead of integer,
change signature of take_sample() to take_sample(int) to avoid int-to-float conversion every test.
[EDIT]
According to comment of #hpaulj, it may be better if you use .read().split('\n') instead of .readlines() suggested by me.
Related
I'm trying to translate the following line of C code into ctypes. Here's a snippet from the C program I'm trying to translate:
pIfRow = (MIB_IF_ROW2 *) malloc(sizeof(MIB_IF_ROW2));
SecureZeroMemory((PVOID)pIfRow, sizeof(MIB_IF_ROW2));
(Note that MIB_IF_ROW2 is a struct, defined in Netioapi.h)
Anyway, I can translate the first line fine in ctypes, assuming MIB_IF_ROW2 has already been defined as a ctypes struct:
from ctypes import *
# Translate first line of C Code
buff = create_string_buffer(sizeof(MIB_IF_ROW2))
p_if_row = cast(buff, POINTER(MIB_IF_ROW2))
# Second Line... ?
But when I get to the second line, I get stuck. I can't find anything in the docs or online with a ctypes equivalent for the function. What is the best way to go about this?
SecureZeroMemory will just fill the memory you pass it with zeroes. You should get the exact same result with ZeroMemory/memset or a plain loop in python. The thing that makes it "secure" is that it is not supposed to be optimized away by the compiler (when programming at a lower level like C/C++).
Using it on memory you just malloc'ed is not its intended purpose (not harmful though), it is supposed to be used like this:
char password[100];
AskUserForPassword(password);
DoSomething(password);
SecureZeroMemory(password, sizeof(password)); // Make sure password is no longer visible in memory in case the application is paged out or creates a memory dump in a crash
Trying to figure out how to use Cython to bypass the GIL and load files parallelly for IO bound tasks. For now I have the following Cython code trying to load files n0.npy, n1.py ... n100.npy
def foo_parallel():
cdef int i
for i in prange(100, nogil=True, num_threads=8):
with gil:
np.load('n'+str(i)+'.npy')
return []
def foo_serial():
cdef int i
for i in range(100):
np.load('n'+str(i)+'.npy')
return []
I'm not noticing a significant speedup - does anyone have any experience with this?
Edit: I'm getting around 900ms parallely vs 1.3 seconds serially. Would expect more speedup given 8 threads
As the comment states you can't use NumPy with gil and expect it to become parallel. You need C or C++ level file operations to do this. See this post here for a potential solution http://www.code-corner.de/?p=183
I.e. adapt this to your problem: file_io.pyx I'd post it here but can't figure out how on my cell. Add nogil to the end of the cdef statement there and call the function from a cpdef foo_parallel defined function within your prange loop. Use the read_file not the slow one and change it to cdef. Please post benchmarks after doing so as I'm curious and have no computer on vacation.
I have a big block of Cython code that is parsing Touchstone files that I want to work with Python 2 and Python 3. I'm using very C-style parsing techniques for what I thought would be maximum efficiency, including manually malloc-ing and free-ing char* instead of using bytes so that I can avoid the GIL. When compiled using
python 3.5.2 0 anaconda
cython 0.24.1 py35_0 anaconda
I see speeds that I'm happy with, a moderate boost on small files (~20% faster) and a huge boost on large files (~2.5x faster). When compiled against
python 2.7.12 0 anaconda
cython 0.24.1 py27_0 anaconda
It runs about 125x slower (~17ms in Python 3 vs ~2.2s in Python 2). It's the exact same code compiled in different environments using a pretty simple setuputils script. I'm not currently using NumPy from Cython for any of the parsing or data storage.
import cython
cimport cython
from cython cimport array
import array
from libc.stdlib cimport strtod, malloc, free
from libc.string cimport memcpy
ctypedef long long int64_t # Really VS2008? Couldn't include this by default?
# Bunch of definitions and utility functions omitted
#cython.boundscheck(False)
cpdef Touchstone parse_touchstone(bytes file_contents, int num_ports):
cdef:
char c
char* buffer = <char*> file_contents
int64_t length_of_buffer = len(file_contents)
int64_t i = 0
# These are some cpdef enums
FreqUnits freq_units
Domain domain
Format fmt
double z0
bint option_line_found = 0
array.array data = array.array('d')
array.array row = array.array('d', [0 for _ in range(row_size)])
while i < length_of_buffer:
c = buffer[i] # cdef char c
if is_whitespace(c):
i += 1
continue
if is_comment_char(c):
# Returns the last index of the comment
i = parse_comment(buffer, length_of_buffer)
continue
if not option_line_found and is_option_leader_char(c):
# Returns the last index of the option line
# assigns values of all references passed in
i = parse_option_line(
buffer, length_of_buffer, i,
&domain, &fmt, &z0, &freq_units)
if i < 0:
# Lots of boring code along the lines of
# if i == some_int:
# raise Exception("message")
# I did this so that only my top-level parse has to interact
# with the interpreter, all the lower level functions have nogil
option_line_found = 1
if option_line_found:
if is_digit(c):
# Parse a float
row[row_idx] = strtod(buffer + i, &end_of_value)
# Jump the cursor to the end of that float
i = end_of_value - p - 1
row_idx += 1
if row_idx == row_size:
# append this row onto the main data array
data.extend(row)
row_idx = 0
i += 1
return Touchstone(num_ports, domain, fmt, z0, freq_units, data)
I've ruled out a few things, such as type casts. I also tested where the code simply loops over the entire file doing nothing. Either Cython optimized that away or it's just really fast because it causes parse_touchstone to not even show up in a cProfile/pstats report. I determined that it's not just the comment, whitespace, and option line parsing (not shown is the significantly more complicated keyword-value parsing) after I threw in a print statement in the last if row_idx == row_size block to print out a status and discovered that it's taking about 0.5-1 second (guesstimate) to parse a row with 512 floating point numbers on it. That really should not take so long, especially when using strtod to do the parsing. I also checked parsing just 2 rows' worth of values then jumping out of the while loop and it told me that parsing the comments, whitespace, and option line took up about 800ms (1/3 of the overall time), and that was for 6 lines of text totaling less than 150 bytes.
Am I just missing something here? Is there a small trick that would cause Cython code to run 3 orders of magnitude slower in Python 2 than Python 3?
(Note: I haven't shown the full code here because I'm not sure if I'm allowed to for legal reasons and because it's about 450 lines total)
The problem is with strtod, which is not optimized in VS2008. Apparently it internally calculates the length of the input string each time its called, and if you call it with a long string this will slow down your code considerably. To circumvent this you have to write a wrapper around strtod to use only small buffers at a time (see the above link for one example of how to do this) or write your own strtod function.
A while ago, I made a Python script which looked similar to this:
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
for line in f:
w.write(line)
Which, of course, worked pretty slowly on a 100mb file.
However, I changed the program to do this
ls = []
with open("somefile.txt", "r") as f, open("otherfile.txt", "a") as w:
for line in f:
ls.append(line)
if len(ls) == 100000:
w.writelines(ls)
del ls[:]
And the file copied much faster. My question is, why does the second method work faster even though the program copies the same number of lines (albeit collects them and prints them one by one)?
I may have found a reason why write is slower than writelines. In looking through the CPython source (3.4.3) I found the code for the write function (took out irrelevent parts).
Modules/_io/fileio.c
static PyObject *
fileio_write(fileio *self, PyObject *args)
{
Py_buffer pbuf;
Py_ssize_t n, len;
int err;
...
n = write(self->fd, pbuf.buf, len);
...
PyBuffer_Release(&pbuf);
if (n < 0) {
if (err == EAGAIN)
Py_RETURN_NONE;
errno = err;
PyErr_SetFromErrno(PyExc_IOError);
return NULL;
}
return PyLong_FromSsize_t(n);
}
If you notice, this function actually returns a value, the size of the string that has been written, which is another function call.
I tested this out to see if it actually had a return value, and it did.
with open('test.txt', 'w+') as f:
x = f.write("hello")
print(x)
>>> 5
The following is the code for the writelines function implementation in CPython (took out irrelevent parts).
Modules/_io/iobase.c
static PyObject *
iobase_writelines(PyObject *self, PyObject *args)
{
PyObject *lines, *iter, *res;
...
while (1) {
PyObject *line = PyIter_Next(iter);
...
res = NULL;
do {
res = PyObject_CallMethodObjArgs(self, _PyIO_str_write, line, NULL);
} while (res == NULL && _PyIO_trap_eintr());
Py_DECREF(line);
if (res == NULL) {
Py_DECREF(iter);
return NULL;
}
Py_DECREF(res);
}
Py_DECREF(iter);
Py_RETURN_NONE;
}
If you notice, there is no return value! It simply has Py_RETURN_NONE instead of another function call to calculate the size of the written value.
So, I went ahead and tested that there really wasn't a return value.
with open('test.txt', 'w+') as f:
x = f.writelines(["hello", "hello"])
print(x)
>>> None
The extra time that write takes seems to be due to the extra function call taken in the implementation to produce the return value. By using writelines, you skip that step and the fileio is the only bottleneck.
Edit: write documentation
I do not agree with the other answer here.
It is simply a coincidence. It highly depends on your environment:
What OS?
What HDD/CPU?
What HDD file system format?
How busy is your CPU/HDD?
What Python version?
Both pieces of code do the absolute same thing with tiny differences in performance.
For me personally .writelines() takes longer to execute then your first example using .write(). Tested with 110MB text file.
I will not post my machine specs on purpose.
Test .write(): ------copying took 0.934000015259 seconds (dashes for readability)
Test .writelines(): copying took 0.936999797821 seconds
Also tested with small and as large as 1.5GB files with the same results. (writelines always beeing slightly slower, up to 0.5sec difference for 1.5GB file).
That's because of that in first part you have to call the method write for all the lines in each iteration which makes your program take much time to run. But in second code although your waste more memory but it performs better because you have called the writelines() method each 100000 line.
Let see this is source,this is the source of writelines function :
def writelines(self, list_of_data):
"""Write a list (or any iterable) of data bytes to the transport.
The default implementation concatenates the arguments and
calls write() on the result.
"""
if not _PY34:
# In Python 3.3, bytes.join() doesn't handle memoryview.
list_of_data = (
bytes(data) if isinstance(data, memoryview) else data
for data in list_of_data)
self.write(b''.join(list_of_data))
As you can see it joins all the list items and calls the write function one time.
Note that joining the data here takes time but its less than the time for calling the write function for each line.But since you use python 3.4 in ,it writes the lines one at a time rather than joining them so it would be much faster than write in this case :
cStringIO.writelines() now accepts any iterable argument and writes
the lines one at a time rather than joining them and writing once.
Made a parallel change to StringIO.writelines(). Saves memory and
makes suitable for use with generator expressions.
I think I may have implemented this incorrectly because the results do not make sense. I have a Go program that counts to 1000000000:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 1000000000; i++ {}
fmt.Println("Done")
}
It finishes in less than a second. On the other hand I have a Python script:
x = 0
while x < 1000000000:
x+=1
print 'Done'
It finishes in a few minutes.
Why is the Go version so much faster? Are they both counting up to 1000000000 or am I missing something?
One billion is not a very big number. Any reasonably modern machine should be able to do this in a few seconds at most, if it's able to do the work with native types. I verified this by writing an equivalent C program, reading the assembly to make sure that it actually was doing addition, and timing it (it completes in about 1.8 seconds on my machine).
Python, however, doesn't have a concept of natively typed variables (or meaningful type annotations at all), so it has to do hundreds of times as much work in this case. In short, the answer to your headline question is "yes". Go really can be that much faster than Python, even without any kind of compiler trickery like optimizing away a side-effect-free loop.
pypy actually does an impressive job of speeding up this loop
def main():
x = 0
while x < 1000000000:
x+=1
if __name__ == "__main__":
s=time.time()
main()
print time.time() - s
$ python count.py
44.221405983
$ pypy count.py
1.03511095047
~97% speedup!
Clarification for 3 people who didn't "get it". The Python language itself isn't slow. The CPython implementation is a relatively straight forward way of running the code. Pypy is another implementation of the language that does many tricky (especiallt the JIT) things that can make enormous differences. Directly answering the question in the title - Go isn't "that much" faster than Python, Go is that much faster than CPython.
Having said that, the code samples aren't really doing the same thing. Python needs to instantiate 1000000000 of its int objects. Go is just incrementing one memory location.
This scenario will highly favor decent natively-compiled statically-typed languages. Natively compiled statically-typed languages are capable of emitting a very trivial loop of say, 4-6 CPU opcodes that utilizes simple check-condition for termination. This loop has effectively zero branch prediction misses and can be effectively thought of as performing an increment every CPU cycle (this isn't entirely true, but..)
Python implementations have to do significantly more work, primarily due to the dynamic typing. Python must make several different calls (internal and external) just to add two ints together. In Python it must call __add__ (it is effectively i = i.__add__(1), but this syntax will only work in Python 3.x), which in turn has to check the type of the value passed (to make sure it is an int), then it adds the integer values (extracting them from both of the objects), and then the new integer value is wrapped up again in a new object. Finally it re-assigns the new object to the local variable. That's significantly more work than a single opcode to increment, and doesn't even address the loop itself - by comparison, the Go/native version is likely only incrementing a register by side-effect.
Java will fair much better in a trivial benchmark like this and will likely be fairly close to Go; the JIT and static-typing of the counter variable can ensure this (it uses a special integer add JVM instruction). Once again, Python has no such advantage. Now, there are some implementations like PyPy/RPython, which run a static-typing phase and should fare much better than CPython here ..
You've got two things at work here. The first of which is that Go is compiled to machine code and run directly on the CPU while Python is compiled to bytecode run against a (particularly slow) VM.
The second, and more significant, thing impacting performance is that the semantics of the two programs are actually significantly different. The Go version makes a "box" called "x" that holds a number and increments that by 1 on each pass through the program. The Python version actually has to create a new "box" (int object) on each cycle (and, eventually, has to throw them away). We can demonstrate this by modifying your programs slightly:
package main
import (
"fmt"
)
func main() {
for i := 0; i < 10; i++ {
fmt.Printf("%d %p\n", i, &i)
}
}
...and:
x = 0;
while x < 10:
x += 1
print x, id(x)
This is because Go, due to it's C roots, takes a variable name to refer to a place, where Python takes variable names to refer to things. Since an integer is considered a unique, immutable entity in python, we must constantly make new ones. Python should be slower than Go but you've picked a worst-case scenario - in the Benchmarks Game, we see go being, on average, about 25x times faster (100x in the worst case).
You've probably read that, if your Python programs are too slow, you can speed them up by moving things into C. Fortunately, in this case, somebody's already done this for you. If you rewrite your empty loop to use xrange() like so:
for x in xrange(1000000000):
pass
print "Done."
...you'll see it run about twice as fast. If you find loop counters to actually be a major bottleneck in your program, it might be time to investigate a new way of solving the problem.
#troq
I'm a little late to the party but I'd say the answer is yes and no. As #gnibbler pointed out, CPython is slower in the simple implementation but pypy is jit compiled for much faster code when you need it.
If you're doing numeric processing with CPython most will do it with numpy resulting in fast operations on arrays and matrices. Recently I've been doing a lot with numba which allows you to add a simple wrapper to your code. For this one I just added #njit to a function incALot() which runs your code above.
On my machine CPython takes 61 seconds, but with the numba wrapper it takes 7.2 microseconds which will be similar to C and maybe faster than Go. Thats an 8 million times speedup.
So, in Python, if things with numbers seem a bit slow, there are tools to address it - and you still get Python's programmer productivity and the REPL.
def incALot(y):
x = 0
while x < y:
x += 1
#njit('i8(i8)')
def nbIncALot(y):
x = 0
while x < y:
x += 1
return x
size = 1000000000
start = time.time()
incALot(size)
t1 = time.time() - start
start = time.time()
x = nbIncALot(size)
t2 = time.time() - start
print('CPython3 takes %.3fs, Numba takes %.9fs' %(t1, t2))
print('Speedup is: %.1f' % (t1/t2))
print('Just Checking:', x)
CPython3 takes 58.958s, Numba takes 0.000007153s
Speedup is: 8242982.2
Just Checking: 1000000000
Problem is Python is interpreted, GO isn't so there's no real way to bench test speeds. Interpreted languages usually (not always have a vm component) that's where the problem lies, any test you run is being run in interpreted bounds not actual runtime bounds. Go is slightly slower than C in terms of speed and that is mostly due to it using garbage collection instead of manual memory management. That said GO compared to Python is fast because its a compiled language, the only thing lacking in GO is bug testing I stand corrected if I'm wrong.
It is possible that the compiler realized that you didn't use the "i" variable after the loop, so it optimized the final code by removing the loop.
Even if you used it afterwards, the compiler is probably smart enough to substitute the loop with
i = 1000000000;
Hope this helps =)
I'm not familiar with go, but I'd guess that go version ignores the loop since the body of the loop does nothing. On the other hand, in the python version, you are incrementing x in the body of the loop so it's probably actually executing the loop.