The background: I'm building a trie to represent a dictionary, using a minimal construction algorithm. The input list is 4.3M utf-8 strings, sorted lexicographically. The resulting graph is acyclic and has a maximum depth of 638 nodes. The first line of my script sets the recursion limit to 1100 via sys.setrecursionlimit().
The problem: I'd like to be able to serialize my trie to disk, so I can load it into memory without having to rebuild from scratch (roughly 22 minutes). I have tried both pickle.dump() and cPickle.dump(), with both the text and binary protocols. Each time, I get a stack-trace that looks like the following:
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 663, in _batch_setitems
save(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 725, in save_inst
save(stuff)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 648, in save_dict
self.memoize(obj)
RuntimeError: maximum recursion depth exceeded
My data structures are relatively simple: trie contains a reference to a start state, and defines some methods. dfa_state contains a boolean field, a string field, and a dictionary mapping from label to state.
I'm not very familiar with the inner workings of pickle - does my max recursion depth need to be greater/equal n times the depth of the trie for some n? Or could this be caused by something else I'm unaware of?
Update: Setting the recursion depth to 3000 didn't help, so this avenue doesn't look promising.
Update 2: You guys were right; I was being short-sighted in assuming that pickle would use a small nesting depth due to default recursion limitations. 10,000 did the trick.
From the docs:
Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a RuntimeError will be raised in this case. You can carefully raise this limit with sys.setrecursionlimit().
Although your trie implementation may be simple, it uses recursion and can lead to issues when converting to a persistent data structure.
My recommendation would be continue raising the recursion limit to see if there is an upper bound for the data you are working with and the trie implementation you are using.
Other then that, you can try changing your tree implementation to be "less recursive", if possible, or write an additional implementation that has data persistence built-in (use pickles and shelves in your implementation). Hope that helps
Pickle does need to recursively walk your trie. If Pickle is using just 5 levels of function calls to do the work your trie of depth 638 will need the level set to more than 3000.
Try a much bigger number, the recursion limit is really just there to protect users from having to wait too long if the recursion falls in an infinite hole.
Pickle handles cycles ok, so it doesn't matter even if your trie had a cycle in there
Stack size must also be increased with resource.setrlimit to prevent segfault
If you use just sys.setrecursionlimit, you can still segfault if you reach the maximum stack size allowed by the Linux kernel.
This value can be increased with resource.setrlimit as mentioned at: Setting stacksize in a python script
import pickle
import resource
import sys
print resource.getrlimit(resource.RLIMIT_STACK)
print sys.getrecursionlimit()
max_rec = 0x100000
# May segfault without this line. 0x100 is a guess at the size of each stack frame.
resource.setrlimit(resource.RLIMIT_STACK, [0x100 * max_rec, resource.RLIM_INFINITY])
sys.setrecursionlimit(max_rec)
a = []
# 0x10 is to account for subfunctions called inside `pickle`.
for i in xrange(max_rec / 0x10):
a = [a]
print pickle.dumps(a, -1)
See also: What is the maximum recursion depth in Python, and how to increase it?
The default maximum value for me is 8Mb.
Tested on Ubuntu 16.10, Python 2.7.12.
Double-check that your structure is indeed acyclic.
You could try bumping up the limit even further. There's a hard maximum that's platform dependent, but trying 50000 would be reasonable.
Also try pickling a trivially small version of your trie. If pickle dies even though it's only storing a couple three-letter words, then you know there's some fundamental problem with your trie and not pickle. But if it only happens when you try storing 10k words, then it might be the fault of a platform limitation in pickle.
My needs were somewhat immediate so I solved this problem by saving my dictionary in .txt format. The only thing is that when you load your file again you have to transform it back into a dictionary.
import json
# Saving the dictionary
with open('filename.txt', 'w') as file_handle:
file_handle.write(str(dictionary))
# Importing the .txt file
with open('filename.txt', 'r') as file_handle:
f = '"' + file_handle.read() + '"'
# From .txt file to dictionary
dictionary = eval(json.loads(f))
If this does not work you may try exporting the dictionary using json format.
For me, removing all uses of importlib.reload solved the issue.
I did not even need to increase the limit with setrecursionlimit.
If you want to know how I found it, continue reading.
Before I found the solution I found that I can actually save the model if I moved it to the CPU first, but then got an error during evaluation (XXX is the class name and it matters not):
PicklingError: Can't pickle <class 'XXX'>: it's not the same object as XXX
Then I found this answer:
https://stackoverflow.com/a/1964942/4295037
But after removing all uses of importlib.reload I was able to save the model without moving it to CPU device first.
Related
Ahoi. I was tasked to improve performance of Bit.ly's Data_Hacks' sample.py, as a practice excercise.
I have cythonized part of the code. and included a PCG random generator, which has thus far improved performance by about 20 seconds (down from 72s), as well as optimizing print output (by using a basic c function, instead of python's write()).
This has all worked well, but aside from these fix-ups, I'd like to optimized the loop itself.
The basic function, as seen in bit.ly's sample.py:
def run(sample_rate):
input_stream = sys.stdin
for line in input_stream:
if random.randint(1,100) <= sample_rate:
sys.stdout.write(line)
My implementation:
cdef int take_sample(float sample_rate):
cdef unsigned int floor = 1
cdef unsigned int top = 100
if pcg32_random() % 100 <= sample_rate:
return 1
else:
return 0
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
if take_sample(sample_rate):
out(line)
What I would like to improve on now, is specifically skipping the next line (and preferably do so repeatedly) if my take_sample() doesn't return True.
My current implementation is this:
def run(float sample_rate, file):
cdef char* line
with open(file, 'rb') as f:
for line in f:
out(line)
while not take_sample(sample_rate):
next(f)
Which appears to do nothing to improve performance - leading me to suspect i've merely replaced a continue call after an if condition at the top of the loop, with my next(f).
So the question is this:
Is there a more efficient way to loop over a file (in Cython)?
I'd like to omit lines entirely, meaning they should only be truly accessed if I call my out() - is this already the case in python's for loop?
Is line a pointer (or comparable to such) to the line of the file? Or does the loop actually load this?
I realize that I could improve on it by writing it in C entirely, but I'd like to know how far I can push this staying with python/cython.
Update:
I've tested a C variant of my code - using the same test case - and it clocks in at under 2s (surprising no one). So, while it is true that the random generator and file I/O are two major bottlenecks generally speaking, it should be pointed out that python's file handling is in itself already darn slow.
So, is there a way to make use of C's file reading, other than implementing the loop itself into cython? The overhead is still slowing the python code down significantly, which makes me wonder if I'm simply at the sonic wall of performance, when it comes to file handling using Cython?
If the file is small, you may read it whole with .readlines() at once (possibly reducing IO traffic) and iterate the sequence of lines.
If the sample rate is small enough, you may consider sampling from geometric distribution which may be more efficient.
I do not know cython, but I would consider also:
simplifying take_sample() by removal of unnecessary variables and returning boolean result of the test instead of integer,
change signature of take_sample() to take_sample(int) to avoid int-to-float conversion every test.
[EDIT]
According to comment of #hpaulj, it may be better if you use .read().split('\n') instead of .readlines() suggested by me.
Is this a bug? It demonstrates what happens when you use libtiff to extract an image from an open tiff file handle. It works in python 2.x and does not work in python 3.2.3
import os
# any file will work here, since it's not actually loading the tiff
# assuming it's big enough for the seek
filename = "/home/kostrom/git/wiredfool-pillow/Tests/images/multipage.tiff"
def test():
fp1 = open(filename, "rb")
buf1 = fp1.read(8)
fp1.seek(28)
fp1.read(2)
for x in range(16):
fp1.read(12)
fp1.read(4)
fd = os.dup(fp1.fileno())
os.lseek(fd, 28, os.SEEK_SET)
os.close(fd)
# this magically fixes it: fp1.tell()
fp1.seek(284)
expect_284 = fp1.tell()
print ("expected 284, actual %d" % expect_284)
test()
The output which I feel is in error is:
expected 284, actual -504
Uncommenting the fp1.tell() does some ... side effect ... which stabilizes the py3 handle, and I don't know why. I'd also appreciate if someone can test other versions of python3.
No, this is not a bug. The Python 3 io library, which provides you with the file object from an open() call, gives you a buffered file object. For binary files, you are given a (subclass of) io.BufferedIOBase.
The Python 2 file object is far more primitive, although you can use the io library there too.
By seeking at the OS level you are bypassing the buffer and are mucking up the internal state. Generally speaking, as the doctor said to the patient complaining that pinching his skin hurts: don't do that.
If you have a pressing need to do this anyway, at the very least use the underlying raw file object (a subclass of the io.RawIOBase class) via the io.BufferedIO.raw attribute:
fp1 = open(filename, "rb").raw
os.dup creates a duplicate file descriptor that refers to the same open file description. Therefore, os.lseek(fd, 28, SEEK_SET) changes the seek position of the file underlying fp1.
Python's file objects cache the file position to avoid repeated system calls. The side effect of this is that changing the file position without using the file object methods will desynchronize the cached position and the real position, leading to nonsense like you've observed.
Worse yet, because the files are internally buffered by Python, seeking outside the file methods could actually cause the returned file data to be incorrect, leading to corruption or other nasty stuff.
The documentation in bufferedio.c notes that tell can be used to reinitialize the cached value:
* The absolute position of the raw stream is cached, if possible, in the
`abs_pos` member. It must be updated every time an operation is done
on the raw stream. If not sure, it can be reinitialized by calling
_buffered_raw_tell(), which queries the raw stream (_buffered_raw_seek()
also does it). To read it, use RAW_TELL().
I'm getting this exception when trying to open shelve persisted files over a certain size which is actually pretty small (< 1MB) but I'm not sure where the exactly number is. Now, I know pickle is sort of the bastard child of python and shelve isn't thought of as a particularly robust solution, but it happens to solve my problem wonderfully (in theory) and I haven't been able to find a reason for this exception.
Traceback (most recent call last):
File "test_shelve.py", line 27, in <module>
print len(f.keys())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 101, in keys
return self.dict.keys()
SystemError: Negative size passed to PyString_FromStringAndSize
I can reproduce it consistently, but I haven't found much on google. Here's a script that will reproduce.
import shelve
import random
import string
import pprint
f = shelve.open('test')
# f = {}
def rand_list(list_size=20, str_size=40):
return [''.join([random.choice(string.ascii_uppercase + string.digits) for j in range(str_size)]) for i in range(list_size)]
def recursive_dict(depth=3):
if depth==0:
return rand_list()
else:
d = {}
for k in rand_list():
d[k] = recursive_dict(depth-1)
return d
for k,v in recursive_dict(2).iteritems():
f[k] = v
f.close()
f = shelve.open('test')
print len(f.keys())
Regarding error itself:
The idea circulating on the web is the data size exceeded the largest
integer possible on that machine (the largest 32 bit (signed) integer
is 2 147 483 647), interpreted as a negative size by Python.
Your code is running with 2.7.3, so may be a fixed bug.
The code "works" if I change the depth from 2 to 1, or if I run under python 3 (after fixing the print statements and using items() instead of iteritems()). However, the list of keys is clearly not the set of keys found while iterating over the return value of recursive_dict().
The following restriction from the shelve documentation may apply (emphases mine):
The choice of which database package will be used (such as dbm, gdbm or bsddb) depends on which interface is available. Therefore it is not safe to open the database directly using dbm. The database is also (unfortunately) subject to the limitations of dbm, if it is used — this means that (the pickled representation of) the objects stored in the database should be fairly small, and in rare cases key collisions may cause the database to refuse updates.
I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.
It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :
with open(filename) as data:
accounts = dict()
for line in data:
username = line.split()[1]
IP = line.split()[0]
try:
accounts[username].add(IP)
except KeyError:
accounts[username] = set()
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
The last lines are there so that I could monitor memory usage.
The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.
I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.
What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?
EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.
EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS
EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).
this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.
i saw two discrete problems here
why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system
I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )
Lazy line reading...
I looked around and found this post -
http://www.peterbe.com/plog/blogitem-040312-1
it's from a much earlier version of python, but this line resonated with me:
readlines() reads in the whole file at once and splits it by line.
then i saw this , also old, effbot post:
http://effbot.org/zone/readline-performance.htm
the key takeaway was this:
For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.
and this:
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better
looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:
This method returns the same thing as iter(f)
Deprecated since version 2.3: Use for line in file instead.
it made me think that perhaps some slurping is going on.
so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...
Read until EOF using readline() and return a list containing the lines thus read.
and it sort of seems like that's what's happening here.
readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]
Read one entire line from the file
so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )
accounts = dict()
data= open(filename)
for line in data.readline():
info = line.split("LOG:")
if len(info) == 2 :
( a , b ) = info
try:
accounts[a].add(True)
except KeyError:
accounts[a] = set()
accounts[a].add(True)
my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data
freeing memory
in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.
so i poked around on that idea , and found a few links that suggest this might be happening:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).
then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4
"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.
So with 2.4 under linux (as you tested) you will indeed not always get
the used memory back, with respect to lots of small objects being
collected.
The difference therefore (I think) you see between doing an f.read() and
an f.readlines() is that the former reads in the whole file as one large
string object (i.e. not a small object), while the latter returns a list
of lines where each line is a python object.
if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.
Which version of python that are you trying this?
I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.
Here I generate sample data like yours:
import random
fn = random.randint
with open('ips.txt', 'w') as f:
for i in xrange(9000000):
f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
fn(0,255),
fn(0,255),
fn(0,255),
fn(0,255),
fn(0, 9000000),
))
And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:
import time
from collections import defaultdict
def read_file(filename):
with open(filename) as data:
accounts = defaultdict(set)
for line in data:
IP, username = line.split()[:2]
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
if __name__ == '__main__':
read_file('ips.txt')
As you can see, memory reached 1.4G and was then released, leaving 36MB:
Using your original script I got the same results, but a bit slower:
There are difference between when Python releases memory for reuse by Python and when it releases memory back to the OS. Python has internal pools for some kinds of objects and it will reuse these itself but doesn't give it back to the OS.
The gc module may be useful, particularly the collect function. I have never used it myself, but from the documentation, it looks like it may be useful. I would try running gc.collect() before you run accounts.clear().
I'm developing a small game with pyglet. One centerpiece is, of course, drawing coloured rectangels. I initially did this by creating images in memory and blit()ing them, which worked fine. After noticing how ugly, roundabout and inefficent (yes, I profiled - ColorRect.draw() took significant time and became 10x more efficent through this change) this is, I've started creating vertex lists instead, via pyglet.graphics.Batch (I copied most of the code verbatim from one of the examples). Since then, I experience a weird exception in some low-level OpenGL code that I failed to find a cause for or reproduce reliably.
There is no apparent relation to gameplay events -- as in, nothing exceptional happens just before, or I constantly miss it. As the error occurs somewhere deep in the event loop, I cannot easily track down which position update causes it. Honestly, I'm stumped. Thus I'll braindump what I have found out and hope for some kind psychic.
I've tried it out on Windows 7 32 bit (I may get around to try it on Ubuntu 11.10 soon) with Python 3.2.2, with a pyglet revision 043180b64260 (pulled from Goggle Code and built from source, the 1.1.4 release is harder to install as it doesn't run 2to3 automatically, though it appears to be equally py3k-ready). I'll probably update to the latest mercurial version next, but it's only a few commits and the changes seem entirely unrelated.
The full traceback (censored some paths out of principle, but note it's in its own virtualenv):
Traceback (most recent call last):
File "<my main file>", line 152, in <module>
main()
File "<my main file>", line 148, in main
run()
File "<my main file>", line 125, in run
pyglet.app.run()
File "<virtualenv>\Lib\site-packages\pyglet\app\__init__.py", line 123, in run
event_loop.run()
File "<virtualenv>\Lib\site-packages\pyglet\app\base.py", line 135, in run
self._run_estimated()
File "<virtualenv>\Lib\site-packages\pyglet\app\base.py", line 164, in _run_estimated
timeout = self.idle()
File "<virtualenv>\Lib\site-packages\pyglet\app\base.py", line 278, in idle
window.switch_to()
File "<virtualenv>\Lib\site-packages\pyglet\window\win32\__init__.py", line 305, in switch_to
self.context.set_current()
File "<virtualenv>\Lib\site-packages\pyglet\gl\win32.py", line 213, in set_current
super(Win32Context, self).set_current()
File "<virtualenv>\Lib\site-packages\pyglet\gl\base.py", line 320, in set_current
buffers = (gl.GLuint * len(buffers))(*buffers)
IndexError: invalid index
Running with post-mortem (actively stepping through code until it happens used to be infeasible as the FPS went from 60 down to 7) pdb shows:
buffers is a list of ints; I have no idea what these represent or where they come from, but they are pulled from a list called self.object_space._doomed_textures (where self is an window object). The associated comment says this block of code releases texture scheduled for deletion. I don't think I explicitly use textures anywhere, but who knows what pyglet does under the hood. I assume these integers are the IDs or something of the textures to be destroyed.
gl.GLuint is an alias for ctypes.c_ulong; Thus (gl.GLuint * len(buffers))(*buffers) creates an ulong array of the same length and contents
I can evaluate the very same expression at the pdb prompt without errors or data corruption.
Independent experiments (outside the virtualenv and without importing pyglet) with ctypes shows that IndexError is raised if too many arguments are given to the array constructor. This makes no sense, both experimentation and logic suggest the length and argument count must always match.
Are there other cases where this exception may occur? May this be a bug of pyglet, or am I misusing the library and missed the associated warning?
Would the code which creates and maintains the vertex lists be of any use in debugging this? There's probably something wrong with it. I've already stared at it, but since I have little experience with pyglet.graphics, this was of limited use. Just leave a comment if you'd like to see the ColorRect code.
Any other ideas what might cause this?
It is a bit hard to provide a really relevant answer since there is no code provided but from what I can see from the error output.
buffers = (gl.GLuint * len(buffers))(*buffers)
So if I understand correctly, you are multiplying the size of an GLuint (4 bytes) with your actually buffers length (if initialized). Maybe that's why your Index is invalid, because it is too high?
Usually it would be ok since a buffer is in bytes, but you said that it is a list of ints?
Hope it helps