Is this a python 3 file bug? - python

Is this a bug? It demonstrates what happens when you use libtiff to extract an image from an open tiff file handle. It works in python 2.x and does not work in python 3.2.3
import os
# any file will work here, since it's not actually loading the tiff
# assuming it's big enough for the seek
filename = "/home/kostrom/git/wiredfool-pillow/Tests/images/multipage.tiff"
def test():
fp1 = open(filename, "rb")
buf1 = fp1.read(8)
fp1.seek(28)
fp1.read(2)
for x in range(16):
fp1.read(12)
fp1.read(4)
fd = os.dup(fp1.fileno())
os.lseek(fd, 28, os.SEEK_SET)
os.close(fd)
# this magically fixes it: fp1.tell()
fp1.seek(284)
expect_284 = fp1.tell()
print ("expected 284, actual %d" % expect_284)
test()
The output which I feel is in error is:
expected 284, actual -504
Uncommenting the fp1.tell() does some ... side effect ... which stabilizes the py3 handle, and I don't know why. I'd also appreciate if someone can test other versions of python3.

No, this is not a bug. The Python 3 io library, which provides you with the file object from an open() call, gives you a buffered file object. For binary files, you are given a (subclass of) io.BufferedIOBase.
The Python 2 file object is far more primitive, although you can use the io library there too.
By seeking at the OS level you are bypassing the buffer and are mucking up the internal state. Generally speaking, as the doctor said to the patient complaining that pinching his skin hurts: don't do that.
If you have a pressing need to do this anyway, at the very least use the underlying raw file object (a subclass of the io.RawIOBase class) via the io.BufferedIO.raw attribute:
fp1 = open(filename, "rb").raw

os.dup creates a duplicate file descriptor that refers to the same open file description. Therefore, os.lseek(fd, 28, SEEK_SET) changes the seek position of the file underlying fp1.
Python's file objects cache the file position to avoid repeated system calls. The side effect of this is that changing the file position without using the file object methods will desynchronize the cached position and the real position, leading to nonsense like you've observed.
Worse yet, because the files are internally buffered by Python, seeking outside the file methods could actually cause the returned file data to be incorrect, leading to corruption or other nasty stuff.
The documentation in bufferedio.c notes that tell can be used to reinitialize the cached value:
* The absolute position of the raw stream is cached, if possible, in the
`abs_pos` member. It must be updated every time an operation is done
on the raw stream. If not sure, it can be reinitialized by calling
_buffered_raw_tell(), which queries the raw stream (_buffered_raw_seek()
also does it). To read it, use RAW_TELL().

Related

Does deleting a dictionary close the file descriptors inside the dict?

Clarified
There are two questions indeed. Updated to make this clearer.
I have:
t = {
'fd': open("filename", 'r')
}
I understand that del t['fd'] removes the key and closes the file. Is that correct?
Does del t call del on contained objects (fd in this case)?
The two parts of your question have completely different answers.
Deleting a variable to close file is not reliable; sometimes it works, sometimes it doesn't, or it works but in a surprising way. Occasionally it may lose data. It will definitely fail to report file errors in a useful way.
The correct ways to close a file are (a) using a with statement, or (b) using the .close() method.
Deleting an object indeed deletes all contained objects, with a couple of caveats:
if (some of) those objects are also in another variable, they will continue to exist until that other variable is also deleted;
if those objects refer to each other, they may continue to exist for some time afterwards and get deleted later; and
for immutable objects (strings, integers), Python may decide to keep them around as an optimisation, but this mostly won't be visible to us and will in any case differ between versions.
The behavior of what you want to do is nondeterministic. Depending on the implementation and internal functioning of Python (CPython, PyPi, ...), this can be work or not:
Working example:
t = {
'fd': open('data.txt', 'r')
}
def hook_close_fd():
print('del dictionary, close the file')
t['fd'].close = hook_close_fd
del t
Output:
del dictionary, close the file
In this case, the close function is called on delete
Non working example:
t = {
'fd': open('data.txt', 'r')
}
def hook_close_fd():
print('del dictionary, close the file')
t['fd'].close = hook_close_fd
3 / 0
Output of 1st run:
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-211-2d81419599a9> in <module>
10 t['fd'].close = hook_close_fd
11
---> 12 3 / 0
ZeroDivisionError: division by zero
Output of 2nd run:
del dictionary, close the file
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-212-2d81419599a9> in <module>
10 t['fd'].close = hook_close_fd
11
---> 12 3 / 0
ZeroDivisionError: division by zero
As you can see, when an exception is raised, you can't be sure if your file descriptor will be closed properly (especially if you don't catch exception yourself).
No. You must close the files you open (see note).
Or, but it may not fit your project, in a more safer way open it in a context manager.
with open("filename", 'r') as fd:
...
If you are opening simultaneously a unknown number of files, you may need to write you own context manager or more simply use contextlib.ExitStack
Note:
From :
"3.10.0 Documentation /
The Python Tutorial /
7. Input and Output /
7.2. Reading and Writing Files
https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files":
Warning Calling f.write() without using the with keyword or calling f.close() might result in the arguments of f.write() not being completely written to the disk, even if the program exits successfully.
Your question is less trivial than you think. But deep down it all comes down to understanding what a pointer is and how the garbage collector works.
Indeed, by doing t['fd'], you delete that entry in the dictionary. What you do is delete the pointer that points to that entry. If you have a dictionary as follows:
t = { 'a':3, 'b':4 }
Then when you do del t you delete the pointer to the dictionary t. As there is therefore no further reference to the dictionary keys, the garbage collector deletes these as well, thus freeing up all the dictionary memory.
However, as long as you have a file descriptor, you can delete it but it is not desirable to do so since the file descriptor (fd, pointer to a file and much more ) is the way the programmer interacts with a file, if the descriptor is deleted abruptly, the state of the file can get corrupted by being in an inconsistent state.
Therefore, it is a good idea to call the close function before you stop working with a file. This function takes care of all those things for you.
Yes, deleting the variable t closes the file (using psutil to show open files per process):
import psutil
def find_open_files():
for proc in psutil.process_iter():
if proc.name() == 'python3.9':
print(proc.open_files())
filename = '/tmp/test.txt'
t = {'fd': open(filename, 'r')}
find_open_files()
del t
find_open_files()
Out:
[popenfile(path='/private/tmp/test.txt', fd=3)]
[]

multiprocessing manager fails pickling [duplicate]

The background: I'm building a trie to represent a dictionary, using a minimal construction algorithm. The input list is 4.3M utf-8 strings, sorted lexicographically. The resulting graph is acyclic and has a maximum depth of 638 nodes. The first line of my script sets the recursion limit to 1100 via sys.setrecursionlimit().
The problem: I'd like to be able to serialize my trie to disk, so I can load it into memory without having to rebuild from scratch (roughly 22 minutes). I have tried both pickle.dump() and cPickle.dump(), with both the text and binary protocols. Each time, I get a stack-trace that looks like the following:
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 663, in _batch_setitems
save(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 725, in save_inst
save(stuff)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/pickle.py", line 648, in save_dict
self.memoize(obj)
RuntimeError: maximum recursion depth exceeded
My data structures are relatively simple: trie contains a reference to a start state, and defines some methods. dfa_state contains a boolean field, a string field, and a dictionary mapping from label to state.
I'm not very familiar with the inner workings of pickle - does my max recursion depth need to be greater/equal n times the depth of the trie for some n? Or could this be caused by something else I'm unaware of?
Update: Setting the recursion depth to 3000 didn't help, so this avenue doesn't look promising.
Update 2: You guys were right; I was being short-sighted in assuming that pickle would use a small nesting depth due to default recursion limitations. 10,000 did the trick.
From the docs:
Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a RuntimeError will be raised in this case. You can carefully raise this limit with sys.setrecursionlimit().
Although your trie implementation may be simple, it uses recursion and can lead to issues when converting to a persistent data structure.
My recommendation would be continue raising the recursion limit to see if there is an upper bound for the data you are working with and the trie implementation you are using.
Other then that, you can try changing your tree implementation to be "less recursive", if possible, or write an additional implementation that has data persistence built-in (use pickles and shelves in your implementation). Hope that helps
Pickle does need to recursively walk your trie. If Pickle is using just 5 levels of function calls to do the work your trie of depth 638 will need the level set to more than 3000.
Try a much bigger number, the recursion limit is really just there to protect users from having to wait too long if the recursion falls in an infinite hole.
Pickle handles cycles ok, so it doesn't matter even if your trie had a cycle in there
Stack size must also be increased with resource.setrlimit to prevent segfault
If you use just sys.setrecursionlimit, you can still segfault if you reach the maximum stack size allowed by the Linux kernel.
This value can be increased with resource.setrlimit as mentioned at: Setting stacksize in a python script
import pickle
import resource
import sys
print resource.getrlimit(resource.RLIMIT_STACK)
print sys.getrecursionlimit()
max_rec = 0x100000
# May segfault without this line. 0x100 is a guess at the size of each stack frame.
resource.setrlimit(resource.RLIMIT_STACK, [0x100 * max_rec, resource.RLIM_INFINITY])
sys.setrecursionlimit(max_rec)
a = []
# 0x10 is to account for subfunctions called inside `pickle`.
for i in xrange(max_rec / 0x10):
a = [a]
print pickle.dumps(a, -1)
See also: What is the maximum recursion depth in Python, and how to increase it?
The default maximum value for me is 8Mb.
Tested on Ubuntu 16.10, Python 2.7.12.
Double-check that your structure is indeed acyclic.
You could try bumping up the limit even further. There's a hard maximum that's platform dependent, but trying 50000 would be reasonable.
Also try pickling a trivially small version of your trie. If pickle dies even though it's only storing a couple three-letter words, then you know there's some fundamental problem with your trie and not pickle. But if it only happens when you try storing 10k words, then it might be the fault of a platform limitation in pickle.
My needs were somewhat immediate so I solved this problem by saving my dictionary in .txt format. The only thing is that when you load your file again you have to transform it back into a dictionary.
import json
# Saving the dictionary
with open('filename.txt', 'w') as file_handle:
file_handle.write(str(dictionary))
# Importing the .txt file
with open('filename.txt', 'r') as file_handle:
f = '"' + file_handle.read() + '"'
# From .txt file to dictionary
dictionary = eval(json.loads(f))
If this does not work you may try exporting the dictionary using json format.
For me, removing all uses of importlib.reload solved the issue.
I did not even need to increase the limit with setrecursionlimit.
If you want to know how I found it, continue reading.
Before I found the solution I found that I can actually save the model if I moved it to the CPU first, but then got an error during evaluation (XXX is the class name and it matters not):
PicklingError: Can't pickle <class 'XXX'>: it's not the same object as XXX
Then I found this answer:
https://stackoverflow.com/a/1964942/4295037
But after removing all uses of importlib.reload I was able to save the model without moving it to CPU device first.

Why is concurrent.futures holding onto memory when returning np.memmap?

The problem
My application is extracting a list of zip files in memory and writing the data to a temporary file. I then memory map the data in the temp file for use in another function. When I do this in a single process, it works fine, reading the data doesn't affect memory, max RAM is around 40MB. However when I do this using concurrent.futures the RAM goes up to 500MB.
I have looked at this example and I understand I could be submitting the jobs in a nicer way to save memory during processing. But I don't think my issue is related, as I am not running out of memory during processing. The issue I don't understand is why it is holding onto the memory even after the memory maps are returned. Nor do I understand what is in the memory, since doing this in a single process does not load the data in memory.
Can anyone explain what is actually in the memory and why this is different between single and parallel processing?
PS I used memory_profiler for measuring the memory usage
Code
Main code:
def main():
datadir = './testdata'
files = os.listdir('./testdata')
files = [os.path.join(datadir, f) for f in files]
datalist = download_files(files, multiprocess=False)
print(len(datalist))
time.sleep(15)
del datalist # See here that memory is freed up
time.sleep(15)
Other functions:
def download_files(filelist, multiprocess=False):
datalist = []
if multiprocess:
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
returned_future = [executor.submit(extract_file, f) for f in filelist]
for future in returned_future:
datalist.append(future.result())
else:
for f in filelist:
datalist.append(extract_file(f))
return datalist
def extract_file(input_zip):
buffer = next(iter(extract_zip(input_zip).values()))
with tempfile.NamedTemporaryFile() as temp_logfile:
temp_logfile.write(buffer)
del buffer
data = memmap(temp_logfile, dtype='float32', shape=(2000000, 4), mode='r')
return data
def extract_zip(input_zip):
with ZipFile(input_zip, 'r') as input_zip:
return {name: input_zip.read(name) for name in input_zip.namelist()}
Helper code for data
I can't share my actual data, but here's some simple code to create files that demonstrate the issue:
for i in range(1, 16):
outdir = './testdata'
outfile = 'file_{}.dat'.format(i)
fp = np.memmap(os.path.join(outdir, outfile), dtype='float32', mode='w+', shape=(2000000, 4))
fp[:] = np.random.rand(*fp.shape)
del fp
with ZipFile(outdir + '/' + outfile[:-4] + '.zip', mode='w', compression=ZIP_DEFLATED) as z:
z.write(outdir + '/' + outfile, outfile)
The problem is that you're trying to pass an np.memmap between processes, and that doesn't work.
The simplest solution is to instead pass the filename, and have the child process memmap the same file.
When you pass an argument to a child process or pool method via multiprocessing, or return a value from one (including doing so indirectly via a ProcessPoolExecutor), it works by calling pickle.dumps on the value, passing the pickle across processes (the details vary, but it doesn't matter whether it's a Pipe or a Queue or something else), and then unpickling the result on the other side.
A memmap is basically just an mmap object with an ndarray allocated in the mmapped memory.
And Python doesn't know how to pickle an mmap object. (If you try, you will either get a PicklingError or a BrokenProcessPool error, depending on your Python version.)
A np.memmap can be pickled, because it's just a subclass of np.ndarray—but pickling and unpickling it actually copies the data and gives you a plain in-memory array. (If you look at data._mmap, it's None.) It would probably be nicer if it gave you an error instead of silently copying all of your data (the pickle-replacement library dill does exactly that: TypeError: can't pickle mmap.mmap objects), but it doesn't.
It's not impossible to pass the underlying file descriptor between processes—the details are different on every platform, but all of the major platforms have a way to do that. And you could then use the passed fd to build an mmap on the receiving side, then build a memmap out of that. And you could probably even wrap this up in a subclass of np.memmap. But I suspect if that weren't somewhat difficult, someone would have already done it, and in fact it would probably already be part of dill, if not numpy itself.
Another alternative is to explicitly use the shared memory features of multiprocessing, and allocate the array in shared memory instead of a mmap.
But the simplest solution is, as I said at the top, to just pass the filename instead of the object, and let each side memmap the same file. This does, unfortunately, mean you can't just use a delete-on-close NamedTemporaryFile (although the way you were using it was already non-portable and wouldn't have worked on Windows the same way it does on Unix), but changing that is still probably less work than the other alternatives.

How to free memory after opening a file in Python

I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.
It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :
with open(filename) as data:
accounts = dict()
for line in data:
username = line.split()[1]
IP = line.split()[0]
try:
accounts[username].add(IP)
except KeyError:
accounts[username] = set()
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
The last lines are there so that I could monitor memory usage.
The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.
I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.
What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?
EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.
EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS
EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).
this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.
i saw two discrete problems here
why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system
I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )
Lazy line reading...
I looked around and found this post -
http://www.peterbe.com/plog/blogitem-040312-1
it's from a much earlier version of python, but this line resonated with me:
readlines() reads in the whole file at once and splits it by line.
then i saw this , also old, effbot post:
http://effbot.org/zone/readline-performance.htm
the key takeaway was this:
For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.
and this:
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better
looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:
This method returns the same thing as iter(f)
Deprecated since version 2.3: Use for line in file instead.
it made me think that perhaps some slurping is going on.
so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...
Read until EOF using readline() and return a list containing the lines thus read.
and it sort of seems like that's what's happening here.
readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]
Read one entire line from the file
so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )
accounts = dict()
data= open(filename)
for line in data.readline():
info = line.split("LOG:")
if len(info) == 2 :
( a , b ) = info
try:
accounts[a].add(True)
except KeyError:
accounts[a] = set()
accounts[a].add(True)
my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data
freeing memory
in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.
so i poked around on that idea , and found a few links that suggest this might be happening:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).
then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4
"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.
So with 2.4 under linux (as you tested) you will indeed not always get
the used memory back, with respect to lots of small objects being
collected.
The difference therefore (I think) you see between doing an f.read() and
an f.readlines() is that the former reads in the whole file as one large
string object (i.e. not a small object), while the latter returns a list
of lines where each line is a python object.
if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.
Which version of python that are you trying this?
I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.
Here I generate sample data like yours:
import random
fn = random.randint
with open('ips.txt', 'w') as f:
for i in xrange(9000000):
f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
fn(0,255),
fn(0,255),
fn(0,255),
fn(0,255),
fn(0, 9000000),
))
And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:
import time
from collections import defaultdict
def read_file(filename):
with open(filename) as data:
accounts = defaultdict(set)
for line in data:
IP, username = line.split()[:2]
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
if __name__ == '__main__':
read_file('ips.txt')
As you can see, memory reached 1.4G and was then released, leaving 36MB:
Using your original script I got the same results, but a bit slower:
There are difference between when Python releases memory for reuse by Python and when it releases memory back to the OS. Python has internal pools for some kinds of objects and it will reuse these itself but doesn't give it back to the OS.
The gc module may be useful, particularly the collect function. I have never used it myself, but from the documentation, it looks like it may be useful. I would try running gc.collect() before you run accounts.clear().

How to recover a broken python "cPickle" dump?

I am using rss2email for converting a number of RSS feeds into mail for easier consumption. That is, I was using it because it broke in a horrible way today: On every run, it only gives me this backtrace:
Traceback (most recent call last):
File "/usr/share/rss2email/rss2email.py", line 740, in <module>
elif action == "list": list()
File "/usr/share/rss2email/rss2email.py", line 681, in list
feeds, feedfileObject = load(lock=0)
File "/usr/share/rss2email/rss2email.py", line 422, in load
feeds = pickle.load(feedfileObject)
TypeError: ("'str' object is not callable", 'sxOYAAuyzSx0WqN3BVPjE+6pgPU', ((2009, 3, 19, 1, 19, 31, 3, 78, 0), {}))
The only helpful fact that I have been able to construct from this backtrace is that the file ~/.rss2email/feeds.dat in which rss2email keeps all its configuration and runtime state is somehow broken. Apparently, rss2email reads its state and dumps it back using cPickle on every run.
I have even found the line containing that 'sxOYAAuyzSx0WqN3BVPjE+6pgPU'string mentioned above in the giant (>12MB) feeds.dat file. To my untrained eye, the dump does not appear to be truncated or otherwise damaged.
What approaches could I try in order to reconstruct the file?
The Python version is 2.5.4 on a Debian/unstable system.
EDIT
Peter Gibson and J.F. Sebastian have suggested directly loading from the
pickle file and I had tried that before. Apparently, a Feed class
that is defined in rss2email.py is needed, so here's my script:
#!/usr/bin/python
import sys
# import pickle
import cPickle as pickle
sys.path.insert(0,"/usr/share/rss2email")
from rss2email import Feed
feedfile = open("feeds.dat", 'rb')
feeds = pickle.load(feedfile)
The "plain" pickle variant produces the following traceback:
Traceback (most recent call last):
File "./r2e-rescue.py", line 8, in <module>
feeds = pickle.load(feedfile)
File "/usr/lib/python2.5/pickle.py", line 1370, in load
return Unpickler(file).load()
File "/usr/lib/python2.5/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.5/pickle.py", line 1133, in load_reduce
value = func(*args)
TypeError: 'str' object is not callable
The cPickle variant produces essentially the same thing as calling
r2e itself:
Traceback (most recent call last):
File "./r2e-rescue.py", line 10, in <module>
feeds = pickle.load(feedfile)
TypeError: ("'str' object is not callable", 'sxOYAAuyzSx0WqN3BVPjE+6pgPU', ((2009, 3, 19, 1, 19, 31, 3, 78, 0), {}))
EDIT 2
Following J.F. Sebastian's suggestion around putting "printf
debugging" into Feed.__setstate__ into my test script, these are the
last few lines before Python bails out.
u'http:/com/news.ars/post/20080924-everyone-declares-victory-in-smutfree-wireless-broadband-test.html': u'http:/com/news.ars/post/20080924-everyone-declares-victory-in-smutfree-wireless-broadband-test.html'},
'to': None,
'url': 'http://arstechnica.com/'}
Traceback (most recent call last):
File "./r2e-rescue.py", line 23, in ?
feeds = pickle.load(feedfile)
TypeError: ("'str' object is not callable", 'sxOYAAuyzSx0WqN3BVPjE+6pgPU', ((2009, 3, 19, 1, 19, 31, 3, 78, 0), {}))
The same thing happens on a Debian/etch box using python 2.4.4-2.
How I solved my problem
A Perl port of pickle.py
Following J.F. Sebastian's comment about how simple the pickle
format is, I went out to port parts of pickle.py to Perl. A couple
of quick regular expressions would have been a faster way to access my
data, but I felt that the hack value and an opportunity to learn more
about Python would be be worth it. Plus, I still feel much more
comfortable using (and debugging code in) Perl than Python.
Most of the porting effort (simple types, tuples, lists, dictionaries)
went very straightforward. Perl's and Python's different notions of
classes and objects has been the only issue so far where a bit more
than simple translation of idioms was needed. The result is a module
called Pickle::Parse which after a bit of polishing will be
published on CPAN.
A module called Python::Serialise::Pickle existed on CPAN, but I
found its parsing capabilities lacking: It spews debugging output all
over the place and doesn't seem to support classes/objects.
Parsing, transforming data, detecting actual errors in the stream
Based upon Pickle::Parse, I tried to parse the feeds.dat file.
After a few iteration of fixing trivial bugs in my parsing code, I got
an error message that was strikingly similar to pickle.py's original
object not callable error message:
Can't use string ("sxOYAAuyzSx0WqN3BVPjE+6pgPU") as a subroutine
ref while "strict refs" in use at lib/Pickle/Parse.pm line 489,
<STDIN> line 187102.
Ha! Now we're at a point where it's quite likely that the actual data
stream is broken. Plus, we get an idea where it is broken.
It turned out that the first line of the following sequence was wrong:
g7724
((I2009
I3
I19
I1
I19
I31
I3
I78
I0
t(dtRp62457
Position 7724 in the "memo" pointed to that string
"sxOYAAuyzSx0WqN3BVPjE+6pgPU". From similar records earlier in the
stream, it was clear that a time.struct_time object was needed
instead. All later records shared this wrong pointer. With a simple
search/replace operation, it was trivial to fix this.
I find it ironic that I found the source of the error by accident
through Perl's feature that tells the user its position in the input
data stream when it dies.
Conclusion
I will move away from rss2email as soon as I find time to
automatically transform its pickled configuration/state mess to
another tool's format.
pickle.py needs more meaningful error messages that tell the user
about the position of the data stream (not the poision in its own
code) where things go wrong.
Porting parts pickle.py to Perl was fun and, in the end, rewarding.
Have you tried manually loading the feeds.dat file using both cPickle and pickle? If the output differs it might hint at the error.
Something like (from your home directory):
import cPickle, pickle
f = open('.rss2email/feeds.dat', 'r')
obj1 = cPickle.load(f)
obj2 = pickle.load(f)
(you might need to open in binary mode 'rb' if rss2email doesn't pickle in ascii).
Pete
Edit: The fact that cPickle and pickle give the same error suggests that the feeds.dat file is the problem. Probably a change in the Feed class between versions of rss2email as suggested in the Ubuntu bug J.F. Sebastian links to.
Sounds like the internals of cPickle are getting tangled up. This thread (http://bytes.com/groups/python/565085-cpickle-problems) looks like it might have a clue..
'sxOYAAuyzSx0WqN3BVPjE+6pgPU' is most probably unrelated to the pickle's problem
Post an error traceback for (to determine what class defines the attribute that can't be called (the one that leads to the TypeError):
python -c "import pickle; pickle.load(open('feeds.dat'))"
EDIT:
Add the following to your code and run (redirect stderr to file then use 'tail -2' on it to print last 2 lines):
from pprint import pprint
def setstate(self, dict_):
pprint(dict_, stream=sys.stderr, depth=None)
self.__dict__.update(dict_)
Feed.__setstate__ = setstate
If the above doesn't yield an interesting output then use general troubleshooting tactics:
Confirm that 'feeds.dat' is the problem:
backup ~/.rss2email directory
install rss2email into virtualenv/pip sandbox (or use zc.buildout) to isolate the environment (make sure you are using feedparser.py from the trunk).
add couple of feeds, add feeds until 'feeds.dat' size is greater than the current. Run some tests.
try old 'feeds.dat'
try new 'feeds.dat' on existing rss2email installation
See r2e bails out with TypeError bug on Ubuntu.

Categories

Resources