Clarified
There are two questions indeed. Updated to make this clearer.
I have:
t = {
'fd': open("filename", 'r')
}
I understand that del t['fd'] removes the key and closes the file. Is that correct?
Does del t call del on contained objects (fd in this case)?
The two parts of your question have completely different answers.
Deleting a variable to close file is not reliable; sometimes it works, sometimes it doesn't, or it works but in a surprising way. Occasionally it may lose data. It will definitely fail to report file errors in a useful way.
The correct ways to close a file are (a) using a with statement, or (b) using the .close() method.
Deleting an object indeed deletes all contained objects, with a couple of caveats:
if (some of) those objects are also in another variable, they will continue to exist until that other variable is also deleted;
if those objects refer to each other, they may continue to exist for some time afterwards and get deleted later; and
for immutable objects (strings, integers), Python may decide to keep them around as an optimisation, but this mostly won't be visible to us and will in any case differ between versions.
The behavior of what you want to do is nondeterministic. Depending on the implementation and internal functioning of Python (CPython, PyPi, ...), this can be work or not:
Working example:
t = {
'fd': open('data.txt', 'r')
}
def hook_close_fd():
print('del dictionary, close the file')
t['fd'].close = hook_close_fd
del t
Output:
del dictionary, close the file
In this case, the close function is called on delete
Non working example:
t = {
'fd': open('data.txt', 'r')
}
def hook_close_fd():
print('del dictionary, close the file')
t['fd'].close = hook_close_fd
3 / 0
Output of 1st run:
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-211-2d81419599a9> in <module>
10 t['fd'].close = hook_close_fd
11
---> 12 3 / 0
ZeroDivisionError: division by zero
Output of 2nd run:
del dictionary, close the file
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-212-2d81419599a9> in <module>
10 t['fd'].close = hook_close_fd
11
---> 12 3 / 0
ZeroDivisionError: division by zero
As you can see, when an exception is raised, you can't be sure if your file descriptor will be closed properly (especially if you don't catch exception yourself).
No. You must close the files you open (see note).
Or, but it may not fit your project, in a more safer way open it in a context manager.
with open("filename", 'r') as fd:
...
If you are opening simultaneously a unknown number of files, you may need to write you own context manager or more simply use contextlib.ExitStack
Note:
From :
"3.10.0 Documentation /
The Python Tutorial /
7. Input and Output /
7.2. Reading and Writing Files
https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files":
Warning Calling f.write() without using the with keyword or calling f.close() might result in the arguments of f.write() not being completely written to the disk, even if the program exits successfully.
Your question is less trivial than you think. But deep down it all comes down to understanding what a pointer is and how the garbage collector works.
Indeed, by doing t['fd'], you delete that entry in the dictionary. What you do is delete the pointer that points to that entry. If you have a dictionary as follows:
t = { 'a':3, 'b':4 }
Then when you do del t you delete the pointer to the dictionary t. As there is therefore no further reference to the dictionary keys, the garbage collector deletes these as well, thus freeing up all the dictionary memory.
However, as long as you have a file descriptor, you can delete it but it is not desirable to do so since the file descriptor (fd, pointer to a file and much more ) is the way the programmer interacts with a file, if the descriptor is deleted abruptly, the state of the file can get corrupted by being in an inconsistent state.
Therefore, it is a good idea to call the close function before you stop working with a file. This function takes care of all those things for you.
Yes, deleting the variable t closes the file (using psutil to show open files per process):
import psutil
def find_open_files():
for proc in psutil.process_iter():
if proc.name() == 'python3.9':
print(proc.open_files())
filename = '/tmp/test.txt'
t = {'fd': open(filename, 'r')}
find_open_files()
del t
find_open_files()
Out:
[popenfile(path='/private/tmp/test.txt', fd=3)]
[]
Related
The problem
My application is extracting a list of zip files in memory and writing the data to a temporary file. I then memory map the data in the temp file for use in another function. When I do this in a single process, it works fine, reading the data doesn't affect memory, max RAM is around 40MB. However when I do this using concurrent.futures the RAM goes up to 500MB.
I have looked at this example and I understand I could be submitting the jobs in a nicer way to save memory during processing. But I don't think my issue is related, as I am not running out of memory during processing. The issue I don't understand is why it is holding onto the memory even after the memory maps are returned. Nor do I understand what is in the memory, since doing this in a single process does not load the data in memory.
Can anyone explain what is actually in the memory and why this is different between single and parallel processing?
PS I used memory_profiler for measuring the memory usage
Code
Main code:
def main():
datadir = './testdata'
files = os.listdir('./testdata')
files = [os.path.join(datadir, f) for f in files]
datalist = download_files(files, multiprocess=False)
print(len(datalist))
time.sleep(15)
del datalist # See here that memory is freed up
time.sleep(15)
Other functions:
def download_files(filelist, multiprocess=False):
datalist = []
if multiprocess:
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
returned_future = [executor.submit(extract_file, f) for f in filelist]
for future in returned_future:
datalist.append(future.result())
else:
for f in filelist:
datalist.append(extract_file(f))
return datalist
def extract_file(input_zip):
buffer = next(iter(extract_zip(input_zip).values()))
with tempfile.NamedTemporaryFile() as temp_logfile:
temp_logfile.write(buffer)
del buffer
data = memmap(temp_logfile, dtype='float32', shape=(2000000, 4), mode='r')
return data
def extract_zip(input_zip):
with ZipFile(input_zip, 'r') as input_zip:
return {name: input_zip.read(name) for name in input_zip.namelist()}
Helper code for data
I can't share my actual data, but here's some simple code to create files that demonstrate the issue:
for i in range(1, 16):
outdir = './testdata'
outfile = 'file_{}.dat'.format(i)
fp = np.memmap(os.path.join(outdir, outfile), dtype='float32', mode='w+', shape=(2000000, 4))
fp[:] = np.random.rand(*fp.shape)
del fp
with ZipFile(outdir + '/' + outfile[:-4] + '.zip', mode='w', compression=ZIP_DEFLATED) as z:
z.write(outdir + '/' + outfile, outfile)
The problem is that you're trying to pass an np.memmap between processes, and that doesn't work.
The simplest solution is to instead pass the filename, and have the child process memmap the same file.
When you pass an argument to a child process or pool method via multiprocessing, or return a value from one (including doing so indirectly via a ProcessPoolExecutor), it works by calling pickle.dumps on the value, passing the pickle across processes (the details vary, but it doesn't matter whether it's a Pipe or a Queue or something else), and then unpickling the result on the other side.
A memmap is basically just an mmap object with an ndarray allocated in the mmapped memory.
And Python doesn't know how to pickle an mmap object. (If you try, you will either get a PicklingError or a BrokenProcessPool error, depending on your Python version.)
A np.memmap can be pickled, because it's just a subclass of np.ndarray—but pickling and unpickling it actually copies the data and gives you a plain in-memory array. (If you look at data._mmap, it's None.) It would probably be nicer if it gave you an error instead of silently copying all of your data (the pickle-replacement library dill does exactly that: TypeError: can't pickle mmap.mmap objects), but it doesn't.
It's not impossible to pass the underlying file descriptor between processes—the details are different on every platform, but all of the major platforms have a way to do that. And you could then use the passed fd to build an mmap on the receiving side, then build a memmap out of that. And you could probably even wrap this up in a subclass of np.memmap. But I suspect if that weren't somewhat difficult, someone would have already done it, and in fact it would probably already be part of dill, if not numpy itself.
Another alternative is to explicitly use the shared memory features of multiprocessing, and allocate the array in shared memory instead of a mmap.
But the simplest solution is, as I said at the top, to just pass the filename instead of the object, and let each side memmap the same file. This does, unfortunately, mean you can't just use a delete-on-close NamedTemporaryFile (although the way you were using it was already non-portable and wouldn't have worked on Windows the same way it does on Unix), but changing that is still probably less work than the other alternatives.
Is this a bug? It demonstrates what happens when you use libtiff to extract an image from an open tiff file handle. It works in python 2.x and does not work in python 3.2.3
import os
# any file will work here, since it's not actually loading the tiff
# assuming it's big enough for the seek
filename = "/home/kostrom/git/wiredfool-pillow/Tests/images/multipage.tiff"
def test():
fp1 = open(filename, "rb")
buf1 = fp1.read(8)
fp1.seek(28)
fp1.read(2)
for x in range(16):
fp1.read(12)
fp1.read(4)
fd = os.dup(fp1.fileno())
os.lseek(fd, 28, os.SEEK_SET)
os.close(fd)
# this magically fixes it: fp1.tell()
fp1.seek(284)
expect_284 = fp1.tell()
print ("expected 284, actual %d" % expect_284)
test()
The output which I feel is in error is:
expected 284, actual -504
Uncommenting the fp1.tell() does some ... side effect ... which stabilizes the py3 handle, and I don't know why. I'd also appreciate if someone can test other versions of python3.
No, this is not a bug. The Python 3 io library, which provides you with the file object from an open() call, gives you a buffered file object. For binary files, you are given a (subclass of) io.BufferedIOBase.
The Python 2 file object is far more primitive, although you can use the io library there too.
By seeking at the OS level you are bypassing the buffer and are mucking up the internal state. Generally speaking, as the doctor said to the patient complaining that pinching his skin hurts: don't do that.
If you have a pressing need to do this anyway, at the very least use the underlying raw file object (a subclass of the io.RawIOBase class) via the io.BufferedIO.raw attribute:
fp1 = open(filename, "rb").raw
os.dup creates a duplicate file descriptor that refers to the same open file description. Therefore, os.lseek(fd, 28, SEEK_SET) changes the seek position of the file underlying fp1.
Python's file objects cache the file position to avoid repeated system calls. The side effect of this is that changing the file position without using the file object methods will desynchronize the cached position and the real position, leading to nonsense like you've observed.
Worse yet, because the files are internally buffered by Python, seeking outside the file methods could actually cause the returned file data to be incorrect, leading to corruption or other nasty stuff.
The documentation in bufferedio.c notes that tell can be used to reinitialize the cached value:
* The absolute position of the raw stream is cached, if possible, in the
`abs_pos` member. It must be updated every time an operation is done
on the raw stream. If not sure, it can be reinitialized by calling
_buffered_raw_tell(), which queries the raw stream (_buffered_raw_seek()
also does it). To read it, use RAW_TELL().
I'm getting this exception when trying to open shelve persisted files over a certain size which is actually pretty small (< 1MB) but I'm not sure where the exactly number is. Now, I know pickle is sort of the bastard child of python and shelve isn't thought of as a particularly robust solution, but it happens to solve my problem wonderfully (in theory) and I haven't been able to find a reason for this exception.
Traceback (most recent call last):
File "test_shelve.py", line 27, in <module>
print len(f.keys())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 101, in keys
return self.dict.keys()
SystemError: Negative size passed to PyString_FromStringAndSize
I can reproduce it consistently, but I haven't found much on google. Here's a script that will reproduce.
import shelve
import random
import string
import pprint
f = shelve.open('test')
# f = {}
def rand_list(list_size=20, str_size=40):
return [''.join([random.choice(string.ascii_uppercase + string.digits) for j in range(str_size)]) for i in range(list_size)]
def recursive_dict(depth=3):
if depth==0:
return rand_list()
else:
d = {}
for k in rand_list():
d[k] = recursive_dict(depth-1)
return d
for k,v in recursive_dict(2).iteritems():
f[k] = v
f.close()
f = shelve.open('test')
print len(f.keys())
Regarding error itself:
The idea circulating on the web is the data size exceeded the largest
integer possible on that machine (the largest 32 bit (signed) integer
is 2 147 483 647), interpreted as a negative size by Python.
Your code is running with 2.7.3, so may be a fixed bug.
The code "works" if I change the depth from 2 to 1, or if I run under python 3 (after fixing the print statements and using items() instead of iteritems()). However, the list of keys is clearly not the set of keys found while iterating over the return value of recursive_dict().
The following restriction from the shelve documentation may apply (emphases mine):
The choice of which database package will be used (such as dbm, gdbm or bsddb) depends on which interface is available. Therefore it is not safe to open the database directly using dbm. The database is also (unfortunately) subject to the limitations of dbm, if it is used — this means that (the pickled representation of) the objects stored in the database should be fairly small, and in rare cases key collisions may cause the database to refuse updates.
I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.
It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :
with open(filename) as data:
accounts = dict()
for line in data:
username = line.split()[1]
IP = line.split()[0]
try:
accounts[username].add(IP)
except KeyError:
accounts[username] = set()
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
The last lines are there so that I could monitor memory usage.
The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.
I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.
What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?
EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.
EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS
EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).
this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.
i saw two discrete problems here
why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system
I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )
Lazy line reading...
I looked around and found this post -
http://www.peterbe.com/plog/blogitem-040312-1
it's from a much earlier version of python, but this line resonated with me:
readlines() reads in the whole file at once and splits it by line.
then i saw this , also old, effbot post:
http://effbot.org/zone/readline-performance.htm
the key takeaway was this:
For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.
and this:
In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better
looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:
This method returns the same thing as iter(f)
Deprecated since version 2.3: Use for line in file instead.
it made me think that perhaps some slurping is going on.
so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...
Read until EOF using readline() and return a list containing the lines thus read.
and it sort of seems like that's what's happening here.
readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]
Read one entire line from the file
so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )
accounts = dict()
data= open(filename)
for line in data.readline():
info = line.split("LOG:")
if len(info) == 2 :
( a , b ) = info
try:
accounts[a].add(True)
except KeyError:
accounts[a] = set()
accounts[a].add(True)
my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data
freeing memory
in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.
so i poked around on that idea , and found a few links that suggest this might be happening:
http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm
If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.
that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).
then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4
"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.
So with 2.4 under linux (as you tested) you will indeed not always get
the used memory back, with respect to lots of small objects being
collected.
The difference therefore (I think) you see between doing an f.read() and
an f.readlines() is that the former reads in the whole file as one large
string object (i.e. not a small object), while the latter returns a list
of lines where each line is a python object.
if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.
Which version of python that are you trying this?
I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.
Here I generate sample data like yours:
import random
fn = random.randint
with open('ips.txt', 'w') as f:
for i in xrange(9000000):
f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
fn(0,255),
fn(0,255),
fn(0,255),
fn(0,255),
fn(0, 9000000),
))
And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:
import time
from collections import defaultdict
def read_file(filename):
with open(filename) as data:
accounts = defaultdict(set)
for line in data:
IP, username = line.split()[:2]
accounts[username].add(IP)
print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()
print "The accounts have been deleted from memory"
time.sleep(5)
print "End of script"
if __name__ == '__main__':
read_file('ips.txt')
As you can see, memory reached 1.4G and was then released, leaving 36MB:
Using your original script I got the same results, but a bit slower:
There are difference between when Python releases memory for reuse by Python and when it releases memory back to the OS. Python has internal pools for some kinds of objects and it will reuse these itself but doesn't give it back to the OS.
The gc module may be useful, particularly the collect function. I have never used it myself, but from the documentation, it looks like it may be useful. I would try running gc.collect() before you run accounts.clear().
So I'm working on a script that will automatically download certain files from IRC XDCC bots when run. These requests are asynchronous and there can be a varying number, depending on a config file so I wanted to keep the file handles in a hash table or library so they could easily be referenced based on who the file sender was and the file they are sending (read during a triggered event). Python is complaining saying SyntaxError: can't assign to function call so I'm guessing it won't work quite how I want.
Any easier way to do this? Am I barking up the wrong tree here?
Thanks! -Russell
The problem is that the left side of an assignment statement must be an lvalue, that is something that the compiler knows has a memory address, like a variable. It is the same in other programming languages. The return value of a function is an rvalue, or a pure value.
These are other illegal assignments:
f() = 1
2 = 1
None = 0
[1,2] = []
Note that the follwing are syntactically correct because the compiler knows how to compute an address for the memory location to be assigned:
f().a = None
[1,2][0] = 0
Create an empty hash:
files = {}
Add items to the hash:
files["gin"] = open('ginpachi.txt','w')
files["ahq"] = open('ahq[DaBomb].txt','w')
Reference them like you would a normal file handler
files["gin"].close()
...
Unfortunately, there wasn't any information on this on the web (specifically with hashes and file handles).
Case closed