Optimization of memory usage while copying buffers in python

Optimization of memory usage while copying buffers in python - python

I have to copy and do some simple processing on file. I can not read whole file to the memory because it is to big. I come up with piece of code which looks like this:
buffer = inFile.read(buffer_size)
while len(buffer) > 0:
outFile.write(buffer)
simpleCalculations(buffer)
buffer = inFile.read(buffer_size)
simpleCalculations procedure is irrelevant in this context but I am worried about subsequent memory allocations of buffer list. On some hardware configuration memory usage gets very high and that apparently kills the machine. I would like to reuse buffer. Is this posible in python 2.6?

I don't think there's any easy way around this. The file.read() method just returns a new string each time you call it. On the other hand, you don't really need to worry about running out of memory -- once you assign buffer to the newly-read string, the previously-read string no longer has any references to it, so its memory gets freed automatically (see here for more details).

Python being a strictly reference-counted environment, your buffer will be deallocated as soon as you no longer have any references to it.
If you're worried about physical RAM but have spare address space, you could mmap your file rather than reading it in a bit at a time.

Related

Memory leak on pickle inside a for loop forcing a memory error

I have huge array objects that are pickled with the python pickler.
I am trying to unpickle them and reading out the data in a for loop.
Every time I am done reading and assesing, I delete all the references to those objects.
After deletion, I even call gc.collect() along with time.sleep() to see if the heap memory reduces.
The heap memory doesn't reduce pointing to the fact that, the data is still referenced somewhere within the pickle loading. After 15 datafiles(I got 250+ files to process, 1.6GB each) I hit the memory error.
I have seen many other questions here, pointing out a memory leak issue which was supposedly solved.
I don't understand what is exactly happening in my case.

Python memory management does not free memory to OS till the process is running.
Running the for loop with a subprocess to call the script helped me solved the issue.
Thanks for the feedbacks.

What exactly is Python's SpooledTemporaryFile?

I was looking into the tempfile options in Python, when I ran into SpooledTemporaryFile, now, the description says:
This function operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().
I would like to understand exactly what that means, I have looked around but found no answer, if I get it right:
Is the written data 'buffered' in RAM until it reaches a certain threshold and then saved in disk? What is the advantage of this method over the standard approach? Is it faster? Because in the end it will have to be saved in disk anyway...
Anyway, if anyone can provide me with some insights I'd be grateful.

The data is buffered in memory before writing. The question becomes, will anything be written?
Not necessarily. TemporaryFile, by default, deletes the file after it is closed. If a spooled file never gets big enough to write to disk (and the user never tries to invoke its fileno method), the buffer will simply be flushed when it is closed, with no actual disk I/O taking place.

Memory leak with PyYAML

I think that I'm having a memory leak when loading an .yml file with the library PyYAML.
I've followed the next steps:
import yaml
d = yaml.load(open(filename, 'r'))
The memory used by the process (I've gotten it with top or htop) has grown from 60K to 160M while the size of the file is lower than 1M.
Then, I've done the next command:
sys.getsizeof(d)
And it has returned a value lower than 400K.
I've also tried to use the garbage collector with gc.collect(), but nothing has happened.
As you can see, it seems that there's a memory leak, but I don't know what is producing it, neither I know how to free this amount of memory.
Any idea?

Your approach doesn't show a memory leak, it just shows that PyYAML uses a lot of memory while processing a moderately sized YAML file.
If you would do:
import yaml
X = 10
for x in range(X):
d = yaml.safe_load(open(filename, 'r'))
And the memory size used by the program would change depending on what you set X to, then there is reason to assume there is a memory leak.
In tests that I ran this is not the case. It is just that the default Loader and SafeLoader take about 330x the filesize in memory (based on an arbitrary 1Mb size simple, i.e. no tags, YAML file) and the CLoader about 145x that filesize.
Loading the YAML data multiple times doesn't increase that, so load() gives back the memory it uses, which means there is no memory leak.
That is not to say that it looks like an enormous amount of overhead.
(I am using safe_load() as PyYAML's documentation indicate that load() is not safe on uncontrolled input files).

Python IMAP search, search results exhaust all memory

I'm trying to fetch all auto-reponse emails from a specific address in Python using imaplib. Everything worked fine for weeks but now each time I run my program all my RAM is consumed (several GB!) and the script end up being killed by the OOM killer.
Here is the code I'm currently using:
M = imaplib.IMAP4_SSL('server')
M.LOGIN('user', 'pass')
M.SELECT()
date = (datetime.date.today() - datetime.timedelta(1)).strftime("%d-%b-%Y")
result, data = M.uid('search', None, '(SENTON %s HEADER FROM "auto#site.com" NOT SUBJECT "RE:")' % date)
...
I'm sure that less than 100 emails of a few kilobytes should be returned. What could be the matter here ? Or is there a way to limit the number of emails returned ?
Thx!

There's no way to know for sure what the cause is, without being able to reproduce the problem (and certainly not without seeing the complete program which triggers the problem, and knowing the version of all dependencies you're using).
However, here's my best guess. Several versions of Python include a very memory-wasteful implementation of imaplib. The problem is particularly evident on Windows, but not limited to that platform.
The core of the problem is the way strings are allocated when read from a socket, and the way imaplib reads strings from sockets.
When reading from a socket, Python first allocates a buffer large enough to handle as many bytes as the application asks for. This may be something reasonable sounding, perhaps 16 kB. Then data is read into that buffer and the buffer is resized down to fit the number of bytes actually read.
The efficiency of this operation depends on the quality of the platform re-allocation implementation. Resizing a buffer may end up moving it to a more suitable location, where the smaller size avoids wasting much memory. Or it may just mark the tail part of the memory, no longer allocated as part of that region, as re-usable (and it may even be able to re-use it in practice). Or it might end up wasting that technically unallocated memory.
Imagine the cumulative effects of that memory being wasted if you have to read a few dozen kB of data, and the data arrives from the network a few dozen bytes at a time. Worse, imagine if the data is really trickling, and you only get a few bytes at a time. Or if you're reading a very "large" response of several hundred kB.
The amount of memory wasted - effectively allocated by the process, but not usable in any meaningful way - can be huge. 100 kB of data, read 5 bytes at a time requires 20480 buffers. If each buffer starts off as 16 kB and is unsuccessfully shrunk, causing them to remain at 16Kb, then you've allocated at least 320MB of memory to hold that 100 kB of data.
Some versions of imaplib exacerbated this problem by introducing multiple layers of buffering and copying. A very old version (hopefully not one you're actually using) even read 1 byte at a time (which would result in 1.6GB of memory usage in the above scenario).
Of course, this problem usually doesn't show up on Linux, where the re-allocator is not so bad. And at various points in previous Python releases (previous to the most recent 2.x release), the bug was "fixed", so I wouldn't expect to see it show up these days. And this doesn't explain why your program ran fine for a while before failing this way.
But it is my best guess.

Deleting an arbitrary chunk of a file

What is the most efficient way to delete an arbitrary chunk of a file, given the start and end offsets? I'd prefer to use Python, but I can fall back to C if I have to.
Say the file is this
..............xxxxxxxx----------------
I want to remove a chunk of it:
..............[xxxxxxxx]----------------
After the operation it should become:
..............----------------
Reading the whole thing into memory and manipulating it in memory is not a feasible option.

The best performance will almost invariably be obtained by writing a new version of the file and then having it atomically write the old version, because filesystems are strongly optimized for such sequential access, and so is the underlying hardware (with the possible exception of some of the newest SSDs, but, even then, it's an iffy proposition). In addition, this avoids destroying data in the case of a system crash at any time -- you're left with either the old version of the file intact, or the new one in its place. Since every system could always crash at any time (and by Murphy's Law, it will choose the most unfortunate moment;-), integrity of data is generally considered very important (often data is more valuable than the system on which it's kept -- hence, "mirroring" RAID solutions to ensure against disk crashes losing precious data;-).
If you accept this sane approach, the general idea is: open old file for reading, new one for writing (creation); copy N1 bytes over from the old file to the new one; then skip N2 bytes of the old file; then copy the rest over; close both files; atomically rename new to old. (Windows apparently has no "atomic rename" system call usable from Python -- to keep integrity in that case, instead of the atomic rename, you'd do three step: rename old file to a backup name, rename new file to old, delete backup-named file -- in case of system crash during the second one of these three very fast operations, one rename is all it will take to restore data integrity).
N1 and N2, of course, are the two parameters saying where the deleted piece starts, and how long it is. For the part about opening the files, with open('old.dat', 'rb') as oldf: and with open('NEWold.dat', 'wb') as newf: statements, nested into each other, are clearly best (the rest of the code until the rename step must be nested in both of them of course).
For the "copy the rest over" step, shutil.copyfileobj is best (be sure to specify a buffer length that's comfortably going to fit in your available RAM, but a large one will tend to give better performance). The "skip" step is clearly just a seek on the oldf open-for-reading file object. For copying exactly N1 bytes from oldf to newf, there is no direct support in Python's standard library, so you have to write your own, e.g:
def copyN1(oldf, newf, N1, buflen=1024*1024):
while N1:
newf.write(oldf.read(min(N1, buflen)))
N1 -= buflen

I'd suggest memory mapping. Though it is actually manipulating file in memory it is more efficient then plain reading of the whole file into memory.
Well, you have to manipulate the file contents in-memory one way or another as there's no system call for such operation neither in *nix nor in Win (at least not that I'm aware of).

Try mmaping the file. This won't necessarily read it all into memory at once.
If you really want to do it by hand, choose some chunk size and do back-and-forth reads and writes. But the seeks are going to kill you...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.