How to solve Python RAM leak when running large script

How to solve Python RAM leak when running large script - python

I have a massive Python script I inherited. It runs continuously on a long list of files, opens them, does some processing, creates plots, writes some variables to a new text file, then loops back over the same files (or waits for new files to be added to the list).
My memory usage steadily goes up to the point where my RAM is full within an hour or so. The code is designed to run 24/7/365 and apparently used to work just fine. I see the RAM usage steadily going up in task manager. When I interrupt the code, the RAM stays used until I restart the Python kernel.
I have used sys.getsizeof() to check all my variables and none are unusually large/increasing with time. This is odd - where is the RAM going then? The text files I am writing to? I have checked and as far as I can tell every file creation ends with a f.close() statement, closing the file. Similar for my plots that I create (I think).
What else would be steadily eating away at my RAM? Any tips or solutions?
What I'd like to do is some sort of "close all open files/figures" command at some point in my code. I am aware of the del command but then I'd have to list hundreds of variables at multiple points in my code to routinely delete them (plus, as I pointed out, I already checked getsizeof and none of the variables are large. Largest was 9433 bytes).
Thanks for your help!

Related

Python Process Terminated due to "Low Swap" When Writing To stdout for Data Science

I'm new to python so I apologize for any misconceptions.
I have a python file that needs to read/write to stdin/stdout many many times (hundreds of thousands) for a large data science project. I know this is not ideal, but I don't have a choice in this case.
After about an hour of running (close to halfway completed), the process gets terminated on my mac due to "Low Swap" which I believe refers to lack of memory. Apart from the read/write, I'm hardly doing any computing and am really just trying to get this to run successfully before going any farther.
My Question: Does writing to stdin/stdout a few hundred thousand times use up that much memory? The file basically needs to loop through some large lists (15k ints) and do it a few thousand times. I've got 500 gigs of hard drive space and 12 gigs of ram and am still getting the errors. I even spun up an EC2 instance on AWS and STILL had memory errors. Is it possible that I have some sort of memory leak in the script even though I'm not doing hardly anything? Is there anyway that I reduce the memory usage to run this successfully?
Appreciate any help.

the process gets terminated on my mac due to "Low Swap" which I believe refers to lack of memory
SWAP space is part of your Main Memory - RAM.
When a user reads a file it puts in it Main Memory (caches, and RAM). When its done it removes it.
However, when a user writes to a file, changes need to be recorded. One problem. What if you are writing to a different file every millisecond. The RAM and L caches reach capacity, so the least recently used (LRU) files are put into SWAP space. And since SWAP is still part of Main Memory (not the hard drive), it is possible to overflow it and lose information, which can cause a crash.
Is it possible that I have some sort of memory leak in the script even though I'm not doing hardly anything?
Possibly
Is there anyway that I reduce the memory usage to run this successfully?
One way is to think of how you are managing the file(s). Reads will not hurt SWAP because the file can just be scrapped, without the need to save. You might want to explicitly save the file (closing and opening the file should work) after a certain amount of information has been processed or a certain amount of time has gone by. Thus, removing the file from SWAP space.

Run a python source code on RAM

I have written a code which does some processing , I want to reduce the execution time of the program and I think it can be done if I run it on my RAM which is 1GB.
So will running my program form RAM make any difference to my execution time and if yes how it can be done.

Believe it or not, when you use a modernish computer system, most of your computation is done from RAM. (Well, technically, it's "done" from processor registers, but those are filled from RAM so let's brush that aside for the purposes of this answer)
This is thanks to the magic we call caches and buffers. A disk "cache" in RAM is filled by the operating system whenever something is read from permanent storage. Any further reads of that same data (until and unless it is "evicted" from the cache) only read memory instead of the permanent storage medium.
A "buffer" works similarly for write output, with data first being written to RAM and then eventually flushed out to the underlying medium.
So, in the course of normal operation, any runs of your program after the first (unless you've done a lot of work in between), will already be from RAM. Ditto the program's input file: if it's been read recently, it's already cached in memory! So you're unlikely to be able to speed things up by putting it in memory yourself.
Now, if you want to force things for some reason, you can create a "ramdisk", which is a filesystem backed by RAM. In Linux the easy way to do this is to mount "tmpfs" or put files in the /dev/shm directory. Files on a tmpfs filesystem go away when the computer loses power and are entirely stored in RAM, but otherwise behave like normal disk-backed files. From the way your question is phrased, I don't think this is what you want. I think your real answer is "whatever performance problems you think you have, this is not the cause, sorry".

Should I delete temporary files created by my script?

It's a common question not specifically about some language or platform. Who is responsible for a file created in systems $TEMP folder?
If it's my duty, why should I care where to put this file? I can place it anywhere with same result.
If it's OS responsibility, can I forgot about this file right after use?
Thanks and sorry for my basic English.

As a general rule, you should remove the temporary files that you create.
Recall that the $TEMP directory is a shared resource that other programs can use. Failure to remove the temporary files will have an impact on the other programs that use $TEMP.
What kind of impacts? That will depend upon the other programs. If those other programs create a lot of temporary files, then their execution will be slower as it will take longer to create a new temporary file as the directory will have to be scanned on each temporary file creation to ensure that the file name is unique.
Consider the following (based on real events) ...
In years past, my group at work had to use the Intel C Compiler. We found that over time, it appeared to be slowing down. That is, the time it took to run our sanity tests using it took longer and longer. This also applied to building/compiling a single C file. We tracked the problem down.
ICC was opening, stat'ing and reading every file under $TEMP. For what purpose, I know not. Although the argument can be made that the problem lay with the ICC, the existence of the files under $TEMP was slowing it and our development team down. Deleting those temporary files resulted in the sanity checks running in less than a half hour instead of over two--a significant time saver.
Hope this helps.

There is no standard and no common rules. In most OSs, the files in the temporary folder will pile up. Some systems try to prevent this by deleting files in there automatically after some time but that sometimes causes grief, for example with long running processes or crash backups.
The reason for $TEMP to exist is that many programs (especially in early times when RAM was scarce) needed a place to store temporary data since "super computers" in the 1970s had only a few KB of RAM (yes, N*1024 bytes where N is << 100 - you couldn't even fit the image of your mouse cursor into that). Around 1980, 64KB was a lot.
The solution was a folder where anyone could write. Security wasn't an issue at the time, memory was.
Over time, OSs started to get better systems to create temporary files and to clean them up but backwards compatibility prevented a clean, "work for all" solution.
So even though you know where the data ends up, you are responsible to clean up the files after yourself. To make error analysis easier, I tend to write my code in such a way that files are only deleted when everything is fine - that way, I can look at intermediate results to figure out what is wrong. But logging is often a better and safer solution.
Related: Memory prices 1957-2014 12KB of Ram did cost US $4'680,- in 1973.

Python process consuming increasing amounts of system memory, but heapy shows roughly constant usage

I'm trying to identify a memory leak in a Python program I'm working on. I'm current'y running Python 2.7.4 on Mac OS 64bit. I installed heapy to hunt down the problem.
The program involves creating, storing, and reading large database using the shelve module. I am not using the writeback option, which I know can create memory problems.
Heapy usage shows during the program execution, the memory is roughly constant. Yet, my activity monitor shows rapidly increasing memory. Within 15 minutes, the process has consumed all my system memory (16gb), and I start seeing page outs. Any idea why heapy isn't tracking this properly?

Take a look at this fine article. You are, most likely, not seeing memory leaks but memory fragmentation. The best workaround I have found is to identify what the output of your large working set operation actually is, load the large dataset in a new process, calculate the output, and then return that output to the original process.
This answer has some great insight and an example, as well. I don't see anything in your question that seems like it would preclude the use of PyPy.

various memory errors, python

I am using Python, but recently I am running a lot into the memory errors.
One is related to saving the plots in .png format. As soon as I try to save them in .pdf format I don't have this problem anymore. How can I still use .png for multiple files?
Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
And finally, Python should release all the unused variables, but I think it's not doing so. If I run just one function I have no problem, but if I run two unrelated functions in the row (after finishing the first and before going to the second, in my understanding, all the variables should be released), during the second one, I run yet again into the memory error problem. Therefore I believe, the variables are not released after the first run. How can I force Python to release all of them (I don't want to use del, because there are loads of variables and I don't want to specify every single one of them).
Thanks for your help!

Looking at code would probably bring more clearance.
You can also try doing
import gc
f() #function that eats lots of memory while executing
gc.collect()
This will call the garbage collector and you will be sure that all abandoned objects are deleted. If that doesn't solve the problem, take a look at objgraph library http://mg.pov.lt/objgraph/objgraph.html in order to detect who leaks the memory or to find the places where you've forgotten to remove reference to a memory consuming object.

Secondly I am reading quite big data files, and after a while, I run out of memory. I try closing them each time but perhaps there is still something opened left. Is there a way to close all the the opened files in Python without having handlers to them?
If you use with open(myfile1) as f1: ..., you don't need to worry about closing files or about accidentally leaving files opened.
See here for a good explanation.
As for the other questions, I agree with alex_jordan that it would help if you showed some of your code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.