Programmatically testing memory usage in Google Colab (Python 3.x) - python

I am trying to write a program that at a certain level of memory left will write a list to a file to free up memory in Google Colab. I can't find a way to programmatically test for the amount of memory left using python. To be clear I'm not looking for a way to save to a file (I already know that), I'm looking for a way to test the amount of memory left. The code I'm looking for would work something like this:
memory_left = memory_function/method()
if memory_left<=memory_threshold:
save_file()
Another solution would be using the memory profiler package to test the size of an object, and save the file when it gets to a certain size, but I don't think that solution will work because I'm going to have a dynamic environment of ever increasing memory usage. This means there might not be enough memory later on for a file of 100mB, when there are only 10mB left.

Related

How to solve Python RAM leak when running large script

I have a massive Python script I inherited. It runs continuously on a long list of files, opens them, does some processing, creates plots, writes some variables to a new text file, then loops back over the same files (or waits for new files to be added to the list).
My memory usage steadily goes up to the point where my RAM is full within an hour or so. The code is designed to run 24/7/365 and apparently used to work just fine. I see the RAM usage steadily going up in task manager. When I interrupt the code, the RAM stays used until I restart the Python kernel.
I have used sys.getsizeof() to check all my variables and none are unusually large/increasing with time. This is odd - where is the RAM going then? The text files I am writing to? I have checked and as far as I can tell every file creation ends with a f.close() statement, closing the file. Similar for my plots that I create (I think).
What else would be steadily eating away at my RAM? Any tips or solutions?
What I'd like to do is some sort of "close all open files/figures" command at some point in my code. I am aware of the del command but then I'd have to list hundreds of variables at multiple points in my code to routinely delete them (plus, as I pointed out, I already checked getsizeof and none of the variables are large. Largest was 9433 bytes).
Thanks for your help!

Run a python source code on RAM

I have written a code which does some processing , I want to reduce the execution time of the program and I think it can be done if I run it on my RAM which is 1GB.
So will running my program form RAM make any difference to my execution time and if yes how it can be done.
Believe it or not, when you use a modernish computer system, most of your computation is done from RAM. (Well, technically, it's "done" from processor registers, but those are filled from RAM so let's brush that aside for the purposes of this answer)
This is thanks to the magic we call caches and buffers. A disk "cache" in RAM is filled by the operating system whenever something is read from permanent storage. Any further reads of that same data (until and unless it is "evicted" from the cache) only read memory instead of the permanent storage medium.
A "buffer" works similarly for write output, with data first being written to RAM and then eventually flushed out to the underlying medium.
So, in the course of normal operation, any runs of your program after the first (unless you've done a lot of work in between), will already be from RAM. Ditto the program's input file: if it's been read recently, it's already cached in memory! So you're unlikely to be able to speed things up by putting it in memory yourself.
Now, if you want to force things for some reason, you can create a "ramdisk", which is a filesystem backed by RAM. In Linux the easy way to do this is to mount "tmpfs" or put files in the /dev/shm directory. Files on a tmpfs filesystem go away when the computer loses power and are entirely stored in RAM, but otherwise behave like normal disk-backed files. From the way your question is phrased, I don't think this is what you want. I think your real answer is "whatever performance problems you think you have, this is not the cause, sorry".

Python process consuming increasing amounts of system memory, but heapy shows roughly constant usage

I'm trying to identify a memory leak in a Python program I'm working on. I'm current'y running Python 2.7.4 on Mac OS 64bit. I installed heapy to hunt down the problem.
The program involves creating, storing, and reading large database using the shelve module. I am not using the writeback option, which I know can create memory problems.
Heapy usage shows during the program execution, the memory is roughly constant. Yet, my activity monitor shows rapidly increasing memory. Within 15 minutes, the process has consumed all my system memory (16gb), and I start seeing page outs. Any idea why heapy isn't tracking this properly?
Take a look at this fine article. You are, most likely, not seeing memory leaks but memory fragmentation. The best workaround I have found is to identify what the output of your large working set operation actually is, load the large dataset in a new process, calculate the output, and then return that output to the original process.
This answer has some great insight and an example, as well. I don't see anything in your question that seems like it would preclude the use of PyPy.

Python - HardDrive access when opening files

If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.
The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.
Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.
Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.

Data persistence for python when a lot of lookups but few writes?

I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.
Orignally i was storing it as a tree:
server->directory->file_name
When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.
Is there a better way to do this?
EDIT
The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.
I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.
You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?
One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.
An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.

Categories

Resources