Freeing up buffer space after use in Python? - python

So I'm using Google Cloud data Lab and I use the %%storage read command to read in a large file (2,000,000 rows) Into the text variable and then I have to process it into a pandas dataframe using BytesIO eg df_new=pd.read_csv(BytesIO(text))
So now I don't need the text Variable or its contents around, (all further processing is done on df_new, how can I delete it (text) and free up memory (I sure don't need two copies of a 2 million record dataset hanging around...)

Use del followed by forced garbage collection.
import gc
# Remove text variable
del text
# Force gc collection - this not actually necessary, but may be useful.
gc.collect()
Note that you may not see process size decreasing and memory returning to OS, depending on memory allocator used (depends on OS, core libraries used and python compilation options).

Related

Python lazy load and unload large files on access

I have a large list of files (10-1000) that I want to lazily load the contents into memory on access. I have enough memory to load the contents of each individual file into memory, but not enough to load the contents of all the files into memory simultaneously.
(To be more specific, these are pickle files around 1-8GB each, containing all sorts of data that I want to use in a Jupyter Notebook. But the solution should ideally be general to any type of file, and not be limited to Jupyter Notebooks)
In the end, I want to be able to write code similar to this
a = file_contents[0]['property1'] # transparently loads file0 into memory
b = file_contents[0]['property2'] # keeps file0 in memory, no disk access
c = file_contents[5]['property1'] # unloads file0, then loads file5 into memory
d = file_contents[0]['property3'] # unloads file5, then loads file0 back into memory
where the interface transparently loads and unloads files as necessary to get the requested data structures, while keeping memory usage reasonable. Ideally the interface should also look like a normal array or dictionary access.
I can of course write my own class, but I'm interested if there is already a more robust implementation or better idiom for this kind of behavior

Python: How to link a filename to a memory file

What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.
You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.

Memory leak with PyYAML

I think that I'm having a memory leak when loading an .yml file with the library PyYAML.
I've followed the next steps:
import yaml
d = yaml.load(open(filename, 'r'))
The memory used by the process (I've gotten it with top or htop) has grown from 60K to 160M while the size of the file is lower than 1M.
Then, I've done the next command:
sys.getsizeof(d)
And it has returned a value lower than 400K.
I've also tried to use the garbage collector with gc.collect(), but nothing has happened.
As you can see, it seems that there's a memory leak, but I don't know what is producing it, neither I know how to free this amount of memory.
Any idea?
Your approach doesn't show a memory leak, it just shows that PyYAML uses a lot of memory while processing a moderately sized YAML file.
If you would do:
import yaml
X = 10
for x in range(X):
d = yaml.safe_load(open(filename, 'r'))
And the memory size used by the program would change depending on what you set X to, then there is reason to assume there is a memory leak.
In tests that I ran this is not the case. It is just that the default Loader and SafeLoader take about 330x the filesize in memory (based on an arbitrary 1Mb size simple, i.e. no tags, YAML file) and the CLoader about 145x that filesize.
Loading the YAML data multiple times doesn't increase that, so load() gives back the memory it uses, which means there is no memory leak.
That is not to say that it looks like an enormous amount of overhead.
(I am using safe_load() as PyYAML's documentation indicate that load() is not safe on uncontrolled input files).

Optimization of memory usage while copying buffers in python

I have to copy and do some simple processing on file. I can not read whole file to the memory because it is to big. I come up with piece of code which looks like this:
buffer = inFile.read(buffer_size)
while len(buffer) > 0:
outFile.write(buffer)
simpleCalculations(buffer)
buffer = inFile.read(buffer_size)
simpleCalculations procedure is irrelevant in this context but I am worried about subsequent memory allocations of buffer list. On some hardware configuration memory usage gets very high and that apparently kills the machine. I would like to reuse buffer. Is this posible in python 2.6?
I don't think there's any easy way around this. The file.read() method just returns a new string each time you call it. On the other hand, you don't really need to worry about running out of memory -- once you assign buffer to the newly-read string, the previously-read string no longer has any references to it, so its memory gets freed automatically (see here for more details).
Python being a strictly reference-counted environment, your buffer will be deallocated as soon as you no longer have any references to it.
If you're worried about physical RAM but have spare address space, you could mmap your file rather than reading it in a bit at a time.

Purging numpy.memmap

Given a numpy.memmap object created with mode='r' (i.e. read-only), is there a way to force it to purge all loaded pages out of physical RAM, without deleting the object itself?
In other words, I'd like the reference to the memmap instance to remain valid, but all physical memory that's being used to cache the on-disk data to be uncommitted. Any views onto to the memmap array must also remain valid.
I am hoping to use this as a diagnostic tool, to help separate "real" memory requirements of a script from "transient" requirements induced by the use of memmap.
I'm using Python 2.7 on RedHat.
If you run "pmap SCRIPT-PID", the "real" memory shows as "[ anon ]" blocks, and all memory-mapped files show up with the file name in the last column.
Purging the pages is possible at C level, if you manage to get ahold of the pointer to the beginning of the mapping and call madvise(ptr, length, MADV_DONTNEED) on it, but it's going to be cludgy.

Categories

Resources