Memory usage increases when reload the same file - python

Each time I load a npy file, the memory usage increases. The following mini-example illustrates this problem.
import numpy as np
X = np.random.randn(10000,10000)
np.save('tmp.npy',X)
Now, if the following line is executed several times, then each time the memory usage will increase
y=np.load('tmp.npy')
I found the exact similar problem with npz file here, yet the solution is to use a function that is applicable to npy file.
Any idea?

The premise is flawed: memory usage does temporarily increase when loading the file, and may increase again the second time, and perhaps even the third, but eventually the garbage collector will run and the memory will be freed.
If you don't want to wait for nondeterministic time to reclaim the memory, you can explicitly force the garbage collector to run whenever need to reclaim memory:
import gc
gc.collect()
You can also explicitly delete the array after loading it, if you no longer need the data:
del y
But if you do neither of these things, and simply load the same data over and over forever, memory usage will not grow forever--at some point the garbage collector will run and the memory usage will shrink. This happens automatically, and you usually do not need to worry about it.

Related

Python: MemoryError (scripts runs sometimes)

I have a script which sometimes runs successfully, providing the desired output, but when rerun moments later it provides the following error:
numpy.core._exceptions.MemoryError: Unable to allocate 70.8 MiB for an array with shape (4643100, 2) and data type float64
I realise this question has been answered several times (like here), but so far none of the solutions have worked for me. I was wondering if anyone has any idea how it's possible that sometimes the script runs fine and then moments later it provides an error?
I have lowered my computer's RAM usage, have increased the virtual memory, rebooted my laptop, none of which seemed to help (Windows 10, RAM 8.0GB, python 3.9.2 32 bit).
PS: Unfortunately not possible to share the script/create dummy.
Python is a garbage collected language. Garbage collection is non-deterministic. This means that peak memory usage may be different each time a program is run. So the first time you run the program, its peak memory usage is less than the available memory. But the next time you run the program, its peak memory usage is sufficient to consume all available memory. This assumes that the available memory on the host system is constant, which is an incorrect assumption. So the fluctuation in available memory, i.e. the memory not in use by the other running processes, is another reason that the program may raise a MemoryError one time, but terminate without error another time.
Sidenote: Increase virtual memory as a last resort. It isn't memory, it's disk that is used like memory, and it is much slower than memory.

Memory leak on pickle inside a for loop forcing a memory error

I have huge array objects that are pickled with the python pickler.
I am trying to unpickle them and reading out the data in a for loop.
Every time I am done reading and assesing, I delete all the references to those objects.
After deletion, I even call gc.collect() along with time.sleep() to see if the heap memory reduces.
The heap memory doesn't reduce pointing to the fact that, the data is still referenced somewhere within the pickle loading. After 15 datafiles(I got 250+ files to process, 1.6GB each) I hit the memory error.
I have seen many other questions here, pointing out a memory leak issue which was supposedly solved.
I don't understand what is exactly happening in my case.
Python memory management does not free memory to OS till the process is running.
Running the for loop with a subprocess to call the script helped me solved the issue.
Thanks for the feedbacks.

Profile memory. Find memory leak in loop

this question was already asked a few times and I already tried some methods. Unfortunately, somehow I can't find out why my python process uses so much memory.
My setup: python 3.5.2, Windows 10, and a lot of third-party packages.
The true memory usage for the process is 300 MB ( way too much but sometimes it even explodes to 32gb)
process = psutil.Process(os.getpid())
memory_real = process.memory_info().rss/(1024*1024) #--> 300 Mb
What I tried so far:
memory line profiler (didn't helped me)
tracemalloc.start(50) and then
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
log_and_print(stat)
gives just few Mb's as result
gc.collect()
import objgraph
objgraph.show_most_common_types()
returns:
function 51791
dict 32939
tuple 28825
list 13823
set 10748
weakref 10551
cell 7870
getset_descriptor 6276
type 6088
OrderedDict 5083
(when the process had 200 mb's the numbers above were even higher)
pympler: process exists with some error-code
So really struggling to find a way where the memory of the process is allocated. Do I do something wrong, or is there some easy way to find out what is going on?
PS:
I was able to solve this problem through luck. It was a badly coded while loop, where a list was extended, without a proper break condition.
Anyway is there a way to find such memory leaks. What I often see is that some memory profiling packages is called explicitly. In this case, I wouldn't have a chance to make a memory dump or check the memory in the main thread since the loop is never left.

Memory leak with PyYAML

I think that I'm having a memory leak when loading an .yml file with the library PyYAML.
I've followed the next steps:
import yaml
d = yaml.load(open(filename, 'r'))
The memory used by the process (I've gotten it with top or htop) has grown from 60K to 160M while the size of the file is lower than 1M.
Then, I've done the next command:
sys.getsizeof(d)
And it has returned a value lower than 400K.
I've also tried to use the garbage collector with gc.collect(), but nothing has happened.
As you can see, it seems that there's a memory leak, but I don't know what is producing it, neither I know how to free this amount of memory.
Any idea?
Your approach doesn't show a memory leak, it just shows that PyYAML uses a lot of memory while processing a moderately sized YAML file.
If you would do:
import yaml
X = 10
for x in range(X):
d = yaml.safe_load(open(filename, 'r'))
And the memory size used by the program would change depending on what you set X to, then there is reason to assume there is a memory leak.
In tests that I ran this is not the case. It is just that the default Loader and SafeLoader take about 330x the filesize in memory (based on an arbitrary 1Mb size simple, i.e. no tags, YAML file) and the CLoader about 145x that filesize.
Loading the YAML data multiple times doesn't increase that, so load() gives back the memory it uses, which means there is no memory leak.
That is not to say that it looks like an enormous amount of overhead.
(I am using safe_load() as PyYAML's documentation indicate that load() is not safe on uncontrolled input files).

Optimization of memory usage while copying buffers in python

I have to copy and do some simple processing on file. I can not read whole file to the memory because it is to big. I come up with piece of code which looks like this:
buffer = inFile.read(buffer_size)
while len(buffer) > 0:
outFile.write(buffer)
simpleCalculations(buffer)
buffer = inFile.read(buffer_size)
simpleCalculations procedure is irrelevant in this context but I am worried about subsequent memory allocations of buffer list. On some hardware configuration memory usage gets very high and that apparently kills the machine. I would like to reuse buffer. Is this posible in python 2.6?
I don't think there's any easy way around this. The file.read() method just returns a new string each time you call it. On the other hand, you don't really need to worry about running out of memory -- once you assign buffer to the newly-read string, the previously-read string no longer has any references to it, so its memory gets freed automatically (see here for more details).
Python being a strictly reference-counted environment, your buffer will be deallocated as soon as you no longer have any references to it.
If you're worried about physical RAM but have spare address space, you could mmap your file rather than reading it in a bit at a time.

Categories

Resources