Is freeing handled differently for small/large numpy arrays?

Is freeing handled differently for small/large numpy arrays? - python

I am trying to debug a memory problem with my large Python application. Most of the memory is in numpy arrays managed by Python classes, so Heapy etc. are useless, since they do not account for the memory in the numpy arrays. So I tried to manually track the memory usage using the MacOSX (10.7.5) Activity Monitor (or top if you will). I noticed the following weird behavior. On a normal python interpreter shell (2.7.3):
import numpy as np # 1.7.1
# Activity Monitor: 12.8 MB
a = np.zeros((1000, 1000, 17)) # a "large" array
# 142.5 MB
del a
# 12.8 MB (so far so good, the array got freed)
a = np.zeros((1000, 1000, 16)) # a "small" array
# 134.9 MB
del a
# 134.9 MB (the system didn't get back the memory)
import gc
gc.collect()
# 134.9 MB
No matter what I do, the memory footprint of the Python session will never go below 134.9 MB again. So my question is:
Why are the resources of arrays larger than 1000x1000x17x8 bytes (found empirically on my system) properly given back to the system, while the memory of smaller arrays appears to be stuck with the Python interpreter forever?
This does appear to ratchet up, since in my real-world applications, I end up with over 2 GB of memory I can never get back from the Python interpreter. Is this intended behavior that Python reserves more and more memory depending on usage history? If yes, then Activity Monitor is just as useless as Heapy for my case. Is there anything out there that is not useless?

Reading from Numpy's policy for releasing memory it seems like numpy does not have any special handling of memory allocation/deallocation. It simply calls free() when the reference count goes to zero. In fact it's pretty easy to replicate the issue with any built-in python object. The problem lies at the OS level.
Nathaniel Smith has written an explanation of what is happening in one of his replies in the linked thread:
In general, processes can request memory from the OS, but they cannot
give it back. At the C level, if you call free(), then what actually
happens is that the memory management library in your process makes a
note for itself that that memory is not used, and may return it from a
future malloc(), but from the OS's point of view it is still
"allocated". (And python uses another similar system on top for
malloc()/free(), but this doesn't really change anything.) So the OS
memory usage you see is generally a "high water mark", the maximum
amount of memory that your process ever needed.
The exception is that for large single allocations (e.g. if you create
a multi-megabyte array), a different mechanism is used. Such large
memory allocations can be released back to the OS. So it might
specifically be the non-numpy parts of your program that are producing
the issues you see.
So, it seems like there is no general solution to the problem .Allocating many small objects will lead to a "high memory usage" as profiled by the tools, even thou it will be reused when needed, while allocating big objects wont show big memory usage after deallocation because memory is reclaimed by the OS.
You can verify this allocating built-in python objects:
In [1]: a = [[0] * 100 for _ in range(1000000)]
In [2]: del a
After this code I can see that memory is not reclaimed, while doing:
In [1]: a = [[0] * 10000 for _ in range(10000)]
In [2]: del a
the memory is reclaimed.
To avoid memory problems you should either allocate big arrays and work with them(maybe use views to "simulate" small arrays?), or try to avoid having many small arrays at the same time. If you have some loop that creates small objects you might explicitly deallocate objects not needed at every iteration instead of doing this only at the end.
I believe Python Memory Management gives good insights on how memory is managed in python. Note that, on top of the "OS problem", python adds another layer to manage memory arenas, which can contribute to high memory usage with small objects.

Related

Understanding memory growth of a Python process (VmRSS vs gc.get_objects)

Without going into algorithmic details, lets just say that my code sequentially processes a list of inputs:
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
for i in inputs:
process_input(i)
For simplicity, lets consider process_input to be a state-less black-box.
I know that this site is full of questions about finding memory leaks in Python code, but this is not what this question is about. Instead, I'm trying to understand the memory consumption of my code over time and whether it might suffer from leaking memory.
In particular, I'm trying to understand a discrepancy of two distinct indicators of memory usage:
The number of allocated objects (reported by gc.get_objects) and
the actually used amount of physical memory (read from VmRSS on a Linux system).
To study these two indicators, I expanded the original code from above as follows:
import time, gc
def get_current_memory_usage():
with open('/proc/self/status') as f:
memusage = f.read().split('VmRSS:')[1].split('\n')[0][:-3]
return int(memusage.strip()) / (1024 ** 2)
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
gc.collect()
last_object_count = len(gc.get_objects())
for i in inputs:
print(f'\nProcessing input {i}...')
process_input(i)
gc.collect()
time.sleep(1)
memory_usage = get_current_memory_usage()
object_count = len(gc.get_objects())
print(f'Memory usage: {memory_usage:.2f} GiB')
print(f'Object count: {object_count - last_object_count:+}')
last_object_count = object_count
Note that process_input is state-less, i.e. the order of the inputs does not matter. Thus, we would expect both indicators to be about the same before running process_input and afterwards, right? Indeed, this is what I observe for the number of allocated objects. However, the consumption of memory grows steadily:
Now my core question: Do these observations indicate a memory leak? To my understanding, memory leaking in Python would be indicated by a growth of allocated objects, which we do not observe here. On the other hand, why does the memory consumption grow steadily?
For further investigation, I also ran a second test. For this test, I repeatedly invoked process_input(i) using a fixed input i (five times each) and recorded the memory consumption in between of the iterations:
For i=12, the memory consumption remained constant at 10.91 GiB.
For i=14, the memory consumption remained constant at 7.00 GiB.
I think, these observations make the presence of a memory leak even more unlikely, right? But then, what could be a possible explanation for why the memory consumption is not falling in between of the iterations, given that process_input is state-less?
The system has 32 GiB RAM in total and is running Ubuntu 20.04. Python version is 3.6.10. The process_input function uses several third-party libraries.

In general RSS is not a particularly good indicator because it is "resident" set size and even a rather piggish process, in terms of committed memory, can have a modest RSS as memory can be swapped out. You can look at /proc/self/smaps and add up the size of the writable regions to get a much better benchmark.
On the other hand, if there is actually growth, and you want to understand why, you need to look at the actual dynamically allocated memory. What I'd suggest for this is using https://github.com/vmware/chap
To do this, just make that 1 second sleep a bit longer, put a print just before the call to sleep, and use gcore from another session to gather a live core during a few of those sleeps.
So lets say you have cores gathered from when the input was 14 and when it was 21. Look at each of the cores using chap, for example, with the following commands:
count used
That will give you a good view of allocations that have been requested but not released. If the numbers are much larger for the later core, you probably have some kind of growth issue. If those numbers do differ by quite a lot, use
summarize used
If you have growth, it is possible that there is a leak (as opposed to some container simply expanding). To check this, you can try commands like
count leaked
show leaked
From there you should probably look at the documentation, depending on what you find.
OTOH if used allocations are not the issue, maybe try the following, to see memory for allocations that have been released but are part of larger regions of memory that cannot be given back to the operating system because parts of those regions are still in use:
count free
summarize free
If neither "used" allocations or "free" allocations are the issue, you might try:
summarize writable
That is a very high level view of all writable memory. For example, you can see things like stack usage...

Jupyter Notebook Memory Management

I am currently working on a jupyter notebook in kaggle. After performing the desired transformations on my numpy array, I pickled it so that it can be stored on disk. The reason I did that is so that I can free up the memory being consumed by the large array.
The memory consumed after pickling the array was about 8.7 gb.
I decided to run this code snippet provided by #jan-glx here , to find out what variables were consuming my memory:
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
After performing this step I noticed that the size of my array was 3.3 gb, and the size of all the other variables summed together was about 0.1 gb.
I decided to delete the array and see if that would fix the problem, by performing the following:
del my_array
gc.collect()
After doing this, the memory consumption decreased from 8.7 gb to 5.4 gb. Which in theory makes sense, but still didn't explain what the rest of the memory was being consumed by.
I decided to continue anyways and reset all my variables to see whether this would free up the memory or not with:
%reset
As expected it freed up the memory of the variables that were printed out in the function above, and I was still left with 5.3 gb of memory in use.
One thing to note is that I noticed a memory spike when pickling the file itself, so a summary of the process would be something like this:
performed operations on array -> memory consumption increased from about 1.9 gb to 5.6 gb
pickled file -> memory consumption increased from 5.6 gb to about 8.7 gb
Memory spikes suddenly while file is being pickled to 15.2 gb then drops back to 8.7 gb.
deleted array -> memory consumption decreased from 8.7 gb to 5.4 gb
performed reset -> memory consumption decreased from 5.4 gb to 5.3 gb
Please note that the above is loosely based of monitoring the memory on kaggle and may be inaccurate.
I have also checked this question but it was not helpful for my case.
Would this be considered a memory leak? If so, what do I do in this case?
EDIT 1:
After some further digging, I noticed that there are others facing this problem. This problem stems from the pickling process, and that pickling creates a copy in memory but, for some reason, does not release it. Is there a way to release the memory after the pickling process is complete.
EDIT 2:
When deleting the pickled file from disk, using:
!rm my_array
It ended up freeing the disk space and freeing up space on memory as well. I don't know whether the above tidbit would be of use or not, but I decided to include it anyways as every bit of info might help.

There is one basic drawback that you should be aware of: The CPython interpreter actually can actually barely free memory and return it to the OS. For most workloads, you can assume that memory is not freed during the lifetime of the interpreter's process. However, the interpreter can re-use the memory internally. So looking at the memory consumption of the CPython process from the operating system's perspective really does not help at all. A rather common work-around is to run memory intensive jobs in a sub-process / worker process (via multiprocessing for instance) and "only" return the result to the main process. Once the worker dies, the memory is actually freed.
Second, using sys.getsizeof on ndarrays can be impressively misleading. Use the ndarray.nbytes property instead and be aware that this may also be misleading when dealing with views.
Besides, I am not entirely sure why you "pickle" numpy arrays. There are better tools for this job. Just to name two: h5py (a classic, based on HDF5) and zarr. Both libraries allow you to work with ndarray-like objects directly on disk (and compression) - essentially eliminating the pickling step. Besides, zarr also allows you to create compressed ndarray-compatible data structures in memory. Must ufuncs from numpy, scipy & friends will happily accept them as input parameters.

memory management in python using memory_profiler

While reading an article on memeory management in python came across few doubts:
import copy
import memory_profiler
#profile
def function():
x = list(range(1000000)) # allocate a big list
y = copy.deepcopy(x)
del x
return y
if __name__ == "__main__":
function()
$:python -m memory_profiler memory_profiler_demo.py
Filename: memory_profiler_demo.py
Line # Mem usage Increment Line Contents
================================================
4 30.074 MiB 30.074 MiB #profile
5 def function():
6 61.441 MiB 31.367 MiB x = list(range(1000000)) # allocate a big list
7 111.664 MiB 50.223 MiB y = copy.deepcopy(x)#doubt 1
8 103.707 MiB -7.957 MiB del x #doubt 2
9 103.707 MiB 0.000 MiB return
so i have the doubts on line 7 why it took more size to copy the list and second doubt on line 8 why it only frees 7 MiB.

First, let's start with why line 8 only frees 7MiB.
Once you allocate a bunch of memory, Python and your OS and/or malloc library both guess that you're likely to allocate a bunch of memory again. On modern platforms, it's a lot faster to reuse that memory in-process than to release it and reallocate it from scratch, while it costs very little to keep extra unused pages of memory in your process's space, so it's usually the right tradeoff. (But of course usually != always, and the blog you linked seems to be in large part about how to work out that you're building an application where it's not the right tradeoff and what to do about it.)
A default build of CPython on Linux virtually never releases any memory. On other POSIX (including Mac) it almost never releases any memory. On Windows, it does release memory more often—but there are still constraints. Basically, if a single allocation from Windows has any piece in use (or even in the middle of a freelist chain), that allocation can't be returned to Windows. So, if you're fragmenting memory (which you usually are), that memory can't be freed. The blog post you linked to explains this to some extent, and there are much better resources than an SO answer to explain further.
If you really do need to allocate a lot of memory for a short time, release it, and never use it again, without holding onto all those pages, there's a common Unix idiom for that—you fork, then do the short-term allocation in the child and exit after passing back the small results in some way. (In Python, that usually means using multiprocessing.Process instead of os.fork directly.)
Now, why does your deepcopy take more memory than the initial construction?
I tested your code on my Mac laptop with python.org builds of 2.7, 3.5, and 3.6. What I found was that the list construction takes around 38MiB (similar to what you're seeing), while the copy takes 42MiB on 2.7, 31MiB on 3.5, and 7MB on 3.6.
Slightly oversimplified, here's the 2.7 behavior: The functions in copy just call the type's constructor on an iterable of the elements (for copy) or recursive copies of them (for deepcopy). For list, this means creating a list with a small starting capacity and then expanding it as it appends. That means you're not just creating a 1M-length array, you're also creating and throwing away arrays of 500K, 250K, etc. all the way down. The sum of all those lengths is equivalent to a 2M-length array. Of course you don't really need the sum of all of them—only the most recent array and the new one are ever live at the same time—but there's no guarantee the old arrays will be freed in a useful way that lets them get reused. (That might explain why I'm seeing about 1.5x the original construction while you're seeing about 2x, but I'd need a lot more investigation to bet anything on that part…)
In 3.5, I believe the biggest difference is that a number of improvements over the 5 years since 2.7 mean that most of those expansions now get done by realloc, even if there is free memory in the pool that could be used instead. That changes a tradeoff that favored 32-bit over 64-bit on modern platforms into one that works the other way round—in 64-bit linux/Mac/Windows: there are often going to be free pages that can be tossed onto the end of an existing large alloc without remapping its address, so most of those reallocs mean no waste.
In 3.6, the huge change is probably #26167. Oversimplifying again, the list type knows how to copy itself by allocating all in one go, and the copy methods now take advantage of that for list and a few other builtin types. Sometimes there's no reallocation at all, and even when there is, it's usually with the special-purpose LIST_APPEND code (which can be used when you can assume nobody outside the current function has access to the list yet) instead of the general-purpose list.append code (which can't).

Memory error with np array when making document term matrix in python 2.7

I am using matrix = np.array(docTermMatrix) to make DTM. But sometimes it will run into memory error problems at this line. How can I prevent this from happening?

I assume you are using 32bit python. 32bit python limits your program ram memory to 2 gb (all 32bit programs have this as a hard limit), some of this is taken up by python overhead, more of this is taken up by your program. normal python objects do not need contiguous memory and will map disparate regions of memory
numpy.arrays require contiguous memory allocation, this is much harder to allocate. aditionally np.array(a) + 1 creates a 2nd array and must allocate again a huge contiguous block (in fact most operations).
some possible solutions that come to mind
use 64 bit python ... this will give you orders of magnitude more ram to work with ... you will be unlikely to encounter a memory error with this unless you have a really really really big array (so much so that numpy is probably not the right solution)
use multiprocessing to create a new process with a new 2gb limit that just does the numpy processing stuff
use a different solution than numpy( ie a database)

Memory management in numpy arrays,python

I get a memory error when processing very large(>50Gb) file (problem: RAM memory gets full).
My solution is: I would like to read only 500 kilo bytes of data once and process( and delete it from memory and go for next 500 kb). Is there any other better solution? or If this solution seems better , how to do it with numpy array?
It is just 1/4th the code(just for an idea)
import h5py
import numpy as np
import sys
import time
import os
hdf5_file_name = r"test.h5"
dataset_name = 'IMG_Data_2'
file = h5py.File(hdf5_file_name,'r+')
dataset = file[dataset_name]
data = dataset.value
dec_array = data.flatten()
........
I get memory error at this point itsef as it trys to put in all the data to memory.

Quick answer
Numpuy.memmap allows presenting a large file on disk as a numpy array. Don't know if it allows mapping files larger than RAM+swap though. Worth a shot.
[Presentation about out-of-memory work with Python] (http://hilpisch.com/TPQ_Out_of_Memory_Analytics.html)
Longer answer
A key question is how much RAM you have (<10GB, >10GB) and what kind of processing you're doing (need to look at each element in the dataset once or need to look at the whole dataset at once).
If it's <10GB and need to look once, then your approach seems like the most decent one. It's a standard way to deal with datasets which are larger than main memory. What I'd do is increase the size of a chunk from 500kb to something closer to the amount of memory you have - perhaps half of physical RAM, but anyway, something in the GB range, but not large enough to cause swapping to disk and interfere with your algorithm. A nice optimisation would be to hold two chunks in memory at one time. One is being processes, while the other is being loaded in parallel from disk. This works because loading stuff from disk is relatively expensive, but it doesn't require much CPU work - the CPU is basically waiting for data to load. It's harder to do in Python, because of the GIL, but numpy and friends should not be affected by that, since they release the GIL during math operations. The threading package might be useful here.
If you have low RAM AND need to look at the whole dataset at once (perhaps when computing some quadratic-time ML algorithm, or even doing random accesses in the dataset), things get more complicated, and you probably won't be able to use the previous approach. Either upgrade your algorithm to a linear one, or you'll need to implement some logic to make the algorithms in numpy etc work with data on disk directly rather than have it in RAM.
If you have >10GB of RAM, you might let the operating system do the hard work for you and increase swap size enough to capture all the dataset. This way everything is loaded into virtual memory, but only a subset is loaded into physical memory, and the operating system handles the transitions between them, so everything looks like one giant block of RAM. How to increase it is OS specific though.

The memmap object can be used anywhere an ndarray is accepted. Given a memmap fp, isinstance(fp, numpy.ndarray) returns True.
Memory-mapped files cannot be larger than 2GB on 32-bit systems.
When a memmap causes a file to be created or extended beyond its current size in the filesystem, the contents of the new part are unspecified. On systems with POSIX filesystem semantics, the extended part will be filled with zero bytes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.