Being new to programming, I was playing around with Python with this script:
import itertools
Lists=list(itertools.permutations(['a','b','c','d','e','f','g','h','i','j','k']))
print(len(Lists))
On 32-bit Python it will cause a memory overflow error. However when trying it on 64-bit Python and watching Task Manager, Python uses 4 GB of memory (I have 8 GB of RAM) then my computer freezes, and I have to restart it.
Is this normal behavior? How can I fix this, or limit how much memory Python has access to?
Also if I converted something like this to a .exe file (used this script as testing for something else) would it freeze other computers?
The function itertools.permutations() returns a generator that lazily computes all possible permutations of the given sequence in lexicographical order. Then your code stores all these permutations explicitly in a list.
Your sequence contains 11 letters. For your input, there are 11! = 39 916 800 permutations. Python is not particularly memory-efficient; for each of the 40 million permutations, these values need to be stored:
A pointer to the list object. (8 bytes on 64-bit Python)
The list container itself, containing 11 pointers to strings and slack space. (8× 8 bytes)
Thus at least 96 bytes are used per permutation. Adding some padding and miscellaneous waste, we can estimate that each permutation uses 150 bytes of memory. Multiply this by 40 million, and we get 6 gigabytes.
This high memory usage explains why your program dies on 32-bit Python (cannot use more than 4 GB of RAM, and in practice is limited by 2 GB). Also, when the process consumes a lot of memory, it can cause thrashing in the page/swap file if one is enabled.
One way to limit Python's memory limit is through mechanisms provided through the operating system, such as ulimit. Another way is by consulting the resource module.
Related
I have a python script, whose memory usage (observed in top) is growing indefinitely when the script is running.
My script is not supposed to store anything. It just receives some request (REST API), processes the request, and returns some result. The expected behaviour is that the memory usage stays constant.
I'm looking for a suitable memory profiler tool that would allow me to identify where the growing object is located in the code.
I had a look at FIL-profiler, but it seems to aim at considering "peak memory usage", which is I believe not my case.
thanks
EDIT1: I wanted to give pyrasite a try, but I could not make it work. Since the last update of this project is 2012 May, it may be that it is not anymore compatible with the current python (I'm using 3.8)
EDIT2: I gave a try to Pympler, following this post. I added the following code to my script:
from pympler import muppy, summary
all_objects = muppy.get_objects()
sum = summary.summarize(all_objects)
summary.print_(sum)
This outputs:
types | # objects | total size
========= | =========== | ============
dict | 7628 | 3.15 MB
str | 24185 | 2.71 MB
type | 1649 | 1.39 MB
tuple | 13722 | 1.35 MB
code | 7701 | 1.06 MB
set | 398 | 453.06 KB
....
Which does not show any suspiciously large object (the scripts, when running, accumulates a memory usage going to GB scale)
EDIT3:
I tried with tracemalloc: I put a tracemalloc.start() at the beginning of the script, then, at the end of the script, but before stopping (so that the memory usage is visibly very high, according to top), I do a
snapshot = tracemalloc.take_snapshot()
display_top(snapshot)
This displays the lines of code with maximal memory usage. But still, nothing in comparison to that observed in top.
I tried also a gc.collect(), but this has no effect.
I could not take benefit of any of the proposed memory profiler. It turned out that the problem lies in a possible bug in the library ujson, which is written in C.
I guess that this is the reason why all those python memory profiler could not help here.
I think I have to close the issue. A remaining question would be: Is there any python tool that allow to track a memory problem happening in a C module?
Assuming you are using the "VIRT" column from top, you cannot extrapolate from that number to an assumption that the number of your python objects are growing, or at least growing enough to explain the total size of the virtual address space.
For example, did you know that python uses pthreads underneath its threading code? That is significant because pthreads takes, as the default stack size, the value from "ulmimit -s" * 1K. So the default stack size for any new threads in python is often 8MB or even 40MB on some variants of Linux, unless you explicitly change the "ulimit -s" value in the parent of your process. Many server applications are multi-threaded so even having 2 additional threads would result in more virtual memory than you are showing from the Pympler output. You have to know how many threads you have in your process and what is the default stack size to understand how threads might contribute to the total VIRT value.
Also, underneath its own allocation mechanism python is using a mixture of mmap and malloc. In the case of malloc, if the program is multi-threaded, on linux libc malloc will use multiple arenas, which reserve 64MB ranges (called heaps) at a time, but only make the starts of these heaps readable and writable and leave the tails of these ranges inaccessible until the memory is needed. Those inaccessible tails are often large but don't really matter at all in terms of the committed memory for the process because inaccessible pages don't require any physical memory or swap space. Nonetheless, top counts in "VIRT" the entire range, including both the accessible start at the beginning of the range and the inaccessible start at the end of the range.
For example, consider the rather tiny python program, in which the main thtead starts 16 additional threads, each of which don't use much memory in python allocations:
import threading
def spin(seed):
l = [ i * seed for i in range(64) ]
while True:
l = [ i * i % 97 for i in l ]
for i in range(16):
t = threading.Thread(target=spin, args=[i])
t.start()
We would not expect that program to result in all that large a process but here is what we see under top, looking at just that one process, which shows way more than 1GB of VIRT:
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.0%us, 1.9%sy, 3.3%ni, 93.3%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 264401648k total, 250450412k used, 13951236k free, 1326116k buffers
Swap: 69205496k total, 17321340k used, 51884156k free, 104421676k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9144 tim 20 0 1787m 105m 99m S 101.8 0.0 13:40.49 python
We can understand the high VIRT value for such a program (and your program for that matter) by taking a live core of the running program (using gcore, for example) and opening the resulting core using chap, open source which can be found at https://github.com/vmware/chap and running commands as shown:
chap> summarize writable
17 ranges take 0x2801c000 bytes for use: stack
14 ranges take 0x380000 bytes for use: python arena
16 ranges take 0x210000 bytes for use: libc malloc heap
1 ranges take 0x1a5000 bytes for use: libc malloc main arena pages
11 ranges take 0xa9000 bytes for use: used by module
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
2 ranges take 0x7000 bytes for use: unknown
62 writable ranges use 0x28832000 (679,682,048) bytes.
From the above we can see that the biggest single use of memory is 17 stacks. If we were to use "describe writable" we would see that 16 of those stacks take 40MB each, and this is all because we have neglected to explicitly tune the stack size to something more reasonable.
We can do something similar with inaccessible (not readable, writable or executable) regions and see that 16 "libc malloc heap tail reservation" ranges take almost 1GB of VIRT. This doesn't as it turns out, really matter for the committed memory of the process but it does make a rather scary contribution to the VIRT number.
chap> summarize inaccessible
16 ranges take 0x3fdf0000 bytes for use: libc malloc heap tail reservation
11 ranges take 0x15f9000 bytes for use: module alignment gap
16 ranges take 0x10000 bytes for use: stack overflow guard
43 inaccessible ranges use 0x413f9000 (1,094,684,672) bytes.
The reason that there are 16 such ranges is that each of the spinning threads was repeatedly making allocations, and this caused libc malloc, running behind the scene because it gets used by the python allocator, to carve out 16 arenas.
You could do a similar command for read only memory ("summarize readonly") or you could get more detail for any of the commands by changing "summarize" to "describe".
I can't say exactly what you will find when you run this on your own server, but it seems quite likely for a REST server to be multi-threaded, so I would guess that this will be a non-trivial contribution to the number TOP is showing you.
This still doesn't answer why the number is continuing to climb, but if you look at those numbers you can see where to look next. For example, in the above summary allocations that are handled by python without resorting to mmap don't take that much memory at only 3.5MB:
14 ranges take 0x380000 bytes for use: python arena
In your case the number might be larger, so you might try any of the following:
describe used
describe free
summarize used
summarize free
Note that the above commands will also cover native allocations (for example ones done by a shared library) as well as python allocations.
Following similar steps on your own program should give you a much better understanding of those "top" numbers you are seeing.
I have a python script which uses an opensource pytorch model and this code has a memory leak. I am running this with memory_profiler mprof run --include-children python my_sctipt.py and get the following image:
I am trying to search for the reason of the leak by the system python module tracemalloc:
tracemalloc.start(25)
while True:
...
snap = tracemalloc.take_snapshot()
domain_filter = tracemalloc.DomainFilter(True, 0)
snap = snap.filter_traces([domain_filter])
stats = snap.statistics('lineno', True)
for stat in stats[:10]:
print(stat)
If looking only at tracemalloc output, I will not be able to identify the problem. I assume that the problem is in the C extension but, I would like to make sure it is true.
I tried to change the domain by DomainFilter, but I have output only in 0 domain.
Also, I don't understand the meaning of the parameter which tracemalloc.start(frameno) has got, frameno is a number of the most recent frames, but nothing happens when I change it.
What can I do next to find the problematic place in the code which causes the memory leak?
Looking forward to your answer.
Given that your guess is that the problem is in the C extension, but that you want to make sure this is true, I would suggest that you do so using a tool that is less python-specific like https://github.com/vmware/chap or at least if you are able to run your program on Linux.
What you will need to do is run your script (uninstrumented) and at some point gather a live core (for example, using "gcore pid-of-your-running-program").
Once you have that core, open that core in chap ("chap your-core-file-path") and try the following command from the chap prompt:
summarize writable
The output will be something like this, but your numbers will likely vary considerably:
chap> summarize writable
5 ranges take 0x2021000 bytes for use: stack
6 ranges take 0x180000 bytes for use: python arena
1 ranges take 0xe1000 bytes for use: libc malloc main arena pages
4 ranges take 0x84000 bytes for use: libc malloc heap
8 ranges take 0x80000 bytes for use: used by module
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
4 ranges take 0x30000 bytes for use: unknown
29 writable ranges use 0x23e7000 (37,646,336) bytes.
The lines in the summary are given in decreasing order of byte usage, so you can follow that order. So looking at the top one first we see that the use is "stack":
5 ranges take 0x2021000 bytes for use: stack
This particular core was for a very simple python program that starts 4 extra threads and has all 5 threads sleep. The reason large stack allocations can happen rather easily with a multi-threaded python program is that python uses pthreads to create additional threads and pthreads uses the ulimit value for stack size as a default. If your program has a similarly large value, you can change the stack size in one of several ways, including running "ulimit -s" in the parent process to change the default stack size. To see what values actually make sense you can use the following command from the chap prompt:
chap> describe stacks
Thread 1 uses stack block [0x7fffe22bc000, 7fffe22dd000)
current sp: 0x7fffe22daa00
Peak stack usage was 0x7798 bytes out of 0x21000 total.
Thread 2 uses stack block [0x7f51ec07c000, 7f51ec87c000)
current sp: 0x7f51ec87a750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 3 uses stack block [0x7f51e7800000, 7f51e8000000)
current sp: 0x7f51e7ffe750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 4 uses stack block [0x7f51e6fff000, 7f51e77ff000)
current sp: 0x7f51e77fd750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 5 uses stack block [0x7f51e67fe000, 7f51e6ffe000)
current sp: 0x7f51e6ffc750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
5 stacks use 0x2021000 (33,689,600) bytes.
So what you see above is that 4 of the stacks are 8MiB in size but could easily be well under 64KiB.
Your program may not have any issues with stack size, but if so, you can fix them as described above.
Continuing with checking for causes of growth, look at the next line from the summary:
6 ranges take 0x180000 bytes for use: python arena
So python arenas use the next most memory. These are used strictly for python-specific allocations. So if this value is large in your case it disproves your theory about C allocations being the culprit, but there is more you can do later to figure out how those python allocations are being used.
Looking at the remaining lines of the summary, we see a few with "libc" as part of the "use" description:
1 ranges take 0xe1000 bytes for use: libc malloc main arena pages
4 ranges take 0x84000 bytes for use: libc malloc heap
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
Note that libc is responsible for all that memory, but you can't know that the memory is used for non-python code because for allocations beyond a certain size threshold (well under 4K) python grabs memory via malloc rather than grabbing memory from one of the python arenas.
So lets assume that you have resolved any issues you might have had with stack usage and you have mainly "python arenas" or "libc malloc" related usages. The next thing you want to understand is whether that memory is mostly "used" (meaning allocated but never freed) or "free" (meaning "freed but not given back to the operating system). You can do that as shown:
chap> count used
15731 allocations use 0x239388 (2,331,528) bytes.
chap> count free
1563 allocations use 0xb84c8 (754,888) bytes.
So in the above case, used allocations dominate and what one should do is to try to understand those used allocations. The case where free allocations dominate is much more complex and is discussed a bit in the user guide but would take too much time to cover here.
So lets assume for now that used allocations are the main cause of growth in your case. We can find out why we have so many used allocations.
The first thing we might want to know is whether any allocations were actually "leaked" in the sense that they are no longer reachable. This excludes the case where the growth is due to container-based growth.
One does this as follows:
chap> summarize leaked
0 allocations use 0x0 (0) bytes.
So for this particular core, as is pretty common for python cores, nothing was leaked. Your number may be non-zero. If it is non-zero but still much lower than the totals associated with memory used for "python" or "libc" reported above, you might just make a note of the leaks but continue to look for the real cause of growth. The user guide has some information about investigating leaks but it is a bit sparse. If the leak count is actually big enough to explain your growth issue, you should investigate that next but if not, read on.
Now that you are assuming container-based growth the following commands are useful:
chap> redirect on
chap> summarize used
Wrote results to scratch/core.python_5_threads.summarize_used
chap> summarize used /sortby bytes
Wrote results to scratch/core.python_5_threads.summarize_used::sortby:bytes
The above created two text files, one which has a summary ordered in terms of object counts and another which has a summary in terms of the total bytes used directly by those objects.
At present chap has only very limited support for python (it finds those python objects, in addition to any allocated by libc malloc but for python objects the summary only breaks out limited categories for python objects in terms of patterns (for example, %SimplePythonObject matches things like "int", "str", ... that don't hold other python objects and %ContainerPythonObject matches things like tuple, list, dict, ... that do hold references to other python objects). With that said, it should be pretty easy to tell from the summary whether the growth in used allocations is primarily due to objects allocated by python or objects allocated by native code.
So in this case, given that you specifically are trying to find out whether the growth is due to native code or not, look in the summary for counts like the following, all of which are python-related:
Pattern %SimplePythonObject has 7798 instances taking 0x9e9e8(649,704) bytes.
Pattern %ContainerPythonObject has 7244 instances taking 0xc51a8(807,336) bytes.
Pattern %PyDictKeysObject has 213 instances taking 0xb6730(747,312) bytes.
So in the core I have been using for an example, definitely python allocations dominate.
You will also see a line for the following, which is for allocations that chap does not yet recognize. You can't make assumptions about whether these are python-related or not.
Unrecognized allocations have 474 instances taking 0x1e9b8(125,368) bytes.
This will hopefully answer your basic question of what you can do next. At least at that point you will understand whether the growth is likely due to C code or python code and depending on what you find, the chap user guide should help you go further from there.
While reading an article on memeory management in python came across few doubts:
import copy
import memory_profiler
#profile
def function():
x = list(range(1000000)) # allocate a big list
y = copy.deepcopy(x)
del x
return y
if __name__ == "__main__":
function()
$:python -m memory_profiler memory_profiler_demo.py
Filename: memory_profiler_demo.py
Line # Mem usage Increment Line Contents
================================================
4 30.074 MiB 30.074 MiB #profile
5 def function():
6 61.441 MiB 31.367 MiB x = list(range(1000000)) # allocate a big list
7 111.664 MiB 50.223 MiB y = copy.deepcopy(x)#doubt 1
8 103.707 MiB -7.957 MiB del x #doubt 2
9 103.707 MiB 0.000 MiB return
so i have the doubts on line 7 why it took more size to copy the list and second doubt on line 8 why it only frees 7 MiB.
First, let's start with why line 8 only frees 7MiB.
Once you allocate a bunch of memory, Python and your OS and/or malloc library both guess that you're likely to allocate a bunch of memory again. On modern platforms, it's a lot faster to reuse that memory in-process than to release it and reallocate it from scratch, while it costs very little to keep extra unused pages of memory in your process's space, so it's usually the right tradeoff. (But of course usually != always, and the blog you linked seems to be in large part about how to work out that you're building an application where it's not the right tradeoff and what to do about it.)
A default build of CPython on Linux virtually never releases any memory. On other POSIX (including Mac) it almost never releases any memory. On Windows, it does release memory more often—but there are still constraints. Basically, if a single allocation from Windows has any piece in use (or even in the middle of a freelist chain), that allocation can't be returned to Windows. So, if you're fragmenting memory (which you usually are), that memory can't be freed. The blog post you linked to explains this to some extent, and there are much better resources than an SO answer to explain further.
If you really do need to allocate a lot of memory for a short time, release it, and never use it again, without holding onto all those pages, there's a common Unix idiom for that—you fork, then do the short-term allocation in the child and exit after passing back the small results in some way. (In Python, that usually means using multiprocessing.Process instead of os.fork directly.)
Now, why does your deepcopy take more memory than the initial construction?
I tested your code on my Mac laptop with python.org builds of 2.7, 3.5, and 3.6. What I found was that the list construction takes around 38MiB (similar to what you're seeing), while the copy takes 42MiB on 2.7, 31MiB on 3.5, and 7MB on 3.6.
Slightly oversimplified, here's the 2.7 behavior: The functions in copy just call the type's constructor on an iterable of the elements (for copy) or recursive copies of them (for deepcopy). For list, this means creating a list with a small starting capacity and then expanding it as it appends. That means you're not just creating a 1M-length array, you're also creating and throwing away arrays of 500K, 250K, etc. all the way down. The sum of all those lengths is equivalent to a 2M-length array. Of course you don't really need the sum of all of them—only the most recent array and the new one are ever live at the same time—but there's no guarantee the old arrays will be freed in a useful way that lets them get reused. (That might explain why I'm seeing about 1.5x the original construction while you're seeing about 2x, but I'd need a lot more investigation to bet anything on that part…)
In 3.5, I believe the biggest difference is that a number of improvements over the 5 years since 2.7 mean that most of those expansions now get done by realloc, even if there is free memory in the pool that could be used instead. That changes a tradeoff that favored 32-bit over 64-bit on modern platforms into one that works the other way round—in 64-bit linux/Mac/Windows: there are often going to be free pages that can be tossed onto the end of an existing large alloc without remapping its address, so most of those reallocs mean no waste.
In 3.6, the huge change is probably #26167. Oversimplifying again, the list type knows how to copy itself by allocating all in one go, and the copy methods now take advantage of that for list and a few other builtin types. Sometimes there's no reallocation at all, and even when there is, it's usually with the special-purpose LIST_APPEND code (which can be used when you can assume nobody outside the current function has access to the list yet) instead of the general-purpose list.append code (which can't).
I've come to work with a rather big simulation code which needs to store up to 189383040 floating point numbers. I know, this is large, but there isn't much that could be done to overcome this, like only looking at a portion of them or processing them one-by-one.
I've written a short script, which reproduces the error so I could quickly test it in different environments:
noSnapshots = 1830
noObjects = 14784
objectsDict={}
for obj in range(0, noObjects):
objectsDict[obj]=[[],[],[]]
for snapshot in range(0,noSnapshots):
objectsDict[obj][0].append([1.232143454,1.232143454,1.232143454])
objectsDict[obj][1].append([1.232143454,1.232143454,1.232143454])
objectsDict[obj][2].append(1.232143454)
It represents the structure of the actual code where some parameters (2 lists of length 3 each and 1 float) have to be stored for each of the 14784 objects at 1830 distinct locations. Obviously the numbers would be different each time for a different object, but in my code I just went for some randomly-typed number.
The thing, which I find not very surprising, is that it fails on Windows 7 Enterprise and Home Premium with a MemoryError. Even if I run the code on a machine with 16 GB of RAM it still fails, even though there's still plenty of memory left on the machine. So the first question would be: Why does it happen so? I'd like to think that the more RAM I've got the more things I can store in the memory.
I ran the same code on an Ubuntu 12.04 machine of my colleague (again with 16 GB of RAM) and it finished no-problem. So another thing which I'd like to know is: Is there anything I could do to make Windows happy with this code? I.e. give my Python process more memory on heap and stack?
Finally: Does anyone have any suggestions as to how to store plenty of data in memory in a manner similar to the one in the example code?
EDIT
After the answer I changed the code to:
import numpy
noSnapshots = 1830
noObjects = int(14784*1.3)
objectsDict={}
for obj in range(0, noObjects):
objectsDict[obj]=[[],[],[]]
objectsDict[obj][0].append(numpy.random.rand(noSnapshots,3))
objectsDict[obj][1].append(numpy.random.rand(noSnapshots,3))
objectsDict[obj][2].append(numpy.random.rand(noSnapshots,1))
and it works despite the larger amount of data, which has to be stored.
In Python, every float is an object on the heap, with its own reference count, etc. For storing this many floats, you really ought to use a dense representation of lists of floats, such as numpy's ndarray.
Also, because you are reusing the same float objects, you are not estimating the memory use correctly. You have lists of references to the same single float object. In a real case (where the floats are different) your memory use would be much higher. You really ought to use ndarray.
Today I wrote a program using an array/list with 64000000 entries. However, when writing sigma=[1]*64000000 using Python it runs fine, but later on, as the program computes, my Ubuntu freezes - without any reaction to inputs, not even mouse movement. I tried twice and the results are the same.
When implemented in C++, long long sigma[64000000] holds up fine and runs very fast.
Is there any reason that my program would freeze in the middle of running, other than crashes at beginning?
EDIT: To reply to Chris below, my code did not freeze until a couple of loops later.
Thank you all!
For those who interested in seeing the code, this is the program, a brute-force Project Euler 211:
def e211():
ans=0
sigma=[1]*64000000
for i in range(2,64000000):
j=i;
if ((j%1000==0) or (j<100)):
print(j)
q=i*i
while j<64000000:
sigma[j]+=q
j+=i
for i in range(1,64000000):
j=int(sqrt(sigma[i]))
if j*j==sigma[i]:
ans+=i
if __name__=='__main__':
print(e211())
Python lists are lists of objects. A number in Python is an object in itself, and takes up somewhat more storage than the 64 bits needed to represent a long long in C++. In particular, Python transparently handles numbers larger than 32 bits, which end up taking a lot more space than a simple integer.
You may be find the standard Python array module useful. It provides Python-compatible access to uniform arrays of integers of specified size. (However, I note that it doesn't offer a 64-bit integer type.)
range(1,64000000):
This line creates a full list of size 64000000, so now you have two lists of that size in memory. Use xrange instead.