I am trying to run a Python (2.7) script with PyPy but I have encountered the following error:
TypeError: sys.getsizeof() is not implemented on PyPy.
A memory profiler using this function is most likely to give results
inconsistent with reality on PyPy. It would be possible to have
sys.getsizeof() return a number (with enough work), but that may or
may not represent how much memory the object uses. It doesn't even
make really sense to ask how much *one* object uses, in isolation
with the rest of the system. For example, instances have maps,
which are often shared across many instances; in this case the maps
would probably be ignored by an implementation of sys.getsizeof(),
but their overhead is important in some cases if they are many
instances with unique maps. Conversely, equal strings may share
their internal string data even if they are different objects---or
empty containers may share parts of their internals as long as they
are empty. Even stranger, some lists create objects as you read
them; if you try to estimate the size in memory of range(10**6) as
the sum of all items' size, that operation will by itself create one
million integer objects that never existed in the first place.
Now, I really need to check the size of one nested dict during the execution of the program, is there any alternative to sys.getsizeof() I can use in PyPy? If not, how would I check for the size of a nested object in PyPy?
Alternatively you can gauge the memory usage of your process using
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
As your program is executing, getrusage will give total memory consumption of the process in number of bytes or kilobytes. Using this information you can estimate the size of your data structures, and if you begin to use say 50% of your machine's total memory, then you can do something to handle it.
Related
I'm working on an application that in a part requires to save the latest n values from a stream of data for processing the whole stream at once in intervals, overwriting the oldest value, I'm familiar with ring buffers in C which are used in low memory constrains scenarios.
Yet, In python, whats the advantage of this implementation rather than just having a queue object and at each insertion of data just performing a pop() and a push(). Are there memory vulnerabilities with this approach?
In any language a ring buffer implementation offers a significant performance advantage: no data is being moved around when adding or removing elements and also the memory where those are stored is contiguous.
In python you can use collections.deque with maxlen argument, to get the same behavior and performance. Still using a plain list it is going to reasonably fast.
What do you mean by memory vulnerabilities ?
Without going into algorithmic details, lets just say that my code sequentially processes a list of inputs:
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
for i in inputs:
process_input(i)
For simplicity, lets consider process_input to be a state-less black-box.
I know that this site is full of questions about finding memory leaks in Python code, but this is not what this question is about. Instead, I'm trying to understand the memory consumption of my code over time and whether it might suffer from leaking memory.
In particular, I'm trying to understand a discrepancy of two distinct indicators of memory usage:
The number of allocated objects (reported by gc.get_objects) and
the actually used amount of physical memory (read from VmRSS on a Linux system).
To study these two indicators, I expanded the original code from above as follows:
import time, gc
def get_current_memory_usage():
with open('/proc/self/status') as f:
memusage = f.read().split('VmRSS:')[1].split('\n')[0][:-3]
return int(memusage.strip()) / (1024 ** 2)
inputs = [2,5,6,7,8,10,12,13,14,15,16,17,18,19,20,21]
gc.collect()
last_object_count = len(gc.get_objects())
for i in inputs:
print(f'\nProcessing input {i}...')
process_input(i)
gc.collect()
time.sleep(1)
memory_usage = get_current_memory_usage()
object_count = len(gc.get_objects())
print(f'Memory usage: {memory_usage:.2f} GiB')
print(f'Object count: {object_count - last_object_count:+}')
last_object_count = object_count
Note that process_input is state-less, i.e. the order of the inputs does not matter. Thus, we would expect both indicators to be about the same before running process_input and afterwards, right? Indeed, this is what I observe for the number of allocated objects. However, the consumption of memory grows steadily:
Now my core question: Do these observations indicate a memory leak? To my understanding, memory leaking in Python would be indicated by a growth of allocated objects, which we do not observe here. On the other hand, why does the memory consumption grow steadily?
For further investigation, I also ran a second test. For this test, I repeatedly invoked process_input(i) using a fixed input i (five times each) and recorded the memory consumption in between of the iterations:
For i=12, the memory consumption remained constant at 10.91 GiB.
For i=14, the memory consumption remained constant at 7.00 GiB.
I think, these observations make the presence of a memory leak even more unlikely, right? But then, what could be a possible explanation for why the memory consumption is not falling in between of the iterations, given that process_input is state-less?
The system has 32 GiB RAM in total and is running Ubuntu 20.04. Python version is 3.6.10. The process_input function uses several third-party libraries.
In general RSS is not a particularly good indicator because it is "resident" set size and even a rather piggish process, in terms of committed memory, can have a modest RSS as memory can be swapped out. You can look at /proc/self/smaps and add up the size of the writable regions to get a much better benchmark.
On the other hand, if there is actually growth, and you want to understand why, you need to look at the actual dynamically allocated memory. What I'd suggest for this is using https://github.com/vmware/chap
To do this, just make that 1 second sleep a bit longer, put a print just before the call to sleep, and use gcore from another session to gather a live core during a few of those sleeps.
So lets say you have cores gathered from when the input was 14 and when it was 21. Look at each of the cores using chap, for example, with the following commands:
count used
That will give you a good view of allocations that have been requested but not released. If the numbers are much larger for the later core, you probably have some kind of growth issue. If those numbers do differ by quite a lot, use
summarize used
If you have growth, it is possible that there is a leak (as opposed to some container simply expanding). To check this, you can try commands like
count leaked
show leaked
From there you should probably look at the documentation, depending on what you find.
OTOH if used allocations are not the issue, maybe try the following, to see memory for allocations that have been released but are part of larger regions of memory that cannot be given back to the operating system because parts of those regions are still in use:
count free
summarize free
If neither "used" allocations or "free" allocations are the issue, you might try:
summarize writable
That is a very high level view of all writable memory. For example, you can see things like stack usage...
I have a python script which uses an opensource pytorch model and this code has a memory leak. I am running this with memory_profiler mprof run --include-children python my_sctipt.py and get the following image:
I am trying to search for the reason of the leak by the system python module tracemalloc:
tracemalloc.start(25)
while True:
...
snap = tracemalloc.take_snapshot()
domain_filter = tracemalloc.DomainFilter(True, 0)
snap = snap.filter_traces([domain_filter])
stats = snap.statistics('lineno', True)
for stat in stats[:10]:
print(stat)
If looking only at tracemalloc output, I will not be able to identify the problem. I assume that the problem is in the C extension but, I would like to make sure it is true.
I tried to change the domain by DomainFilter, but I have output only in 0 domain.
Also, I don't understand the meaning of the parameter which tracemalloc.start(frameno) has got, frameno is a number of the most recent frames, but nothing happens when I change it.
What can I do next to find the problematic place in the code which causes the memory leak?
Looking forward to your answer.
Given that your guess is that the problem is in the C extension, but that you want to make sure this is true, I would suggest that you do so using a tool that is less python-specific like https://github.com/vmware/chap or at least if you are able to run your program on Linux.
What you will need to do is run your script (uninstrumented) and at some point gather a live core (for example, using "gcore pid-of-your-running-program").
Once you have that core, open that core in chap ("chap your-core-file-path") and try the following command from the chap prompt:
summarize writable
The output will be something like this, but your numbers will likely vary considerably:
chap> summarize writable
5 ranges take 0x2021000 bytes for use: stack
6 ranges take 0x180000 bytes for use: python arena
1 ranges take 0xe1000 bytes for use: libc malloc main arena pages
4 ranges take 0x84000 bytes for use: libc malloc heap
8 ranges take 0x80000 bytes for use: used by module
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
4 ranges take 0x30000 bytes for use: unknown
29 writable ranges use 0x23e7000 (37,646,336) bytes.
The lines in the summary are given in decreasing order of byte usage, so you can follow that order. So looking at the top one first we see that the use is "stack":
5 ranges take 0x2021000 bytes for use: stack
This particular core was for a very simple python program that starts 4 extra threads and has all 5 threads sleep. The reason large stack allocations can happen rather easily with a multi-threaded python program is that python uses pthreads to create additional threads and pthreads uses the ulimit value for stack size as a default. If your program has a similarly large value, you can change the stack size in one of several ways, including running "ulimit -s" in the parent process to change the default stack size. To see what values actually make sense you can use the following command from the chap prompt:
chap> describe stacks
Thread 1 uses stack block [0x7fffe22bc000, 7fffe22dd000)
current sp: 0x7fffe22daa00
Peak stack usage was 0x7798 bytes out of 0x21000 total.
Thread 2 uses stack block [0x7f51ec07c000, 7f51ec87c000)
current sp: 0x7f51ec87a750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 3 uses stack block [0x7f51e7800000, 7f51e8000000)
current sp: 0x7f51e7ffe750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 4 uses stack block [0x7f51e6fff000, 7f51e77ff000)
current sp: 0x7f51e77fd750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
Thread 5 uses stack block [0x7f51e67fe000, 7f51e6ffe000)
current sp: 0x7f51e6ffc750
Peak stack usage was 0x2178 bytes out of 0x800000 total.
5 stacks use 0x2021000 (33,689,600) bytes.
So what you see above is that 4 of the stacks are 8MiB in size but could easily be well under 64KiB.
Your program may not have any issues with stack size, but if so, you can fix them as described above.
Continuing with checking for causes of growth, look at the next line from the summary:
6 ranges take 0x180000 bytes for use: python arena
So python arenas use the next most memory. These are used strictly for python-specific allocations. So if this value is large in your case it disproves your theory about C allocations being the culprit, but there is more you can do later to figure out how those python allocations are being used.
Looking at the remaining lines of the summary, we see a few with "libc" as part of the "use" description:
1 ranges take 0xe1000 bytes for use: libc malloc main arena pages
4 ranges take 0x84000 bytes for use: libc malloc heap
1 ranges take 0x31000 bytes for use: libc malloc mmapped allocation
Note that libc is responsible for all that memory, but you can't know that the memory is used for non-python code because for allocations beyond a certain size threshold (well under 4K) python grabs memory via malloc rather than grabbing memory from one of the python arenas.
So lets assume that you have resolved any issues you might have had with stack usage and you have mainly "python arenas" or "libc malloc" related usages. The next thing you want to understand is whether that memory is mostly "used" (meaning allocated but never freed) or "free" (meaning "freed but not given back to the operating system). You can do that as shown:
chap> count used
15731 allocations use 0x239388 (2,331,528) bytes.
chap> count free
1563 allocations use 0xb84c8 (754,888) bytes.
So in the above case, used allocations dominate and what one should do is to try to understand those used allocations. The case where free allocations dominate is much more complex and is discussed a bit in the user guide but would take too much time to cover here.
So lets assume for now that used allocations are the main cause of growth in your case. We can find out why we have so many used allocations.
The first thing we might want to know is whether any allocations were actually "leaked" in the sense that they are no longer reachable. This excludes the case where the growth is due to container-based growth.
One does this as follows:
chap> summarize leaked
0 allocations use 0x0 (0) bytes.
So for this particular core, as is pretty common for python cores, nothing was leaked. Your number may be non-zero. If it is non-zero but still much lower than the totals associated with memory used for "python" or "libc" reported above, you might just make a note of the leaks but continue to look for the real cause of growth. The user guide has some information about investigating leaks but it is a bit sparse. If the leak count is actually big enough to explain your growth issue, you should investigate that next but if not, read on.
Now that you are assuming container-based growth the following commands are useful:
chap> redirect on
chap> summarize used
Wrote results to scratch/core.python_5_threads.summarize_used
chap> summarize used /sortby bytes
Wrote results to scratch/core.python_5_threads.summarize_used::sortby:bytes
The above created two text files, one which has a summary ordered in terms of object counts and another which has a summary in terms of the total bytes used directly by those objects.
At present chap has only very limited support for python (it finds those python objects, in addition to any allocated by libc malloc but for python objects the summary only breaks out limited categories for python objects in terms of patterns (for example, %SimplePythonObject matches things like "int", "str", ... that don't hold other python objects and %ContainerPythonObject matches things like tuple, list, dict, ... that do hold references to other python objects). With that said, it should be pretty easy to tell from the summary whether the growth in used allocations is primarily due to objects allocated by python or objects allocated by native code.
So in this case, given that you specifically are trying to find out whether the growth is due to native code or not, look in the summary for counts like the following, all of which are python-related:
Pattern %SimplePythonObject has 7798 instances taking 0x9e9e8(649,704) bytes.
Pattern %ContainerPythonObject has 7244 instances taking 0xc51a8(807,336) bytes.
Pattern %PyDictKeysObject has 213 instances taking 0xb6730(747,312) bytes.
So in the core I have been using for an example, definitely python allocations dominate.
You will also see a line for the following, which is for allocations that chap does not yet recognize. You can't make assumptions about whether these are python-related or not.
Unrecognized allocations have 474 instances taking 0x1e9b8(125,368) bytes.
This will hopefully answer your basic question of what you can do next. At least at that point you will understand whether the growth is likely due to C code or python code and depending on what you find, the chap user guide should help you go further from there.
Does Python stores similar objects at memory locations nearer to each other?
Because id of similar objects, say lists and tuples, are nearer to each other than an object of type str.
No, except of course by coincidence. While this is highly implementation- and environment-specific, and there are actually memory management schemes which would dedicate page-sized memory regions to objects of the same type, no Python implementations I'm aware of exhibits the behavior you describe. With the possible exception of small numbers, which are sometimes cached under the hood and will likely be located next to each other.
What you're seeing may be because string literals are created at import time (part of the constants in the byte code) and interned, while lists and tuples (that don't contain literals) are created while running code. If a bunch of memory is allocated in between (especially if it isn't freed), the state of the heap may be sufficiently different that quite different addresses are handed out when you're checking.
I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).
Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.
What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?
UPDATE
I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?
Any advice would be appreciated.
Thanks!
Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.
There are two concepts you need to consider as you're building the models that process your data.
What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).
In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.
Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.
Its hard to say without real look into your data/algo, but the following approaches seem to be universal:
Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.