I've read about Raymond Hettinger's new method of implementing compact dicts. This explains why dicts in Python 3.6 use less memory than dicts in Python 2.7-3.5. However there seems to be a difference between the memory used in Python 2.7 and 3.3-3.5 dicts. Test code:
import sys
d = {i: i for i in range(n)}
print(sys.getsizeof(d))
Python 2.7: 12568
Python 3.5: 6240
Python 3.6: 4704
As mentioned I understand the savings between 3.5 and 3.6 but am curious about the cause of the savings between 2.7 and 3.5.
Turns out this is a red herring. The rules for increasing the size of dicts changed between cPython 2.7 - 3.2 and cPython 3.3 and again at cPython 3.4 (though this change only applies when deletions occur). We can see this using the following code to determine when the dict expands:
import sys
size_old = 0
for n in range(512):
d = {i: i for i in range(n)}
size = sys.getsizeof(d)
if size != size_old:
print(n, size_old, size)
size_old = size
Python 2.7:
(0, 0, 280)
(6, 280, 1048)
(22, 1048, 3352)
(86, 3352, 12568)
Python 3.5
0 0 288
6 288 480
12 480 864
22 864 1632
44 1632 3168
86 3168 6240
Python 3.6:
0 0 240
6 240 368
11 368 648
22 648 1184
43 1184 2280
86 2280 4704
Keeping in mind that dicts resize when they get to be 2/3 full, we can see that the cPython 2.7 dict implementation quadruples in size when it expands while the cPython 3.5/3.6 dict implementations only double in size.
This is explained in a comment in the dict source code:
/* GROWTH_RATE. Growth rate upon hitting maximum load.
* Currently set to used*2 + capacity/2.
* This means that dicts double in size when growing without deletions,
* but have more head room when the number of deletions is on a par with the
* number of insertions.
* Raising this to used*4 doubles memory consumption depending on the size of
* the dictionary, but results in half the number of resizes, less effort to
* resize.
* GROWTH_RATE was set to used*4 up to version 3.2.
* GROWTH_RATE was set to used*2 in version 3.3.0
*/
Related
I was trying to replicate the memory usage test here.
Essentially, the post claims that given the following code snippet:
import copy
import memory_profiler
#profile
def function():
x = list(range(1000000)) # allocate a big list
y = copy.deepcopy(x)
del x
return y
if __name__ == "__main__":
function()
Invoking
python -m memory_profiler memory-profile-me.py
prints, on a 64-bit computer
Filename: memory-profile-me.py
Line # Mem usage Increment Line Contents
================================================
4 #profile
5 9.11 MB 0.00 MB def function():
6 40.05 MB 30.94 MB x = list(range(1000000)) # allocate a big list
7 89.73 MB 49.68 MB y = copy.deepcopy(x)
8 82.10 MB -7.63 MB del x
9 82.10 MB 0.00 MB return y
I copied and pasted the same code but my profiler yields
Line # Mem usage Increment Line Contents
================================================
3 44.711 MiB 44.711 MiB #profile
4 def function():
5 83.309 MiB 38.598 MiB x = list(range(1000000)) # allocate a big list
6 90.793 MiB 7.484 MiB y = copy.deepcopy(x)
7 90.793 MiB 0.000 MiB del x
8 90.793 MiB 0.000 MiB return y
This post could be outdated --- either the profiler package or python could have changed. In any case, my questions are, in Python 3.6.x
(1) Should copy.deepcopy(x) (as defined in the code above) consume a nontrivial amount of memory?
(2) Why couldn't I replicate?
(3) If I repeat x = list(range(1000000)) after del x, would the memory increase by the same amount as I first assigned x = list(range(1000000)) (as in line 5 of my code)?
copy.deepcopy() recursively copies mutable object only, immutable objects such as integers or strings are not copied. The list being copied consists of immutable integers, so the y copy ends up sharing references to the same integer values:
>>> import copy
>>> x = list(range(1000000))
>>> y = copy.deepcopy(x)
>>> x[-1] is y[-1]
True
>>> all(xv is yv for xv, yv in zip(x, y))
True
So the copy only needs to create a new list object with 1 million references, an object that takes a little over 8MB of memory on my Python 3.6 build on Mac OS X 10.13 (a 64-bit OS):
>>> import sys
>>> sys.getsizeof(y)
8697464
>>> sys.getsizeof(y) / 2 ** 20 # Mb
8.294548034667969
An empty list object takes 64 bytes, each reference takes 8 bytes:
>>> sys.getsizeof([])
64
>>> sys.getsizeof([None])
72
Python list objects overallocate space to grow, converting a range() object to a list causes it to make a little more space for additional growth than when using deepcopy, so x is slightly larger still, having room for an additional 125k objects before having to resize again:
>>> sys.getsizeof(x)
9000112
>>> sys.getsizeof(x) / 2 ** 20
8.583175659179688
>>> ((sys.getsizeof(x) - 64) // 8) - 10**6
125006
while the copy only has additional space for left for about 87k:
>>> ((sys.getsizeof(y) - 64) // 8) - 10**6
87175
On Python 3.6 I can't replicate the article claims either, in part because Python has seen a lot of memory management improvements, and in part because the article is wrong on several points.
The behaviour of copy.deepcopy() regarding lists and integers has never changed in the long history of the copy.deepcopy() (see the first revision of the module, added in 1995), and the interpretation of the memory figures is wrong, even on Python 2.7.
Specifically, I can reproduce the results using Python 2.7 This is what I see on my machine:
$ python -V
Python 2.7.15
$ python -m memory_profiler memtest.py
Filename: memtest.py
Line # Mem usage Increment Line Contents
================================================
4 28.406 MiB 28.406 MiB #profile
5 def function():
6 67.121 MiB 38.715 MiB x = list(range(1000000)) # allocate a big list
7 159.918 MiB 92.797 MiB y = copy.deepcopy(x)
8 159.918 MiB 0.000 MiB del x
9 159.918 MiB 0.000 MiB return y
What is happening is that Python's memory management system is allocating a new chunk of memory for additional expansion. It's not that the new y list object takes nearly 93MiB of memory, that's just the additional memory the OS has allocated to the Python process when that process requested some more memory for the object heap. The list object itself is a lot smaller.
The Python 3 tracemalloc module is a lot more accurate about what actually happens:
python3 -m memory_profiler --backend tracemalloc memtest.py
Filename: memtest.py
Line # Mem usage Increment Line Contents
================================================
4 0.001 MiB 0.001 MiB #profile
5 def function():
6 35.280 MiB 35.279 MiB x = list(range(1000000)) # allocate a big list
7 35.281 MiB 0.001 MiB y = copy.deepcopy(x)
8 26.698 MiB -8.583 MiB del x
9 26.698 MiB 0.000 MiB return y
The Python 3.x memory manager and list implementation is smarter than those one in 2.7; evidently the new list object was able to fit into existing already-available memory, pre-allocated when creating x.
We can test Python 2.7's behaviour with a manually built Python 2.7.12 tracemalloc binary and a small patch to memory_profile.py. Now we get more reassuring results on Python 2.7 as well:
Filename: memtest.py
Line # Mem usage Increment Line Contents
================================================
4 0.099 MiB 0.099 MiB #profile
5 def function():
6 31.734 MiB 31.635 MiB x = list(range(1000000)) # allocate a big list
7 31.726 MiB -0.008 MiB y = copy.deepcopy(x)
8 23.143 MiB -8.583 MiB del x
9 23.141 MiB -0.002 MiB return y
I note that the author was confused as well:
copy.deepcopy copies both lists, which allocates again ~50 MB (I am not sure where the additional overhead of 50 MB - 31 MB = 19 MB comes from)
(Bold emphasis mine).
The error here is to assume that all memory changes in the Python process size can directly be attributed to specific objects, but the reality is far more complex, as the memory manager can add (and remove!) memory 'arenas', blocks of memory reserved for the heap, as needed and will do so in larger blocks if that makes sense. The process here is complex, as it depends on interactions between Python's manager and the OS malloc implementation details. The author has found an older article on Python's model that they have misunderstood to be current, the author of that article themselves has already tried to point this out; as of Python 2.5 the claim that Python doesn't free memory is no longer true.
What's troubling, is that the same misunderstandings then lead the author to recommend against using pickle, but in reality the module, even on Python 2, never adds more than a little bookkeeping memory to track recursive structures. See this gist for my testing methodology; using cPickle on Python 2.7 adds a one-time 46MiB increase (doubling the create_file() call results in no further memory increase). In Python 3, the memory changes have gone altogether.
I'll open a dialog with the Theano team about the post, the article is wrong, confusing, and Python 2.7 is soon to be made entirely obsolete anyway so they really should focus on Python 3's memory model. (*)
When you create a new list from range(), not a copy, you'll see a similar increase in memory as for creating x the first time, because you'd create a new set of integer objects in addition to the new list object. Aside from a specific set of small integers, Python doesn't cache and re-use integer values for range() operations.
(*) addendum: I opened issue #6619 with the Thano project. The project agreed with my assessment and removed the page from their documentation, although they haven't yet updated the published version.
I am trying to get an understanding of using memory_profiler on my python app.
referring to the Python memory profile guide I copied the following code snippet :-
from memory_profiler import profile
#profile
def my_func():
a = [1] * (10 ** 6)
b = [2] * (2 * 10 ** 7)
del b
return a
The expected result according to the link is :-
Line # Mem usage Increment Line Contents
==============================================
3 #profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
But when i ran my it on my VM running Ubuntu 16.04 I got the following results instead :-
Line # Mem usage Increment Line Contents
================================================
3 35.4 MiB 35.4 MiB #profile
4 def my_func():
5 43.0 MiB 7.7 MiB a = [1] * (10 ** 6)
6 195.7 MiB 152.6 MiB b = [2] * (2 * 10 ** 7)
7 43.1 MiB -152.5 MiB del b
8 43.1 MiB 0.0 MiB return a
There seems to be a huge overhead of around 30MiB difference between the expected and my run. I am trying to get an understanding of where this comes from and if I am doing anything incorrect. Should I be worried about it?
Please advice if anyone have any idea. Thanks
EDIT:
O/S : Ubuntu 16.06.4 (Xenial) running inside a VM
Python : Python 3.6.4 :: Anaconda, Inc.
The memory taken by a list, or an integer heavily depends on the python version/build.
For instance, in python 3, all integers are long integers, whereas in python 2, long is used only when the integer doesn't fit in a CPU register / C int.
On my machine, python 2:
>>> sys.getsizeof(2)
24
Python 3.6.2:
>>> sys.getsizeof(2)
28
When computing your ratios vs the 24/28 ratio, it's pretty close:
>>> 195.7/166.2
1.177496991576414
>>> 28/24
1.1666666666666667
(this is probably not the only difference, but that's the most obvious I can think of)
So no, as long as the results are proportional, you shouldn't worry, but if you have memory issues with python integers (python 3, that is), you could use alternatives, like numpy or other native integer types.
I have a huge binary file(several GB) that has the following dataformat:
4 subsequent bytes form one composite datapoint(32 bits) which consists of:
b0-b3 4 flag bits
b4-b17 14 bit signed integer
b18-b32 14 bit signed integer
I need to access both signed integers and the flag bits separately and append to a list or some smarter datastructure (not yet decided). At the moment I'm using the following code to read it in:
from collections import namedtuple
DataPackage = namedtuple('DataPackage', ['ie', 'if1', 'if2', 'if3', 'quad2', 'quad1'])
def _unpack_integer(bits):
value = int(bits, 2)
if bits[0] == '1':
value -= (1 << len(bits))
return value
def unpack(data):
bits = ''.join(['{0:08b}'.format(b) for b in bytearray(data)])
flags = [bool(bits[i]) for i in range(4)]
quad2 = _unpack_integer(bits[4:18])
quad1 = _unpack_integer(bits[18:])
return DataPackage(flags[0], flags[1], flags[2], flags[3], quad2, quad1)
def read_file(filename, datapoints=None):
data = []
i = 0
with open(filename, 'rb') as fh:
value = fh.read(4)
while value:
dp = unpack(value)
data.append(dp)
value = fh.read(4)
i += 1
if i % 10000 == 0:
print('Read: %d kB' % (float(i) * 4.0 / 1000.0))
if datapoints:
if i == datapoints:
break
return data
if __name__ == '__main__':
data = read_heterodyne_file('test.dat')
This code works but it's too slow for my purposes (2s for 100k datapoints with 4byte each). I would need a factor of 10 in speed at least.
The profiler says that the code spends it's time mostly in string formatting(to get the bits) and in _unpack_integer().
Unfortunately I am not sure how to proceed here. I'm thinking about either using cython or directly writing some c code to do the read in. I also tried Pypy ant it gave me huge performance gain but unfortunately it needs to be compatible to a bigger project which doesn't work with Pypy.
I would recommend trying ctypes, if you already have a c/c++ library that recognizes the data-strcture. The benefits are, the datastructues are still available to your python while the 'loading' would be fast. If you already have a c library to load the data you can use the function call from that library to do the heavy lifting and just map the data into your python structures. I'm sorry I won't be able to try out and provide proper code for your example (perhaps someone else cane) but here are a couple of tips to get you started
My take on how one might create bit vectors in python:
https://stackoverflow.com/a/40364970/262108
The approach I mentioned above which I applied to a similar problem that you described. Here I use ctypes to create a ctypes data-structure (thus enabling me to use the object as any other python object), while also being able to pass it along to a C library:
https://gist.github.com/lonetwin/2bfdd41da41dae326afb
Thanks to the hint by Jean-François Fabre I found a suitable sulution using bitmasks which gives me a speedup of factor 6 in comparison to the code in the question. It has now a throuput of around 300k datapoints/s.
Also I neglected using the admittedly nice named tuples and replaced it by a list because I found out this is also a bottleneck.
The code now looks like
masks = [2**(31-i) for i in range(4)]
def unpack3(data):
data = struct.unpack('>I', data)[0]
quad2 = (data & 0xfffc000) >> 14
quad1 = data & 0x3fff
if (quad2 & (1 << (14 - 1))) != 0:
quad2 = quad2 - (1 << 14)
if (quad1 & (1 << (14 - 1))) != 0:
quad1 = quad1 - (1 << 14)
flag0 = data & masks[0]
flag1 = data & masks[1]
flag2 = data & masks[2]
flag3 = data & masks[3]
return flag0, flag1, flag2, flag3, quad2, quad1
The line profiler says:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
58 #profile
59 def unpack3(data):
60 1000000 3805727 3.8 12.3 data = struct.unpack('>I', data)[0]
61 1000000 2670576 2.7 8.7 quad2 = (data & 0xfffc000) >> 14
62 1000000 2257150 2.3 7.3 quad1 = data & 0x3fff
63 1000000 2634679 2.6 8.5 if (quad2 & (1 << (14 - 1))) != 0:
64 976874 2234091 2.3 7.2 quad2 = quad2 - (1 << 14)
65 1000000 2660488 2.7 8.6 if (quad1 & (1 << (14 - 1))) != 0:
66 510978 1218965 2.4 3.9 quad1 = quad1 - (1 << 14)
67 1000000 3099397 3.1 10.0 flag0 = data & masks[0]
68 1000000 2583991 2.6 8.4 flag1 = data & masks[1]
69 1000000 2486619 2.5 8.1 flag2 = data & masks[2]
70 1000000 2473058 2.5 8.0 flag3 = data & masks[3]
71 1000000 2742228 2.7 8.9 return flag0, flag1, flag2, flag3, quad2, quad1
So there is not one clear bottleneck anymore. Probably now it's as fast as it gets in pure Python. Or does anyone have an idea for further speedup?
NB: This is my first foray into memory profiling with Python, so perhaps I'm asking the wrong question here. Advice re improving the question appreciated.
I'm working on some code where I need to store a few million small strings in a set. This, according to top, is using ~3x the amount of memory reported by heapy. I'm not clear what all this extra memory is used for and how I can go about figuring out whether I can - and if so how to - reduce the footprint.
memtest.py:
from guppy import hpy
import gc
hp = hpy()
# do setup here - open files & init the class that holds the data
print 'gc', gc.collect()
hp.setrelheap()
raw_input('relheap set - enter to continue') # top shows 14MB resident for python
# load data from files into the class
print 'gc', gc.collect()
h = hp.heap()
print h
raw_input('enter to quit') # top shows 743MB resident for python
The output is:
$ python memtest.py
gc 5
relheap set - enter to continue
gc 2
Partition of a set of 3197065 objects. Total size = 263570944 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 3197061 100 263570168 100 263570168 100 str
1 1 0 448 0 263570616 100 types.FrameType
2 1 0 280 0 263570896 100 dict (no owner)
3 1 0 24 0 263570920 100 float
4 1 0 24 0 263570944 100 int
So in summary, heapy shows 264MB while top shows 743MB. What's using the extra 500MB?
Update:
I'm running 64 bit python on Ubuntu 12.04 in VirtualBox in Windows 7.
I installed guppy as per the answer here:
sudo pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy
Can some body help me as how to find how much time and how much memory does it take for a code in python?
Use this for calculating time:
import time
time_start = time.clock()
#run your code
time_elapsed = (time.clock() - time_start)
As referenced by the Python documentation:
time.clock()
On Unix, return the current processor time as a floating
point number expressed in seconds. The precision, and in fact the very
definition of the meaning of “processor time”, depends on that of the
C function of the same name, but in any case, this is the function to
use for benchmarking Python or timing algorithms.
On Windows, this function returns wall-clock seconds elapsed since the
first call to this function, as a floating point number, based on the
Win32 function QueryPerformanceCounter(). The resolution is typically
better than one microsecond.
Reference: http://docs.python.org/library/time.html
Use this for calculating memory:
import resource
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
Reference: http://docs.python.org/library/resource.html
Based on #Daniel Li's answer for cut&paste convenience and Python 3.x compatibility:
import time
import resource
time_start = time.perf_counter()
# insert code here ...
time_elapsed = (time.perf_counter() - time_start)
memMb=resource.getrusage(resource.RUSAGE_SELF).ru_maxrss/1024.0/1024.0
print ("%5.1f secs %5.1f MByte" % (time_elapsed,memMb))
Example:
2.3 secs 140.8 MByte
There is a really good library called jackedCodeTimerPy for timing your code. You should then use resource package that Daniel Li suggested.
jackedCodeTimerPy gives really good reports like
label min max mean total run count
------- ----------- ----------- ----------- ----------- -----------
imports 0.00283813 0.00283813 0.00283813 0.00283813 1
loop 5.96046e-06 1.50204e-05 6.71864e-06 0.000335932 50
I like how it gives you statistics on it and the number of times the timer is run.
It's simple to use. If i want to measure the time code takes in a for loop i just do the following:
from jackedCodeTimerPY import JackedTiming
JTimer = JackedTiming()
for i in range(50):
JTimer.start('loop') # 'loop' is the name of the timer
doSomethingHere = 'This is really useful!'
JTimer.stop('loop')
print(JTimer.report()) # prints the timing report
You can can also have multiple timers running at the same time.
JTimer.start('first timer')
JTimer.start('second timer')
do_something = 'amazing'
JTimer.stop('first timer')
do_something = 'else'
JTimer.stop('second timer')
print(JTimer.report()) # prints the timing report
There are more use example in the repo. Hope this helps.
https://github.com/BebeSparkelSparkel/jackedCodeTimerPY
Use a memory profiler like guppy
>>> from guppy import hpy; h=hpy()
>>> h.heap()
Partition of a set of 48477 objects. Total size = 3265516 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 25773 53 1612820 49 1612820 49 str
1 11699 24 483960 15 2096780 64 tuple
2 174 0 241584 7 2338364 72 dict of module
3 3478 7 222592 7 2560956 78 types.CodeType
4 3296 7 184576 6 2745532 84 function
5 401 1 175112 5 2920644 89 dict of class
6 108 0 81888 3 3002532 92 dict (no owner)
7 114 0 79632 2 3082164 94 dict of type
8 117 0 51336 2 3133500 96 type
9 667 1 24012 1 3157512 97 __builtin__.wrapper_descriptor
<76 more rows. Type e.g. '_.more' to view.>
>>> h.iso(1,[],{})
Partition of a set of 3 objects. Total size = 176 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 33 136 77 136 77 dict (no owner)
1 1 33 28 16 164 93 list
2 1 33 12 7 176 100 int
>>> x=[]
>>> h.iso(x).sp
0: h.Root.i0_modules['__main__'].__dict__['x']