Python Memory Size - python

If I have a list of numbers and if I implement the same thing using dictionary, will they both occupy the same memory space?
Eg.
list = [[0,1],[1,0]]
dict = {'0,0' = 0 , '0,1' = 1, '1,0' = 1, '1,1' = 0 }
will the memory size of list and dict be the same? which will take up more space?

If you're using Python 2.6 or higher, you can use sys.getsizeof() to do a test
>>> import sys
>>> aList = [[0,1],[1,0]]
>>> aDict = {'0,0' : 0 , '0,1' : 1, '1,0' : 1, '1,1' : 0 }
>>> sys.getsizeof(aList)
44
>>> sys.getsizeof(aDict)
140
With the example you provided, we see that aDict takes up more space in memory.

There are very good chances that the dictionary will be bigger.
A dictionary and a list are both fairly different data structures. When it boils down to memory and processor instructions, a list is fairly straightforward: all values in it are contiguous, and when you want to access item n, you go at the beginning of the list, move n items forward, and return it. This is easy, because list items are contiguous and have integer keys.
On the other hand, constraints for dictionaries are fairly different. You can't just go to the beginning of the dictionary, move key item forwards and return it, because key might just not be numeric. Besides, keys don't need to be contiguous.
In the case of a dictionary, you need a structure to find the values associated to keys very easily, even though there might be no relationship between them. Therefore, it can't use the same kind of algorithms a list use. And typically, the data structures required by dictionaries are bigger than data structures required for lists.
Coincidentally, both may have the same size, even though it would be kind of surprising. Though, the data representation will be different, no matter what.

Getting the size of objects in python is tricky. Ostensibly, this can be done with sys.getsizeof but it returns incomplete data.
An empty list has size usage of 32 bytes on my system.
>>> sys.getsizeof([])
32
A list with one element has size of 36 bytes. This does not seem to vary according to the element.
>>> sys.getsizeof([[1, 2]])
36
>>> sys.getsizeof([1])
36
So you would need to know the size of the inner list as well.
>>> sys.getsizeof([1, 2])
40
So your memory usage for a list (assuming the same as my system) should be 32 bytes plus 44 bytes for every internal list. This is because python is storing the overhead involved with keeping a list which costs 32 bytes. Every addition entry is represented as a pointer to that object and costs 4 bytes. So the pointer costs 4 bytes on top of whatever you are storing. In the case of two element lists, this is 40 bytes.
For a dict, it's 136 for an empty dict
>>> sys.getsizeof({})
136
From there, it will quadruple its size as adding members causes it to run out of space and risk frequent hash collisions. Again, you have to include the size of the object being stored and the keys as well.
>>> sys.getsizeof({1: 2})
136

In short, they won't be the same. The list should perform better, unless you can have sparsely populated lists that could be implemented in a dict.
As mentioned by several others, getsizeof doesn't sum the contained objects.
Here is a recipe that does some work for you on standard python (3.0) types.
Compute Memory footprint of an object and its contents
Using this recipe on python 3.1 here are some results:
aList = [[x,x] for x in range(1000)]
aListMod10 = [[x%10,x%10] for x in range(1000)]
aTuple = [(x,x) for x in range(1000)]
aDictString = dict(("%s,%s" % (x,x),x) for x in range(1000))
aDictTuple = dict(((x,x),x) for x in range(1000))
print("0", total_size(0))
print("10", total_size(10))
print("100", total_size(100))
print("1000", total_size(1000))
print("[0,1]", total_size([0,1]))
print("(0,1)", total_size((0,1)))
print("aList", total_size(aList))
print("aTuple", total_size(aTuple))
print("aListMod10", total_size(aListMod10))
print("aDictString", total_size(aDictString))
print("aDictTuple", total_size(aDictTuple))
print("[0]'s", total_size([0 for x in range(1000)]))
print("[x%10]'s", total_size([x%10 for x in range(1000)]))
print("[x%100]'s", total_size([x%100 for x in range(1000)]))
print("[x]'s", total_size([x for x in range(1000)]))
Output:
0 12
10 14
100 14
1000 14
[0,1] 70
(0,1) 62
aList 62514
aTuple 54514
aListMod10 48654
aDictString 82274
aDictTuple 74714
[0]'s 4528
[x%10]'s 4654
[x%100]'s 5914
[x]'s 18514
It seems to logically follow that the best performing memory option would be to use 2 lists:
list_x = [0, 1, ...]
list_y = [1, 0, ...]
This may only be worth it if you are running tight on memory and your list is expected to be large. I would guess that the usage pattern would be creating (x,y) tuples all over the place anyway, so it may be that you should really just do:
tuples = [(0, 1), (1, 0), ...]
All things being equal, choose what allows you to write the most readable code.

Related

memory usage: numpy-arrays vs python-lists

Numpy is known for optimized arrays and various advantages over python-lists.
But when I check for the memory usage python-lists have less space than the numpy arrays.
The code I used is entered below.
Can anyone explain me why?
import sys
Z = np.zeros((10,10),dtype = int)
A = [[0] * 10] * 10
print(A,'\n',f'{sys.getsizeof(A)} bytes')
print(Z,'\n',f'{Z.size * Z.itemsize} bytes')
You're not measuring correctly; the native Python list only contains 10 references. You need to add in the collective size of the sub-lists as well:
>>> sys.getsizeof(A) + sum(map(sys.getsizeof, A))
1496
And it might get worse: each element inside the sub-lists could also be a reference (to an int). It's difficult to check whether the Python implementation is optimizing this away and storing the actual numbers inside the list.
You're also under-representing the size of the numpy array, because it includes a header:
>>> Z.size * Z.itemsize
800
>>> sys.getsizeof(Z)
912
In either case it's not an exact science and will depend on your platform and Python implementation.
According to the spec (https://docs.python.org/3/library/sys.html#sys.getsizeof), "only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."
Also "getsizeof calls object’s " sizeof method
So you are given the size of just the container (a list object).
Please check https://code.activestate.com/recipes/577504/ for a complete size computation, which returns 296bytes for your example since only two unique objects are used. A list [0 0 0 0 0 0 0 0 0 0] and int 0.
If you initialize the list with different values, overall size will increase and will become bigger than np.array, which reserves 4bytes for numpy.int32 type elements, plus the size of its own internal data structure.
Find detailed info with examples here: https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

Avoid generating duplicate values from random

I want to generate random numbers and store them in a list as the following:
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]
My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:
for i in range(num_of_chunks):
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
# save to file
Well, since efficiency/speed doesn't matter, I think this will work:
s = set()
while len(s) < total:
s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)
Since sets can only have unique elements in it, i think this will work well enough
To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.
Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.
You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.
If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.
One way to generate random values that don't repeat is first to create a list of contiguous values
l = list(range(1000))
then shuffle it:
import random
random.shuffle(l)
You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)
As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.
Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:
import random
step = 10
l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
with the required max value and number of iterations that gives:
import random
number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations
l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))
runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)
Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.
random.seed()
The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.
It may need High CPU and Physical memory!
I suggest you to classify your data.
For Example you can Save:
All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt
All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt
Then for checking a new number:
For Example my number is 9256547
It starts with 92 and is 7 characters.
So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt
Finally,You can join all files together.
Sorry If I have mistakes.My main language isn't English.

how to create an empty 2 d list or list of lists of a size in megabytes.

how to create an empty 2 d list or list of lists of a size in megabytes. main purpose is it can store data in Mbs but while iterating it should not process empty sub sub lists.
It sounds like you want a data structure like this: [[], [], []], ultimately. Let's call that final data structure master. There are two ways to interpret your question, then:
1) the "size" of just master itself (and not the objects to which it refers) should be n megabytes.
2) the size of master plus the size of every object contained within should be n megabytes.
These are two slightly different problems. If you just wanted to add empty lists to master until the size of master reached n MB, then you could do something like this.
def init_empty_array(mb):
byte = mb*(10**6)
master = []
size = getsizeof(master)
while size < byte:
master.append([])
size = getsizeof(master)
return master
sys.getsizeof returns the size of an object in bytes, so that's why we convert the passed mb (megabytes) parameter to bytes. Then we create an empty, "master" list and grab its size (which is 64 bytes in Python). After that, we just append to master empty lists until we reach the desired size. You could really just initiate size as 64 and then add 8 for each additional list, instead of using sys.getsizeof.
Notice, however, that the ultimate size of master likely will not correspond with the size you would expect from appending k lists to master. That's because sys.getsizeof only measures the size of master itself, and not the lists contained therein. Consider this:
>>> from sys import getsizeof
>>> getsizeof([])
64
>>> getsizeof([[]])
72
Well, look at that. The size, in bytes, of [] (an empty list) is 64 bytes. If we nest an empty list inside of a formerly-empty list, call the outer list arr, then the size of arr becomes 72. Each additional object (str, list, set, etc.) added to arr adds 8 bytes to the size of arr. But there's still technically an empty list inside of [[]], right? It's up to you whether you want to account for the size (64 bytes) of each empty list inside your master list.
To get both the size of master and the objects contained within, you can try something like this:
def init_empty_array(mb):
byte = mb*(10**6)
master = []
size = 64
while size < byte:
master.append([])
size += 64 + 8
return master
We add 64 bytes for the additional list, and 8 for the additional size added to master from that list.
Please note that, depending on how many megabytes you pass (e.g. 2.53 versus 2.0), your final master list may or may not be exactly the size of mb that you pass (that's because we can't append a fraction of a list; it's either 64 bytes or nothing).
Depending on whether mb is a strict upper or lower limit, you can adjust where master's size falls (i.e. add some modular arithmetic in there to find the maximum or minimum number of lists for a given number of megabytes).
And, suppose you wanted to fill a list with whatever until its size reached n MB. You could use the recursive memory footprint function here to measure the size of everything in that list (whether the objects within are str, set, list, dict, or tuple, or any combination of those classes).
As you can see, there's lots of room above for your own interpretation and tweaking, but this should give you a good start.

Python: Effective storage of data in memory

I have the following dictionary structure ( 10,000 keys with values being formed by list of lists )
my_dic ={0: [[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]],
1:[[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]] .....
10,000th key:[[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]]}
(Note: Data is dummy, so I have just repeated it across the keys)
The logical data types that I want in the smaller elementry list are
inner_list = [int, float, small_int, boolean( 0 or 1), float]
A sys.getsizeof(inner_list), shows it's size to be 56 bytes. Adding 12 bytes for the int key makes it 68 bytes. Now, since I have 10^8 such lists ( 10000*10000) it's storage in memory is becoming a big problem. I want the data in memory ( no DB as of now ). What should be the most optimized method of storing it ? I am inclined to think that it must have something to do with numpy but not sure on what would be the best method and how to implement it. Any suggestions ?
2) Also, since I am storing these dictionaries in memory, I would like to clear the memory occupied by them as soon as I am done using them. Is there a way of doing this in python?
One idea is to break up the dictionary structure into simpler structures, but it may affect how efficiently you can process it.
1 Create separate array for the keys
keys = array('i', [key1, key2, ..., key10000])
Depending on the possible values of the keys, you can further specify the particular int type for the array. Also, the keys should be ordered, so you could perform binary search on the key table. This way you also save some space from the hash table used in the Python dictionary implementation. Downside is that key lookup now takes O(logn) time instead of O(1).
2 Store inner_list elements in a 10000x10000 matrices or in a 100000000 length lists
As each position i from 0 to 9999 corresponds to a specific key that can be obtained from keys array, each list of lists can be put into i'th row in the matrix and each inner_list elements in columns of the row.
Other option is to put them in a long list and index using the key position i such that
idx = i*10000 + j
where i is the index of key in keys array and j is the index of particular inner_list instance.
Additionally, for each inner_list element you can have total of five separate arrays, which somewhat breaks the locality of the data in memory
int_array = array('i', [value1, ..., value100000000])
float1_array = array('f', [value1, ..., value100000000])
small_int_array = array('h', [value1, ..., value100000000])
bool_array = array('?', [value1, ..., value100000000])
float2_array = array('f', [value1, ..., value100000000])
Boolean array can be further optimized by packing them into bits.
Alternative is also to pack inner_list elements in a binary string using struct module and store them in a single list instead of five different lists.
3 Releasing memory
As soon as the variables go out of scope, they are ready to be garbage collected, so the memory can be claimed back. To do this sooner, for example in a function or a loop, you may just replace a list with a dummy value to bring the reference count of the variable down to zero.
variable = None
Note
However, these ideas may not be good enough for you particular solution. There are other possibilities too, such as loading only some part of the data in memory. It depends, how do you plan to process it.
Generally Python takes its own share of the memory for internal handling of the pointers/structures. Therefore, yet another alternative is to implement that particular data strucure and its handling in a language like Fortran, C or C++, which can be more easily tuned for your particular needs.

Memory efficient int-int dict in Python

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:
d[k] = v # replace if present
v = d[k] # None or a negative number if not present
I need to hold ~250M pairs, so it really has to be tight.
Do you happen to know a suitable implementation (Python 2.7)?
EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!
To rephrase. Here's a trivial int-int dictionary with 1M pairs:
>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
... d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
...
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 25165960 51 25165960 51 dict (no owner)
1 1999521 100 23994252 49 49160212 100 int
On average, a pair of integers uses 49 bytes.
Here's an array of 2M integers:
>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
... a.append(random.randint(0, sys.maxint))
...
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 7 8000028 100 8000028 100 array.array
On average, a pair of integers uses 8 bytes.
I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?
You could use the IIBtree from Zope
I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.
I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.
http://docs.scipy.org/doc/numpy/reference/
You might also find some useful ideas in this thread:
Memory Efficient Alternatives to Python Dictionaries
8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).
I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.
How about a Judy array if you're mapping from ints? It is kind of a sparse array... Uses 1/4th of the dictionary implementation's space.
Judy:
$ cat j.py ; time python j.py
import judy, random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = judy.JudyIntObjectMap()
for _ in xrange(4000000):
d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
print h.heap()
Partition of a set of 4000004 objects. Total size = 96000624 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 4000001 100 96000024 100 96000024 100 int
1 1 0 448 0 96000472 100 types.FrameType
2 1 0 88 0 96000560 100 __builtin__.weakref
3 1 0 64 0 96000624 100 __builtin__.PyJudyIntObjectMap
real 1m9.231s
user 1m8.248s
sys 0m0.381s
Dictionary:
$ cat d.py ; time python d.py
import random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = {}
for _ in xrange(4000000):
d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
print h.heap()
Partition of a set of 8000003 objects. Total size = 393327344 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 201326872 51 201326872 51 dict (no owner)
1 8000001 100 192000024 49 393326896 100 int
2 1 0 448 0 393327344 100 types.FrameType
real 1m8.129s
user 1m6.947s
sys 0m0.559s
~1/4th the space:
$ echo 96000624 / 393327344 | bc -l
.24407309958089260125
(I'm using 64bit python, btw, so my base numbers may be inflated due to 64bit pointers)
Looking at your data above, that's not 49 bytes per int, it's 25. The other 24 bytes per entry are the int objects themselves. So you need something that is significantly smaller than 25 bytes per entry. Unless you also are going to reimplement the int objects, which is possible for the key hashes, at least. Or implement it in C, where you can skip the objects completely (this is what Zopes IIBTree does, mentioned above).
To be honest the Python dictionary is highly tuned in various ways. It will not be easy to beat it, but good luck.
I have implemented my own int-int dictionary, available here (BSD license). In short, I use array.array('i') to store key-value pairs sorted by keys. In fact, instead of one large array, I keep a dictionary of smaller arrays (a key-value pair is stored in the key/65536th array) in order to speed up shifting during insertion and binary search during retrieval. Each array stores the keys and values in the following way:
key0 value0 key1 value1 key2 value2 ...
Actually, it is not only an int-int dictionary, but a general object-int dictionary with objects reduced to their hashes. Thus, the hash-int dictionary can be used as a cache of some persistently stored dictionary.
There are three possible strategies of handling "key collisions", that is, attempts to assign a different value to the same key. The default strategy allows it. The "deleting" removes the key and marks it as colliding, so that any further attempts to assign a value to it will have no effect. The "shouting" strategy throws an exception during any overwrite attempt and on any further access to any colliding key.
Please see my answer to a related question for a differently worded description of my approach.

Categories

Resources