Memory efficient int-int dict in Python

Memory efficient int-int dict in Python - python

I need a memory efficient int-int dict in Python that would support the following operations in O(log n) time:
d[k] = v # replace if present
v = d[k] # None or a negative number if not present
I need to hold ~250M pairs, so it really has to be tight.
Do you happen to know a suitable implementation (Python 2.7)?
EDIT Removed impossible requirement and other nonsense. Thanks, Craig and Kylotan!
To rephrase. Here's a trivial int-int dictionary with 1M pairs:
>>> import random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> d = {}
>>> for _ in xrange(1000000):
... d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
...
>>> h.heap()
Partition of a set of 1999530 objects. Total size = 49161112 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 25165960 51 25165960 51 dict (no owner)
1 1999521 100 23994252 49 49160212 100 int
On average, a pair of integers uses 49 bytes.
Here's an array of 2M integers:
>>> import array, random, sys
>>> from guppy import hpy
>>> h = hpy()
>>> h.setrelheap()
>>> a = array.array('i')
>>> for _ in xrange(2000000):
... a.append(random.randint(0, sys.maxint))
...
>>> h.heap()
Partition of a set of 14 objects. Total size = 8001108 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 7 8000028 100 8000028 100 array.array
On average, a pair of integers uses 8 bytes.
I accept that 8 bytes/pair in a dictionary is rather hard to achieve in general. Rephrased question: is there a memory-efficient implementation of int-int dictionary that uses considerably less than 49 bytes/pair?

You could use the IIBtree from Zope

I don't know if this is a one-shot solution, or part of an ongoing project, but if it's the former, is throwing more ram at it cheaper than the necessary developer time to optimize the memory usage? Even at 64 bytes per pair, you're still only looking at 15GB, which would fit easily enough into most desktop boxes.
I think the correct answer probably lies within the SciPy/NumPy libraries, but I'm not familiar enough with the library to tell you exactly where to look.
http://docs.scipy.org/doc/numpy/reference/
You might also find some useful ideas in this thread:
Memory Efficient Alternatives to Python Dictionaries

8 bytes per key/value pair would be pretty difficult under any implementation, Python or otherwise. If you don't have a guarantee that the keys are contiguous then either you'd waste a lot of space between the keys by using an array representation (as well as needing some sort of dead value to indicate a null key), or you'd need to maintain a separate index to key/value pairs which by definition would exceed your 8 bytes per pair (even if only by a small amount).
I suggest you go with your array method, but the best approach will depend on the nature of the keys I expect.

How about a Judy array if you're mapping from ints? It is kind of a sparse array... Uses 1/4th of the dictionary implementation's space.
Judy:
$ cat j.py ; time python j.py
import judy, random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = judy.JudyIntObjectMap()
for _ in xrange(4000000):
d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
print h.heap()
Partition of a set of 4000004 objects. Total size = 96000624 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 4000001 100 96000024 100 96000024 100 int
1 1 0 448 0 96000472 100 types.FrameType
2 1 0 88 0 96000560 100 __builtin__.weakref
3 1 0 64 0 96000624 100 __builtin__.PyJudyIntObjectMap
real 1m9.231s
user 1m8.248s
sys 0m0.381s
Dictionary:
$ cat d.py ; time python d.py
import random, sys
from guppy import hpy
random.seed(0)
h = hpy()
h.setrelheap()
d = {}
for _ in xrange(4000000):
d[random.randint(0, sys.maxint)] = random.randint(0, sys.maxint)
print h.heap()
Partition of a set of 8000003 objects. Total size = 393327344 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 1 0 201326872 51 201326872 51 dict (no owner)
1 8000001 100 192000024 49 393326896 100 int
2 1 0 448 0 393327344 100 types.FrameType
real 1m8.129s
user 1m6.947s
sys 0m0.559s
~1/4th the space:
$ echo 96000624 / 393327344 | bc -l
.24407309958089260125
(I'm using 64bit python, btw, so my base numbers may be inflated due to 64bit pointers)

Looking at your data above, that's not 49 bytes per int, it's 25. The other 24 bytes per entry are the int objects themselves. So you need something that is significantly smaller than 25 bytes per entry. Unless you also are going to reimplement the int objects, which is possible for the key hashes, at least. Or implement it in C, where you can skip the objects completely (this is what Zopes IIBTree does, mentioned above).
To be honest the Python dictionary is highly tuned in various ways. It will not be easy to beat it, but good luck.

I have implemented my own int-int dictionary, available here (BSD license). In short, I use array.array('i') to store key-value pairs sorted by keys. In fact, instead of one large array, I keep a dictionary of smaller arrays (a key-value pair is stored in the key/65536th array) in order to speed up shifting during insertion and binary search during retrieval. Each array stores the keys and values in the following way:
key0 value0 key1 value1 key2 value2 ...
Actually, it is not only an int-int dictionary, but a general object-int dictionary with objects reduced to their hashes. Thus, the hash-int dictionary can be used as a cache of some persistently stored dictionary.
There are three possible strategies of handling "key collisions", that is, attempts to assign a different value to the same key. The default strategy allows it. The "deleting" removes the key and marks it as colliding, so that any further attempts to assign a value to it will have no effect. The "shouting" strategy throws an exception during any overwrite attempt and on any further access to any colliding key.
Please see my answer to a related question for a differently worded description of my approach.

Related

memory usage: numpy-arrays vs python-lists

Numpy is known for optimized arrays and various advantages over python-lists.
But when I check for the memory usage python-lists have less space than the numpy arrays.
The code I used is entered below.
Can anyone explain me why?
import sys
Z = np.zeros((10,10),dtype = int)
A = [[0] * 10] * 10
print(A,'\n',f'{sys.getsizeof(A)} bytes')
print(Z,'\n',f'{Z.size * Z.itemsize} bytes')

You're not measuring correctly; the native Python list only contains 10 references. You need to add in the collective size of the sub-lists as well:
>>> sys.getsizeof(A) + sum(map(sys.getsizeof, A))
1496
And it might get worse: each element inside the sub-lists could also be a reference (to an int). It's difficult to check whether the Python implementation is optimizing this away and storing the actual numbers inside the list.
You're also under-representing the size of the numpy array, because it includes a header:
>>> Z.size * Z.itemsize
800
>>> sys.getsizeof(Z)
912
In either case it's not an exact science and will depend on your platform and Python implementation.

According to the spec (https://docs.python.org/3/library/sys.html#sys.getsizeof), "only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."
Also "getsizeof calls object’s " sizeof method
So you are given the size of just the container (a list object).
Please check https://code.activestate.com/recipes/577504/ for a complete size computation, which returns 296bytes for your example since only two unique objects are used. A list [0 0 0 0 0 0 0 0 0 0] and int 0.
If you initialize the list with different values, overall size will increase and will become bigger than np.array, which reserves 4bytes for numpy.int32 type elements, plus the size of its own internal data structure.
Find detailed info with examples here: https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html

Is there anything faster than dict()?

I need a faster way to store and access around 3GB of k:v pairs. Where k is a string or an integer and v is an np.array() that can be of different shapes.
Is there any object that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame?
As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case?

No, there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing (getting and setting item) and even membership checking is O(1) in average. (check the complexity of the rest of functionalities on Python doc https://wiki.python.org/moin/TimeComplexity )
Once you saved your items in a dictionary, you can have access to them in constant time which means that it's unlikely for your performance problem to have anything to do with dictionary indexing. That being said, you still might be able to make this process slightly faster by making some changes in your objects and their types that may result in some optimizations at under the hood operations.
e.g. If your strings (keys) are not very large you can intern the lookup key and your dictionary's keys. Interning is caching the objects in memory --or as in Python, table of "interned" strings-- rather than creating them as a separate object.
Python has provided an intern() function within the sys module that you can use for this.
Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup...
also ...
If the keys in a dictionary are interned and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer comparison instead of comparing the string values themselves which in consequence reduces the access time to the object.
Here is an example:
In [49]: d = {'mystr{}'.format(i): i for i in range(30)}
In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop
In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}
In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop

No, I don't think there is anything faster than dict. The time complexity of its index checking is O(1).
-------------------------------------------------------
Operation | Average Case | Amortized Worst Case |
-------------------------------------------------------
Copy[2] | O(n) | O(n) |
Get Item | O(1) | O(n) |
Set Item[1] | O(1) | O(n) |
Delete Item | O(1) | O(n) |
Iteration[2] | O(n) | O(n) |
-------------------------------------------------------
PS https://wiki.python.org/moin/TimeComplexity

A numpy.array[] and simple dict = {} comparison:
import numpy
from timeit import default_timer as timer
my_array = numpy.ones([400,400])
def read_out_array_values():
cumsum = 0
for i in range(400):
for j in range(400):
cumsum += my_array[i,j]
start = timer()
read_out_array_values()
end = timer()
print("Time for array calculations:" + str(end - start))
my_dict = {}
for i in range(400):
for j in range(400):
my_dict[i,j] = 1
def read_out_dict_values():
cumsum = 0
for i in range(400):
for j in range(400):
cumsum += my_dict[i,j]
start = timer()
read_out_dict_values()
end = timer()
print("Time for dict calculations:" + str(end - start))
Prints:
Time for dict calculations:0.046898419999999996
Time for array calculations:0.07558204099999999
============= RESTART: C:/Users/user/Desktop/dict-vs-numpyarray.py =============
Time for array calculations:0.07849989000000002
Time for dict calculations:0.047769446000000104

One would think that array indexing is faster than hash lookup.
So if we could store this data in a numpy array, and assume the keys are not strings, but numbers, would that be faster than a python a dictionary?
Unfortunately not, because NumPy is optimized for vector operations, not for individual look up of values.
Pandas fares even worse.
See the experiment here: https://nbviewer.jupyter.org/github/annotation/text-fabric/blob/master/test/pandas/pandas.ipynb
The other candidate could be the Python array, in the array module. But that is not usable for variable-size values.
And in order to make this work, you probably need to wrap it into some pure python code, which will set back all time performance gains that the array offers.
So, even if the requirements of the OP are relaxed, there still does not seem to be a faster option than dictionaries.

You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don't consider the hashing time or computation.
You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say
HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE
or something similar due to collision we need to use 256^i.
You can try to store them in Trie and see how it performs.

Generate big random sequence of unique numbers [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 9 years ago.
I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).
I tried this:
# coding: utf-8
import random
COUNT = 100000000
random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()
But it's eating all of my memory.
Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?

If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).
As DSM pointed out, this can be done with the standard modules in an efficient way:
>>> import array
>>> a = array.array('I', xrange(10**8)) # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random
>>> random.shuffle(a)
It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:
>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32') # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)
(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).
Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.
PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!

Maybe something like (won't be consecutive, but will be unique):
from uuid import uuid4
def unique_nums(): # Not strictly unique, but *practically* unique
while True:
yield int(uuid4().hex, 16)
# alternative yield uuid4().int
unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...

You can fetch random int easily from reading (on linux) /dev/urandom or using os.urandom() and struct.unpack():
Return a string of n random bytes suitable for cryptographic use.
This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom. If a randomness source is not found, NotImplementedError will be raised.
>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
...
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6
While on the other hand random package:
However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.
If you really need unique records you should go with this or answer provided by EOL.
But assuming really random source, with possibly repeated characters you'll have 1/N (where N = 2 ** sizeof(int)*8 = 2 ** 32) chance of hitting item at first guess, thus you can get (2**32) ** length possible outputs.
On the other hand when using just unique results you'll have max:
product from i = 0 to length {2*32 - i}
= n! / (n-length)!
= (2**32)! / (2**32-length)!
Where ! is factorial, not logical negation. So you'll just decrease randomness of result.

This one will keep your memory OK but will probably kill your disk :)
It generates a file with the sequence of the numbers from 0 to 100000000 and then it randomly pick positions in it and writes to another file. The numbers have to be re-organized in the first file to "delete" the numbers that have been chosen already.
import random
COUNT = 100000000
# Feed the file
with open('file1','w') as f:
i = 0
while i <= COUNT:
f.write("{0:08d}".format(i))
i += 1
with open('file1','r+') as f1:
i = COUNT
with open('file2','w') as f2:
while i >= 0:
f1.seek(i*8)
# Read the last val
last_val = f1.read(8)
random_pos = random.randint(0, i)
# Read random pos
f1.seek(random_pos*8)
random_val = f1.read(8)
f2.write('ID{0},A{0}\n'.format(random_val))
# Write the last value to this position
f1.seek(random_pos*8)
f1.write(last_val)
i -= 1
print "Done"

Counting collisions in a Python dictionary

my first time posting here, so hope I've asked my question in the right sort of way,
After adding an element to a Python dictionary, is it possible to get Python to tell you if adding that element caused a collision? (And how many locations the collision resolution strategy probed before finding a place to put the element?)
My problem is: I am using dictionaries as part of a larger project, and after extensive profiling, I have discovered that the slowest part of the code is dealing with a sparse distance matrix implemented using dictionaries.
The keys I'm using are IDs of Python objects, which are unique integers, so I know they all hash to different values. But putting them in a dictionary could still cause collisions in principle. I don't believe that dictionary collisions are the thing that's slowing my program down, but I want to eliminate them from my enquiries.
So, for example, given the following dictionary:
d = {}
for i in xrange(15000):
d[random.randint(15000000, 18000000)] = 0
can you get Python to tell you how many collisions happened when creating it?
My actual code is tangled up with the application, but the above code makes a dictionary that looks very similar to the ones I am using.
To repeat: I don't think that collisions are what is slowing down my code, I just want to eliminate the possibility by showing that my dictionaries don't have many collisions.
Thanks for your help.
Edit: Some code to implement #Winston Ewert's solution:
n = 1500
global collision_count
collision_count = 0
class Foo():
def __eq__(self, other):
global collision_count
collision_count += 1
return id(self) == id(other)
def __hash__(self):
#return id(self) # #John Machin: yes, I know!
return 1
objects = [Foo() for i in xrange(n)]
d = {}
for o in objects:
d[o] = 1
print collision_count
Note that when you define __eq__ on a class, Python gives you a TypeError: unhashable instance if you don't also define a __hash__ function.
It doesn't run quite as I expected. If you have the __hash__ function return 1, then you get loads of collisions, as expected (1125560 collisions for n=1500 on my system). But with return id(self), there are 0 collisions.
Anyone know why this is saying 0 collisions?
Edit:
I might have figured this out.
Is it because __eq__ is only called if the __hash__ values of two objects are the same, not their "crunched version" (as #John Machin put it)?

Short answer:
You can't simulate using object ids as dict keys by using random integers as dict keys. They have different hash functions.
Collisions do happen. "Having unique thingies means no collisions" is wrong for several values of "thingy".
You shouldn't be worrying about collisions.
Long answer:
Some explanations, derived from reading the source code:
A dict is implemented as a table of 2 ** i entries, where i is an integer.
dicts are no more than 2/3 full. Consequently for 15000 keys, i must be 15 and 2 ** i is 32768.
When o is an arbitrary instance of a class that doesn't define __hash__(), it is NOT true that hash(o) == id(o). As the address is likely to have zeroes in the low-order 3 or 4 bits, the hash is constructed by rotating the address right by 4 bits; see the source file Objects/object.c, function _Py_HashPointer
It would be a problem if there were lots of zeroes in the low-order bits, because to access a table of size 2 ** i (e.g. 32768), the hash value (often much larger than that) must be crunched to fit, and this is done very simply and quickly by taking the low order i (e.g. 15) bits of the hash value.
Consequently collisions are inevitable.
However this is not cause for panic. The remaining bits of the hash value are factored into the calculation of where the next probe will be. The likelihood of a 3rd etc probe being needed should be rather small, especially as the dict is never more than 2/3 full. The cost of multiple probes is mitigated by the cheap cost of calculating the slot for the first and subsequent probes.
The code below is a simple experiment illustrating most of the above discussion. It presumes random accesses of the dict after it has reached its maximum size. With Python2.7.1, it shows about 2000 collisions for 15000 objects (13.3%).
In any case the bottom line is that you should really divert your attention elsewhere. Collisions are not your problem unless you have achieved some extremely abnormal way of getting memory for your objects. You should look at how you are using the dicts e.g. use k in d or try/except, not d.has_key(k). Consider one dict accessed as d[(x, y)] instead of two levels accessed as d[x][y]. If you need help with that, ask a seperate question.
Update after testing on Python 2.6:
Rotating the address was not introduced until Python 2.7; see this bug report for comprehensive discussion and benchmarks. The basic conclusions are IMHO still valid, and can be augmented by "Update if you can".
>>> n = 15000
>>> i = 0
>>> while 2 ** i / 1.5 < n:
... i += 1
...
>>> print i, 2 ** i, int(2 ** i / 1.5)
15 32768 21845
>>> probe_mask = 2 ** i - 1
>>> print hex(probe_mask)
0x7fff
>>> class Foo(object):
... pass
...
>>> olist = [Foo() for j in xrange(n)]
>>> hashes = [hash(o) for o in olist]
>>> print len(set(hashes))
15000
>>> probes = [h & probe_mask for h in hashes]
>>> print len(set(probes))
12997
>>>

This idea doesn't actually work, see discussion in the question.
A quick look at the C implementation of python shows that the code for resolving collisions does not calculate or store the number of collisions.
However, it will invoke PyObject_RichCompareBool on the keys to check if they match. This means that __eq__ on the key will be invoked for every collision.
So:
Replace your keys with objects that define __eq__ and increment a counter when it is called. This will be slower because of the overhead involved in jumping into python for the compare. However, it should give you an idea of how many collisions are happening.
Make sure you use different objects as the key, otherwise python will take a shortcut because an object is always equal to itself. Also, make sure the objects hash to the same value as the original keys.

Python Memory Size

If I have a list of numbers and if I implement the same thing using dictionary, will they both occupy the same memory space?
Eg.
list = [[0,1],[1,0]]
dict = {'0,0' = 0 , '0,1' = 1, '1,0' = 1, '1,1' = 0 }
will the memory size of list and dict be the same? which will take up more space?

If you're using Python 2.6 or higher, you can use sys.getsizeof() to do a test
>>> import sys
>>> aList = [[0,1],[1,0]]
>>> aDict = {'0,0' : 0 , '0,1' : 1, '1,0' : 1, '1,1' : 0 }
>>> sys.getsizeof(aList)
44
>>> sys.getsizeof(aDict)
140
With the example you provided, we see that aDict takes up more space in memory.

There are very good chances that the dictionary will be bigger.
A dictionary and a list are both fairly different data structures. When it boils down to memory and processor instructions, a list is fairly straightforward: all values in it are contiguous, and when you want to access item n, you go at the beginning of the list, move n items forward, and return it. This is easy, because list items are contiguous and have integer keys.
On the other hand, constraints for dictionaries are fairly different. You can't just go to the beginning of the dictionary, move key item forwards and return it, because key might just not be numeric. Besides, keys don't need to be contiguous.
In the case of a dictionary, you need a structure to find the values associated to keys very easily, even though there might be no relationship between them. Therefore, it can't use the same kind of algorithms a list use. And typically, the data structures required by dictionaries are bigger than data structures required for lists.
Coincidentally, both may have the same size, even though it would be kind of surprising. Though, the data representation will be different, no matter what.

Getting the size of objects in python is tricky. Ostensibly, this can be done with sys.getsizeof but it returns incomplete data.
An empty list has size usage of 32 bytes on my system.
>>> sys.getsizeof([])
32
A list with one element has size of 36 bytes. This does not seem to vary according to the element.
>>> sys.getsizeof([[1, 2]])
36
>>> sys.getsizeof([1])
36
So you would need to know the size of the inner list as well.
>>> sys.getsizeof([1, 2])
40
So your memory usage for a list (assuming the same as my system) should be 32 bytes plus 44 bytes for every internal list. This is because python is storing the overhead involved with keeping a list which costs 32 bytes. Every addition entry is represented as a pointer to that object and costs 4 bytes. So the pointer costs 4 bytes on top of whatever you are storing. In the case of two element lists, this is 40 bytes.
For a dict, it's 136 for an empty dict
>>> sys.getsizeof({})
136
From there, it will quadruple its size as adding members causes it to run out of space and risk frequent hash collisions. Again, you have to include the size of the object being stored and the keys as well.
>>> sys.getsizeof({1: 2})
136

In short, they won't be the same. The list should perform better, unless you can have sparsely populated lists that could be implemented in a dict.
As mentioned by several others, getsizeof doesn't sum the contained objects.
Here is a recipe that does some work for you on standard python (3.0) types.
Compute Memory footprint of an object and its contents
Using this recipe on python 3.1 here are some results:
aList = [[x,x] for x in range(1000)]
aListMod10 = [[x%10,x%10] for x in range(1000)]
aTuple = [(x,x) for x in range(1000)]
aDictString = dict(("%s,%s" % (x,x),x) for x in range(1000))
aDictTuple = dict(((x,x),x) for x in range(1000))
print("0", total_size(0))
print("10", total_size(10))
print("100", total_size(100))
print("1000", total_size(1000))
print("[0,1]", total_size([0,1]))
print("(0,1)", total_size((0,1)))
print("aList", total_size(aList))
print("aTuple", total_size(aTuple))
print("aListMod10", total_size(aListMod10))
print("aDictString", total_size(aDictString))
print("aDictTuple", total_size(aDictTuple))
print("[0]'s", total_size([0 for x in range(1000)]))
print("[x%10]'s", total_size([x%10 for x in range(1000)]))
print("[x%100]'s", total_size([x%100 for x in range(1000)]))
print("[x]'s", total_size([x for x in range(1000)]))
Output:
0 12
10 14
100 14
1000 14
[0,1] 70
(0,1) 62
aList 62514
aTuple 54514
aListMod10 48654
aDictString 82274
aDictTuple 74714
[0]'s 4528
[x%10]'s 4654
[x%100]'s 5914
[x]'s 18514
It seems to logically follow that the best performing memory option would be to use 2 lists:
list_x = [0, 1, ...]
list_y = [1, 0, ...]
This may only be worth it if you are running tight on memory and your list is expected to be large. I would guess that the usage pattern would be creating (x,y) tuples all over the place anyway, so it may be that you should really just do:
tuples = [(0, 1), (1, 0), ...]
All things being equal, choose what allows you to write the most readable code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Memory efficient int-int dict in Python - python

You could use the IIBtree from Zope

Related

memory usage: numpy-arrays vs python-lists

Is there anything faster than dict()?

Generate big random sequence of unique numbers [duplicate]

Counting collisions in a Python dictionary

Python Memory Size

Categories

Resources