Python: Effective storage of data in memory

Python: Effective storage of data in memory - python

I have the following dictionary structure ( 10,000 keys with values being formed by list of lists )
my_dic ={0: [[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]],
1:[[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]] .....
10,000th key:[[1,.65,3, 0, 5.5], [[4, .55, 3, 0, 5.5] ...(10,000th value)[3,.15, 2, 1, 2.5]]}
(Note: Data is dummy, so I have just repeated it across the keys)
The logical data types that I want in the smaller elementry list are
inner_list = [int, float, small_int, boolean( 0 or 1), float]
A sys.getsizeof(inner_list), shows it's size to be 56 bytes. Adding 12 bytes for the int key makes it 68 bytes. Now, since I have 10^8 such lists ( 10000*10000) it's storage in memory is becoming a big problem. I want the data in memory ( no DB as of now ). What should be the most optimized method of storing it ? I am inclined to think that it must have something to do with numpy but not sure on what would be the best method and how to implement it. Any suggestions ?
2) Also, since I am storing these dictionaries in memory, I would like to clear the memory occupied by them as soon as I am done using them. Is there a way of doing this in python?

One idea is to break up the dictionary structure into simpler structures, but it may affect how efficiently you can process it.
1 Create separate array for the keys
keys = array('i', [key1, key2, ..., key10000])
Depending on the possible values of the keys, you can further specify the particular int type for the array. Also, the keys should be ordered, so you could perform binary search on the key table. This way you also save some space from the hash table used in the Python dictionary implementation. Downside is that key lookup now takes O(logn) time instead of O(1).
2 Store inner_list elements in a 10000x10000 matrices or in a 100000000 length lists
As each position i from 0 to 9999 corresponds to a specific key that can be obtained from keys array, each list of lists can be put into i'th row in the matrix and each inner_list elements in columns of the row.
Other option is to put them in a long list and index using the key position i such that
idx = i*10000 + j
where i is the index of key in keys array and j is the index of particular inner_list instance.
Additionally, for each inner_list element you can have total of five separate arrays, which somewhat breaks the locality of the data in memory
int_array = array('i', [value1, ..., value100000000])
float1_array = array('f', [value1, ..., value100000000])
small_int_array = array('h', [value1, ..., value100000000])
bool_array = array('?', [value1, ..., value100000000])
float2_array = array('f', [value1, ..., value100000000])
Boolean array can be further optimized by packing them into bits.
Alternative is also to pack inner_list elements in a binary string using struct module and store them in a single list instead of five different lists.
3 Releasing memory
As soon as the variables go out of scope, they are ready to be garbage collected, so the memory can be claimed back. To do this sooner, for example in a function or a loop, you may just replace a list with a dummy value to bring the reference count of the variable down to zero.
variable = None
Note
However, these ideas may not be good enough for you particular solution. There are other possibilities too, such as loading only some part of the data in memory. It depends, how do you plan to process it.
Generally Python takes its own share of the memory for internal handling of the pointers/structures. Therefore, yet another alternative is to implement that particular data strucure and its handling in a language like Fortran, C or C++, which can be more easily tuned for your particular needs.

Related

Difference between list and NumPy array memory size

I've heard that Numpy arrays are more efficient then python built in list and that they take less space in memory. As I understand Numpy stores this objects next to each other in memory, while python implementation of the list stores 8 bytes pointers to given values. However, when I try to test in jupyter notebook it turns out that both objects have same size.
import numpy as np
from sys import getsizeof
array = np.array([_ for _ in range(4)])
getsizeof(array), array
Returns (128, array([0, 1, 2, 3]))
Same as:
l = list([_ for _ in range(4)])
getsizeof(l), l
Gives (128, [0, 1, 2, 3])
Can you provide any clear example on how can I show that in jupyter notebook?

getsizeof is not a good measure of memory use, especially with lists. As you note the list has a buffer of pointers to objects elsewhere in memory. getsizeof notes the size of the buffer, but tells us nothing about the objects.
With
In [66]: list(range(4))
Out[66]: [0, 1, 2, 3]
the list has its basic object storage, plus the buffer with 4 pointers (plus some growth room). The numbers are stored else where. In this case the numbers are small, and already created and cached by the interpreter. So their storage doesn't add anything. But larger numbers (and floats) are created with each use, and take up space. Also a list can contain anything, such as pointers to other lists, or strings or dicts, or what ever.
In [67]: arr = np.array([i for i in range(4)]) # via list
In [68]: arr
Out[68]: array([0, 1, 2, 3])
In [69]: np.array(range(4)) # more direct
Out[69]: array([0, 1, 2, 3])
In [70]: np.arange(4)
Out[70]: array([0, 1, 2, 3]) # faster
arr too has a basic object storage with attributes like shape and dtype. It too has a databuffer, but for a numeric dtype like this, that buffer has actual numeric values (8 byte integers), not pointers to Python integer objects.
In [71]: arr.nbytes
Out[71]: 32
That data buffer only takes 32 bytes - 4*8.
For this small example it's not surprising that getsizeof returns the same thing. The basic object storage is more significant than where the 4 values are stored. It's when working with 1000's of values, and multidimensional arrays that memory use is significantly different.
But more important is the calculation speeds. With an array you can do things like arr+1 or arr.sum(). These operate in compiled code, and are quite fast. Similar list operations have to iterate, at slow Python speeds, though the pointers, fetching values etc. But doing the same sort of iteration on arrays is even slower.
As a general rule, if you start with lists, and do list operations such as append and list comprehensions, it's best to stick with them.
But if you can create the arrays once, or from other arrays, and then use numpy methods, you'll get 10x speed improvements. Arrays are indeed faster, but only if you use them in the right way. They aren't a simple drop in substitute for lists.

NumPy array has general array information on the array object header (like shape,data type etc.). All the values stored in continous block of memory. But lists allocate new memory block for every new object and stores their pointer. So when you iterate over, you are not directly iterating on memory. you are iterating over pointers. So it is not handy when you are working with large data. Here is an example:
import sys
import numpy as np
random_values_numpy=np.arange(1000)
random_values=range(1000)
#Numpy
print(random_values_numpy.itemsize)
print(random_values_numpy.size*random_values_numpy.itemsize)
#PyList
print(sys.getsizeof(random_values))
print(sys.getsizeof(random_values)*len(random_values))

Efficient way of marking which elements of a smaller list are in a larger list (Python)

I am parsing a fairly large dataset from json into a "traditional" data frame (rows as observations, columns as variables). The json object contains a list of characteristics of each observation. I want to transform this into a zero-one vector which indicates whether the observation in question has that characteristic.
What I have is the "master list" (a list of all possible characteristics) and a list of the observations (as json dicts). Let the number of all characteristics be K. The output for each observation should be a zero-one list of length K, marking whether each characteristic applies to that observation.
My current approach is a "brute-force" iteration:
characteristics #master list of all possibilities
output_dataset = []
for observation in data:
chars = observation["characteristics"]
vector = [ int(chr in chars) for chr in characteristics ]
output_dataset.append(vector)
However, this is rather computationally expensive when the number of characteristics get into the thousands and the number of observations into the tens of thousands.
Is there a more efficient way of doing this (generally, or specifically in Python/Numpy/Pandas)?
Update:
For clarity and as an example, here's what the different variables should look like. (Imagine the observations being mobile devices.)
Master list: ["android", "ios", "windows", "phone", "tablet", "dual-sim", "fingerprint", "nfc", "usb-c", "lg", "samsung", "huawei", "htc", "motorola", "apple", "google", "nokia"...]
One observation: ["android", "phone", "fingerprint", "nfc", "lg"...]
Desired output vector: [1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,...]

I'm not entirely sure I understand this, but I think you want to use a dict. A dict is an efficient way of performing a lookup on an arbitrary key value. Looking up a key in a list of length N is O(N), looking up a key in a dict is optimised by way of a hash table. Depending on the algorithm and the available memory, something between O(1) and O(ln N).
So, first convert characteristics into a dict which maps a characteristic to an index number, that index numnber being its position in characteristics
chardict = {}
for i,v in enumerate(characteristics):
chardict[v] = i
and now you have an efficient way to map a characteristic to a 1 in the correct position in your vector
vector = [0] * len(characteristics) # initialize all zero, correct length
for k in observation["characteristics"]:
if k in chardict:
vector[ chardict[k]] = 1
It's possibly more efficient to use the get method of a dict, and this way you can also check for bad input data, now or later. Less clear, though.
L = len(characteristics)
vector = [0] * (L+1) # last bin for junk input
for k in observation["characteristics"]:
vector[ chardict.get(k, L) ] = 1
if vector[L]:
# there was an input not known to characteristics ....
del vector[L] # get rid of the last element if no longer wanted

Index of list within a numpy array

So I'm writing a code that uses exact diagonalization to study the Lieb-Liniger model. The first step is building a numpy array containing lists that describe particle occupations. The array would look something like
array([[2, 0, 0],
[1, 1, 0],
[1, 0, 1],
[0, 2, 0],
[0, 1, 1],
[0, 0, 2]])
for the case of 2 particles in 3 modes. My question is, is it possible to get the index of a particular list in this array, similar to how you would get an index in a regular list with the index function. For instance, with a list of lists, A, i was able to use A.index(some_list_in_A) to get the index of that list, but I have tried using numpy.where(HS=[2,0,0]) to get the index of [2,0,0] (and so on), but to no avail. For large numbers of particles and modes, I'm looking for an efficient way to obtain these indices, and I figured using numpy arrays were quite efficient, but I have just hit this block and have not found a solution to it. Any suggestions?

You can use np.where() doing:
pattern = [2,0,0]
index = np.where(np.all(a==pattern, axis=1))[0]

Here are several ways of doing this lookup:
In [36]: A=np.array([[2,0,0],[1,1,0],[1,0,1],[0,2,0],[0,1,1],[0,0,2]])
In [37]: pattern = [0,2,0]
In [38]: np.where(np.all(pattern==A,1)) # Saullo's where
Out[38]: (array([3]),)
In [39]: A.tolist().index(pattern) # your list find
Out[39]: 3
In [40]: D={tuple(a):i for i,a in enumerate(A.tolist())} # dictionary
In [41]: D[tuple(pattern)]
Out[41]: 3
I am using tuples as the dictionary keys - a tuple is an immutable list.
For this small size, the dictionary approach is fastest, especially if the dictionary can be built once and used repeatedly. Even if constructed on the fly it is faster than the np.where. But you should test it with more realistic sizes.
Python dictionaries are tuned for speed, since they are fundamental to the language's operation.
The pieces in the np.where are all fast, using compiled code. But still, it has to compare all the elements of A with the pattern. There's a lot more work than the dictionary hash lookup.

Plotting occurrences for values higher than a threshold in Python

I have a non-uniform array 'A'.
A = [1,3,2,4,..., 12002, 13242, ...]
I want to explore how many elements from the array 'A' have values above certain threshold values.
For example, there are 1000 elements that have values larger than 1200, so I want to plot the number of elements that have values larger than 1200. Also, there are other 1500 elements that have values larger than 110 (this includes the 1000 elements, whose values are larger than 1200).
This is a rather large data set, so I would not like to omit any kind of information.
Then, I want to plot the number of elements 'N' above a value A vs. Log (A), i.e.
**'Log N(> A)" vs. 'Log (A)'**.
I thought of binning the data, but I was rather unsuccessful.
I haven't done that much statistics in python, so I was wondering if there is a good way to plot this data?
Thanks in advance.

Let me take another crack at what we have:
A = [1, 3, 2, 4, ..., 12002, 13242, ...]
# This is a List of 12,000 zeros.
num_above = [0]*(12000)
# Notice how we can re-write this for-loop!
for i in B:
num_above = [val+1 if key <= i else val for key,val in enumerate(num_above)]
I believe this is what you want. The final list num_above will be such that for num_above[5] equals the number of elements in A that are above 5.
Explanation::
That last line is where all the magic happens. It goes through elements in A (i)and adds one to all the elements in num_above whose index is less than i.
The enumerate(A) statement is an enumerator that generates an iterator of tuples that include the keys and values of all the elements in A: (0,1) (1,3) -> (2,2) -> (3,4) -> ...
Also, the num_above = [x for y in List] statement is known as List Comprehension, and is a really powerful tool in Python.
Improvements: I see you already modified your question to include these changes, but I think they were important.
I removed the numpy dependency. When possible, removing dependencies reduces the complexity of projects, especially larger projects.
I also removed the original list A. This could be replaced with something that was basically like A = range(12000).

Python Memory Size

If I have a list of numbers and if I implement the same thing using dictionary, will they both occupy the same memory space?
Eg.
list = [[0,1],[1,0]]
dict = {'0,0' = 0 , '0,1' = 1, '1,0' = 1, '1,1' = 0 }
will the memory size of list and dict be the same? which will take up more space?

If you're using Python 2.6 or higher, you can use sys.getsizeof() to do a test
>>> import sys
>>> aList = [[0,1],[1,0]]
>>> aDict = {'0,0' : 0 , '0,1' : 1, '1,0' : 1, '1,1' : 0 }
>>> sys.getsizeof(aList)
44
>>> sys.getsizeof(aDict)
140
With the example you provided, we see that aDict takes up more space in memory.

There are very good chances that the dictionary will be bigger.
A dictionary and a list are both fairly different data structures. When it boils down to memory and processor instructions, a list is fairly straightforward: all values in it are contiguous, and when you want to access item n, you go at the beginning of the list, move n items forward, and return it. This is easy, because list items are contiguous and have integer keys.
On the other hand, constraints for dictionaries are fairly different. You can't just go to the beginning of the dictionary, move key item forwards and return it, because key might just not be numeric. Besides, keys don't need to be contiguous.
In the case of a dictionary, you need a structure to find the values associated to keys very easily, even though there might be no relationship between them. Therefore, it can't use the same kind of algorithms a list use. And typically, the data structures required by dictionaries are bigger than data structures required for lists.
Coincidentally, both may have the same size, even though it would be kind of surprising. Though, the data representation will be different, no matter what.

Getting the size of objects in python is tricky. Ostensibly, this can be done with sys.getsizeof but it returns incomplete data.
An empty list has size usage of 32 bytes on my system.
>>> sys.getsizeof([])
32
A list with one element has size of 36 bytes. This does not seem to vary according to the element.
>>> sys.getsizeof([[1, 2]])
36
>>> sys.getsizeof([1])
36
So you would need to know the size of the inner list as well.
>>> sys.getsizeof([1, 2])
40
So your memory usage for a list (assuming the same as my system) should be 32 bytes plus 44 bytes for every internal list. This is because python is storing the overhead involved with keeping a list which costs 32 bytes. Every addition entry is represented as a pointer to that object and costs 4 bytes. So the pointer costs 4 bytes on top of whatever you are storing. In the case of two element lists, this is 40 bytes.
For a dict, it's 136 for an empty dict
>>> sys.getsizeof({})
136
From there, it will quadruple its size as adding members causes it to run out of space and risk frequent hash collisions. Again, you have to include the size of the object being stored and the keys as well.
>>> sys.getsizeof({1: 2})
136

In short, they won't be the same. The list should perform better, unless you can have sparsely populated lists that could be implemented in a dict.
As mentioned by several others, getsizeof doesn't sum the contained objects.
Here is a recipe that does some work for you on standard python (3.0) types.
Compute Memory footprint of an object and its contents
Using this recipe on python 3.1 here are some results:
aList = [[x,x] for x in range(1000)]
aListMod10 = [[x%10,x%10] for x in range(1000)]
aTuple = [(x,x) for x in range(1000)]
aDictString = dict(("%s,%s" % (x,x),x) for x in range(1000))
aDictTuple = dict(((x,x),x) for x in range(1000))
print("0", total_size(0))
print("10", total_size(10))
print("100", total_size(100))
print("1000", total_size(1000))
print("[0,1]", total_size([0,1]))
print("(0,1)", total_size((0,1)))
print("aList", total_size(aList))
print("aTuple", total_size(aTuple))
print("aListMod10", total_size(aListMod10))
print("aDictString", total_size(aDictString))
print("aDictTuple", total_size(aDictTuple))
print("[0]'s", total_size([0 for x in range(1000)]))
print("[x%10]'s", total_size([x%10 for x in range(1000)]))
print("[x%100]'s", total_size([x%100 for x in range(1000)]))
print("[x]'s", total_size([x for x in range(1000)]))
Output:
0 12
10 14
100 14
1000 14
[0,1] 70
(0,1) 62
aList 62514
aTuple 54514
aListMod10 48654
aDictString 82274
aDictTuple 74714
[0]'s 4528
[x%10]'s 4654
[x%100]'s 5914
[x]'s 18514
It seems to logically follow that the best performing memory option would be to use 2 lists:
list_x = [0, 1, ...]
list_y = [1, 0, ...]
This may only be worth it if you are running tight on memory and your list is expected to be large. I would guess that the usage pattern would be creating (x,y) tuples all over the place anyway, so it may be that you should really just do:
tuples = [(0, 1), (1, 0), ...]
All things being equal, choose what allows you to write the most readable code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.