How much data can a dictionary store? - python

How much data sets can a dictionary store? Is there a limit? If so, what defines those limits?
I am just beginning to use Python and would like to understand dictionaries more.

Dictionaries can store a virtually unlimited amount of data sets, limited primarily by the amount of memory available on the computer being used.
The maximum number of items that a dictionary can hold is determined by the amount of memory allocated to the Python interpreter, and the size of each item stored in the dictionary.
A dictionary is implemented as a hash table, which typically uses more memory than other data structures such as lists and arrays. The amount of memory used by a dictionary depends on the number of items stored in the dictionary, as well as the size and complexity of those items.
One way to store more data in a dictionary is to use a "lazy evaluation" approach, where the values for certain keys are not computed and stored in the dictionary until they are actually needed. This can be done by using a function as the value for a key, rather than a pre-computed value.
For example, let's say you have a large dataset that you want to store in a dictionary, but you don't want to use up all of your memory by loading the entire dataset into memory at once. Instead, you could create a dictionary with keys corresponding to different subsets of the dataset, and values that are functions that generate the corresponding subsets when called.
def load_data_subset(start, end):
# code to load and return a subset of the dataset
return subset
data = {
'subset_1': lambda: load_data_subset(0, 10000),
'subset_2': lambda: load_data_subset(10000, 20000),
# etc.
}
# To access a subset of the data, you would call the corresponding function
subset_1 = data['subset_1']()
In this way, the data is only loaded into memory when it is actually needed, and the memory usage is limited to the size of the subsets being used at any given time.This approach allows you to store large amounts of data without using up all of your memory at once.
It's important to notice that this approach may have performance drawbacks when compared to pre-loading all the data into memory, but it's a good solution when the data can't fit into memory.

There is no hard limit to the size of a dictionary, except the amount of RAM available on your system. For a more practical answer, see this post.

Related

How can I preserve search efficiency, and reduce RAM usage of a dictionary container?

I'm keeping a record of what I'm calling profiles.
Each profile is a tuple of dictionaries: (Dict,Dict), and we can associate to it a unique id(may include characters like _ or -).
I need to keep them in RAM memory for a certain time, because I'll need to search and updated some of them, and only at the end, when I no longer need the whole set of profiles will I flush them to persistent storage.
Currently, I'm using a dictionary/hash table to keep all of them (number of elements around 100K, but could be more), since I'll do many searches, using the id:profile as the key:value pair.
The data looks similar to this:
{
"Aw23_1adGF":({
"data":
{"contacts":[11,22],"address":"usa"},
"meta_data":
{"created_by":"John"}
},{"key_3":"yes"}),
"AK23_1adGF":({
"data":
{"contacts":[33,44],"address":"mexico"},
"meta_data":
{"created_by":"Juan"}
},{"key_3":"no"}),
# ...
}
Once this data structure is built, I don't need to add/delete any more elements. I only build it once, than search it many times. On certain occasions I'll need to update some element in the dictionaries that compose a profile. However, building this data object contributes to the peak RAM usage that I'm trying to diminish.
The problem is that the dictionary seems to use too much RAM.
What were my other options regarding data structures that could keep some of the search efficiency and with a small RAM footprint?
I thought of an ordered list, since the id seems to me to be orderable (maybe except for characters like _ or -).
What data structures are there that could help me in this predicament?

How to replace large python dictionary with a memory efficient data structure?

I use a python dictionary to store key value pairs and the dictionary gets too large(>100GB) and hits a memory limit.
What is a better memory efficient data structure to store key value pairs in python?
E.g. we can use generators to replace lists
You can use sqlitedict which provides key-value interface to SQLite database. About memory usage. SQLite doesn't need your dataset to fit in RAM. By default it caches up to cache_size pages, which is barely 2MB .
Maybe that could help: https://github.com/dagnelies/pysos
It keeps only the index in memory and keeps the data on disk.

Dictionary, set or frozenset?

I have a large collection of data, about 10 million entries and part of my program required very many membership checks...
if a in data:
return True
return False
right now I have data as dictionary entries with all their values equal to '1'
I also have a program that uses an algorithm to figure out the same information, but for now it is slower then the dictionary method however I expect the size of data to continue growing...
For my current dictionary solution, would type(data) as a frozenset, or set (or something else?) be faster?
And for the future to find out when I need to switch to my program, does anyone know how the speed of checking membership correlated with increasing the size of a type that's hashable?
Is a dictionary with 1 billion entries still fast?
On Principal
If you expect the data to keep growing you can't use a frozenset.
A set would be smaller than a dictionary storage wise for testing if an element exist in it. It would be similar in speed to a dictionary lookup since the keys and items of a set are both hashed for storage and always unique. If you don't need data associated with the username, use a set.
Practically speaking...
When you are dealing with that many entries move the data to a database. You will eventually run out of memory trying to store and read all of that into memory. With a database, you can issue a specific query to check membership. Seriously. Put that data in a database.
For this amount of data RyPeck is right - a DB will do the job much better.
One more point:
Something seems odd to me in what you've written:
If you use a dictionary to store the objects of the memberships, what the value of said key-value pair in the dictionary is '1'? Shouldn't the key-value pair of the dictionary be: "id of a"-"a" where 'a' is the object.
There are several bytes overhead per entry in a hash-able (whether dictionary or set doesn't make much of a difference), so for billions of entries you will run into swapping unless you have 32+Gb of memory for the application. I would start looking for a fast DB
For frozenset you also need to have all data in memory in some acceptable form at creation time, which probably doubles the amount of mem needed

python dict set min_size

I'm parsing hundreds of millions of JSON records and storing the relevant components from each in a dict. The problem is that because of the number of records I'm processing, python is being forced to increase the size of the dict's underlying hash table several times. This results in a LOT of data having to be rehashed. The sheer amount of rehashing itself seems to cost a lot of time. Therefore, I wonder if there's a way to set a minimum size on the dict's underlying hash table so that the number of resizing operations is minimized.
I have read this on optimizing python's dict, from an answer on this question, but cannot find how to change the initial size of a dict's hash table. If anyone can help me out with this, I'd be very grateful.
Thank you
If you do this:
a = dict.fromkeys(range(n))
it will force the dictionary size to accomodate n items. It is quite quick after that, but it takes 3s to do so.

merging dictionaries in python

Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.

Categories

Resources