I'm parsing hundreds of millions of JSON records and storing the relevant components from each in a dict. The problem is that because of the number of records I'm processing, python is being forced to increase the size of the dict's underlying hash table several times. This results in a LOT of data having to be rehashed. The sheer amount of rehashing itself seems to cost a lot of time. Therefore, I wonder if there's a way to set a minimum size on the dict's underlying hash table so that the number of resizing operations is minimized.
I have read this on optimizing python's dict, from an answer on this question, but cannot find how to change the initial size of a dict's hash table. If anyone can help me out with this, I'd be very grateful.
Thank you
If you do this:
a = dict.fromkeys(range(n))
it will force the dictionary size to accomodate n items. It is quite quick after that, but it takes 3s to do so.
Related
How much data sets can a dictionary store? Is there a limit? If so, what defines those limits?
I am just beginning to use Python and would like to understand dictionaries more.
Dictionaries can store a virtually unlimited amount of data sets, limited primarily by the amount of memory available on the computer being used.
The maximum number of items that a dictionary can hold is determined by the amount of memory allocated to the Python interpreter, and the size of each item stored in the dictionary.
A dictionary is implemented as a hash table, which typically uses more memory than other data structures such as lists and arrays. The amount of memory used by a dictionary depends on the number of items stored in the dictionary, as well as the size and complexity of those items.
One way to store more data in a dictionary is to use a "lazy evaluation" approach, where the values for certain keys are not computed and stored in the dictionary until they are actually needed. This can be done by using a function as the value for a key, rather than a pre-computed value.
For example, let's say you have a large dataset that you want to store in a dictionary, but you don't want to use up all of your memory by loading the entire dataset into memory at once. Instead, you could create a dictionary with keys corresponding to different subsets of the dataset, and values that are functions that generate the corresponding subsets when called.
def load_data_subset(start, end):
# code to load and return a subset of the dataset
return subset
data = {
'subset_1': lambda: load_data_subset(0, 10000),
'subset_2': lambda: load_data_subset(10000, 20000),
# etc.
}
# To access a subset of the data, you would call the corresponding function
subset_1 = data['subset_1']()
In this way, the data is only loaded into memory when it is actually needed, and the memory usage is limited to the size of the subsets being used at any given time.This approach allows you to store large amounts of data without using up all of your memory at once.
It's important to notice that this approach may have performance drawbacks when compared to pre-loading all the data into memory, but it's a good solution when the data can't fit into memory.
There is no hard limit to the size of a dictionary, except the amount of RAM available on your system. For a more practical answer, see this post.
There are around 3 millions of arrays - or Python lists\tuples (does not really matter). Each array consists of the following elements:
['string1', 'string2', 'string3', ...] # totally, 10000 elements
These arrays should be stored in some kind of key-value storage. Let's assume now it's a Python's dict, for a simple explanation.
So, 3 millions of keys, each key represents a 10000-elements array.
Lists\tuples or any other custom thing - it doesn't really matter. What matters is that arrays should consist strings - utf8 or unicode strings, from 5 to about 50 chars each. There are about 3 millions of possible strings as well. It is possible to replace them with integers if it's really needed, but for more efficient further operations, I would prefer to have strings.
Though it's hard to give you a full description of the data (it's complicated and odd), it's something similar to synonyms - let's assume we have 3 millions of words - as the dict keys - and 10k synonyms for each of the word - or element of the list.
Like that (not real synonyms but it will give you the idea):
{
'computer': ['pc', 'mac', 'laptop', ...], # (10k totally)
'house': ['building', 'hut', 'inn', ...], # (another 10k)
...
}
Elements - 'synonyms' - can be sorted if it's needed.
Later, after the arrays are populated, there's a loop: we go thru all the keys and check if some var is in its value. For example, user inputs the words 'computer' and 'laptop' - and we must quickly reply if the word 'laptop' is a synonym of the word 'computer'. The issue here is that we have to check it millions of time, probably 20 millions or so. Just imagine we have a lot of users entering some random words - 'computer' and 'car', 'phone' and 'building', etc. etc. They may 'match', or they may not 'match'.
So, in short - what I need is to:
store these data structures memory-efficiently,
be able to quickly check if some item is in array.
I should be able to keep memory usage below 30GB. Also I should be able to perform all the iterations in less than 10 hours on a Xeon CPU.
It's ok to have around 0.1% of false answers - both positive and negative - though it would be better to reduce them or don't have them at all.
What is the best approach here? Algorithms, links to code, anything is really appreciated. Also - a friend of mine suggested using bloom filters or marisa tries here - is he right? I didn't work with none of them.
I would map each unique string to a numeric ID then associate a bloom filter with around 20 bits per element for your <0.1% error rate. 20 bits * 10000 elements * 3 million keys is 75GB so if you are space limited, then store a smaller less accurate filter in memory and the more accurate filter on disk which is called up if the first filter says it might be a match.
There are alternatives, but they will only reduce the size from 1.44·n·ln2(1/ε) to n·ln2(1/ε) per key, in your case ε=0.001 so the theoretical limit is a data structure of 99658 bits per key, or 10 bits per element, which would be 298,974,000,000 bits or 38 GB.
So 30GB is below the theoretical limit for a data structure with the performance and number of entries that you require, but within the ball park.
Why do you want to maintain your own in-memory data-structure? Why not use a regular database for this purpose? If that is too slow, why no use an in-memory database? One solution is to use in-memory sqlite3. Check this SO link, for example: Fast relational Database for simple use with Python
You create the in-memory database by passing ':memory:' to connect method.
import sqlite3
conn = sqlite3.connect(':memory:')
What will your schema be? I can think of a wide-schema, with a string as an id key (e.g. 'computer', 'house' in your example and about 10000 additional columns: 'field1' to 'field10000'; one of each element of your array). Once you construct the schema, iteratively inserting your data in the database will be simple: one SQL statement per row of your data. And from your description, the insert part is one-time-only. There are no further modifications to the database.
The biggest question is retrieval (more crucially, speed of retrieval). Retrieving entire array for a single key like computer is again a simple SQL statement. The scalability and speed is something I don't have an idea about and this is something you will have to experiment. There is still hope that in-memory database will speed up the retrieval part. Yet, I believe that this is the cheapest and fastest solution you can implement and test (much cheaper than multiple node cluster)
Why am I suggesting this solution? Because the setup that you have in mind is extremely similar to that of a fast-growing database-backed internet startup. All good startups have similar number of requests per day; use some sort of database with caching (Caching would be next thing to look for your problem if a simple database doesn't scale to million requests. Again, it is much easier and cheaper than buying RAM/nodes).
I have a large collection of data, about 10 million entries and part of my program required very many membership checks...
if a in data:
return True
return False
right now I have data as dictionary entries with all their values equal to '1'
I also have a program that uses an algorithm to figure out the same information, but for now it is slower then the dictionary method however I expect the size of data to continue growing...
For my current dictionary solution, would type(data) as a frozenset, or set (or something else?) be faster?
And for the future to find out when I need to switch to my program, does anyone know how the speed of checking membership correlated with increasing the size of a type that's hashable?
Is a dictionary with 1 billion entries still fast?
On Principal
If you expect the data to keep growing you can't use a frozenset.
A set would be smaller than a dictionary storage wise for testing if an element exist in it. It would be similar in speed to a dictionary lookup since the keys and items of a set are both hashed for storage and always unique. If you don't need data associated with the username, use a set.
Practically speaking...
When you are dealing with that many entries move the data to a database. You will eventually run out of memory trying to store and read all of that into memory. With a database, you can issue a specific query to check membership. Seriously. Put that data in a database.
For this amount of data RyPeck is right - a DB will do the job much better.
One more point:
Something seems odd to me in what you've written:
If you use a dictionary to store the objects of the memberships, what the value of said key-value pair in the dictionary is '1'? Shouldn't the key-value pair of the dictionary be: "id of a"-"a" where 'a' is the object.
There are several bytes overhead per entry in a hash-able (whether dictionary or set doesn't make much of a difference), so for billions of entries you will run into swapping unless you have 32+Gb of memory for the application. I would start looking for a fast DB
For frozenset you also need to have all data in memory in some acceptable form at creation time, which probably doubles the amount of mem needed
I have this somewhat big data structure that stores pairs of data. The individual data is tiny and readily hashable, and there are something like a few hundred thousand data points in there.
At first, this was a simple dict that was accessed only by keys. Later on however, I discovered that I also needed to access it by value, that is, get the key for a certain value. Since this was done somewhat less often (~1/10) than access by key, I naïvely implemented it by simply iterating over all the dicts items(). Which proved a bit sluggish at a few hundred thousand calls per second. It is about 500 times slower.
So my next idea was to just use save the reverse dict, too. This seems to be a rather inelegant solution however, so I turn to you guys for help.
Do you know any data structure in Python that stores data pairs that can be accessed by either data point of the pair?
You could try bidict.
Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.