Python's underlying hash data structure for dictionaries

Python's underlying hash data structure for dictionaries - python

I am build a very large dictionary and I am performing many checks to see if a key is in the structure and then adding if it unique or incrementing a counter if it is identical.
Python uses a hash data structure to store dictionaries (not to be confused with a cryptographic hash function). Lookups are O(1), but if the hash table is full the it has to be rehashed which is very expensive.
My Question is, would I be better off using a AVL Binary Search Tree or is a hash table good enough?

The only way to be sure would be to implement both and check, but my informed guess is that the dictionary will be faster, because a binary search tree has cost O(log(n)) for lookup and insertion, and I think that except under the most pessimal of situations (such as massive hash collisions) the hash table's O(1) lookup will outweigh the occasional resize.
If you take a look at the Python dictionary implementation, you'll see that:
a dictionary starts out with 8 entries (PyDict_MINSIZE);
a dictionary with 50,000 or fewer entries quadruples in size when it grows;
a dictionary with more than 50,000 entries doubles in size when it grows;
key hashes are cached in the dictionary, so they are not recomputed when the dictionary is resized.
(The "NOTES ON OPTIMIZING DICTIONARIES" are worth reading too.)
So if your dictionary has 1,000,000 entries, I believe that it will be resized eleven times (8 → 32 → 128 → 512 → 2048 → 8192 → 32768 → 131072 → 262144 → 524288 → 1048576 → 2097152) at a cost of 2,009,768 extra insertions during the resizes. This seems likely to be much less than the cost of all the rebalancing involved in 1,000,000 insertions into an AVL tree.

What's the ratio of items vs unique items?
What's the expected number of unique items?
If a hash bucket fills, then extending should just be a matter of some memory reallocation, not rehashing.
Testing a counting dict should be very quick and easy to do.
Note also the counter class available since python 2.7
http://docs.python.org/library/collections.html#counter-objects
http://svn.python.org/view?view=rev&revision=68559

Python dictionaries are highly optimized. Python makes various special-case optimizations that the Python developers cater for in the CPython dictionary implementation.
In CPython, all PyDictObject's are optimized for dictionaries containing only string keys.
Python's dictionary makes an effort to never be more than 2/3rds full.
The book "Beautiful Code" discusses this all.
The eighteenth chapter is Python's Dictionary Implementation: Being All Things to All People by Adrew Kuchling
It is much better to use it than try to achieve the hand crafted custom implementation which will have to replicate all these optimizations to be any where near the main CPython implementation of dictionary look ups.

You would have to implement your own data structures in C to stand a reasonable chance of beating the built-in structures.
Also you can avoid some of the overhead by using get, avoiding find existing elements twice.
Or collections.Counter if you are using python 2.7+.
def increment(map, key):
map[key] = map.get(key,0)+1

Using a dict is O(1). As the dict grows, reallocation is sometimes required, but that is amortized O(1)
If your other algorithm is O(log n), the simple dict will always beat it as the dataset grows larger.
If you use any type of tree, I would expect a O(log n) component in there somewhere.
Not only is a hash table good enough, it is better

Related

Why use hashes instead of test true equality?

I've recently been looking into Python's dictionaries (I believe they're called associate arrays in other languages) and was confused by a couple of the restrictions on its keys.
First, dict keys must be immutable. When I looked up the logic behind it the answer was that dictionaries work like hash tables to look up the values for keys, and thus immutable keys (if they're hashable at all) may change their hash value, causing problems when retrieving the value.
I understood why that was the case just fine, but I was still somewhat confused by what the point of using a hash table would be. If you simply didn't hash the keys and tested for true equality (assuming indentically constructed objects compare equal), you could replicate most of the functionality of the dictionary without that limitation just by using two lists.
So, I suppose that's my real question - what's the rationale behind using hashes to look up values instead of equality?
If I had to guess, it's likely simply because comparing integers is incredibly fast and optimized, whereas comparing instances of other classes may not be.

You seem to be missing the whole point of a hash table, which is fast (O(1))1 retrieval, and which cannot be implemented without hashing, i.e. transformation of the key into some kind of well-distributed integer that is used as an index into a table. Notice that equality is still needed on retrieval to be able to handle hash collisions2, but you resort to it only when you already narrowed down the set of elements to those having a certain hash.
With just equality you could replicate similar functionality with parallel arrays or something similar, but that would make retrieval O(n)3; if you ask for strict weak ordering, instead, you can implement RB trees, that allow for O(log n) retrieval. But O(1) requires hashing.
Have a look at Wikipedia for some more insight on hash tables.
Notes
It can become O(n) in pathological scenarios (where all the elements get put in the same bucket), but that's not supposed to happen with a good hashing function.
Since different elements may have the same hash, after getting to the bucket corresponding to the hash you must check if you are actually retrieving the element associated with the given key.
Or O(log n) if you keep your arrays sorted, but that complicates insertion, which becomes on average O(n) due to shifting elements around; but then again, if you have ordering you probably want an RB tree or a heap.

By using hastables you achieve O(1) retrieval data, while comparing against each independent vale for equality will take O(n) (in a sequential search) or O(log(n)) in a binary search.
Also note that O(1) is amortized time, because if there are several values that hash to the same key, then a sequential search is needed among these values.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?

list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.

Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.

I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.

While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.

Python uses hashtables, which have O(1) lookup.

Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.

Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.

A list must be searched one by one, where a set or dictionary has an index for faster searching.

Memory-efficient string-to-string map in Python (or C)

I need a memory-efficient data structure for for storing about a million key--value pairs, where keys are strings of about 80 bytes, and values are strings of about 200 bytes, the total key and value size being about 280MB. I also need efficient lookup of value by key, preferably a hash-map. The memory overhead should be as little as possible, e.g. for 280MB of useful data, the data structure shouldn't use more than 300MB of virtual memory (including malloc() overhead and everything else). The usage pattern is the following: we start with an empty data structure, and we populate it gradually, never changing keys, and never changing the length of values. As a plus, the data structure may support changing the length of values, at the expense of a 100% value overhead (meaning that for x value bytes, x bytes might be wasted in temporarily in unused buffer space).
I need a pure Python module, or a built-in Python module, or a C implementation preferably with (C)Python bindings. I'd prefer if it was possible to serialize the whole data structure to disk, and to read it back very quickly.
Just to prove that such a small overhead is possible, I've created a simple design with open addressing, the hash table of 1.25 million elements containing 4-byte pointers to 1MB data blocks, the data blocks containing the key and value lengths as base-128 varints. This design has an important limitation: it doesn't allow removing or changing pairs without wasting their memory area. According to my calculations with 1 million key--value pairs of 280 bytes each, the overhead is less than 3.6% (10 080 000 bytes). The limits above are more generous, they allow 20 000 000 bytes of overhead.
I've just found http://www.pytables.org/ , which provides fast access and memory-efficient packing of data. I have to examine it more closely to check if it suits my needs.

Ok, the dirt-simple approach.
Use a python dictionary for the data structure. I filled a python dictionary with 1 million random key-value pairs where the key was 80 characters and the value 200 characters. It took 360,844 Kb on my computer, which is outside of your specification of no more than 300 MB, but I offer it up as a solution anyway because it's still pretty memory efficient.
This also fails your requirement of having a C API. I'm not sure why you need C, but as the question is tagged Python and lacks a C tag, I'll offer the pure Python to see if it just might fit the bill.
Regarding persistence. Use the cPickle module. It's very fast and, again, dirt-simple. To save your dictionary:
cPickle.dump(mydict, "myfile.pkl")
To reload your dictionary:
mydict = cPickle.load("myfile.pkl")
A second dirt-simple idea is to use the shelve module, which is basically disk-based python dictionary. Memory overhead is very low (it's all on disk). But it's also much slower.

Martijn mentioned this in a comment (not sure why people comment with answers), but I agree: use SQLite. You should give it a try and see if it will meet your needs.

If you don't plan to have a large amounts of deletes, then this isn't that hard. Deletes lead to fragmentation.
You also need to commit to a fixed length key. You mentioned 80 bytes. Are your keys allowed to duplicate? If not, it's even easier.
So, here is what you do.
You create an array of:
struct {
char value[80];
char *data;
} key;
And you keep this array sorted.
If you keys can duplicate, then you need:
struct link {
char *data;
link *next;
}
struct {
char value[80];
link *data;
} key;
(My C is rusty, but this is the gist of it) The latter has each key pointing to a linked list of values.
Then a lookup is a simple binary search. The "pain" is in maintaining this array and inserting/deleting keys. It's not as painful as it sounds, but it saves a LOT of memory, especially on 64bit systems.
What you want to reduce is the number of pointers. Pointers are expensive when you have lots of structures filled with pointers. On a 64bit system, a pointer is 8 bytes. So for a single pointer, there goes 8MB of your memory budget.
So, the expense is in building the array, copying and compacting memory (if you "know" you will have a million rows and can commit to that, then malloc(1000000 * sizeof(key)) right away, it'll save you some copying during expansion).
But don't be afraid of, once it's up and running, performance is quite good. Modern cpus are actually pretty good at copying 100M blocks of memory around.
Just as an aside, I just did something much like this in Java. On a 64bit JVM, a Map with 25M entries is 2G of RAM. My solution (using similar techniques to this) has at around 600M). Java uses more pointers than C, but the premise is the same.

Have you tried using a straightforward dict? Most of your data is in strings, so the overhead might fit within your requirements.

You can use the sha1 of the key instead of the key itself. If the keys are unique, then the sha1 hash of the keys is likely, too. It provides a memory savings to try to squeak in under your limit.
from random import choice
from string import letters
from hashlib import sha1
def keygen(length):
return "".join(choice(letters) for _ in xrange(length))
def gentestdata(n=1000*1000):
# return dict((sha1(keygen(80)).digest(), keygen(200)) for _ in xrange(n))
d = {}
for _ in xrange(n):
key = sha1(keygen(80)).digest()
assert key not in d
value = keygen(200)
d[key] = value
return d
if __name__ == '__main__':
d = gentestdata()
On my ubuntu box, this tops out at 304 MB of memory:
2010-10-26 14:26:02 hbrown#hbrown-ubuntu-wks:~$ ps aux | grep python
[...]
hbrown 12082 78.2 7.5 307420 303128 pts/1 S+ 14:20 4:47 python
Close enough? It's python, not C.
Later: also, if your data is somewhat redundant, you can gzip the values. It's a time versus space trade-off.

Using SQLite is a good idea. A quick implementation can tell if you are fast enough with little effort.
If you determine you have to roll your own, I'd recommend the following:
How well can you predict the number of pairs, or an upper limit for that?
How well can you predict the total data size, or an upper limit for that?
Arena allocator for strings and nodes. (Usually, you'd work on a list of arenas, so you don't have to predict the total size).
Alignment depends on your algorithms, in principle you could pack it byte-tight, and the only overhead is your overallocation, which only minimally affects your working set.
However, if you have to run any cmp/copy etc. operations on these strings, remember that with the following guarantees, you can squeeze a little or a lot from these string operations:
all elements are CPU word aligned
all pad bytes are (e.g.) 0
you can safely read "beyond" a string end as long as you don't cross a CPU border
Hash table for the index. A dictionary would work, too, but that makes sense only if potential degradation / rehashing would be a serious issue. I don't know any "stock" hashtable implementation for C, but there should be one, right? right? Just replace allocations with calls to the arena allocator.
Memory Locality
If you can guarantee that lookup will never request a string that is not in the map, you should store the keys in a separate arena, as they are needed only on hash collisions. That can improve memory locality significantly. (In that case, if you ever have a "final" table, you could even copy the colliding keys to a new arena, and throw away all the others. The benefits of that are probably marginal, though.)
Separation can help or hurt, depending on your access patterns. If you typically use the value once after each lookup, having them pair-wise in the same arena is great. If you e.g. look up a few keys, then use their values repeatedly, separate arenas make sense.
If you have to support "funny characters" / Unicode, normalize your strings before storing them.

You could use struct module to pack binary data and unpack it when needed.
You can implement a memory efficient storage using this approach. I guess access would be a pain.
http://docs.python.org/library/struct.html

Apache Portable Runtime (aka APR) has a c-based hash table. You can see documentation at http://apr.apache.org/docs/apr/0.9/group_apr_hash.html
With apr_hash_t all you store is void*. So it gives you full control over values. SO if you want you can store pointer to a 100 byte block instead of actual length of the string.

Judy should be memory-efficient: http://judy.sourceforge.net/
(Benchmarks: http://www.nothings.org/computer/judy/, see "Data Structure Size").
See also: http://www.dalkescientific.com/Python/PyJudy.html
Also,
For keys of a fixed size there is http://panthema.net/2007/stx-btree/ in C++ (I'm sure that with a custom C wrappers it can be used from CPython).
If the dataset allows it, you can store the variable-length keys in the value and use a hash or a prefix of the variable-length key as the fixed-length key.
The same logic applies to http://google-opensource.blogspot.ru/2013/01/c-containers-that-save-memory-and-time.html and http://code.google.com/p/sparsehash/ - istead of using a heavy std::string as a key, use an 32-bit or 64-bit integer key, making it somehow from the real variable-length key.

Since I couldn't find any existing solutions which will pack the memory tightly, I've decided to implement it in C for myself. See my design with open addressing in the question.

Best data-structure to use for two ended sorted list

I need a collection data-structure that can do the following:
Be sorted
Allow me to quickly pop values off the front and back of the list O(log n)
Remain sorted after I insert a new value
Allow a user-specified comparison function, as I will be storing tuples and want to sort on a particular value
Thread-safety is not required
Optionally allow efficient haskey() lookups (I'm happy to maintain a separate hash-table for this though)
My thoughts at this stage are that I need a priority queue and a hash table, although I don't know if I can quickly pop values off both ends of a priority queue. Another possibility is simply maintaining an OrderedDictionary and doing an insertion sort it every-time I add more data to it.
Because I'm interested in performance for a moderate number of items (I would estimate less than 200,000), I am unsure about what asymptotic performance I require for these operations. n will not grow infinitely, so a low constant performance k in k * O(n) may be as important O(n). That said, I would prefer that both the insert and pop operations take O(log n) time.
Furthermore, are there any particular implementations in Python? I would really like to avoid writing this code myself.

You might get good performance for these kinds of operations using blist or a database (such as the sqlite which is in the stdlib).

I suggest some sort of balanced binary tree such as a red-black tree.
A search on PyPi throws up a couple of implementations. Searching on google will give you more.
bintrees on PyPi looks very complete and has both Python and C/Cython implementations. I have not used it though, so caveat emptor.
A red-black tree is kept sorted and most operations (insert, delete, find) are O(log2(N)), so finding an element in a tree of 200,000 entries will take on average 17-18 comparisons.

Sounds like a skip list will fulfill all your requirements. It's basically a dynamically-sized sorted linked list, with O(log n) insertions and removals.
I don't really know Python, but this link seems to be relevant:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/

I presume you need it sorted because you access element by rank in the sorted order?
You can use any implementation of any balanced binary tree, with the additional information at each node which tells you the numbers of descendants of that node (usually called the Order Statistic Binary Tree).
With this structure, given the rank of an element (even min/max), you can access/delete it in O(log n) time.
This makes all operations (access/insert/delete by rank, pop front/back, insert/delete/search by value) O(logn) time, while allowing custom sort methods.
Also, apparently python has an AVL tree (one of the first balanced tree structures) implementation which supports order statistics: http://www.python.org/ftp/python/contrib-09-Dec-1999/DataStructures/avl.README
So you won't need a custom implementation.

Except for the hashing, what you're looking for is a double-ended priority queue, aka a priority deque.
If your need for sorting doesn't extend beyond managing the min and max of your data, another structure for you to look at might be an interval heap, which has the advantage of O(1) lookup of both min and max if you need to peek at values (though deleteMin and deleteMax are still O(log(N)) ). Unfortunately, I'm not aware of any implementations in Python, so I think you'd have to roll your own.
Here's an addendum to an algorithms textbook that describes interval heaps if you're interested:
http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/interval.pdf

If you can really allow O(log n) for pop, dequeue, and insert, then a simple balanced search tree like red-black tree is definitely sufficient.
You can optimize this of course by maintaining a direct pointer to the smallest and largest element in the tree, and then updating it when you (1) insert elements into the tree or (2) pop or dequeue, which of course invalidate the resp. pointer. But because the tree is balanced, there's some shuffling going out anyway, and you can update the corr. pointer at the same time.
There is also something called min-max heap (see the Wikipedia entry for Binary Heap), which implements exactly a "double-ended priority queue", i.e. a queue where you can pop both from front end and the rear end. However there you can't access the whole list of objects in order, whereas a search tree can be iterated efficiently through in O(n) time.
The benefit of a min-max heap however is that the current min and max objects can be read in O(1) time, a search tree requires O(log(n)) just to read the min or max object unless you have the cached pointers as I mentioned above.

If this were Java I'd use a TreeSet with the NavigableSet interface.
This is implemented as a Red-Black-Tree.

Memory efficiency: One large dictionary or a dictionary of smaller dictionaries?

I'm writing an application in Python (2.6) that requires me to use a dictionary as a data store.
I am curious as to whether or not it is more memory efficient to have one large dictionary, or to break that down into many (much) smaller dictionaries, then have an "index" dictionary that contains a reference to all the smaller dictionaries.
I know there is a lot of overhead in general with lists and dictionaries. I read somewhere that python internally allocates enough space that the dictionary/list # of items to the power of 2.
I'm new enough to python that I'm not sure if there are other unexpected internal complexities/suprises like that, that is not apparent to the average user that I should take into consideration.
One of the difficulties is knowing how the power of 2 system counts "items"? Is each key:pair counted as 1 item? That's seems important to know because if you have a 100 item monolithic dictionary then space 100^2 items would be allocated. If you have 100 single item dictionaries (1 key:pair) then each dictionary would only be allocation 1^2 (aka no extra allocation)?
Any clearly laid out information would be very helpful!

Three suggestions:
Use one dictionary.
It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing.
Optimize later.
If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance.
Read up on hash tables.
Dictionaries are hash tables, and if you are worried about their time or space overhead, you should read up on how they're implemented. This is basic computer science. The short of it is that hash tables are:
average case O(1) lookup time
O(n) space (Expect about 2n, depending on various parameters)
I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables:
O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size.
O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces. Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not.
Here are some more resources that might help:
The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables.
The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why "the amount of space used by the hash table is proportional to the number of associations in the table". This might interest you.
Here are some things you might consider if you find you actually need to optimize your dictionary implementation:
Here is the C source code for Python's dictionaries, in case you want ALL the details. There's copious documentation in here:
dictobject.h
dictobject.c
Here is a python implementation of that, in case you don't like reading C.
(Thanks to Ben Peterson)
The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up. Note there's a tradeoff between your load factor and how frequently you need to rehash. Rehashes can be costly.

If you're using Python, you really shouldn't be worrying about this sort of thing in the first place. Just build your data structure the way it best suits your needs, not the computer's.
This smacks of premature optimization, not performance improvement. Profile your code if something is actually bottlenecking, but until then, just let Python do what it does and focus on the actual programming task, and not the underlying mechanics.

"Simple" is generally better than "clever", especially if you have no tested reason to go beyond "simple". And anyway "Memory efficient" is an ambiguous term, and there are tradeoffs, when you consider persisting, serializing, cacheing, swapping, and a whole bunch of other stuff that someone else has already thought through so that in most cases you don't need to.
Think "Simplest way to handle it properly" optimize much later.

Premature optimization bla bla, don't do it bla bla.
I think you're mistaken about the power of two extra allocation does. I think its just a multiplier of two. x*2, not x^2.
I've seen this question a few times on various python mailing lists.
With regards to memory, here's a paraphrased version of one such discussion (the post in question wanted to store hundreds of millions integers):
A set() is more space efficient than a dict(), if you just want to test for membership
gmpy has a bitvector type class for storing dense sets of integers
Dicts are kept between 50% and 30% empty, and an entry is about ~12 bytes (though the true amount will vary by platform a bit).
So, the fewer objects you have, the less memory you're going to be using, and the fewer lookups you're going to do (since you'll have to lookup in the index, then a second lookup in the actual value).
Like others, said, profile to see your bottlenecks. Keeping an membership set() and value dict() might be faster, but you'll be using more memory.
I'd also suggest reposting this to a python specific list, such as comp.lang.python, which is full of much more knowledgeable people than myself who would give you all sorts of useful information.

If your dictionary is so big that it does not fit into memory, you might want to have a look at ZODB, a very mature object database for Python.
The 'root' of the db has the same interface as a dictionary, and you don't need to load the whole data structure into memory at once e.g. you can iterate over only a portion of the structure by providing start and end keys.
It also provides transactions and versioning.

Honestly, you won't be able to tell the difference either way, in terms of either performance or memory usage. Unless you're dealing with tens of millions of items or more, the performance or memory impact is just noise.
From the way you worded your second sentence, it sounds like the one big dictionary is your first inclination, and matches more closely with the problem you're trying to solve. If that's true, go with that. What you'll find about Python is that the solutions that everyone considers 'right' nearly always turn out to be those that are as clear and simple as possible.

Often times, dictionaries of dictionaries are useful for other than performance reasons. ie, they allow you to store context information about the data without having extra fields on the objects themselves, and make querying subsets of the data faster.
In terms of memory usage, it would stand to reason that one large dictionary will use less ram than multiple smaller ones. Remember, if you're nesting dictionaries, each additional layer of nesting will roughly double the number of dictionaries you need to allocate.
In terms of query speed, multiple dicts will take longer due to the increased number of lookups required.
So I think the only way to answer this question is for you to profile your own code. However, my suggestion is to use the method that makes your code the cleanest and easiest to maintain. Of all the features of Python, dictionaries are probably the most heavily tweaked for optimal performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.