storing values between iterations (cache-like mechanism) in pyCUDA

storing values between iterations (cache-like mechanism) in pyCUDA - python

Good morning all,
I am kind of newbie with cuda/pyCuda, so probably this will have an easy solution employing some mechanism that I don't know....
I am employing pycuda to operate over pairs of values: I subtract the smallest from the biggest and then perform some time-consuming operations. As it must be repeated many times, it is well suited for GPUs.
However, most of the times the result of the substraction is the same. Then, performing the time-consuming operations make no sense. what I do in the non-GPU version of my code is something like:
myFunction(A,B):
index=A-B
try:
value = myDictionary[index]
except:
value = expensiveOperation(index)
myDictionary[index] = value
return value
As accessing the dictionary is much faster than expensiveOperation, and the value is found most of the times, I obtain a significant time gain.
When porting this to GPUs, I can call to myFunction(A,B) with a high degree of parallelism, which is great. However, I don't know how could I employ this dictionary mechanism -or a similar one- to avoid redundant operations.
any thoughts on this?
Thanks for your help
edit: I would like to know, is it possible to store the dictionary on the GPU, or should I copy it every time? If it's on the GPU, can it be accessed/edited by several cores at the same time? How should I implement it?

You could try this:
myFunction(A,B):
index=A-B
if index in myDictionary.keys():
value = myDictionary[index]
else:
value = expensiveOperation(index)
myDictionary[index] = value
return value

It seems your question is about implementing some sort of memoise facility inside GPU code. I don't think this is worth pursuing. In the GPU arithmetic operations are almost free, but memory access is very expensive (and random memory access even more so). Performing a dictionary/hash table look-up in GPU memory to retrieve an arithmetic result from a cache is almost guaranteed to be slower that the cost of just calculating the result. It sounds counter-intuitive, but that is the reality of GPU computing.
In an interpreted language like Python, which is relatively slow, using a fast native memoisation mechanism makes a lot of sense, and memoising the results of a complete kernel function call also could yield useful performance benefits for expensive kernels. But memoisation inside CUDA doesn't seem all that useful.

Related

Does every simple mathematical operation use the same amount of power (as in, battery power)?

Recently I have been revising some of my old python codes, which are essentially loops of algebra, in order to have them execute faster, generally by eliminating some un-necessary operations. Often, changing the value of an entry in a list from 0 (as a python float, which I believe is a double by default) to the same value, which is obviously not necessary. Or, checking if a float is equal to something, when it MUST be that thing, because a preceeding "if" would not have triggered if it wasn't, or some other extraneous operation. This got me wondering about what will preserve my battery more, as I do a some of my coding on the bus where I can't plug my laptop in.
For example, which of the following two operations would be expected to use less battery power?
if b != 0: #b was assigned previously, and I know it is zero already
b = 0
or:
b = 0
The first one checks if b is zero, and it is, so it doesn't do the next part. The second one just assigns b to zero without bothering to check. I believe the first one is more time-efficient, as you don't have to change anything in memory. Is that correct, and if so, would it also be more power-efficient? Does "more time efficient" always imply "more power efficient"?

I suggest watching this talk by Chandler Carruth: "Efficiency with Algorithms, Performance with Data Structures"
He addresses the idea of "Power efficient instructions" at 4m 49s in the video. I agree with him, thinking about how much watt particular code consumes is useless. As he put it
Q: "How to save battery life?"
A: "Finish ruining the program".
Also, in Python you do not have low level control to be even thinking about low level problems like this. Use appropriate data structures and algorithms, and pray that Python interpreter will give you well optimized byte-code.

Does every simple mathematical operation use the same amount of power (as in, battery power)?
No. It's not the same to compute a two number addition than a fourier transform of a 20 megapixel photo.
I believe the first one is more time-efficient, as you don't have to change anything in memory. Is that correct, and if so, would it also be more power-efficient? Does "more time efficient" always imply "more power efficient"?
Yes. You are right on your intuitions but these are very trivial examples. And if you dig deeper you will get into uncharted territory of weird optimization that's quite difficult to grasp (e.g., see this question: Times two faster than bit shift?)

In general the more your code utilizes system resources the greater power those resources would use. However it is more useful to optimize your code based on time or size constraints instead of thinking about high level code in terms of power draw.
One way of doing this is Big O notation. In essence, Big O notation is a way of comparing the size and or runtime complexity of an algorithm. https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/
A computer at its lowest level is large quantity of transistors which require power to change and maintain their state.
It would be extremely difficult to anticipate how much power any one line of python code would draw.

I once had questions like this. Still do sometimes. Here's the answer I wish someone told me earlier.
Summary
You are correct that generally, if your computer does less work, it'll use less power.
But we have to be really careful in figuring out which logical operations involve more work and which ones involve less work - in this case:
Reading vs writing memory is usually the same amount of work.
if and any other conditional execution also costs work.
Python's "simple operations" are not "simple operations" for the CPU.
But the idea you had is probably correct for some cases you had in mind.
If you're concerned about power consumption, measure where power is being used.
For some perspective: You're asking about which Python code costs you one more drop of water, but really in Python every operation costs a bucket and your whole Python program is using a river and your computer as a whole is using an ocean.
Direct Answers
Don't apply these answers to Python yet. Read the rest of the answer first, because there's so much indirection between Python and the CPU that you'll mislead yourself about how they're connected if you don't take that into account.
I believe the first one is more time-efficient, as you don't have to change anything in memory.
As a general rule, reading memory is just as slow as writing to memory, or even slower depending on exactly what your computer is doing. For further reading you'll want to look into CPU memory cache levels, memory access times, and how out-of-order execution and data dependencies factor into modern CPU architectures.
As a general rule, the if statement in a language is itself an operation which can have a non-negligible cost. For further reading you should look into how CPU pipelining relates to branch prediction and branch penalties. Also look into how if statements are implemented in typical CPU instruction sets.
Does "more time efficient" always imply "more power efficient"?
As a general rule, more work efficient (doing less work - less machine instructions, for example) implies more power efficient, because on modern hardware (this wasn't always this way) your hardware will use less power when it's not doing anything.
You should be careful about the idea of "more time efficient" though, because modern hardware doesn't always execute the same amount of work in the same amount of time: for further reading you'll want to look into CPU frequency scaling, ARM's big.LITTLE architectures, and discussions about the "Race to Idle" concept as a starting point.
"One Simple Operation" - CPU vs. Python
Your question is about Python, so it's important to realize that Python's x != 0, if, and x = 0 do not map directly to simple operations in the CPU.
For further reading, especially if you're familiar with C, I would recommend taking a long look at how Python is implemented. There are many implementations - the main one is CPython, which is a C program that reads and interprets Python source, converts it into Python "bytecode" and then when running interprets that bytecode one by one.
As a baseline, if you're using Python, any one "simple" operation is actually a lot of CPU operations, as each step in the Python interpreter is multiple CPU operations, but which ones cost more might be surprising.
Let's break down the three used in our example (I'm primarily describing this from the perspective of the main Python implementation written in C, called "CPython", which I am the most closely familiar with, but in general this explanation is roughly applicable to all of them, though some will be able to optimize out certain steps):
x != 0
It looks like a simple operation, and if this was C and x was an int it would be just one machine instruction - but Python allows for operator overloading, so first Python has to:
look up x (at least one memory read, but may involve one or more hashmap lookups in Python's internals, which is many machine operations),
check the type of x (more memory reading),
based on the type look up a function pointer that implements the not-equality operation (one or arbitrarily many memory reads and one or more arbitrarily many conditional branches, with data dependencies between them),
only then it can finally call that function with references to Python objects holding the values of x and 0 (which is also not "free" - look up function calling ABI for more on that).
All that and more has to be done by the CPU even if x is a Python int or float mapping closely to the CPU's native numerical data types.
x = 0
An assignment is actually far cheaper in Python (though still not trivial): it only has to get as far as step 1 above, because once it knows "where" x is, it can just overwrite that pointer with the pointer to the Python object representing 0.
if
Abstractly speaking, the Python if statement has to be able to handle "truthy" and "falsey" values, which in the most naive implementation would involves running through more CPU instructions to evaluate what result of the condition is according to Python's semantics of what's true and what's false.
Sidenote About Optimizations
Different Python implementations go to different lengths to get Python operations closer to as few CPU operations in possible. For example, an optimizing JIT (Just In Time) compiler might notice that, inside some loop on an array, all elements of the array are native integers and actually reduce the if x != 0 and x = 0 parts into their respective minimal machine instructions, but that only happens in very specific circumstances when the optimizing logic can prove that it can safely bypass a lot of the behavior it would normally need to do.
The biggest thing here is this: a high-level language like Python is so removed from the hardware that "simple" operations are often complex "under the covers".
What You Asked vs. What I Think You Wanted To Ask
Correct me if I'm wrong, but I suspect the use-case you actually had in mind was this:
if x != 0:
# some code
x = 0
vs. this:
if x != 0:
# some code
x = 0
In that case, the first option is superior to the second, because you are already paying the cost of if x != 0 anyway.
Last Point of Emphasis
The hardest breakthrough for me was moving away from trying to reason about individual instructions in my head, and instead switching into looking at how things work and measuring real systems.
Looking at how things work will teach you how to optimize, but measuring will show you where to optimize.
This question is great for exploring the former, but for your stated motivation of reducing power consumption on your laptop, you would benefit more from the latter.

Running parallel iterations

I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()

From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange

Performance of multiple iterations

Wondering about the performance impact of doing one iteration vs many iterations. I work in Python -- I'm not sure if that affects the answer or not.
Consider trying to perform a series of data transformations to every item in a list.
def one_pass(my_list):
for i in xrange(0, len(my_list)):
my_list[i] = first_transformation(my_list[i])
my_list[i] = second_transformation(my_list[i])
my_list[i] = third_transformation(my_list[i])
return my_list
def multi_pass(my_list):
range_end = len(my_list)
for i in xrange(0, range_end):
my_list[i] = first_transformation(my_list[i])
for i in xrange(0, range_end):
my_list[i] = second_transformation(my_list[i])
for i in xrange(0, range_end):
my_list[i] = third_transformation(my_list[i])
return my_list
Now, apart from issues with readability, strictly in performance terms, is there a real advantage to one_pass over multi_pass? Assuming most of the work happens in the transformation functions themselves, wouldn't each iteration in multi_pass only take roughly 1/3 as long?

The difference will be how often the values and code you're reading are in the CPU's cache.
If the elements of my_list are large, but fit into the CPU cache, the first version may be beneficial. On the other hand, if the (byte)code of the transformations is large, caching the operations may be better than caching the data.
Both versions are probably slower than the way more readable:
def simple(my_list):
return [third_transformation(second_transformation(first_transformation(e)))
for e in my_list]
Timing it yields:
one_pass: 0.839533090591
multi_pass: 0.840938806534
simple: 0.569097995758

Assuming you're considering a program that can easily be one loop with multiple operations, or multiple loops doing one operation each, then it never changes the computational complexity (e.g. an O(n) algorithm is still O(n) either way).
One advantage of the single-pass approach are that you save on the "book-keeping" of the looping. Whether the iteration mechanism is incrementing and comparing counters, or retrieving "next" pointers and checking for null, or whatever, you do it less when you do everything in one pass. Assuming that your operations do any significant amount of work at all (and that your looping mechanism is simple and straightforward, not looping over an expensive generator or something), then this "book-keeping" work will be dwarfed by the actual work of your operations, which makes this definitely a micro-optimisation that you shouldn't be doing unless you know your program is too slow and you've exhausted all more significant available optimisations.
Another advantage can be that applying all your operations to each element of the iteration before you move on to the next one tends to benefit better from the CPU cache, since each item could still be in the cache in subsequent operations on the same item, whereas using multiple passes makes that almost impossible (unless your entire collection fits in the cache). Python has so much indirection via dictionaries going on though that it's probably not hard for each individual operation to overflow the cache by reading hash buckets scattered all over the memory space. So this is still a micro-optimisation, but this analysis gives it more of a chance (though still no certainty) of making a significant difference.
One advantage of multi-pass can be that if you need to keep state between loop iterations, the single-pass approach forces you to keep the state of all operations around. This can hurt the CPU cache (maybe the state of each operation individually fits in the cache for an entire pass, but not the state of all the operations put together). In the extreme case this effect could theoretically make the difference between the program fitting in memory and not (I have encountered this once in a program that was chewing through very large quantities of data). But in the extreme cases you know that you need to split things up, and the non-extreme cases are again micro-optimisations that are not worth making in advance.
So performance generally favours single-pass by an insignificant amount, but can in some cases favour either single-pass or multi-pass by a significant amount. The conclusion you can draw from this is the same as the general advice applying to all programming: start by writing code in whatever way is most clear and maintainable and still adequately addresses your program. Only once you've got a mostly finished program and if it turns out to be "not fast enough", then measure the performance effects of the various parts of your code to find out where it's worth spending your time.
Time spent worrying about whether to write single-pass or multi-pass algorithms for performance reasons will almost always turn out to have been wasted. So unless you have unlimited development time available to you, you will get the "best" results from your total development effort (where best includes performance) by not worrying about this up-front, and addressing it on an as-needed basis.

Personally, I would no doubt prefer the one_pass option. It definitely performs better. You may be right that the difference wouldn't be huge. Python has optimized the xrange iterator really well, but you are still doing 3 times more iterations than needed.

You may get decreased cached misses in either version compared to the other. It depends on what those transform functions actually do.
If those functions have a lot of code and operate on different sets of data (besides the input and output), multipass may be better. Otherwise the single pass is likely to be better because the current list element will likely remain cached and the loop operations are only done once instead of three times.
This is a case were comparing actual run times would be very useful.

Will multiprocessing be a good solution for this operation?

while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.

It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate

Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.

Memory efficiency: One large dictionary or a dictionary of smaller dictionaries?

I'm writing an application in Python (2.6) that requires me to use a dictionary as a data store.
I am curious as to whether or not it is more memory efficient to have one large dictionary, or to break that down into many (much) smaller dictionaries, then have an "index" dictionary that contains a reference to all the smaller dictionaries.
I know there is a lot of overhead in general with lists and dictionaries. I read somewhere that python internally allocates enough space that the dictionary/list # of items to the power of 2.
I'm new enough to python that I'm not sure if there are other unexpected internal complexities/suprises like that, that is not apparent to the average user that I should take into consideration.
One of the difficulties is knowing how the power of 2 system counts "items"? Is each key:pair counted as 1 item? That's seems important to know because if you have a 100 item monolithic dictionary then space 100^2 items would be allocated. If you have 100 single item dictionaries (1 key:pair) then each dictionary would only be allocation 1^2 (aka no extra allocation)?
Any clearly laid out information would be very helpful!

Three suggestions:
Use one dictionary.
It's easier, it's more straightforward, and someone else has already optimized this problem for you. Until you've actually measured your code and traced a performance problem to this part of it, you have no reason not to do the simple, straightforward thing.
Optimize later.
If you are really worried about performance, then abstract the problem make a class to wrap whatever lookup mechanism you end up using and write your code to use this class. You can change the implementation later if you find you need some other data structure for greater performance.
Read up on hash tables.
Dictionaries are hash tables, and if you are worried about their time or space overhead, you should read up on how they're implemented. This is basic computer science. The short of it is that hash tables are:
average case O(1) lookup time
O(n) space (Expect about 2n, depending on various parameters)
I do not know where you read that they were O(n^2) space, but if they were, then they would not be in widespread, practical use as they are in most languages today. There are two advantages to these nice properties of hash tables:
O(1) lookup time implies that you will not pay a cost in lookup time for having a larger dictionary, as lookup time doesn't depend on size.
O(n) space implies that you don't gain much of anything from breaking your dictionary up into smaller pieces. Space scales linearly with number of elements, so lots of small dictionaries will not take up significantly less space than one large one or vice versa. This would not be true if they were O(n^2) space, but lucky for you, they're not.
Here are some more resources that might help:
The Wikipedia article on Hash Tables gives a great listing of the various lookup and allocation schemes used in hashtables.
The GNU Scheme documentation has a nice discussion of how much space you can expect hashtables to take up, including a formal discussion of why "the amount of space used by the hash table is proportional to the number of associations in the table". This might interest you.
Here are some things you might consider if you find you actually need to optimize your dictionary implementation:
Here is the C source code for Python's dictionaries, in case you want ALL the details. There's copious documentation in here:
dictobject.h
dictobject.c
Here is a python implementation of that, in case you don't like reading C.
(Thanks to Ben Peterson)
The Java Hashtable class docs talk a bit about how load factors work, and how they affect the space your hash takes up. Note there's a tradeoff between your load factor and how frequently you need to rehash. Rehashes can be costly.

If you're using Python, you really shouldn't be worrying about this sort of thing in the first place. Just build your data structure the way it best suits your needs, not the computer's.
This smacks of premature optimization, not performance improvement. Profile your code if something is actually bottlenecking, but until then, just let Python do what it does and focus on the actual programming task, and not the underlying mechanics.

"Simple" is generally better than "clever", especially if you have no tested reason to go beyond "simple". And anyway "Memory efficient" is an ambiguous term, and there are tradeoffs, when you consider persisting, serializing, cacheing, swapping, and a whole bunch of other stuff that someone else has already thought through so that in most cases you don't need to.
Think "Simplest way to handle it properly" optimize much later.

Premature optimization bla bla, don't do it bla bla.
I think you're mistaken about the power of two extra allocation does. I think its just a multiplier of two. x*2, not x^2.
I've seen this question a few times on various python mailing lists.
With regards to memory, here's a paraphrased version of one such discussion (the post in question wanted to store hundreds of millions integers):
A set() is more space efficient than a dict(), if you just want to test for membership
gmpy has a bitvector type class for storing dense sets of integers
Dicts are kept between 50% and 30% empty, and an entry is about ~12 bytes (though the true amount will vary by platform a bit).
So, the fewer objects you have, the less memory you're going to be using, and the fewer lookups you're going to do (since you'll have to lookup in the index, then a second lookup in the actual value).
Like others, said, profile to see your bottlenecks. Keeping an membership set() and value dict() might be faster, but you'll be using more memory.
I'd also suggest reposting this to a python specific list, such as comp.lang.python, which is full of much more knowledgeable people than myself who would give you all sorts of useful information.

If your dictionary is so big that it does not fit into memory, you might want to have a look at ZODB, a very mature object database for Python.
The 'root' of the db has the same interface as a dictionary, and you don't need to load the whole data structure into memory at once e.g. you can iterate over only a portion of the structure by providing start and end keys.
It also provides transactions and versioning.

Honestly, you won't be able to tell the difference either way, in terms of either performance or memory usage. Unless you're dealing with tens of millions of items or more, the performance or memory impact is just noise.
From the way you worded your second sentence, it sounds like the one big dictionary is your first inclination, and matches more closely with the problem you're trying to solve. If that's true, go with that. What you'll find about Python is that the solutions that everyone considers 'right' nearly always turn out to be those that are as clear and simple as possible.

Often times, dictionaries of dictionaries are useful for other than performance reasons. ie, they allow you to store context information about the data without having extra fields on the objects themselves, and make querying subsets of the data faster.
In terms of memory usage, it would stand to reason that one large dictionary will use less ram than multiple smaller ones. Remember, if you're nesting dictionaries, each additional layer of nesting will roughly double the number of dictionaries you need to allocate.
In terms of query speed, multiple dicts will take longer due to the increased number of lookups required.
So I think the only way to answer this question is for you to profile your own code. However, my suggestion is to use the method that makes your code the cleanest and easiest to maintain. Of all the features of Python, dictionaries are probably the most heavily tweaked for optimal performance.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.