Related
I am brand-new to Python, having decided to make the jump from Matlab. I have tried to find the answer to my question for days but without success!
The problem: I have a bunch of objects with certain attributes. Note that I am not talking about objects and attributes in the programming sense of the word - I am talking about literal astronomical objects about which I have different types of numerical data and physical attributes for.
In a loop in my script, I go through each source/object in my catalogue, do some calculations, and stick the results in a huge dictionary. The form of the script is like this:
for i in range ( len(ObjectCatalogue) )
calculate quantity1 for source i
calculate quantity2 for source i
determine attribute1 for source i
sourceDataDict[i].update( {'spectrum':quantity1} )
sourceDataDict[i].update( {'peakflux':quantity2} )
sourceDataDict[i].update( {'morphology':attribute1} )
So once I have gone through a hundred sources or so, I can, say, access the spectrum for object no. 20 with spectrumSource20 = sourceData[20]['spectrum'] etc.
What I want to do is be able to select all objects in the dictionary based on the value of the keyword 'morphology' say. So say the keyword for 'morphology' can take on the values 'simple' or 'complex'. Is there anyway I can do this without resorting to a loop? I.e. - could I do something like create a new dictionary that contains all the sources that take the 'complex' value for the 'morphology' keyword?
Hard to explain, but using logical indexing that I am used to from Matlab, it would look something like
complexSourceDataDict = sourceDataDict[*]['morphology'=='complex']
(where * indicates all objects in the dictionary)
Anyway - any help would be greatly appreciated!
Without a loop, no. With a list comprehension, yes:
complex = [src for src in sourceDataDict.itervalues() if src.get('morphology') == 'complex']
If sourceDataDict happens to really be a list, you can drop the itervalues:
complex = [src for src in sourceDataDict if src.get('morphology') == 'complex']
If you think about it, evaluating a * would imply a loop operation under the hood anyways (assuming it were valid syntax). So your trick is to do the most efficient looping you can with the data structure you are using.
The only way to get more efficient would be to index all of the data objects "morphology" keys ahead of time and keep them up to date.
There's not a direct way to index nested dictionaries out of order, like your desired syntax wants to do. However, there are a few ways to do it in Python, with varying interfaces and performance characteristics.
The best performing solution would probably be to create an additional dictionary which indexes by whatever characteristic you care about. For instance, to find values with the 'morphology' value is 'complex', you'd d something like this:
from collections import defaultdict
# set up morphology dict (you could do this as part of generating the morphology)
morph_dict = defaultdict(list)
for data in sourceDataDict.values():
morph_dict[data["morphology"]].append(data)
# later, you can access a list of the values with any particular morphology
complex_morph = morph_dict["complex"]
While this is high-performance, it may be annoying to need to set up the reverse indexes for everything ahead of time. An alternative might be to use a list comprehension or generator expression to iterate over your dictionary and finding the appropriate values:
complex = (d for d in sourceDataDict.values() if d["morphology"] == "complex")
for c in complex:
do_whatever(c)
I believe you are dealing with a structure similar to the following
sourceDataDict = [
{'spectrum':1,
'peakflux':10,
'morphology':'simple'
},
{'spectrum':2,
'peakflux':11,
'morphology':'comlex'
},
{'spectrum':3,
'peakflux':12,
'morphology':'simple'
},
{'spectrum':4,
'peakflux':13,
'morphology':'complex'
}
]
you can do something similar using List COmprehension
>>> [e for e in sourceDataDict if e.get('morphology',None) == 'complex']
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
Using itertools.ifilter, you can achieve a similar result
>>> list(itertools.ifilter(lambda e:e.get('morphology',None) == 'complex', sourceDataDict))
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
Please note, the use of get instead of indexing is to ensure that the functionality wont fail even when the key 'morphology' does not exist. In case, its definite to exist, you can rewrite the above as
>>> [e for e in sourceDataDict if e['morphology'] == 'complex']
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
>>> list(itertools.ifilter(lambda e:e['morphology'] == 'complex', sourceDataDict))
[{'morphology': 'complex', 'peakflux': 13, 'spectrum': 4}]
Working with huge amount of data, you may want to store it somewhere, so some sort of database and ORM (for instance), but latter is a matter of taste. Sort of RDBMS may be solution.
As for raw python, there is no built-in solution except of functional routines like filter. Anyway you face iteration at some step (implicitly or not).
The simpliest way is is keeping additional dict with keys as attribute values:
objectsBy['morphology'] = {'complex': set(), 'simple': set()}
for item in sources:
...
objMorphology = compute_morphology(item)
objectsBy['morphology'][objMorphology] += item
...
EDIT: as #BrenBarn pointed out, the original didn't make sense.
Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.
# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]
# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
key = str(d)
if key not in keys:
keys.add(key)
filtered.append(d)
# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}
# iv. something else entirely
In the spirit of the Zen of Python what's the "obvious way to do it"?
Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.
Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:
filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()
Or in Python 3, assuming you want a list rather than a dictionary view:
filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())
These are both bit clunky. For readability, consider breaking it into two lines:
dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()
The alternative is an ordered tuple of tuples:
filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()
And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:
>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:
groups = {}
for d in pile_o_dicts:
k = tuple(sorted(d.items()))
groups.setdefault(k, []).append(d)
This will group identical dictionaries together.
FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)
If the dicts all have the same keys, you can use a namedtuple
>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)
Let's say I have a function f() that takes a list and returns a mutation of that list. If I want to apply that function to five member variables in my class instance (i), I can do this:
for x in [i.a, i.b, i.c, i.d, i.e]:
x[:] = f(x)
1) Is there a more elegant way? I don't want f() to modify the passed list.
2) If my variables hold a simple integer (which won't work with the slice notation), is there also a way? (f() would also take & return an integer in this case)
Another solution, though it's probably not elegant:
for x in ['a', 'b', 'c', 'd', 'e']:
setattr(i, x, f(getattr(i, x)))
Python doesn't have pass by reference. The best you can do is write a function which constructs a new list and assign the result of the function to the original list. Example:
def double_list(x):
return [item * 2 for item in x]
nums = [1, 2, 3, 4]
nums = double_list(nums) # now [2, 4, 6, 8]
Or better yet:
nums = map(lambda x: x * 2, nums)
Super simple example, but you get the idea. If you want to change a list from a function you'll have to return the altered list and assign that to the original.
You might be able to hack up a solution, but it's best just to do it the normal way.
EDIT
It occurs to me that I don't actually know what you're trying to do, specifically. Perhaps if you were to clarify your specific task we could come up with a solution that Python will permit?
Ultimately, what you want to do is incompatible with the way that Python is structured. You have the most elegant way to do it already in the case that your variables are lists but this is not possible with numbers.
This is because variables do not exist in Python. References do. So i.x is not a list, it is a reference to a list. Likewise, if it references a number. So if i.x references y, then i.x = z doesn't actually change the value y, it changes the location in memory that i.x points to.
Most of the time, variables are viewed as boxes that hold a value. The name is on the box. In python, values are fundamental and "variables" are just tags that get hung on a particular value. It's very nice once you get used to it.
In the case of a list, you can use use slice assignment, as you are already doing. This will allow all references to the list to see the changes because you are changing the list object itself. In the case of a number, there is no way to do that because numbers are immutable objects in Python. This makes sense. Five is five and there's not much that you can do to change it. If you know or can determine the name of the attribute, then you can use setattr to modify it but this will not change other references that might already exist.
As Rafe Kettler says, if you can be more specific about what you actually want to do, then we can come up with a simple elegant way to do it.
I have a situation in Python(cough, homework) where I need to multiply EACH ELEMENT in a given list of objects a specified number of times and return the output of the elements. The problem is that the sample inputs given are of different types. For example, one case may input a list of strings whose elements I need to multiply while the others may be ints. So my return type needs to vary. I would like to do this without having to test what every type of object is. Is there a way to do this? I know in C# i could just use "var" but I don't know if such a thing exists in Python?
I realize that variables don't have to be declared, but in this case I can't see any way around it. Here's the function I made:
def multiplyItemsByFour(argsList):
output = ????
for arg in argsList:
output += arg * 4
return output
See how I need to add to the output variable. If I just try to take away the output assignment on the first line, I get an error that the variable was not defined. But if I assign it a 0 or a "" for an empty string, an exception could be thrown since you can't add 3 to a string or "a" to an integer, etc...
Here are some sample inputs and outputs:
Input: ('a','b') Output: 'aaaabbbb'
Input: (2,3,4) Output: 36
Thanks!
def fivetimes(anylist):
return anylist * 5
As you see, if you're given a list argument, there's no need for any assignment whatsoever in order to "multiply it a given number of times and return the output". You talk about a given list; how is it given to you, if not (the most natural way) as an argument to your function? Not that it matters much -- if it's a global variable, a property of the object that's your argument, and so forth, this still doesn't necessitate any assignment.
If you were "homeworkically" forbidden from using the * operator of lists, and just required to implement it yourself, this would require assignment, but no declaration:
def multiply_the_hard_way(inputlist, multiplier):
outputlist = []
for i in range(multiplier):
outputlist.extend(inputlist)
return outputlist
You can simply make the empty list "magicaly appear": there's no need to "declare" it as being anything whatsoever, it's an empty list and the Python compiler knows it as well as you or any reader of your code does. Binding it to the name outputlist doesn't require you to perform any special ritual either, just the binding (aka assignment) itself: names don't have types, only objects have types... that's Python!-)
Edit: OP now says output must not be a list, but rather int, float, or maybe string, and he is given no indication of what. I've asked for clarification -- multiplying a list ALWAYS returns a list, so clearly he must mean something different from what he originally said, that he had to multiply a list. Meanwhile, here's another attempt at mind-reading. Perhaps he must return a list where EACH ITEM of the input list is multiplied by the same factor (whether that item is an int, float, string, list, ...). Well then:
define multiply_each_item(somelist, multiplier):
return [item * multiplier for item in somelist]
Look ma, no hands^H^H^H^H^H assignment. (This is known as a "list comprehension", btw).
Or maybe (unlikely, but my mind-reading hat may be suffering interference from my tinfoil hat, will need to go to the mad hatter's shop to have them tuned) he needs to (say) multiply each list item as if they were the same type as the first item, but return them as their original type, so that for example
>>> mystic(['zap', 1, 23, 'goo'], 2)
['zapzap', 11, 2323, 'googoo']
>>> mystic([23, '12', 15, 2.5], 2)
[46, '24', 30, 4.0]
Even this highly-mystical spec COULD be accomodated...:
>>> def mystic(alist, mul):
... multyp = type(alist[0])
... return [type(x)(mul*multyp(x)) for x in alist]
...
...though I very much doubt it's the spec actually encoded in the mysterious runes of that homework assignment. Just about ANY precise spec can be either implemented or proven to be likely impossible as stated (by requiring you to solve the Halting Problem or demanding that P==NP, say;-). That may take some work ("prove the 4-color theorem", for example;-)... but still less than it takes to magically divine what the actual spec IS, from a collection of mutually contradictory observations, no examples, etc. Though in our daily work as software developer (ah for the good old times when all we had to face was homework!-) we DO meet a lot of such cases of course (and have to solve them to earn our daily bread;-).
EditEdit: finally seeing a precise spec I point out I already implemented that one, anyway, here it goes again:
def multiplyItemsByFour(argsList):
return [item * 4 for item in argsList]
EditEditEdit: finally/finally seeing a MORE precise spec, with (luxury!-) examples:
Input: ('a','b') Output: 'aaaabbbb' Input: (2,3,4) Output: 36
So then what's wanted it the summation (and you can't use sum as it wouldn't work on strings) of the items in the input list, each multiplied by four. My preferred solution:
def theFinalAndTrulyRealProblemAsPosed(argsList):
items = iter(argsList)
output = next(items, []) * 4
for item in items:
output += item * 4
return output
If you're forbidden from using some of these constructs, such as built-ins items and iter, there are many other possibilities (slightly inferior ones) such as:
def theFinalAndTrulyRealProblemAsPosed(argsList):
if not argsList: return None
output = argsList[0] * 4
for item in argsList[1:]:
output += item * 4
return output
For an empty argsList, the first version returns [], the second one returns None -- not sure what you're supposed to do in that corner case anyway.
Very easy in Python. You need to get the type of the data in your list - use the type() function on the first item - type(argsList[0]). Then to initialize output (where you now have ????) you need the 'zero' or nul value for that type. So just as int() or float() or str() returns the zero or nul for their type so to will type(argsList[0])() return the zero or nul value for whatever type you have in your list.
So, here is your function with one minor modification:
def multiplyItemsByFour(argsList):
output = type(argsList[0])()
for arg in argsList:
output += arg * 4
return output
Works with::
argsList = [1, 2, 3, 4] or [1.0, 2.0, 3.0, 4.0] or "abcdef" ... etc,
Are you sure this is for Python beginners? To me, the cleanest way to do this is with reduce() and lambda, both of which are not typical beginner tools, and sometimes discouraged even for experienced Python programmers:
def multiplyItemsByFour(argsList):
if not argsList:
return None
newItems = [item * 4 for item in argsList]
return reduce(lambda x, y: x + y, newItems)
Like Alex Martelli, I've thrown in a quick test for an empty list at the beginning which returns None. Note that if you are using Python 3, you must import functools to use reduce().
Essentially, the reduce(lambda...) solution is very similar to the other suggestions to set up an accumulator using the first input item, and then processing the rest of the input items; but is simply more concise.
My guess is that the purpose of your homework is to expose you to "duck typing". The basic idea is that you don't worry about the types too much, you just worry about whether the behaviors work correctly. A classic example:
def add_two(a, b):
return a + b
print add_two(1, 2) # prints 3
print add_two("foo", "bar") # prints "foobar"
print add_two([0, 1, 2], [3, 4, 5]) # prints [0, 1, 2, 3, 4, 5]
Notice that when you def a function in Python, you don't declare a return type anywhere. It is perfectly okay for the same function to return different types based on its arguments. It's considered a virtue, even; consider that in Python we only need one definition of add_two() and we can add integers, add floats, concatenate strings, and join lists with it. Statically typed languages would require multiple implementations, unless they had an escape such as variant, but Python is dynamically typed. (Python is strongly typed, but dynamically typed. Some will tell you Python is weakly typed, but it isn't. In a weakly typed language such as JavaScript, the expression 1 + "1" will give you a result of 2; in Python this expression just raises a TypeError exception.)
It is considered very poor style to try to test the arguments to figure out their types, and then do things based on the types. If you need to make your code robust, you can always use a try block:
def safe_add_two(a, b):
try:
return a + b
except TypeError:
return None
See also the Wikipedia page on duck typing.
Python is dynamically typed, you don't need to declare the type of a variable, because a variable doesn't have a type, only values do. (Any variable can store any value, a value never changes its type during its lifetime.)
def do_something(x):
return x * 5
This will work for any x you pass to it, the actual result depending on what type the value in x has. If x contains a number it will just do regular multiplication, if it contains a string the string will be repeated five times in a row, for lists and such it will repeat the list five times, and so on. For custom types (classes) it depends on whether the class has an operation defined for the multiplication operator.
You don't need to declare variable types in python; a variable has the type of whatever's assigned to it.
EDIT:
To solve the re-stated problem, try this:
def multiplyItemsByFour(argsList):
output = argsList.pop(0) * 4
for arg in argsList:
output += arg * 4
return output
(This is probably not the most pythonic way of doing this, but it should at least start off your output variable as the right type, assuming the whole list is of the same type)
You gave these sample inputs and outputs:
Input: ('a','b') Output: 'aaaabbbb' Input: (2,3,4) Output: 36
I don't want to write the solution to your homework for you, but I do want to steer you in the correct direction. But I'm still not sure I understand what your problem is, because the problem as I understand it seems a bit difficult for an intro to Python class.
The most straightforward way to solve this requires that the arguments be passed in a list. Then, you can look at the first item in the list, and work from that. Here is a function that requires the caller to pass in a list of two items:
def handle_list_of_len_2(lst):
return lst[0] * 4 + lst[1] * 4
Now, how can we make this extend past two items? Well, in your sample code you weren't sure what to assign to your variable output. How about assigning lst[0]? Then it always has the correct type. Then you could loop over all the other elements in lst and accumulate to your output variable using += as you wrote. If you don't know how to loop over a list of items but skip the first thing in the list, Google search for "python list slice".
Now, how can we make this not require the user to pack up everything into a list, but just call the function? What we really want is some way to accept whatever arguments the user wants to pass to the function, and make a list out of them. Perhaps there is special syntax for declaring a function where you tell Python you just want the arguments bundled up into a list. You might check a good tutorial and see what it says about how to define a function.
Now that we have covered (very generally) how to accumulate an answer using +=, let's consider other ways to accumulate an answer. If you know how to use a list comprehension, you could use one of those to return a new list based on the argument list, with the multiply performed on each argument; you could then somehow reduce the list down to a single item and return it. Python 2.3 and newer have a built-in function called sum() and you might want to read up on that. [EDIT: Oh drat, sum() only works on numbers. See note added at end.]
I hope this helps. If you are still very confused, I suggest you contact your teacher and ask for clarification. Good luck.
P.S. Python 2.x have a built-in function called reduce() and it is possible to implement sum() using reduce(). However, the creator of Python thinks it is better to just use sum() and in fact he removed reduce() from Python 3.0 (well, he moved it into a module called functools).
P.P.S. If you get the list comprehension working, here's one more thing to think about. If you use a list comprehension and then pass the result to sum(), you build a list to be used once and then discarded. Wouldn't it be neat if we could get the result, but instead of building the whole list and then discarding it we could just have the sum() function consume the list items as fast as they are generated? You might want to read this: Generator Expressions vs. List Comprehension
EDIT: Oh drat, I assumed that Python's sum() builtin would use duck typing. Actually it is documented to work on numbers, only. I'm disappointed! I'll have to search and see if there were any discussions about that, and see why they did it the way they did; they probably had good reasons. Meanwhile, you might as well use your += solution. Sorry about that.
EDIT: Okay, reading through other answers, I now notice two ways suggested for peeling off the first element in the list.
For simplicity, because you seem like a Python beginner, I suggested simply using output = lst[0] and then using list slicing to skip past the first item in the list. However, Wooble in his answer suggested using output = lst.pop(0) which is a very clean solution: it gets the zeroth thing on the list, and then you can just loop over the list and you automatically skip the zeroth thing. However, this "mutates" the list! It's better if a function like this does not have "side effects" such as modifying the list passed to it. (Unless the list is a special list made just for that function call, such as a *args list.) Another way would be to use the "list slice" trick to make a copy of the list that has the first item removed. Alex Martelli provided an example of how to make an "iterator" using a Python feature called iter(), and then using iterator to get the "next" thing. Since the iterator hasn't been used yet, the next thing is the zeroth thing in the list. That's not really a beginner solution but it is the most elegant way to do this in Python; you could pass a really huge list to the function, and Alex Martelli's solution will neither mutate the list nor waste memory by making a copy of the list.
No need to test the objects, just multiply away!
'this is a string' * 6
14 * 6
[1,2,3] * 6
all just work
Try this:
def timesfourlist(list):
nextstep = map(times_four, list)
sum(nextstep)
map performs the function passed in on each element of the list(returning a new list) and then sum does the += on the list.
If you just want to fill in the blank in your code, you could try setting object=arglist[0].__class__() to give it the zero equivalent value of that class.
>>> def multiplyItemsByFour(argsList):
output = argsList[0].__class__()
for arg in argsList:
output += arg * 4
return output
>>> multiplyItemsByFour('ab')
'aaaabbbb'
>>> multiplyItemsByFour((2,3,4))
36
>>> multiplyItemsByFour((2.0,3.3))
21.199999999999999
This will crash if the list is empty, but you can check for that case at the beginning of the function and return whatever you feel appropriate.
Thanks to Alex Martelli, you have the best possible solution:
def theFinalAndTrulyRealProblemAsPosed(argsList):
items = iter(argsList)
output = next(items, []) * 4
for item in items:
output += item * 4
return output
This is beautiful and elegant. First we create an iterator with iter(), then we use next() to get the first object in the list. Then we accumulate as we iterate through the rest of the list, and we are done. We never need to know the type of the objects in argsList, and indeed they can be of different types as long as all the types can have operator + applied with them. This is duck typing.
For a moment there last night I was confused and thought that you wanted a function that, instead of taking an explicit list, just took one or more arguments.
def four_x_args(*args):
return theFinalAndTrulyRealProblemAsPosed(args)
The *args argument to the function tells Python to gather up all arguments to this function and make a tuple out of them; then the tuple is bound to the name args. You can easily make a list out of it, and then you could use the .pop(0) method to get the first item from the list. This costs the memory and time to build the list, which is why the iter() solution is so elegant.
def four_x_args(*args):
argsList = list(args) # convert from tuple to list
output = argsList.pop(0) * 4
for arg in argsList:
output += arg * 4
return output
This is just Wooble's solution, rewritten to use *args.
Examples of calling it:
print four_x_args(1) # prints 4
print four_x_args(1, 2) # prints 12
print four_x_args('a') # prints 'aaaa'
print four_x_args('ab', 'c') # prints 'ababababcccc'
Finally, I'm going to be malicious and complain about the solution you accepted. That solution depends on the object's base class having a sensible null or zero, but not all classes have this. int() returns 0, and str() returns '' (null string), so they work. But how about this:
class NaturalNumber(int):
"""
Exactly like an int, but only values >= 1 are possible.
"""
def __new__(cls, initial_value=1):
try:
n = int(initial_value)
if n < 1:
raise ValueError
except ValueError:
raise ValueError, "NaturalNumber() initial value must be an int() >= 1"
return super(NaturalNumber, cls).__new__ (cls, n)
argList = [NaturalNumber(n) for n in xrange(1, 4)]
print theFinalAndTrulyRealProblemAsPosed(argList) # prints correct answer: 24
print NaturalNumber() # prints 1
print type(argList[0])() # prints 1, same as previous line
print multiplyItemsByFour(argList) # prints 25!
Good luck in your studies, and I hope you enjoy Python as much as I do.
I have a problem which requires a reversable 1:1 mapping of keys to values.
That means sometimes I want to find the value given a key, but at other times I want to find the key given the value. Both keys and values are guaranteed unique.
x = D[y]
y == D.inverse[x]
The obvious solution is to simply invert the dictionary every time I want a reverse-lookup: Inverting a dictionary is very easy, there's a recipe here but for a large dictionary it can be very slow.
The other alternative is to make a new class which unites two dictionaries, one for each kind of lookup. That would most likely be fast but would use up twice as much memory as a single dict.
So is there a better structure I can use?
My application requires that this should be very fast and use as little as possible memory.
The structure must be mutable, and it's strongly desirable that mutating the object should not cause it to be slower (e.g. to force a complete re-index)
We can guarantee that either the key or the value (or both) will be an integer
It's likely that the structure will be needed to store thousands or possibly millions of items.
Keys & Valus are guaranteed to be unique, i.e. len(set(x)) == len(x) for for x in [D.keys(), D.valuies()]
The other alternative is to make a new
class which unites two dictionaries,
one for each kind of lookup. That
would most likely be fast but would
use up twice as much memory as a
single dict.
Not really. Have you measured that? Since both dictionaries would use references to the same objects as keys and values, then the memory spent would be just the dictionary structure. That's a lot less than twice and is a fixed ammount regardless of your data size.
What I mean is that the actual data wouldn't be copied. So you'd spend little extra memory.
Example:
a = "some really really big text spending a lot of memory"
number_to_text = {1: a}
text_to_number = {a: 1}
Only a single copy of the "really big" string exists, so you end up spending just a little more memory. That's generally affordable.
I can't imagine a solution where you'd have the key lookup speed when looking by value, if you don't spend at least enough memory to store a reverse lookup hash table (which is exactly what's being done in your "unite two dicts" solution).
class TwoWay:
def __init__(self):
self.d = {}
def add(self, k, v):
self.d[k] = v
self.d[v] = k
def remove(self, k):
self.d.pop(self.d.pop(k))
def get(self, k):
return self.d[k]
The other alternative is to make a new class which unites two dictionaries, one for each > kind of lookup. That would most likely use up twice as much memory as a single dict.
Not really, since they would just be holding two references to the same data. In my mind, this is not a bad solution.
Have you considered an in-memory database lookup? I am not sure how it will compare in speed, but lookups in relational databases can be very fast.
Here is my own solution to this problem: http://github.com/spenthil/pymathmap/blob/master/pymathmap.py
The goal is to make it as transparent to the user as possible. The only introduced significant attribute is partner.
OneToOneDict subclasses from dict - I know that isn't generally recommended, but I think I have the common use cases covered. The backend is pretty simple, it (dict1) keeps a weakref to a 'partner' OneToOneDict (dict2) which is its inverse. When dict1 is modified dict2 is updated accordingly as well and vice versa.
From the docstring:
>>> dict1 = OneToOneDict()
>>> dict2 = OneToOneDict()
>>> dict1.partner = dict2
>>> assert(dict1 is dict2.partner)
>>> assert(dict2 is dict1.partner)
>>> dict1['one'] = '1'
>>> dict2['2'] = '1'
>>> dict1['one'] = 'wow'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1['one'] = '1'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1.update({'three': '3', 'four': '4'})
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict3 = OneToOneDict({'4':'four'})
>>> assert(dict3.partner is None)
>>> assert(dict3 == {'4':'four'})
>>> dict1.partner = dict3
>>> assert(dict1.partner is not dict2)
>>> assert(dict2.partner is None)
>>> assert(dict1.partner is dict3)
>>> assert(dict3.partner is dict1)
>>> dict1.setdefault('five', '5')
>>> dict1['five']
'5'
>>> dict1.setdefault('five', '0')
>>> dict1['five']
'5'
When I get some free time, I intend to make a version that doesn't store things twice. No clue when that'll be though :)
Assuming that you have a key with which you look up a more complex mutable object, just make the key a property of that object. It does seem you might be better off thinking about the data model a bit.
"We can guarantee that either the key or the value (or both) will be an integer"
That's weirdly written -- "key or the value (or both)" doesn't feel right. Either they're all integers, or they're not all integers.
It sounds like they're all integers.
Or, it sounds like you're thinking of replacing the target object with an integer value so you only have one copy referenced by an integer. This is a false economy. Just keep the target object. All Python objects are -- in effect -- references. Very little actual copying gets done.
Let's pretend that you simply have two integers and can do a lookup on either one of the pair. One way to do this is to use heap queues or the bisect module to maintain ordered lists of integer key-value tuples.
See http://docs.python.org/library/heapq.html#module-heapq
See http://docs.python.org/library/bisect.html#module-bisect
You have one heapq (key,value) tuples. Or, if your underlying object is more complex, the (key,object) tuples.
You have another heapq (value,key) tuples. Or, if your underlying object is more complex, (otherkey,object) tuples.
An "insert" becomes two inserts, one to each heapq-structured list.
A key lookup is in one queue; a value lookup is in the other queue. Do the lookups using bisect(list,item).
It so happens that I find myself asking this question all the time (yesterday in particular). I agree with the approach of making two dictionaries. Do some benchmarking to see how much memory it's taking. I've never needed to make it mutable, but here's how I abstract it, if it's of any use:
class BiDict(list):
def __init__(self,*pairs):
super(list,self).__init__(pairs)
self._first_access = {}
self._second_access = {}
for pair in pairs:
self._first_access[pair[0]] = pair[1]
self._second_access[pair[1]] = pair[0]
self.append(pair)
def _get_by_first(self,key):
return self._first_access[key]
def _get_by_second(self,key):
return self._second_access[key]
# You'll have to do some overrides to make it mutable
# Methods such as append, __add__, __del__, __iadd__
# to name a few will have to maintain ._*_access
class Constants(BiDict):
# An implementation expecting an integer and a string
get_by_name = BiDict._get_by_second
get_by_number = BiDict._get_by_first
t = Constants(
( 1, 'foo'),
( 5, 'bar'),
( 8, 'baz'),
)
>>> print t.get_by_number(5)
bar
>>> print t.get_by_name('baz')
8
>>> print t
[(1, 'foo'), (5, 'bar'), (8, 'baz')]
How about using sqlite? Just create a :memory: database with a two-column table. You can even add indexes, then query by either one. Wrap it in a class if it's something you're going to use a lot.