EDIT: as #BrenBarn pointed out, the original didn't make sense.
Given a list of dicts (courtesy of csv.DictReader--they all have str keys and values) it'd be nice to remove duplicates by stuffing them all in a set, but this can't be done directly since dict isn't hashable. Some existing questions touch on how to fake __hash__() for sets/dicts but don't address which way should be preferred.
# i. concise but ugly round trip
filtered = [eval(x) for x in {repr(d) for d in pile_o_dicts}]
# ii. wordy but avoids round trip
filtered = []
keys = set()
for d in pile_o_dicts:
key = str(d)
if key not in keys:
keys.add(key)
filtered.append(d)
# iii. introducing another class for this seems Java-like?
filtered = {hashable_dict(x) for x in pile_o_dicts}
# iv. something else entirely
In the spirit of the Zen of Python what's the "obvious way to do it"?
Based on your example code, I take your question to be something slightly different from what you literally say. You don't actually want to override __hash__() -- you just want to filter out duplicates in linear time, right? So you need to ensure the following for each dictionary: 1) every key-value pair is represented, and 2) they are represented in a stable order. You could use a sorted tuple of key-value pairs, but instead, I would suggest using frozenset. frozensets are hashable, and they avoid the overhead of sorting, which should improve performance (as this answer seems to confirm). The downside is that they take up more memory than tuples, so there is a space/time tradeoff here.
Also, your code uses sets to do the filtering, but that doesn't make a lot of sense. There's no need for that ugly eval step if you use a dictionary:
filtered = {frozenset(d.iteritems()):d for d in pile_o_dicts}.values()
Or in Python 3, assuming you want a list rather than a dictionary view:
filtered = list({frozenset(d.items()):d for d in pile_o_dicts}.values())
These are both bit clunky. For readability, consider breaking it into two lines:
dict_o_dicts = {frozenset(d.iteritems()):d for d in pile_o_dicts}
filtered = dict_o_dicts.values()
The alternative is an ordered tuple of tuples:
filtered = {tuple(sorted(d.iteritems())):d for d in pile_o_dicts}.values()
And a final note: don't use repr for this. Dictionaries that evaluate as equal can have different representations:
>>> d1 = {str(i):str(i) for i in range(300)}
>>> d2 = {str(i):str(i) for i in range(299, -1, -1)}
>>> d1 == d2
True
>>> repr(d1) == repr(d2)
False
The artfully named pile_o_dicts can be converted to a canonical form by sorting their items lists:
groups = {}
for d in pile_o_dicts:
k = tuple(sorted(d.items()))
groups.setdefault(k, []).append(d)
This will group identical dictionaries together.
FWIW, the technique of using sorted(d.items()) is currently used in the standard library for functools.lru_cache() in order to recognize function calls that have the same keyword arguments. IOW, this technique is tried and true :-)
If the dicts all have the same keys, you can use a namedtuple
>>> from collections import namedtuple
>>> nt = namedtuple('nt', pile_o_dicts[0])
>>> set(nt(**d) for d in pile_o_dicts)
Related
I'm trying to trim an ordered dict to the last x items.
I have the following code, which works but doesn't seem very pythonic.
Is there a better way of doing this?
import collections
d = collections.OrderedDict()
# SNIP: POPULATE DICT HERE!
d = collections.OrderedDict(d.items()[-3:])
This works a bit faster:
for k in range(len(d) - x) : data.popitem(last = False)
Not really sure how pythonic it is though.
Benefits include not having to cast create a new OrderedDict object, and not having to look at keys or items.
If you wish to trim the dictionary in place, then you can pop the offending items:
for k in d.keys()[:-3]:
d.pop(k)
(On python 3, you'll need to convert .keys() to a list).
If you're wishing to create a new OrderedDict, then its not clear quite what is "unpythonic" about your current approach.
I have a very large file I'm parsing and getting the key value from the line. I want only the first key and value, for only one value. That is, I'm removing the duplicate values
So it would look like:
{
A:1
B:2
C:3
D:2
E:2
F:3
G:1
}
and it would output:
{E:2,F:3,G:1}
It's a bit confusing because I don't really care what the key is. So E in the above could be replaced with B or D, F could be replaced with C, and G could be replaced with A.
Here is the best way I have found to do it but it is extremely slow as the file gets larger.
mapp = {}
value_holder = []
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.append(mydict[i])
Must look through value_holder every time :( Is there a faster way to do this?
Yes, a trivial change makes it much faster:
value_holder = set()
(Well, you also have to change the append to add. But still pretty simple.)
Using a set instead of a list means each lookup is O(1) instead of O(N), so the whole operation is O(N) instead of O(N^2). In other words, if you have 10,000 lines, you're doing 10,000 hash lookups instead of 50,000,000 comparisons.
One caveat with this solution—and all of the others posted—is that it requires the values to be hashable. If they're not hashable, but they are comparable, you can still get O(NlogN) instead of O(N^2) by using a sorted set (e.g., from the blist library). If they're neither hashable nor sortable… well, you'll probably want to find some way to generate something hashable (or sortable) to use as a "first check", and then only walk the "first check" matches for actual matches, which will get you to O(NM), where M is the average number of hash collisions.
You might want to look at how unique_everseen is implemented in the itertools recipes in the standard library documentation.
Note that dictionaries don't actually have an order, so there's no way to pick the "first" duplicate; you'll just get one arbitrarily. In which case, there's another way to do this:
inverted = {v:k for k, v in d.iteritems()}
reverted = {v:k for k, v in inverted.iteritems()}
(This is effectively a form of the decorate-process-undecorate idiom without any processing.)
But instead of building up the dict and then filtering it, you can make things better (simpler, and faster, and more memory-efficient, and order-preserving) by filtering as you read. Basically, keep the set alongside the dict as you go along. For example, instead of this:
mydict = {}
for line in f:
k, v = line.split(None, 1)
mydict[k] = v
mapp = {}
value_holder = set()
for i in mydict:
if mydict[i] not in value_holder:
mapp[i] = mydict[i]
value_holder.add(mydict[i])
Just do this:
mapp = {}
value_holder = set()
for line in f:
k, v = line.split(None, 1)
if v not in value_holder:
mapp[k] = v
value_holder.add(v)
In fact, you may want to consider writing a one_to_one_dict that wraps this up (or search PyPI modules and ActiveState recipes to see if someone has already written it for you), so then you can just write:
mapp = one_to_one_dict()
for line in f:
k, v = line.split(None, 1)
mapp[k] = v
I'm not completely clear on exactly what you're doing, but set is a great way to remove duplicates. For example:
>>> k = [1,3,4,4,5,4,3,2,2,3,3,4,5]
>>> set(k)
set([1, 2, 3, 4, 5])
>>> list(set(k))
[1, 2, 3, 4, 5]
Though it depends a bit on the structure of the input that you're loading, there might be a way to simply use set so that you don't have to iterate through the entire object every time to see if there any matching keys--instead run it through set once.
The first way to speed this up, as others have mentioned, is a using a set to record seen values, as checking for membership on a set is much faster.
We can also make this a lot shorter with a dict comprehension:
seen = set()
new_mapp = {k: v for k, v in mapp.items() if v not in seen or seen.add(i)}
The if case requires a little explanation: we only add key/value pairs where we havn't seen the value before, but we use or a little bit hackishly to ensure any unseen values are added to the set. As set.add() returns None, it will not affect the outcome.
As always, in 2.x, user dict.iteritems() over dict.items().
Using a set instead of a list would speed you up considerably ...
You said you are reading from a very large file and want to keep only the first occurrence of a key. I originally assumed this meant you care about the order in which the key/value pairs occurs in the very large file. This code will do that and will be fast.
values_seen = set()
mapp = {}
with open("large_file.txt") as f:
for line in f:
key, value = line.split()
if value not in values_seen:
values_seen.add(value)
mapp[key] = value
You were using a list to keep track of what keys your code had seen. Searching through a list is very slow: it gets slower the larger the list gets. A set is much faster because lookups are close to constant time (don't get much slower, or maybe at all slower, the larger the list gets). (A dict also works the way a set works.)
Part of your problem is that dicts do not preserve any sort of logical ordering when they are iterated through. They use hash tables to index items (see this great article). So there's no real concept of "first occurence of value" in this sort of data structure. The right way to do this would probably be a list of key-value pairs. e.g. :
kv_pairs = [(k1,v1),(k2,v2),...]
or, because the file is so large, it would be better to use the excellent file iteration python provides to retrieve the k/v pairs:
def kv_iter(f):
# f being the file descriptor
for line in f:
yield ... # (whatever logic you use to get k, v values from a line)
Value_holder is a great candidate for a set variable. You are really just testing whether value_holder. Because values are unique, they can be indexed more efficiently using a similar hashing method. So it would end up a bit like this:
mapp = {}
value_holder = set()
for k,v in kv_iter(f):
if v in value_holder:
mapp[k] = v
value_holder.add(v)
I have a dictionary:
a = {"w1": "wer", "w2": "qaz", "w3": "edc"}
When I try to print its values, they are printed from right to left:
>>> for item in a.values():
print item,
edc qaz wer
I want them to be printed from left to right:
wer qaz edc
How can I do it?
You can't. Dictionaries don't have any order you can use, so there's no concept of "left to right" with regards to dictionary literals. Decide on a sorting, and stick with it.
You can use collections.OrderedDict (python 2.7 or newer -- There's an ActiveState recipe somewhere which provides this functionality for python 2.4 or newer (I think)) to store your items. Of course, you'll need to insert the items into the dictionary in the proper order (the {} syntax will no longer work -- nor will passing key=value to the constructor, because as others have mentioned, those rely on regular dictionaries which have no concept of order)
Assuming you want them in alphabetical order of the keys, you can do something like this:
a = {"w1": "wer", "w2": "qaz", "w3": "edc"} # your dictionary
keylist = a.keys() # list of keys, in this case ["w3", "w2", "w1"]
keylist.sort() # sort alphabetically in place,
# changing keylist to ["w1", "w2", w3"]
for key in keylist:
print a[key] # access dictionary in order of sorted keys
as #IgnacioVazquez-Abrams mentioned, this is no such thing as order in dictionaries, but you can achieve a similar effect by using the ordered dict odict from http://pypi.python.org/pypi/odict
also check out PEP372 for more discussion and odict patches.
Dictionaries use hash values to associate values. The only way to sort a dictionary would look something like:
dict = {}
x = [x for x in dict]
# sort here
y = []
for z in x: y.append(dict[z])
I haven't done any real work in python in a while, so I may be a little rusty. Please correct me if I am mistaken.
How come that I can easily do a for-loop in Python to loop through all the elements of a dictionary in the order I appended the elements but there's no obvious way to access a specific element? When I say element I'm talking about a key+value pair.
I've looked through what some basic tutorials on Python says on dictionaries but not a single one answers my question, I can't find my answer in docs.python.org either.
According to:
accessing specific elements from python dictionary (Senderies comment) a dict is supposed to be unordered but then why does the for-loop print them in the order I put them in?
You access a specific element in a dictionary by key. That's what a dictionary is. If that behavior isn't what you want, use something other than a dictionary.
a dict is supposed to be unordered but then why does the for-loop print them in the order I put them in?
Coincidence: basically, you happened to put them in in the order that Python prefers. This isn't too hard to do, especially with integers (ints are their own hashes and will tend to come out from a dict in ascending numeric order, though this is an implementation detail of CPython and may not be true in other Python implementations), and especially if you put them in in numerical order to begin with.
"Unordered" really means that you don't control the order, and it may change due to various implementation-specific criteria, so you should not rely on it. Of course when you iterate over a dictionary elements come out in some order.
If you need to be able to access dictionary elements by numeric index, there are lots of ways to do that. collections.OrderedDict is the standard way; the keys are always returned in the order you added them, so you can always do foo[foo.keys()[i]] to access the ith element. There are other schemes you could use as well.
Python dicts are accessed by hashing the key. So if you have any sort of a sizable dict and things are coming out in the order you put them in, then it's time to start betting on the lottery!
my_dict = {}
my_dict['a'] = 1
my_dict['b'] = 2
my_dict['c'] = 3
my_dict['d'] = 4
for k,v in my_dict.items():
print k, v
yields:
a 1
c 3
b 2
d 4
d = {}
d['first'] = 1
d['second'] = 2
d['third'] = 3
print d
# prints {'seconds': 2, 'third': 3, 'first': 1}
# Hmm, just like the docs say, order of insertion isn't preserved.
print d['third']
# prints 3
# And there you have it: access to a specific element
If you want to iterate through the items in insertion order, you should be using OrderedDict. A regular dict is not guaranteed to do the same, so you're asking for trouble later if you rely on it to do so.
If you want to access a particular item, you should access it by its key using the [] operator or the get() method. That's the primary function of a dict, after all.
Result of the hashing of several values varies according the values:
sometimes the order seems to be kept: following example with d_one
generaly, the order is not kept: following example with d_two
Believing that the order is anyway kept is only because you are deceived by particuliar cases in which the order is apparently conserved
d_one = {}
for i,x in enumerate((122,'xo','roto',885)):
print x
d_one[x] = i
print
for k in d_one:
print k
print '\n=======================\n'
d_two = {}
for i,x in enumerate((122,'xo','roto','taratata',885)):
print x
d_two[x] = i
print
for k in d_two:
print k
result
122
xo
roto
885
122
xo
roto
885
=======================
122
xo
roto
taratata
885
122
taratata
xo
roto
885
By the way, what you call "elements of a dictionary' are commonly called 'items of the dictionary' ( hence the methods items() and iteritems() of a dictionary)
I have a problem which requires a reversable 1:1 mapping of keys to values.
That means sometimes I want to find the value given a key, but at other times I want to find the key given the value. Both keys and values are guaranteed unique.
x = D[y]
y == D.inverse[x]
The obvious solution is to simply invert the dictionary every time I want a reverse-lookup: Inverting a dictionary is very easy, there's a recipe here but for a large dictionary it can be very slow.
The other alternative is to make a new class which unites two dictionaries, one for each kind of lookup. That would most likely be fast but would use up twice as much memory as a single dict.
So is there a better structure I can use?
My application requires that this should be very fast and use as little as possible memory.
The structure must be mutable, and it's strongly desirable that mutating the object should not cause it to be slower (e.g. to force a complete re-index)
We can guarantee that either the key or the value (or both) will be an integer
It's likely that the structure will be needed to store thousands or possibly millions of items.
Keys & Valus are guaranteed to be unique, i.e. len(set(x)) == len(x) for for x in [D.keys(), D.valuies()]
The other alternative is to make a new
class which unites two dictionaries,
one for each kind of lookup. That
would most likely be fast but would
use up twice as much memory as a
single dict.
Not really. Have you measured that? Since both dictionaries would use references to the same objects as keys and values, then the memory spent would be just the dictionary structure. That's a lot less than twice and is a fixed ammount regardless of your data size.
What I mean is that the actual data wouldn't be copied. So you'd spend little extra memory.
Example:
a = "some really really big text spending a lot of memory"
number_to_text = {1: a}
text_to_number = {a: 1}
Only a single copy of the "really big" string exists, so you end up spending just a little more memory. That's generally affordable.
I can't imagine a solution where you'd have the key lookup speed when looking by value, if you don't spend at least enough memory to store a reverse lookup hash table (which is exactly what's being done in your "unite two dicts" solution).
class TwoWay:
def __init__(self):
self.d = {}
def add(self, k, v):
self.d[k] = v
self.d[v] = k
def remove(self, k):
self.d.pop(self.d.pop(k))
def get(self, k):
return self.d[k]
The other alternative is to make a new class which unites two dictionaries, one for each > kind of lookup. That would most likely use up twice as much memory as a single dict.
Not really, since they would just be holding two references to the same data. In my mind, this is not a bad solution.
Have you considered an in-memory database lookup? I am not sure how it will compare in speed, but lookups in relational databases can be very fast.
Here is my own solution to this problem: http://github.com/spenthil/pymathmap/blob/master/pymathmap.py
The goal is to make it as transparent to the user as possible. The only introduced significant attribute is partner.
OneToOneDict subclasses from dict - I know that isn't generally recommended, but I think I have the common use cases covered. The backend is pretty simple, it (dict1) keeps a weakref to a 'partner' OneToOneDict (dict2) which is its inverse. When dict1 is modified dict2 is updated accordingly as well and vice versa.
From the docstring:
>>> dict1 = OneToOneDict()
>>> dict2 = OneToOneDict()
>>> dict1.partner = dict2
>>> assert(dict1 is dict2.partner)
>>> assert(dict2 is dict1.partner)
>>> dict1['one'] = '1'
>>> dict2['2'] = '1'
>>> dict1['one'] = 'wow'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1['one'] = '1'
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict1.update({'three': '3', 'four': '4'})
>>> assert(dict1 == dict((v,k) for k,v in dict2.items()))
>>> dict3 = OneToOneDict({'4':'four'})
>>> assert(dict3.partner is None)
>>> assert(dict3 == {'4':'four'})
>>> dict1.partner = dict3
>>> assert(dict1.partner is not dict2)
>>> assert(dict2.partner is None)
>>> assert(dict1.partner is dict3)
>>> assert(dict3.partner is dict1)
>>> dict1.setdefault('five', '5')
>>> dict1['five']
'5'
>>> dict1.setdefault('five', '0')
>>> dict1['five']
'5'
When I get some free time, I intend to make a version that doesn't store things twice. No clue when that'll be though :)
Assuming that you have a key with which you look up a more complex mutable object, just make the key a property of that object. It does seem you might be better off thinking about the data model a bit.
"We can guarantee that either the key or the value (or both) will be an integer"
That's weirdly written -- "key or the value (or both)" doesn't feel right. Either they're all integers, or they're not all integers.
It sounds like they're all integers.
Or, it sounds like you're thinking of replacing the target object with an integer value so you only have one copy referenced by an integer. This is a false economy. Just keep the target object. All Python objects are -- in effect -- references. Very little actual copying gets done.
Let's pretend that you simply have two integers and can do a lookup on either one of the pair. One way to do this is to use heap queues or the bisect module to maintain ordered lists of integer key-value tuples.
See http://docs.python.org/library/heapq.html#module-heapq
See http://docs.python.org/library/bisect.html#module-bisect
You have one heapq (key,value) tuples. Or, if your underlying object is more complex, the (key,object) tuples.
You have another heapq (value,key) tuples. Or, if your underlying object is more complex, (otherkey,object) tuples.
An "insert" becomes two inserts, one to each heapq-structured list.
A key lookup is in one queue; a value lookup is in the other queue. Do the lookups using bisect(list,item).
It so happens that I find myself asking this question all the time (yesterday in particular). I agree with the approach of making two dictionaries. Do some benchmarking to see how much memory it's taking. I've never needed to make it mutable, but here's how I abstract it, if it's of any use:
class BiDict(list):
def __init__(self,*pairs):
super(list,self).__init__(pairs)
self._first_access = {}
self._second_access = {}
for pair in pairs:
self._first_access[pair[0]] = pair[1]
self._second_access[pair[1]] = pair[0]
self.append(pair)
def _get_by_first(self,key):
return self._first_access[key]
def _get_by_second(self,key):
return self._second_access[key]
# You'll have to do some overrides to make it mutable
# Methods such as append, __add__, __del__, __iadd__
# to name a few will have to maintain ._*_access
class Constants(BiDict):
# An implementation expecting an integer and a string
get_by_name = BiDict._get_by_second
get_by_number = BiDict._get_by_first
t = Constants(
( 1, 'foo'),
( 5, 'bar'),
( 8, 'baz'),
)
>>> print t.get_by_number(5)
bar
>>> print t.get_by_name('baz')
8
>>> print t
[(1, 'foo'), (5, 'bar'), (8, 'baz')]
How about using sqlite? Just create a :memory: database with a two-column table. You can even add indexes, then query by either one. Wrap it in a class if it's something you're going to use a lot.