(Using Python 3.4.3)
Here's what I want to do: I have a dictionary where the keys are strings and the values are the number of times that string occurs in file. I need to output which string(s) occur with the greatest frequency, along with their frequencies (if there's a tie for the most-frequent, output all of the most-frequent).
I had tried to use OrderedDict. I can create it fine, but I struggle to get it to output specifically the most frequently occurring. I can keep trying, but I'm not sure an OrderedDict is really what I should be using, since I'll never need the actual OrderedDict once I've determined and output the most-frequent strings and their frequency. A fellow student recommended an ordered list, but I don't see how I'd preserve the link between the keys and values as I currently have them.
Is OrderedDict the best tool to do what I'm looking for, or is there something else? If it is, is there a way to filter/slice(or equivalent) the OrderedDict?
You can simply use sorted with a proper key function, in this case you can use operator.itemgetter(1) which will sorts your items based on values.
from operator import itemgetter
print sorted(my_dict.items(),key=itemgetter(1),reverse=True)
This can be solved in two steps. First sort your dictionary entries by their frequency so that the highest frequency is first.
Secondly use Python's groupby function to take matching entries from the list. As you are only interested in the highest, you stop after one iteration. For example:
from itertools import groupby
from operator import itemgetter
my_dict = {"a" : 8, "d" : 3, "c" : 8, "b" : 2, "e" : 2}
for k, g in groupby(sorted(my_dict.items(), key=itemgetter(1), reverse=True), key=itemgetter(1)):
print list(g)
break
This would display:
[('a', 8), ('c', 8)]
As a and c are equal top.
If you remove the break statement, you would get the full list:
[('a', 8), ('c', 8)]
[('d', 3)]
[('b', 2), ('e', 2)]
Related
Define:
diction = {"Book":(1, 2), "Armchair":(2, 2), "Lamp":(1, 3)}
If one wants to sort this dictionary by item["key"][1] descendingly and by "keys" ascendingly, what will be the appropriate way?
Desired output:
diction = {"Lamp":(1, 3), "Armchair":(2, 2), "Book":(1, 2)}
After receiving the correct answer by Sinan Note that I did not ask about sorting either ascendingly or descendingly, that you sent me the relevant links! I asked about doing both at the same time which is not solved by the trick of minus given by Sinan.
The main idea is copied from here.
from collections import OrderedDict
diction = {"Book":(1, 2), "Armchair":(2, 2), "Lamp":(1, 3)}
diction_list = list(diction.items())
diction = OrderedDict(sorted(diction_list, key=lambda x: (-x[1][1], x[0])))
print(diction)
OrderedDict([('Lamp', (1, 3)), ('Armchair', (2, 2)), ('Book', (1, 2))])
Here is where the magic is happening:
sorted(diction_list, key=lambda x: (-x[1][1], x[0]))
sorted sorts stuff for you. It is very good at it. If you use the key parameter with sort, you can give it a function to be used on each item to be sorted.
In this case, we are giving at a lambda that returns the negative value from the tuples -x[1][1] (this is the second value in the tuple) and the key x[0]. The negative sign makes it sort in reverse. Of course, this would not work with non-numeric values.
The OrderedDict thing is not strictly necessary. Python keeps dict in order after version 3.6 I think (or was it 3.7?). OrderedDicts are always kept in the same order, so it is there that so our dict stays ordered independent of the Python version.
For a better way of doing this see the Sorting HOW TO in Python Documentation.
I noticed that the results are different of the two lines. One is a sorted list, while the other is a sorted dictionary. Cant figure out why adding .item will give this difference:
aa={'a':1,'d':2,'c':3,'b':4}
bb=sorted(aa,key=lambda x:x[0])
print(bb)
#['a', 'b', 'c', 'd']
aa={'a':1,'d':2,'c':3,'b':4}
bb=sorted(aa.items(),key=lambda x:x[0])
print(bb)
# [('a', 1), ('b', 4), ('c', 3), ('d', 2)]
The first version implicitly sorts the keys in the dictionary, and is equivalent to sorting aa.keys(). The second version sorts the items, that is: a list of tuples of the form (key, value).
When you iterate on dictionary then you get iterate of keys not (key, value) pair. The sorted method takes any object on which we can iterate and hence you're seeing a difference.
You can verify this by prining while iterating on the dict:
aa={'a':1,'d':2,'c':3,'b':4}
for key in aa:
print(key)
for key in aa.keys():
print(key)
All of the above two for loops print same values.
In the second example, items() method applied to a dictionary returns an iterable collection of tuples (dictionary_key, dictrionary_value). Then the collection is being sorted.
In the first example, a dictionary is automatically casted to an iterable collection of its keys first. (And note: only very first characters of each of them are used for comparinson while sorting, which is probably NOT what you want)
I am trying to write a code that replicates greedy algorithm and for that I need to make sure that my calculations use the highest value possible. Potential values are presented in a dictionary and my goal is to use largest value first and then move on to lower values. However since dictionary values are not sequenced, in for loop I am getting unorganized sequences. For example, out put of below code would start from 25.
How can I make sure that my code is using a dictionary yet following the sequence of (500,100,25,10,5)?
a={"f":500,"o":100,"q":25,"d":10,"n":5}
for i in a:
print a[i]
Two ideas spring to mind:
Use collections.OrderedDict, a dictionary subclass which remembers the order in which items are added. As long as you add the pairs in descending value order, looping over this dict will return them in the right order.
If you can't be sure the items will be added to the dict in the right order, you could construct them by sorting:
Get the values of the dictionary with values()
Sort by (ascending) value: this is sorted(), and Python will default to sorting in ascending order
Get them by descending value instead: this is reverse=True
Here's an example:
for value in sorted(a.values(), reverse=True):
print value
Dictionaries yield their keys when you iterate them normally, but you can use the items() view to get tuples of the key and value. That'll be un-ordered, but you can then use sorted() on the "one-th" element of the tuples (the value) with reverse set to True:
a={"f":500,"o":100,"q":25,"d":10,"n":5}
for k, v in sorted(a.items(), key=operator.itemgetter(1), reverse=True):
print(v)
I'm guessing that you do actually need the keys, but if not, you can just use values() instead of items(): sorted(a.values(), reverse=True)
You can use this
>>> a={"f":500,"o":100,"q":25,"d":10,"n":5}
>>> for value in sorted(a.itervalues(),reverse=True):
... print value
...
500
100
25
10
5
>>>
a={"f":500,"o":100,"q":25,"d":10,"n":5}
k = sorted(a, key=a.__getitem__, reverse=True)
v = sorted(a.values(), reverse=True)
sorted_a = zip(k,v)
print (sorted_a)
Output:
[('f', 500), ('o', 100), ('q', 25), ('d', 10), ('n', 5)]
I have a list comprehension that looks like this:
cart = [ ((p,pp),(q,qq)) for ((p,pp),(q,qq))\
in itertools.product(C.items(), repeat=2)\
if p[1:] == q[:-1] ]
C is a dict with keys that are tuples of arbitrary integers . All the tuples have the same length. Worst case is that all the combinations should be included in the new list. This can happen quite frequently.
As an example, I have a dictionary like this:
C = { (0,1):'b',
(2,0):'c',
(0,0):'d' }
And I want the the result to be:
cart = [ (((2, 0), 'c'), ((0, 1), 'b'))
(((2, 0), 'c'), ((0, 0), 'd'))
(((0, 0), 'd'), ((0, 1), 'b'))
(((0, 0), 'd'), ((0, 0), 'd')) ]
So, by overlap I am referring to, for instance, that the tuples (1,2,3,4) and (2,3,4,5) have the overlapping section (2,3,4). The overlapping sections must be on the "edges" of the tuples. I only want overlaps that have length one shorter than the tuple length. Thus (1,2,3,4) does not overlap with (3,4,5,6). Also note that when removing the first or last element of a tuple we might end up with non-distinct tuples, all of which must be compared to all the other elements. This last point was not emphasized in my first example.
The better part of my codes execution time is spent in this list comprehension. I always need all elements of cart so there appears to be no speedup when using a generator instead.
My question is: Is there a faster way of doing this?
A thought I had was that I could try to create two new dictionaries like this:
aa = defaultdict(list)
bb = defaultdict(list)
[aa[p[1:]].append(p) for p in C.keys()]
[bb[p[:-1]].append(p) for p in C.keys()]
And somehow merge all combinations of elements of the list in aa[i] with the list in bb[i] for all i, but I can not seem to wrap my head around this idea either.
Update
Both the solution added by tobias_k and shx2 have better complexity than my original code (as far as I can tell). My code is O(n^2) whereas the two other solutions are O(n). For my problem size and composition however, all three solutions seem to run at more or less the same time. I suppose this has to do with a combination of overhead associated with function calls, as well as the nature of the data I am working with. In particular the number of different keys, as well as the actual composition of the keys, seem to have a large impact. The latter I know because the code runs much slower for completely random keys. I have accepted tobias_k's answer because his code is the easiest to follow. However, i would still greatly welcome other suggestions on how to perform this task.
You were actually on the right track, using the dictionaries to store all the prefixes to the keys. However, keep in mind that (as far as I understand the question) two keys can also overlap if the overlap is less than len-1, e.g. the keys (1,2,3,4) and (3,4,5,6) would overlap, too. Thus we have to create a map holding all the prefixes of the keys. (If I am mistaken about this, just drop the two inner for loops.) Once we have this map, we can iterate over all the keys a second time, and check whether for any of their suffixes there are matching keys in the prefixes map. (Update: Since keys can overlap w.r.t. more than one prefix/suffix, we store the overlapping pairs in a set.)
def get_overlap(keys):
# create map: prefix -> set(keys with that prefix)
prefixes = defaultdict(set)
for key in keys:
for prefix in [key[:i] for i in range(len(key))]:
prefixes[prefix].add(key)
# get keys with matching prefixes for all suffixes
overlap = set()
for key in keys:
for suffix in [key[i:] for i in range(len(key))]:
overlap.update([(key, other) for other in prefixes[suffix]
if other != key])
return overlap
(Note that, for simplicity, I only care about the keys in the dictionary, not the values. Extending this to return the values, too, or doing this as a postprocessing step, should be trivial.)
Overall running time should be only 2*n*k, n being the number of keys and k the length of the keys. Space complexity (the size of the prefixes map) should be between n*k and n^2*k, if there are very many keys with the same prefixes.
Note: The above answer is for the more general case that the overlapping region can have any length. For the simpler case that you consider only overlaps one shorter than the original tuple, the following should suffice and yield the results described in your examples:
def get_overlap_simple(keys):
prefixes = defaultdict(list)
for key in keys:
prefixes[key[:-1]].append(key)
return [(key, other) for key in keys for other in prefixes[key[1:]]]
Your idea of preprocessing the data into a dict was a good one. Here goes:
from itertools import groupby
C = { (0,1): 'b', (2,0): 'c', (0,0): 'd' }
def my_groupby(seq, key):
"""
>>> group_by(range(10), lambda x: 'mod=%d' % (x % 3))
{'mod=2': [2, 5, 8], 'mod=0': [0, 3, 6, 9], 'mod=1': [1, 4, 7]}
"""
groups = dict()
for x in seq:
y = key(x)
groups.setdefault(y, []).append(x)
return groups
def get_overlapping_items(C):
prefixes = my_groupby(C.iteritems(), key = lambda (k,v): k[:-1])
for k1, v1 in C.iteritems():
prefix = k1[1:]
for k2, v2 in prefixes.get(prefix, []):
yield (k1, v1), (k2, v2)
for x in get_overlapping_items(C):
print x
(((2, 0), 'c'), ((0, 1), 'b'))
(((2, 0), 'c'), ((0, 0), 'd'))
(((0, 0), 'd'), ((0, 1), 'b'))
(((0, 0), 'd'), ((0, 0), 'd'))
And by the way, instead of:
itertools.product(*[C.items()]*2)
do:
itertools.product(C.items(), repeat=2)
I am new to Python, and I am familiar with implementations of Multimaps in other languages. Does Python have such a data structure built-in, or available in a commonly-used library?
To illustrate what I mean by "multimap":
a = multidict()
a[1] = 'a'
a[1] = 'b'
a[2] = 'c'
print(a[1]) # prints: ['a', 'b']
print(a[2]) # prints: ['c']
Such a thing is not present in the standard library. You can use a defaultdict though:
>>> from collections import defaultdict
>>> md = defaultdict(list)
>>> md[1].append('a')
>>> md[1].append('b')
>>> md[2].append('c')
>>> md[1]
['a', 'b']
>>> md[2]
['c']
(Instead of list you may want to use set, in which case you'd call .add instead of .append.)
As an aside: look at these two lines you wrote:
a[1] = 'a'
a[1] = 'b'
This seems to indicate that you want the expression a[1] to be equal to two distinct values. This is not possible with dictionaries because their keys are unique and each of them is associated with a single value. What you can do, however, is extract all values inside the list associated with a given key, one by one. You can use iter followed by successive calls to next for that. Or you can just use two loops:
>>> for k, v in md.items():
... for w in v:
... print("md[%d] = '%s'" % (k, w))
...
md[1] = 'a'
md[1] = 'b'
md[2] = 'c'
Just for future visitors. Currently there is a python implementation of Multimap. It's available via pypi
Stephan202 has the right answer, use defaultdict. But if you want something with the interface of C++ STL multimap and much worse performance, you can do this:
multimap = []
multimap.append( (3,'a') )
multimap.append( (2,'x') )
multimap.append( (3,'b') )
multimap.sort()
Now when you iterate through multimap, you'll get pairs like you would in a std::multimap. Unfortunately, that means your loop code will start to look as ugly as C++.
def multimap_iter(multimap,minkey,maxkey=None):
maxkey = minkey if (maxkey is None) else maxkey
for k,v in multimap:
if k<minkey: continue
if k>maxkey: break
yield k,v
# this will print 'a','b'
for k,v in multimap_iter(multimap,3,3):
print v
In summary, defaultdict is really cool and leverages the power of python and you should use it.
You can take list of tuples and than can sort them as if it was a multimap.
listAsMultimap=[]
Let's append some elements (tuples):
listAsMultimap.append((1,'a'))
listAsMultimap.append((2,'c'))
listAsMultimap.append((3,'d'))
listAsMultimap.append((2,'b'))
listAsMultimap.append((5,'e'))
listAsMultimap.append((4,'d'))
Now sort it.
listAsMultimap=sorted(listAsMultimap)
After printing it you will get:
[(1, 'a'), (2, 'b'), (2, 'c'), (3, 'd'), (4, 'd'), (5, 'e')]
That means it is working as a Multimap!
Please note that like multimap here values are also sorted in ascending order if the keys are the same (for key=2, 'b' comes before 'c' although we didn't append them in this order.)
If you want to get them in descending order just change the sorted() function like this:
listAsMultimap=sorted(listAsMultimap,reverse=True)
And after you will get output like this:
[(5, 'e'), (4, 'd'), (3, 'd'), (2, 'c'), (2, 'b'), (1, 'a')]
Similarly here values are in descending order if the keys are the same.
The standard way to write this in Python is with a dict whose elements are each a list or set. As stephan202 says, you can somewhat automate this with a defaultdict, but you don't have to.
In other words I would translate your code to
a = dict()
a[1] = ['a', 'b']
a[2] = ['c']
print(a[1]) # prints: ['a', 'b']
print(a[2]) # prints: ['c']
Or subclass dict:
class Multimap(dict):
def __setitem__(self, key, value):
if key not in self:
dict.__setitem__(self, key, [value]) # call super method to avoid recursion
else
self[key].append(value)
There is no multi-map in the Python standard libs currently.
WebOb has a MultiDict class used to represent HTML form values, and it is used by a few Python Web frameworks, so the implementation is battle tested.
Werkzeug also has a MultiDict class, and for the same reason.