I have a list of tuples that can be understood as key-value pairs, where a key can appear several times, possibly with different values, for example
[(2,8),(5,10),(2,5),(3,4),(5,50)]
I now want to get a list of tuples with the highest value for each key, i.e.
[(2,8),(3,4),(5,50)]
The order of the keys is irrelevant.
How do I do that in an efficient way?
Sort them and then cast to a dictionary and take the items again from it:
l = [(2,8),(5,10),(2,5),(3,4),(5,50)]
list(dict(sorted(l)).items()) #python3, if python2 list cast is not needed
[(2, 8), (3, 4), (5, 50)]
The idea is that the key-value pairs will get updated in ascending order when transforming to a dictionary filtering the lowest values for each key, then you just have to take it as tuples.
At its core, this problem is essentially about grouping the tuples based on their first element and then keeping only the maximum of each group.
Grouping can be done easily with a defaultdict. A detailed explanation of grouping with defaultdicts can be found in my answer here. In your case, we group the tuples by their first element and then use the max function to find the tuple with the largest number.
import collections
tuples = [(2,8),(5,10),(2,5),(3,4),(5,50)]
groupdict = collections.defaultdict(list)
for tup in tuples:
group = tup[0]
groupdict[group].append(tup)
result = [max(group) for group in groupdict.values()]
# result: [(2, 8), (5, 50), (3, 4)]
In your particular case, we can optimize the code a little bit by storing only the maximum 2nd element in the dict, rather than storing a list of all tuples and finding the maximum at the end:
tuples = [(2,8),(5,10),(2,5),(3,4),(5,50)]
groupdict = {}
for tup in tuples:
group, value = tup
if group in groupdict:
groupdict[group] = max(groupdict[group], value)
else:
groupdict[group] = value
result = [(group, value) for group, value in groupdict.items()]
This keeps the memory footprint to a minimum, but only works for tuples with exactly 2 elements.
This has a number of advantages over Netwave's solution:
It's more readable. Anyone who sees a defaultdict being instantiated knows that it'll be used to group data, and the use of the max function makes it easy to understand which tuples are kept. Netwave's one-liner is clever, but clever solutions are rarely easy to read.
Since the data doesn't have to be sorted, this runs in linear O(n) time instead of O(n log n).
Related
Few questions on the below code to find if a list is sorted or not:
Why did we use lambda as key here ? Does it always mean key of a list can be derived so ?
In the enumerate loop , why did we compare key(el) < key(lst[i]) and not key(el) <key(el-1) or lst[i+1] <lst[i] ?
def is_sorted(lst, key=lambda x:x):
for i, el in enumerate(lst[1:]):
if key(el) < key(lst[i]): # i is the index of the previous element
return False
return True
hh=[1,2,3,4,6]
val = is_sorted(hh)
print(val)
(NB: the code above was taken from this SO answer)
This code scans a list to see if it is sorted low to high. The first problem is to decide what "low" and "high" mean for arbitrary types. Its easy for integers, but what about user defined types? So, the author lets you pass in a function that converts a type to something whose comparison works the way you want.
For instance, lets say you want to sort tuples, but based on the 3rd item which you know to be an integer, it would be key=lambda x: x[2]. But the author provides a default key=lamba x:x which just returns the object its supplied for items that are already their own sort key.
The second part is easy. If any item is less than the item just before it, then we found an example where its not low to high. The reason it works is literally in the comment - i is the index of the element directly preceding el. We know this because we enumerated on the second and following elements of the list (enumerate(lst[1:]))
enumerate yields both index and current element:
for i, el in enumerate(lst):
print(i,el)
would print:
0 1
1 2
2 3
3 4
4 6
By slicing the list off by one (removing the first element), the code introduces a shift between the index and the current element, and it allows to access by index only once (not seen as pythonic to use indexes on lists when iterating on them fully)
It's still better/pythonic to zip (interleave) list and a sliced version of the list and pass a comparison to all, no indices involved, clearer code:
import itertools
def is_sorted(lst, key=lambda x:x):
return all(key(current) < key(prev) for prev,current in zip(lst,itertools.islice(lst,1,None,None)))
The slicing being done by islice, no extra list is generated (otherwise it's the same as lst[1:])
The key function (here: identity function by default) is the function which converts from the value to the comparable value. For integers, identity is okay, unless we want to reverse comparison, in which case we would pass lambda x:-x
The point is not that the lambda "derives" the key of a list. Rather, it's a function that allows you to choose the key. That is, given a list of objects of type X, what attribute would you use to compare them with? The default is the identity function - ie use the plain value of each element. But you could choose anything here.
You could indeed write this function by comparing lst[i+1] < lst[i]. You couldn't however write it by comparing key(el) < key(el-1), because el is the value of the element itself, not the index.
This is a function that test if a list has been sorted, as an example with the builtin sorted function. This function takes an keyword argument key which is used on every single element on the list to compute its compare value:
>>> sorted([(0,3),(1,2),(2,1),(3,0)])
[(0, 3), (1, 2), (2, 1), (3, 0)]
>>> sorted([(0,3),(1,2),(2,1),(3,0)],key=lambda x:x[1])
[(3, 0), (2, 1), (1, 2), (0, 3)]
The key keyword in your function is to be able to mimic the behavior of sorted:
>>> is_sorted([(0,3),(1,2),(2,1),(3,0)])
True
>>> is_sorted([(0,3),(1,2),(2,1),(3,0)],key=lambda x:x[1])
False
The default lambda is just there to mimic a default behavior where nothing is changed.
I have two 2-dimensional lists. Each list item contains a list with a string ID and an integer. I want to subtract the integers from each other where the string ID matches.
List 1:
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
List 2:
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
I want to end up with
difference = [['ID_001',500],['ID_002',1000],['ID_003',2000]]
Notice that the elements aren't necessarily in the same order in both lists. Both lists will be the same length and there is an integer corresponding to each ID in both lists.
I would also like this to be done efficiently as both lists will have thousands of records.
from collections import defaultdict
diffs = defaultdict(int)
list1 = [['ID_001',1000],['ID_002',2000],['ID_003',3000]]
list2 = [['ID_001',500],['ID_003',1000],['ID_002',1000]]
for pair in list1:
diffs[pair[0]] = pair[1]
for pair in list2:
diffs[pair[0]] -= pair[1]
differences = [[k,abs(v)] for k,v in diffs.items()]
print(differences)
I was curious so I ran a few timeits comparing my answer to Jim's. They seem to run in about the same time. You can cut the runtime of mine in half if you're willing to accept the output as a dictionary, however.
His is, of course, more Pythonic, if that's important to you.
You could achieve this by using a list comprehension:
diff = [(i[0], abs(i[1] - j[1])) for i,j in zip(sorted(list1), sorted(list2))]
This first sorts the lists with sorted in order for the order to be similar (not with list.sort() which sorts in place) and then, it creates tuples containing each entry in the lists ['ID_001', 1000], ['ID_001', 500] by feeding the sorted lists to zip.
Finally:
(i[0], abs(i[1] - j[1]))
returns i[0] indicating the ID for each entry and abs(i[1] - j[1]) computes their absolute difference. There are added as a tuple in the final list result (note the parentheses surrounding them).
In general, sorted might slow you down if you have a large amount of data, but that depends on how disorganized the data is from what I'm aware.
Other than that, zip creates an iterator so memory wise it doesn't affect you. Speed wise, list comps tend to be quite efficient and in most cases are your best options.
I have:
([(5,2),(7,2)],[(5,1),(7,3),(11,1)])
I need to add the second elements having the same first element.
output:[(5,3),(7,5),(11,1)]
This is a great use-case for collections.Counter...
from collections import Counter
tup = ([(5,2),(7,2)], [(5,1),(7,3),(11,1)])
counts = sum((Counter(dict(sublist)) for sublist in tup), Counter())
result = list(counts.items())
print(result)
One downside here is that you'll lose the order of the inputs. They appear to be sorted by the key, so you could just sort the items:
result = sorted(counts.items())
A Counter is a dictionary whose purpose is to hold the "counts" of bins. Counts are cleverly designed so that you can simply add them together (which adds the counts "bin-wise" -- If a bin isn't present in both Counters, the missing bin's value is assumed to be 0). So, that explains why we can use sum on a bunch of counters to get a dictionary that has the values that you want. Unfortunately for this solution, a Counter can't be instantiated by using an iterable that yields 2-item sequences like normal mappings ...,
Counter([(1, 2), (3, 4)])
would create a Counter with keys (1, 2) and (3, 4) -- Both values would be 1. It does work as expected if you create it with a mapping however:
Counter(dict([(1, 2), (3, 4)]))
creates a Counter with keys 1 and 3 (and values 2 and 4).
Try this code: (Brute force, may be..)
dt = {}
tp = ([(5,2),(7,2)],[(5,1),(7,3),(11,1)])
for ls in tp:
for t in ls:
dt[t[0]] = dt[t[0]] + t[1] if t[0] in dt else t[1]
print dt.items()
The approach taken here is to loop through the list of tuples and store the tuple's data as a dictionary, wherein the 1st element in the tuple t[0] is the key and the 2nd element t[1] is the value.
Upon iteration, every time the same key is found in the tuple's 1st element, add the value with the tuple's 2nd element. In the end, we will have a dictionary dt with all the key, value pairs as required. Convert this dictionary to list of tuples dt.items() and we have our output.
I am trying to write a code that replicates greedy algorithm and for that I need to make sure that my calculations use the highest value possible. Potential values are presented in a dictionary and my goal is to use largest value first and then move on to lower values. However since dictionary values are not sequenced, in for loop I am getting unorganized sequences. For example, out put of below code would start from 25.
How can I make sure that my code is using a dictionary yet following the sequence of (500,100,25,10,5)?
a={"f":500,"o":100,"q":25,"d":10,"n":5}
for i in a:
print a[i]
Two ideas spring to mind:
Use collections.OrderedDict, a dictionary subclass which remembers the order in which items are added. As long as you add the pairs in descending value order, looping over this dict will return them in the right order.
If you can't be sure the items will be added to the dict in the right order, you could construct them by sorting:
Get the values of the dictionary with values()
Sort by (ascending) value: this is sorted(), and Python will default to sorting in ascending order
Get them by descending value instead: this is reverse=True
Here's an example:
for value in sorted(a.values(), reverse=True):
print value
Dictionaries yield their keys when you iterate them normally, but you can use the items() view to get tuples of the key and value. That'll be un-ordered, but you can then use sorted() on the "one-th" element of the tuples (the value) with reverse set to True:
a={"f":500,"o":100,"q":25,"d":10,"n":5}
for k, v in sorted(a.items(), key=operator.itemgetter(1), reverse=True):
print(v)
I'm guessing that you do actually need the keys, but if not, you can just use values() instead of items(): sorted(a.values(), reverse=True)
You can use this
>>> a={"f":500,"o":100,"q":25,"d":10,"n":5}
>>> for value in sorted(a.itervalues(),reverse=True):
... print value
...
500
100
25
10
5
>>>
a={"f":500,"o":100,"q":25,"d":10,"n":5}
k = sorted(a, key=a.__getitem__, reverse=True)
v = sorted(a.values(), reverse=True)
sorted_a = zip(k,v)
print (sorted_a)
Output:
[('f', 500), ('o', 100), ('q', 25), ('d', 10), ('n', 5)]
If I have the list:
list1 = [(12, "AB", "CD"), (13, "EF", "GH"), (14, "IJ", "KL")]
I want to get the index of the group that has the value 13 in it:
if 13 in list1[0]:
idx = list1.index(13)
item = list1[idx]
print str(item)
[13, EF, GH]
When I try this, I keep getting "Index not in list", even though it is passing the if statement because it is finding the value 13 within the list.
You can use next and enumerate:
>>> list1 = [(12, "AB", "CD"), (13, "EF", "GH"), (14, "IJ", "KL")]
>>> next(i for i,x in enumerate(list1) if 13 in x)
1
With a simple for-loop:
for i, item in enumerate(list1):
if 13 in item:
print i
break
...
1
Update:
If the first item in each tuple is unique and you're doing this multiple times then create a dict first. Dicts provide O(1) lookup while lists O(N)
>>> list1 = [(12, "AB", "CD"), (13, "EF", "GH"), (14, "IJ", "KL")]
>>> dic = {x[0]:x[1:] for x in list1}
Accessing items:
>>> dic[12]
('AB', 'CD')
>>> dic[14]
('IJ', 'KL')
#checking key existence
>>> if 17 in dic: #if a key exists in dic then do something
#then do something
Given the added criterion from the comment "I really don't care where they are in the list" the task becomes much easier and far more obvious
def get_ids(id, tuple_list):
"""returns members from tuple_list whose first element is id"""
return [x for x in tuple_list if x[0] == id]
This isn't as expensive as one might expect if you recall that tuples are immutable objects. When the interpreter builds the new list, it only contains the internal id (reference) of the tuples of interest. This is in keeping with the original question asking for a list of indices. List comprehensions as used here are an efficient way of constructing new lists as much of the work is done internal to the interpreter. In short, many intuitions from C-like languages about performance don't apply well to Python.
As Ashwini noted, if the id numbers in the tuples are unique, and you are making multiple queries, then a dictionary might be a more suitable structure. Even if the id numbers aren't unique, you could use a dictionary of lists of tuples, but it is best to do the clearest thing first and not guess at the performance in advance.
As with the dictionary example, because an empty list is "falsey" in Python, you can use the same sort of conditional:
hits = get_ids(13, list1)
if hits:
# we got at least one tuple back
else:
# no 13s to be had