I have:
([(5,2),(7,2)],[(5,1),(7,3),(11,1)])
I need to add the second elements having the same first element.
output:[(5,3),(7,5),(11,1)]
This is a great use-case for collections.Counter...
from collections import Counter
tup = ([(5,2),(7,2)], [(5,1),(7,3),(11,1)])
counts = sum((Counter(dict(sublist)) for sublist in tup), Counter())
result = list(counts.items())
print(result)
One downside here is that you'll lose the order of the inputs. They appear to be sorted by the key, so you could just sort the items:
result = sorted(counts.items())
A Counter is a dictionary whose purpose is to hold the "counts" of bins. Counts are cleverly designed so that you can simply add them together (which adds the counts "bin-wise" -- If a bin isn't present in both Counters, the missing bin's value is assumed to be 0). So, that explains why we can use sum on a bunch of counters to get a dictionary that has the values that you want. Unfortunately for this solution, a Counter can't be instantiated by using an iterable that yields 2-item sequences like normal mappings ...,
Counter([(1, 2), (3, 4)])
would create a Counter with keys (1, 2) and (3, 4) -- Both values would be 1. It does work as expected if you create it with a mapping however:
Counter(dict([(1, 2), (3, 4)]))
creates a Counter with keys 1 and 3 (and values 2 and 4).
Try this code: (Brute force, may be..)
dt = {}
tp = ([(5,2),(7,2)],[(5,1),(7,3),(11,1)])
for ls in tp:
for t in ls:
dt[t[0]] = dt[t[0]] + t[1] if t[0] in dt else t[1]
print dt.items()
The approach taken here is to loop through the list of tuples and store the tuple's data as a dictionary, wherein the 1st element in the tuple t[0] is the key and the 2nd element t[1] is the value.
Upon iteration, every time the same key is found in the tuple's 1st element, add the value with the tuple's 2nd element. In the end, we will have a dictionary dt with all the key, value pairs as required. Convert this dictionary to list of tuples dt.items() and we have our output.
Related
I have an a 2D array where each element is a pair of two tag, like ["NOUN", "VERB"] and I want to count the number of times each of these unique pairs occurs in a large dataset.
So far I have tried using defaultdict(int) and Counter() to easily just add the element if previously not found, or if found increase the value by 1.
dTransition = Counter()
# dTransition = defaultdict(int)
# <s> is a start of sentence tag
pairs = [[('<s>', 'NOUN')], [('CCONJ', 'NOUN')], [('NOUN', 'SCONJ')], [('SCONJ', 'NOUN')]]
for pair in pairs:
dTransition[pairs] += 1
This does not work as it does not accept two arguments. So im wondering if there is an easy way to check the dictionary if a key that is a 2D array already exist, and if so increase the value by 1.
You need to flatten your list, given that unlike lists, tuples are hashable. A simple option is using itertools.chain and then building a Counter with the list of tuples:
from itertools import chain
Counter(chain(*pairs))
Output
Counter({('<s>', 'NOUN'): 1, ('CCONJ', 'NOUN'): 1,
('NOUN', 'SCONJ'): 1, ('SCONJ', 'NOUN'): 1})
You can use a numpy array to do this with an already built in function.
import numpy as np
#convert array to numpy array
pairs= np.array(pairs)
#pairs.unique() returns an array with only the unique elements
#len() returns the length(count) of unique pairs
count= len(pairs.unique())
Your solution with defaultdict was correct, but you have to insert the two values as a tuple for the key of the dictionary. The tuple is always in your example the first element of the lists:
import collections
dTransition = collections.defaultdict(int)
# <s> is a start of sentence tag
pairs = [[('<s>', 'NOUN')], [('CCONJ', 'NOUN')], [('NOUN', 'SCONJ')], [('SCONJ', 'NOUN')],[('SCONJ', 'NOUN')]]
for pair in pairs:
dTransition[pair[0]] += 1
Then it works
I have a list of tuples that can be understood as key-value pairs, where a key can appear several times, possibly with different values, for example
[(2,8),(5,10),(2,5),(3,4),(5,50)]
I now want to get a list of tuples with the highest value for each key, i.e.
[(2,8),(3,4),(5,50)]
The order of the keys is irrelevant.
How do I do that in an efficient way?
Sort them and then cast to a dictionary and take the items again from it:
l = [(2,8),(5,10),(2,5),(3,4),(5,50)]
list(dict(sorted(l)).items()) #python3, if python2 list cast is not needed
[(2, 8), (3, 4), (5, 50)]
The idea is that the key-value pairs will get updated in ascending order when transforming to a dictionary filtering the lowest values for each key, then you just have to take it as tuples.
At its core, this problem is essentially about grouping the tuples based on their first element and then keeping only the maximum of each group.
Grouping can be done easily with a defaultdict. A detailed explanation of grouping with defaultdicts can be found in my answer here. In your case, we group the tuples by their first element and then use the max function to find the tuple with the largest number.
import collections
tuples = [(2,8),(5,10),(2,5),(3,4),(5,50)]
groupdict = collections.defaultdict(list)
for tup in tuples:
group = tup[0]
groupdict[group].append(tup)
result = [max(group) for group in groupdict.values()]
# result: [(2, 8), (5, 50), (3, 4)]
In your particular case, we can optimize the code a little bit by storing only the maximum 2nd element in the dict, rather than storing a list of all tuples and finding the maximum at the end:
tuples = [(2,8),(5,10),(2,5),(3,4),(5,50)]
groupdict = {}
for tup in tuples:
group, value = tup
if group in groupdict:
groupdict[group] = max(groupdict[group], value)
else:
groupdict[group] = value
result = [(group, value) for group, value in groupdict.items()]
This keeps the memory footprint to a minimum, but only works for tuples with exactly 2 elements.
This has a number of advantages over Netwave's solution:
It's more readable. Anyone who sees a defaultdict being instantiated knows that it'll be used to group data, and the use of the max function makes it easy to understand which tuples are kept. Netwave's one-liner is clever, but clever solutions are rarely easy to read.
Since the data doesn't have to be sorted, this runs in linear O(n) time instead of O(n log n).
Few questions on the below code to find if a list is sorted or not:
Why did we use lambda as key here ? Does it always mean key of a list can be derived so ?
In the enumerate loop , why did we compare key(el) < key(lst[i]) and not key(el) <key(el-1) or lst[i+1] <lst[i] ?
def is_sorted(lst, key=lambda x:x):
for i, el in enumerate(lst[1:]):
if key(el) < key(lst[i]): # i is the index of the previous element
return False
return True
hh=[1,2,3,4,6]
val = is_sorted(hh)
print(val)
(NB: the code above was taken from this SO answer)
This code scans a list to see if it is sorted low to high. The first problem is to decide what "low" and "high" mean for arbitrary types. Its easy for integers, but what about user defined types? So, the author lets you pass in a function that converts a type to something whose comparison works the way you want.
For instance, lets say you want to sort tuples, but based on the 3rd item which you know to be an integer, it would be key=lambda x: x[2]. But the author provides a default key=lamba x:x which just returns the object its supplied for items that are already their own sort key.
The second part is easy. If any item is less than the item just before it, then we found an example where its not low to high. The reason it works is literally in the comment - i is the index of the element directly preceding el. We know this because we enumerated on the second and following elements of the list (enumerate(lst[1:]))
enumerate yields both index and current element:
for i, el in enumerate(lst):
print(i,el)
would print:
0 1
1 2
2 3
3 4
4 6
By slicing the list off by one (removing the first element), the code introduces a shift between the index and the current element, and it allows to access by index only once (not seen as pythonic to use indexes on lists when iterating on them fully)
It's still better/pythonic to zip (interleave) list and a sliced version of the list and pass a comparison to all, no indices involved, clearer code:
import itertools
def is_sorted(lst, key=lambda x:x):
return all(key(current) < key(prev) for prev,current in zip(lst,itertools.islice(lst,1,None,None)))
The slicing being done by islice, no extra list is generated (otherwise it's the same as lst[1:])
The key function (here: identity function by default) is the function which converts from the value to the comparable value. For integers, identity is okay, unless we want to reverse comparison, in which case we would pass lambda x:-x
The point is not that the lambda "derives" the key of a list. Rather, it's a function that allows you to choose the key. That is, given a list of objects of type X, what attribute would you use to compare them with? The default is the identity function - ie use the plain value of each element. But you could choose anything here.
You could indeed write this function by comparing lst[i+1] < lst[i]. You couldn't however write it by comparing key(el) < key(el-1), because el is the value of the element itself, not the index.
This is a function that test if a list has been sorted, as an example with the builtin sorted function. This function takes an keyword argument key which is used on every single element on the list to compute its compare value:
>>> sorted([(0,3),(1,2),(2,1),(3,0)])
[(0, 3), (1, 2), (2, 1), (3, 0)]
>>> sorted([(0,3),(1,2),(2,1),(3,0)],key=lambda x:x[1])
[(3, 0), (2, 1), (1, 2), (0, 3)]
The key keyword in your function is to be able to mimic the behavior of sorted:
>>> is_sorted([(0,3),(1,2),(2,1),(3,0)])
True
>>> is_sorted([(0,3),(1,2),(2,1),(3,0)],key=lambda x:x[1])
False
The default lambda is just there to mimic a default behavior where nothing is changed.
Let's assume that there is a dictionary list like this one:
lst = {(1,1):2, (1,2):5, (1,3):10, (1,4):14, (1,6):22}
I want a simple (the most efficient) function that returns the dictionary key which its value is the maximum.
For example:
key_for_max_value_in_dict(lst) = (1,6)
because the tuple (1,6) has the most value (22).
I came up with this code which might be the most efficient one:
max(lst, key=lambda x: lst[x])
Use a comprehension for that like:
Code:
max((v, k) for k, v in lst.items())[1]
How does it work?
Iterate over the items() in the dict, and emit them as tuples of (value, key) with the value first in the tuple. max() can then find the largest value, because tuples sort by each element in the tuple, with first element matching first element. Then take the second element ([1]) of the max tuple since it is the key value for the max value in the dict.
Test Code:
lst = {(1,1):2, (1,2):5, (1,3):10, (1,4):14, (1,6):22}
print(max((v, k) for k, v in lst.items())[1])
Results;
(1, 6)
Assuming you're using a regular unsorted dictionary, you'll need to walk down the entire thing once. Keep track of what the largest element is and update it if you see a larger one. If it is the same, add to the list.
largest_key = []
largest_value = 0
for key, value in lst.items():
if value > largest_value:
largest_value = value
largest_key = [key]
elif value == largest_value:
largest_key.append(key)
I have a dictionary in the form:
{"a": (1, 0.1), "b": (2, 0.2), ...}
Each tuple corresponds to (score, standard deviation).
How can I take the average of just the first integer in each tuple?
I've tried this:
for word in d:
(score, std) = d[word]
d[word]=float(score),float(std)
if word in string:
number = len(string)
v = sum(score)
return (v) / number
Get this error:
v = sum(score)
TypeError: 'int' object is not iterable
It's easy to do using list comprehensions. First, you can get all the dictionary values from d.values(). To make a list of just the first item in each value you make a list like [v[0] for v in d.values()]. Then, just take the sum of those elements, and divide by the number of items in the dictionary:
sum([v[0] for v in d.values()]) / float(len(d))
As Pedro rightly points out, this actually creates the list, and then does the sum. If you have a huge dictionary, this might take up a bunch of memory and be inefficient, so you would want a generator expression instead of a list comprehension. In this case, that just means getting rid of one pair of brackets:
sum(v[0] for v in d.values()) / float(len(d))
The two methods are compared in another question.