Groupby complex function - python

I have an iterator containing strings:
it = (_ for _ in ['aaxbb', 'aayybb', 'aaaaaaabb', 'ccabcavabb', 'yyaaadbb', 'yyaabb', 'a'])
I want to group these string if they have the same first and last two characters. The end result of the groupby in the above example should be:
[['aaxbb', 'aayybb', 'aaaaaaabb'],
['ccabcavabb'],
['yyaaadbb', 'yyaabb'],
['a']]
Can this complex groupby be achieved using itertools.groupby?

Not complex at all, just return a tuple of the first and last two characters:
lambda v: (v[:2], v[-2:])
or, if you want to use operator.itemgetter():
from operator import itemgetter
itemgetter(slice(2), slice(-2, None))
Demo:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> sample = ['aaxbb', 'aayybb', 'aaaaaaabb', 'ccabcavabb', 'yyaaadbb', 'yyaabb', 'a']
>>> for key, group in groupby(sample, lambda v: (v[:2], v[-2:])):
... print list(group)
...
['aaxbb', 'aayybb', 'aaaaaaabb']
['ccabcavabb']
['yyaaadbb', 'yyaabb']
['a']
>>> for key, group in groupby(sample, itemgetter(slice(2), slice(-2, None))):
... print list(group)
...
['aaxbb', 'aayybb', 'aaaaaaabb']
['ccabcavabb']
['yyaaadbb', 'yyaabb']
['a']

It is important to do sorting first before using groupby. In this specific example, all the items that belong to a group appear consecutively, so sorting might be optional. But in general, the collection must be sorted before using groupby.
See the note from Python documentation regarding the same
https://docs.python.org/2/library/itertools.html
"The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order."
sample = ['aaxbb', 'aayybb', 'aaaaaaabb', 'ccabcavabb', 'yyaaadbb', 'yyaabb', 'a', 'aaxxbb']
print [list(group) for key, group in groupby(sorted(sample), lambda x: x[:2]+x[-2:])]

Related

Group dictionary values together based on a key

I am looking to group values in an input set together with the first element in a tuple acting as the key. The second elements need to be grouped together to a list based on the common key. Output needs to be a list with tuples.
# Input set
values = {(304008, 2020.0), (304008, 2017.0), (250128, 2020.0), (93646, 2020.0), (93646, 2017.0)}
# Current workflow
keys = {i[0] for i in values}
id_dict = dict()
for k in keys:
id_dict[k] = [int(i[1]) for i in values if i[0] == k]
lst2 = list(id_dict.items())
# Expected output
# [(250128, [2020]), (304008, [2017, 2020]), (93646, [2020, 2017])]
I have the expected output, but the whole process is too slow. I am looking to make it faster. I was looking at groupby functions, but I can't seem to make them work.
You can use itertools.groupby to accomplish this. Basically groupby the first element in the tuple, then make a list of the second elements in each group.
>>> from itertools import groupby
>>> [(k, [i[1] for i in g]) for k, g in groupby(sorted(values), key=lambda i: i[0])]
[(93646, [2017.0, 2020.0]), (250128, [2020.0]), (304008, [2017.0, 2020.0])]
You can use setdefault to make a dict with key as first item of tuple and iterate on the set to populate it in single shot.
The use list constructor to get the required list. See below:
>>> values = {(304008, 2020.0), (304008, 2017.0), (250128, 2020.0), (93646, 2020.0), (93646, 2017.0)}
>>> info = {}
>>> for elements in values:
... info.setdefault(elements[0], []).append(elements[1])
...
>>> list(info.items())
[(304008, [2017.0, 2020.0]), (93646, [2017.0, 2020.0]), (250128, [2020.0])]
>>>
This does not use groupby but avoids your second loop.

sorting a list by names in python

I have a list of filenames. I need to group them based on the ending names after underscore ( _ ). My list looks something like this:
[
'1_result1.txt',
'2_result2.txt',
'3_result2.txt',
'4_result3.txt',
'5_result4.txt',
'6_result1.txt',
'7_result2.txt',
'8_result3.txt',
]
My end result should be:
List1 = ['1_result1.txt', '6_result1.txt']
List2 = ['2_result2.txt', '3_result2.txt', '7_result2.txt']
List3 = ['4_result3.txt', '8_result3.txt']
List4 = ['5_result4.txt']
This will come down to making a dictionary of lists, then iterating the input and adding each item to its proper list:
output = {}
for item in inlist:
output.setdefault(item.split("_")[1], []).append(item)
print output.values()
We use setdefault to make sure there's a list for the entry, then add our current filename to the list. output.values() will return just the lists, not the entire dictionary, which appears to be what you want.
using defaultdict from collections module:
from collections import defaultdict
output = defaultdict(list)
for file in data:
output[item.split("_")[1]].append(file)
print output.values()
using groupby from itertools module:
data.sort(key=lambda x: x.split('_')[1])
for key, group in groupby(data, lambda x: x.split('_')[1]):
print list(group)
Starting with Python 2.4, both list.sort() and sorted() added a key parameter to specify a function to be called on each list element prior to making comparisons.
The value of the key parameter should be a function that takes a single argument and returns a key to use for sorting purposes. This technique is fast because the key function is called exactly once for each input record.
So if l is the name of your list then you could use something like :
l.sort(key=lambda s: s.split('_')[1])
More information about key functions at here

How to sort a dictionary by two elements, reversing only one

Consider the following dictionary:
data = {'A':{'total':3},
'B':{'total':5},
'C':{'total':0},
'D':{'total':0},
}
The desired order for above is B, A, C, D. Order by descending total, then by ascending key.
When I call sorted(data, key=lambda x: (data[x]['total'], x), reverse=True)
I get B,A,D,C because reverse is called on both keys.
Is there an efficient way to solve this?
Sort on negative total, that'll reverse put the totals in reverse order without having to use reverse=True. Ties are then broken on the key in forward order:
sorted(data, key=lambda x: (-data[x]['total'], x))
Demo:
>>> data = {'A':{'total':3},
... 'B':{'total':5},
... 'C':{'total':0},
... 'D':{'total':0},
... }
>>> sorted(data, key=lambda x: (-data[x]['total'], x))
['B', 'A', 'C', 'D']
This trick only works for numeric components in a sort key; if you have multiple keys that require a sort direction change that are not numeric, you'd have to do a multi-pass sort (sort multiple times, from last key to first):
# when you can't take advantage of numerical values to reverse on
# you need to sort repeatedly from last key to first.
# Here, sort forward by dict key, then in reverse by total
bykey = sorted(data)
final = sorted(bykey, key=lambda x: data[x]['total'], reverse=True)
This works because the Python sort algorithm is stable; two elements keep their relative positions if the current sort key result is equal for those two elements.

Combine Python List Elements Based On Another List

I have 2 lists:
phon = ["A","R","K","H"]
idx = [1,2,3,3]
idx corresponds to how phon should be grouped. In this case, phon_grouped should be ["A","R","KH"] because both "K" and "H" correspond to group 3.
I'm assuming some sort of zip or map function is required, but I'm not sure how to implement it. I have something like:
a = []
for i in enumerate(phon):
a[idx[i-1].append(phon[i])
but this does not actually work/compile
Use zip() and itertools.groupby() to group the output after zipping:
from itertools import groupby
from operator import itemgetter
result = [''.join([c for i, c in group])
for key, group in groupby(zip(idx, phon), itemgetter(0))]
itertools.groupby() requires that your input is already sorted on the key (your idx values here).
zip() pairs up the indices from idx with characters from phon
itertools.groupby() groups the resulting tuples on the first value, the index. Equal index values puts the tuples into the same group
The list comprehension then picks the characters from the group again and joins them into strings.
Demo:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> phon = ["A","R","K","H"]
>>> idx = [1,2,3,3]
>>> [''.join([c for i, c in group]) for key, group in groupby(zip(idx, phon), itemgetter(0))]
['A', 'R', 'KH']
If you don't want to use an extra class:
phon = ["A","R","K","H"]
idx = [1,2,3,3]
a = [[] for i in range(idx[-1])] # Create list of lists of length(max(idx))
for data,place in enumerate(idx):
a[place-1].append(phon[data])
[['A'], ['R'], ['K', 'H']]
Mainly the trick is to just pre-initialize your list. You know the final list will be of the max number found in idx, which should be the last number as you said idx is sorted.
Not sure if you wanted the end result to be an appended list, or concatenated characters, i.e. "KH" vs ['K', 'H']

groupby iterator not adding to list in dictionary comprehension

I have a db query that returns a list. I then do a a dictionary comprehension like so:
results = {product: [g for g in group] for product, group in groupby(db_results, lambda x: x.product_id)}
The problem is that the value of the dictionary is only returning 1 value. I assume this do to the fact that the group is an iterator.
The following returns each item of the group, so I know that they are there:
groups = groupby(db_results, lambda x: x.product_id)
for k,g in groups:
if k==1001:
print list(g)
I am trying to get all the values of g in the above in a list whose key is the key of dictionary.
I've tried many variations like:
blah = dict((k,list(v)) for k,v in groupby(db_results, key=lambda x: x.product_id))
but I can't get it right.
If you insist on using groupby, then you need to make sure that the input is sorted byt the same key that you group on, however, I think I would suggest that you use defaultdict instead:
from collections import defaultdict
blah = defaultdict(list)
for item in db_results:
blah[item.product_id].append(item)

Categories

Resources