GroupBy All possible permutations - python

Example dataset columns: ["A","B","C","D","num1","num2"]. So I have 6 columns - first 4 for grouping and last 2 are numeric and means will be calculated based on groupBy statements.
I want to groupBy all possible combinations of the 4 grouping columns.
I wish to avoid explicitly typing all possible groupBy's such as groupBy["A","B","C","D"] then groupBy["A","B","D","C"] etc.
I'm new to Python - in python how can I automate group by in a loop so that it does a groupBy calc for all possible combinations - in this case 4*3*2*1 = 24 combinations?
Ta.
Thanks for your help so far. Any idea why the 'a =' part isn't working?
import itertools
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(100, 5)), columns=list('ABCDE'))
group_by_vars = list(df.columns)[0:4]
perms = [perm for perm in itertools.permutations(group_by_vars)]
print list(itertools.combinations(group_by_vars,2))
a = [x for x in itertools.combinations(group_by_vars,group_by_n+1) for group_by_n in range(len(group_by_vars))]
a doesn't error I just get an empty object. Why???
Something like [comb for comb in itertools.combinations(group_by_vars,2)] is easy enough but how to get a = [x for x in itertools.combinations(group_by_vars,group_by_n+1) for group_by_n in range(len(group_by_vars))]??

When you group by ['A', 'B', 'C', 'D'] and calculate the mean, you'll get one particular group (a0, b0, c0, d0) with a mean of m0.
When you permute the columns and group by ['A', 'B', 'D', 'C'], you'll get one particular group (a0, b0, d0, c0) with a mean of m0.
In fact that those m0 are the same. All the groups are the same. You will be duplicating the same exact calculations for every permutation... You just have 4! ways of ordering the tuples... why?

from itertools import permutations
perms = [perm for perm in permutations(['A','B','C','D'])]
perms will then be a list of all the possible 24 permutations

Related

How can I create words with a certain length consisting of a and b?

I want to create a list of all words with a certain length (for example 1 to 3) that consist of two letters. So my output would be:
a, b, aa, ab, ba, bb,....
but I am struggling to implement it recursively in python. What’s the right way to do this?
I combined both itertools and recursion in the following code:
from itertools import product,chain
ab = ['a', 'b']
def rec_prod(x):
if x==1:
return ab
elif x==2:
return list(product(ab, ab))
else:
return [tuple(chain((i[0],), i[1])) for i in product(ab, rec_prod(x-1))]
prod_range = lambda y: list(chain.from_iterable(rec_prod(j) for j in range(1, y+1)))
The first fuction recursively calculates all "words" of length x, the second one returns all words from length 1 to length y. It's a bit messy and not very efficient, but if you study the way I used recursion and the itertools function I used (product and chain) I'm sure you will learn something useful out of it.
I believe you can use itertools for this, and it is a more pythonic way of approaching the problem.
list = ['a','b','c']
import itertools
for letter in itertools.permutations(list):
list.append(' ',join(letter))
this will give you :
list = ['a', 'b', 'c', 'abc', 'acb', 'bac', 'bca', 'cab', 'cba']
Python recursion permutations
This might also help you.

partial intersection - multiple groups

I am not sure how to approach my problem, thus I haven't been able to see if it already exists (apologies in advance)
Group Item
A 1
A 2
A 3
B 1
B 3
C 1
D 2
D 3
I want to know all combinations of groups that share more than X items (2 in this example). And I want to know which items they share.
RESULT:
A-B: 2 (item 1 and item 3)
A-D: 2 (item 2 and item 3)
The list of groups and items is really long and the maximum number of item matches across groups is probably not more than 3-5.
NB More than 2 groups can have shared items - e.g. A-B-E: 3
So it's not sufficient to only compare two groups at a time. I need to compare all combination of groups.
My thoughts
First round: one pile of all groups - are at least two values shared amongst all?
Second round: All-1 group (all combinations)
Third round: All-2 groups (all combinations)
Untill I reach the comparison between only two groups (all combinations).
However this seems super heavy performance-wise!! And I have no idea of how to do this.
What are your thoughts?
Thanks!
Unless you have additional information to restrict the search, I would just process all subsets (having size >= 2) of the set of unique groups.
For each subset, I would search the items belonging to all members of the set:
a = df['Group'].unique()
for cols in chain(*(combinations(a, i) for i in range(2, len(a) + 1))):
vals = df['Item'].unique()
for col in cols:
vals = df.loc[(df.Group==col)&(df.Item.isin(vals)), 'Item'].unique()
if len(vals) > 0: print(cols, vals)
it gives:
('A', 'B') [1 3]
('A', 'C') [1]
('A', 'D') [2 3]
('B', 'C') [1]
('B', 'D') [3]
('A', 'B', 'C') [1]
('A', 'B', 'D') [3]
This is how I would approach the problem, it may not be the most efficient way to deal with it, but it has the merit to be clear.
List for each group, all items possessed by the group.
Then for each pair of group, list all shared items (for instance, for all items of group A, check if it is an item of group B).
Check if the number of shared items is higher than your threshold X.
It's not an off-the-shelf function, but it should be rather easy (or at least a good exercise) to implement.
Have fun !
Here is new solution that will work with all combinations
Steps:
get dataframe "grouped" which groups/lists all groups the the item is in
from each row of grouped get all possible combinations of group which has some common items
from "grouped" dataframe count for each combination if there are 2 or more common items add that in dictionary
Note: It only loop through group combinations that has common items so if you have lots of groups its already filters out huge part of possible combinations that don't have common items
import numpy as np
import pandas as pd
from itertools import combinations
d = {
"Group": "A,A,A,B,B,C,D,D".split(","),
"Item": [1,2,3,1,3,1,2,3]
}
df = pd.DataFrame(d)
grouped = df.groupby("Item").apply(lambda x: list(x.Group))
all_combinations_with_common = [sorted(combinations(item, i)) for item in grouped
for i in range(2, len(item)) if len(item)>=2]
all_combinations_with_common = np.concatenate(all_combinations_with_common)
commons = {}
REPEAT_COUNT = 2
for comb in all_combinations_with_common:
items = grouped.apply(lambda x: np.all(np.in1d(comb, x)))
if sum(items)>=REPEAT_COUNT:
commons["-".join(comb)] = grouped[items].index.values
display(commons)
output
{'A-B': array([1, 3]), 'A-D': array([2, 3])}

Why is my program returning the same permutations multiple times?

I have a simple code which I am using to try and get all combinations of 4 nucleotide bases, but only as sets of 3 because 3 nucleotides make up a codon. I basically need to generate all possible permutations that can be made by the 4 bases a, c, t and g, and put them in chunks of three, however at first the program seemed to work, and then as I looked at the result I realized it was repeating the permutations 5 times.
I only need all permutations once and I am not sure how to change my code to get this result. I am also very new to Python so I would really appreciate simple talk and jargon in any answers, thank you!
This is the code:
import itertools
bases = ['a','c','t','g']
for L in range(0, len(bases)+1):
for subset in itertools.permutations(bases, 3):
print(subset)
And the result I get looks right, but I just don't want it repeated 5 times:
('a', 'c', 't')
('a', 'c', 'g')
('a', 't', 'c').....
You're telling it to do it five times with that first for L in range(0, len(bases)+1): line.
Without that it works fine.
import itertools
bases = ['a','c','t','g']
for subset in itertools.permutations(bases, 3):
print(subset)
The line for L in range(0, len(bases)+1) is looping 5 times, giving you repeted permutations.
You can just remove that line:
import itertools
bases = ['a','c','t','g']
for subset in itertools.permutations(bases, 3):
print(subset)

Powerset with frozenset in Python

I'm sitting here for almost 5 hours trying to solve the problem and now I'm hoping for your help.
Here is my Python Code:
def powerset3(a):
if (len(a) == 0):
return frozenset({})
else:
s=a.pop()
b=frozenset({})
b|=frozenset({})
b|=frozenset({s})
for subset in powerset3(a):
b|=frozenset({str(subset)})
b|=frozenset({s+subset})
return b
If I run the program with:
print(powerset3(set(['a', 'b'])))
I get following solution
frozenset({'a', 'b', 'ab'})
But I want to have
{frozenset(), frozenset({'a'}), frozenset({'b'}), frozenset({'b', 'a'})}
I don't want to use libraries and it should be recursive!
Thanks for your help
Here's a slightly more readable implementation using itertools, if you don't want to use a lib for the combinations, you can replace the combinations code with its implementation e.g. from https://docs.python.org/2/library/itertools.html#itertools.combinations
def powerset(l):
result = [()]
for i in range(len(l)):
result += itertools.combinations(l, i+1)
return frozenset([frozenset(x) for x in result])
Testing on IPython, with different lengths
In [82]: powerset(['a', 'b'])
Out[82]:
frozenset({frozenset(),
frozenset({'b'}),
frozenset({'a'}),
frozenset({'a', 'b'})})
In [83]: powerset(['x', 'y', 'z'])
Out[83]:
frozenset({frozenset(),
frozenset({'x'}),
frozenset({'x', 'z'}),
frozenset({'y'}),
frozenset({'x', 'y'}),
frozenset({'z'}),
frozenset({'y', 'z'}),
frozenset({'x', 'y', 'z'})})
In [84]: powerset([])
Out[84]: frozenset({frozenset()})
You sort of have the right idea. If a is non-empty, then the powerset of a can be formed by taking some element s from a, and let's called what's left over rest. Then build up the powerset of s from the powerset of rest by adding to it, for each subset in powerset3(rest) both subset itself and subset | frozenset({s}).
That last bit, doing subset | frozenset({s}) instead of string concatenation is half of what's missing with your solution. The other problem is the base case. The powerset of the empty set is not the empty set, is the set of one element containing the empty set.
One more issue with your solution is that you're trying to use frozenset, which is immutable, in mutable ways (e.g. pop(), b |= something, etc.)
Here's a working solution:
from functools import partial
def helper(x, accum, subset):
return accum | frozenset({subset}) | frozenset({frozenset({x}) | subset})
def powerset(xs):
if len(xs) == 0:
return frozenset({frozenset({})})
else:
# this loop is the only way to access elements in frozenset, notice
# it always returns out of the first iteration
for x in xs:
return reduce(partial(helper, x), powerset(xs - frozenset({x})), frozenset({}))
a = frozenset({'a', 'b'})
print(powerset(a))

Combine Python List Elements Based On Another List

I have 2 lists:
phon = ["A","R","K","H"]
idx = [1,2,3,3]
idx corresponds to how phon should be grouped. In this case, phon_grouped should be ["A","R","KH"] because both "K" and "H" correspond to group 3.
I'm assuming some sort of zip or map function is required, but I'm not sure how to implement it. I have something like:
a = []
for i in enumerate(phon):
a[idx[i-1].append(phon[i])
but this does not actually work/compile
Use zip() and itertools.groupby() to group the output after zipping:
from itertools import groupby
from operator import itemgetter
result = [''.join([c for i, c in group])
for key, group in groupby(zip(idx, phon), itemgetter(0))]
itertools.groupby() requires that your input is already sorted on the key (your idx values here).
zip() pairs up the indices from idx with characters from phon
itertools.groupby() groups the resulting tuples on the first value, the index. Equal index values puts the tuples into the same group
The list comprehension then picks the characters from the group again and joins them into strings.
Demo:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> phon = ["A","R","K","H"]
>>> idx = [1,2,3,3]
>>> [''.join([c for i, c in group]) for key, group in groupby(zip(idx, phon), itemgetter(0))]
['A', 'R', 'KH']
If you don't want to use an extra class:
phon = ["A","R","K","H"]
idx = [1,2,3,3]
a = [[] for i in range(idx[-1])] # Create list of lists of length(max(idx))
for data,place in enumerate(idx):
a[place-1].append(phon[data])
[['A'], ['R'], ['K', 'H']]
Mainly the trick is to just pre-initialize your list. You know the final list will be of the max number found in idx, which should be the last number as you said idx is sorted.
Not sure if you wanted the end result to be an appended list, or concatenated characters, i.e. "KH" vs ['K', 'H']

Categories

Resources