partial intersection - multiple groups - python

I am not sure how to approach my problem, thus I haven't been able to see if it already exists (apologies in advance)
Group Item
A 1
A 2
A 3
B 1
B 3
C 1
D 2
D 3
I want to know all combinations of groups that share more than X items (2 in this example). And I want to know which items they share.
RESULT:
A-B: 2 (item 1 and item 3)
A-D: 2 (item 2 and item 3)
The list of groups and items is really long and the maximum number of item matches across groups is probably not more than 3-5.
NB More than 2 groups can have shared items - e.g. A-B-E: 3
So it's not sufficient to only compare two groups at a time. I need to compare all combination of groups.
My thoughts
First round: one pile of all groups - are at least two values shared amongst all?
Second round: All-1 group (all combinations)
Third round: All-2 groups (all combinations)
Untill I reach the comparison between only two groups (all combinations).
However this seems super heavy performance-wise!! And I have no idea of how to do this.
What are your thoughts?
Thanks!

Unless you have additional information to restrict the search, I would just process all subsets (having size >= 2) of the set of unique groups.
For each subset, I would search the items belonging to all members of the set:
a = df['Group'].unique()
for cols in chain(*(combinations(a, i) for i in range(2, len(a) + 1))):
vals = df['Item'].unique()
for col in cols:
vals = df.loc[(df.Group==col)&(df.Item.isin(vals)), 'Item'].unique()
if len(vals) > 0: print(cols, vals)
it gives:
('A', 'B') [1 3]
('A', 'C') [1]
('A', 'D') [2 3]
('B', 'C') [1]
('B', 'D') [3]
('A', 'B', 'C') [1]
('A', 'B', 'D') [3]

This is how I would approach the problem, it may not be the most efficient way to deal with it, but it has the merit to be clear.
List for each group, all items possessed by the group.
Then for each pair of group, list all shared items (for instance, for all items of group A, check if it is an item of group B).
Check if the number of shared items is higher than your threshold X.
It's not an off-the-shelf function, but it should be rather easy (or at least a good exercise) to implement.
Have fun !

Here is new solution that will work with all combinations
Steps:
get dataframe "grouped" which groups/lists all groups the the item is in
from each row of grouped get all possible combinations of group which has some common items
from "grouped" dataframe count for each combination if there are 2 or more common items add that in dictionary
Note: It only loop through group combinations that has common items so if you have lots of groups its already filters out huge part of possible combinations that don't have common items
import numpy as np
import pandas as pd
from itertools import combinations
d = {
"Group": "A,A,A,B,B,C,D,D".split(","),
"Item": [1,2,3,1,3,1,2,3]
}
df = pd.DataFrame(d)
grouped = df.groupby("Item").apply(lambda x: list(x.Group))
all_combinations_with_common = [sorted(combinations(item, i)) for item in grouped
for i in range(2, len(item)) if len(item)>=2]
all_combinations_with_common = np.concatenate(all_combinations_with_common)
commons = {}
REPEAT_COUNT = 2
for comb in all_combinations_with_common:
items = grouped.apply(lambda x: np.all(np.in1d(comb, x)))
if sum(items)>=REPEAT_COUNT:
commons["-".join(comb)] = grouped[items].index.values
display(commons)
output
{'A-B': array([1, 3]), 'A-D': array([2, 3])}

Related

How to create a randomly matched list of pairs for an odd number of variables without replacement?

I am trying to create random pairs without replacement of fruits which are all in one list.
The problem is that this list may contain an even number of fruits or odd. If the number of fruits is even then I want to create pairs of two by dividing the number of fruits/2. I.e. If I have 4 fruits, I can create a total of 2 pairs of 2 fruits that are randomly matched.
However, when it comes to an odd number of fruits, it is more complicated. For example, I have 5 fruits and it should create 1 pair of 2 and 1 pair of 3 randomly matched fruits. So, in the odd case there will be pairs of two and 1 pair of three fruits. The requirement is that when creating the pair of 3, it should not take any of the fruits which were used in the even pair(s). I am not sure how to exclude those when creating the odd pair.
This is my code:
import numpy as np
x = ['banana','apple','pear','cherry','blueberry']
fruit_count=len(x)
if fruit_count%2==0:
print('even')
pairs=np.random.choice(x, size=(int(fruit_count/2), 2), replace=False)
print(pairs)
else:
print('odd')
pairs=np.random.choice(x, size=(int((fruit_count/2)-1.5), 2), replace=False)
pairs_odd=np.random.choice(x, size=(int(fruit_count/2)-1, 3), replace=False)
print(pairs_odd)
print(pairs)
The output is showing the problem of having the uneven pair take values from the even pairs. The desired values of the odd pair should be: ['pear','cherry','blueberry'].
How do I fix that?
OUTPUT
odd
['cherry' 'banana' 'apple']
['banana' 'apple']
Draw one less pair if the list length is odd. This will leave three elements instead of one element - which are then, by definition, the leftover size-three "pair".
pairs=np.random.choice(x, size=(fruit_count // 2 - 1, 2), replace=False)
You can quickly remove the fruits already in size-two pairs by using set operations.
pair_size_three = list(set(x) - set.union(*(set(pair) for pair in pairs)))
Edit: Just realized there is a faster alternative by doing the above in reverse. First pick three elements from the collection and remove them, then group the remaining elements into pairs. This saves a lot of set operations if the list is long.
three_elements = np.random.choice(x, size=(1, 3), replace=False)
remaining_elements = list(set(x) - set(three_elements))
pairs = np.random_choice(remaining_elements, size=(len(remaining_elements) // 2, 2), replace=False)
You can use the pairs_odd and create a list found in x items but not found in pairs_odd.
print([item for item in x if item not in pairs_odd[0]])
Result:
odd
['pear' 'cherry' 'banana']
['apple', 'blueberry']

GroupBy All possible permutations

Example dataset columns: ["A","B","C","D","num1","num2"]. So I have 6 columns - first 4 for grouping and last 2 are numeric and means will be calculated based on groupBy statements.
I want to groupBy all possible combinations of the 4 grouping columns.
I wish to avoid explicitly typing all possible groupBy's such as groupBy["A","B","C","D"] then groupBy["A","B","D","C"] etc.
I'm new to Python - in python how can I automate group by in a loop so that it does a groupBy calc for all possible combinations - in this case 4*3*2*1 = 24 combinations?
Ta.
Thanks for your help so far. Any idea why the 'a =' part isn't working?
import itertools
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(100, 5)), columns=list('ABCDE'))
group_by_vars = list(df.columns)[0:4]
perms = [perm for perm in itertools.permutations(group_by_vars)]
print list(itertools.combinations(group_by_vars,2))
a = [x for x in itertools.combinations(group_by_vars,group_by_n+1) for group_by_n in range(len(group_by_vars))]
a doesn't error I just get an empty object. Why???
Something like [comb for comb in itertools.combinations(group_by_vars,2)] is easy enough but how to get a = [x for x in itertools.combinations(group_by_vars,group_by_n+1) for group_by_n in range(len(group_by_vars))]??
When you group by ['A', 'B', 'C', 'D'] and calculate the mean, you'll get one particular group (a0, b0, c0, d0) with a mean of m0.
When you permute the columns and group by ['A', 'B', 'D', 'C'], you'll get one particular group (a0, b0, d0, c0) with a mean of m0.
In fact that those m0 are the same. All the groups are the same. You will be duplicating the same exact calculations for every permutation... You just have 4! ways of ordering the tuples... why?
from itertools import permutations
perms = [perm for perm in permutations(['A','B','C','D'])]
perms will then be a list of all the possible 24 permutations

Max, len, split Python

I'm trying to find all combinations of A,B repeated 3 times.
Once I've done this I would like to count how many A's there are in a row, by splitting the string and returning the len.max value. However this is going crazy on me. I must have misunderstood the len(max(tmp.split="A")
Can anyone explain what this really does (len returns the length of the string, and max returns the highest integer of that string, based on my split?) I expect it to return the number of A's in a row. "A,B,A" should return 1 even though there are two A's.
Suggestions and clarifications would be sincerely welcome
import itertools
list = list(itertools.product(["A", "B"], repeat=3))
count = 0;
for i in list:
count += 1;
tmp = str(i);
var = len(max(tmp.split("B")))
print(count, i, var)
You can use itertools.groupby to find groups of identical elements in an iterable. groupby generates a sequence of (key, group) tuples, where key is the value of the elements in the group, and group is an iterator of that group (which shares the underlying iterable with groupby. To get the length of the group we need to convert it to a list.
from itertools import product, groupby
for t in product("AB", repeat=3):
a = max([len(list(g)) for k, g in groupby(t) if k == "A"] or [0])
print(t, a)
output
('A', 'A', 'A') 3
('A', 'A', 'B') 2
('A', 'B', 'A') 1
('A', 'B', 'B') 1
('B', 'A', 'A') 2
('B', 'A', 'B') 1
('B', 'B', 'A') 1
('B', 'B', 'B') 0
We need to append or [0] to the list comprehension to cover the situation where no "A"s are found, otherwise max complains that we're trying to find the maximum of an empty sequence.
Update
Padraic Cunningham reminded me that the Python 3 version of max accepts a default arg to handle the situation when you pass it an empty iterable. He also shows another way to calculate the length of an iterable that is a bit nicer since it avoids capturing the iterable into a list, so it's a bit faster and consumes less RAM, which can be handy when working with large iterables. So we can rewrite the above code as
from itertools import product, groupby
for t in product("AB", repeat=3):
a = max((sum(1 for _ in g) for k, g in groupby(t) if k == "A"), default=0)
print(t, a)

Check if there are 2 or 3 elements with same value in a list/tuple/etc.

I have a 5-element list and I want to know if there are 2 or 3 equal elements (or two equal and three equal). This "check" would be a part of if condition. Let's say I'm too lazy or stupid to write:
if (a==b and c==d and c==e) or .......... or .........
i know it might be written like this, but not exactly how:
if (a==b and (c==b and (c==e or ....
How do I do it? I also know that you can write something similar to this:
if (x,y for x in [5element list] for y in [5element list] x==y, x not y:
If you just want to check for multiple occurences and the objects are of an hashable type, the following solution could work for you.
You could create a list of your objects
>>>l = [a, b, c, d, e]
Then, you could create a set from the same list and compare the length of the list and the length of the set. If the set has less elements than the list, you know you must have multiple occurences.
>>>if (len(set(l)) < len(l)):
...
Use count. You just want [i for i in myList if myList.count(i) > 1]. This list contains the repeated elements, if it's non-empty you have repeated elements.
Edit: SQL != python, removed 'where', also this'll get slow for bigger lists, but for 5 elements it'll be fine.
You can use collections.Counter, which counts the occurrence of every element in your list.
Once you have the count, just check that your wanted values (2 or 3) are present among the counts.
from collections import Counter
my_data = ['a', 'b', 'a', 'c', 'd']
c=Counter(my_data)
counts = set(c.values())
print 2 in counts or 3 in counts

nested for loops in python with lists

Folks - I have two lists
list1=['a','b']
list2=['y','z']
I would like to send the variables to a function like below:
associate_address(list1[0],list2[0])
associate_address(list1[1],list2[1])
my script:
for l in list1:
for i in list2:
conn.associate_address(i,l)
I receive the below output:
conn.associate_address(a,y)
conn.associate_address(a,z)
I would like it to look like this:
conn.associate_address(a,y)
conn.associate_address(b,z)
Use the zip function, like this:
list1=['a','b']
list2=['y','z']
for i, j in zip(list1, list2):
print(i, j)
Output:
('a', 'y')
('b', 'z')
Why do you suppose this is?
>>> for x in [1,2]:
... for y in ['a','b']:
... print x,y
...
1 a
1 b
2 a
2 b
Nested loops will be performed for each iteration in their parent loop. Think about truth tables:
p q
0 0
0 1
1 0
1 1
Or combinations:
Choose an element from a set of two elements.
2 C 1 = 2
Choose one element from each set, where each set contains two elements.
(2 C 1) * (2 C 1) = 4
Let's say you have a list of 10 elements. Iterating over it with a for loop will take 10 iterations. If you have another list of 5 elements, iterating over it with a for loop will take 5 iterations. Now, if you nest these two loops, you will have to perform 50 iterations to cover every possible combination of the elements of each list.
You have many options to solve this.
# use tuples to describe your pairs
lst = [('a','y'), ('b','z')]
for pair in lst:
conn.associate_address(pair[0], pair[1])
# use a dictionary to create a key-value relationship
dct = {'a':'y', 'b':'z'}
for key in dct:
conn.associate_address(key, dct[key])
# use zip to combine pairwise elements in your lists
lst1, lst2 = ['a', 'b'], ['y', 'z']
for p, q in zip(lst1, lst2):
conn.associate_address(p, q)
# use an index instead, and sub-index your lists
lst1, lst2 = ['a', 'b'], ['y', 'z']
for i in range(len(lst1)):
conn.associate_address(lst1[i], lst2[i])
I would recommend using a dict instead of 2 lists since you clearly want them associated.
Dicts are explained here
Once you have your dicts set up you will be able to say
>>>mylist['a']
y
>>>mylist['b']
z

Categories

Resources