I am given a (pandas) dataframe telling me about membership relations of people and clubs. What I want to find is the number of members that any two clubs have in common.
Example Input:
Person Club
1 A
1 B
1 C
2 A
2 C
3 A
3 B
4 C
In other words, A = {1,2,3}, B = {1,3}, and C = {1,2,4}.
Desired output:
Club 1 Club 2 Num_Overlaps
A B 2
A C 2
B C 1
I can of course write python code that calculates those numbers, but I guess there must be a more dataframe-ish way using groupby or so to accomplish the same.
First, I grouped the dataframe on the club to get a set of each person in the club.
grouped = df.groupby("Club").agg({"Person": set}).reset_index()
Club Person
0 A {1, 2, 3}
1 B {1, 3}
2 C {1, 2, 4}
Then, I created a Cartesian product of this dataframe. I didn't have pandas 1.2.0, so I couldn't use the cross join available in df.merge(). Instead, I used the idea from this answer: pandas two dataframe cross join
grouped["key"] = 0
product = grouped.merge(grouped, on="key", how="outer").drop(columns="key")
Club_x Person_x Club_y Person_y
0 A {1, 2, 3} A {1, 2, 3}
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
3 B {1, 3} A {1, 2, 3}
4 B {1, 3} B {1, 3}
5 B {1, 3} C {1, 2, 4}
6 C {1, 2, 4} A {1, 2, 3}
7 C {1, 2, 4} B {1, 3}
8 C {1, 2, 4} C {1, 2, 4}
I then filtered out pairs where Club_x < Club_y so it removes duplicate pairs.
filtered = product[product["Club_x"] < product["Club_y"]]
Club_x Person_x Club_y Person_y
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
5 B {1, 3} C {1, 2, 4}
Finally, I added the column with the overlap size and renamed columns as necessary.
result = filtered.assign(Num_Overlaps=filtered.apply(lambda row: len(row["Person_x"].intersection(row["Person_y"])), axis=1))
result = result.rename(columns={"Club_x": "Club 1", "Club_y": "Club 2"}).drop(["Person_x", "Person_y"], axis=1)
Club 1 Club 2 Num_Overlaps
1 A B 2
2 A C 2
5 B C 1
You can indeed do this with groupby and some set manipulation. I would also use itertools.combinations, to get the list of club pairs.
import pandas as pd
from itertools import combinations
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 3, 3, 4],
'Club': list('ABCACABC')})
members = df.groupby('Club').agg(set)
clubs = sorted(list(set(df.Club)))
overlap = pd.DataFrame(list(combinations(clubs, 2)),
columns=['Club 1', 'Club 2'])
def n_overlap(row):
club1, club2 = row
members1 = members.loc[club1, 'Person']
members2 = members.loc[club2, 'Person']
return len(members1.intersection(members2))
overlap['Num_Overlaps'] = overlap.apply(n_overlap, axis=1)
overlap
Club 1 Club 2 Num_Overlaps
0 A B 2
1 A C 2
2 B C 1
Note there is one difference to your desired output, but that is probably as it should be, as noted by #rchome in the comment above.
Related
I am trying to group the words_count column by both essay_Set and domain1_score and adding the counters in words_count to add the counters results as mentioned here:
>>> c = Counter(a=3, b=1)
>>> d = Counter(a=1, b=2)
>>> c + d # add two counters together: c[x] + d[x]
Counter({'a': 4, 'b': 3})
I grouped them using this command:
words_freq_by_set = words_freq_by_set.groupby(by=["essay_set", "domain1_score"]) but do not know how to pass the Counter addition function to apply it on words_count column which is simply +.
Here is my dataframe:
GroupBy.sum works with Counter objects. However I should mention the process is pairwise, so this may not be very fast. Let's try
words_freq_by_set.groupby(by=["essay_set", "domain1_score"])['words_count'].sum()
df = pd.DataFrame({
'a': [1, 1, 2],
'b': [Counter([1, 2]), Counter([1, 3]), Counter([2, 3])]
})
df
a b
0 1 {1: 1, 2: 1}
1 1 {1: 1, 3: 1}
2 2 {2: 1, 3: 1}
df.groupby(by=['a'])['b'].sum()
a
1 {1: 2, 2: 1, 3: 1}
2 {2: 1, 3: 1}
Name: b, dtype: object
2 3 4 loc_id
0 b b c 1
1 b b c 6
2 b a b 8
3 b b c 10
4 b a b 11
Can somone help me with converting the above dataframe to the following dictionary in Python with column names as first key and a dictionary inside that with keys as columns values of some columns and values as column values of another column
{2:{'b':[1,6,8,10,11]},3:{'b':[1,6,10],'a':[8,11]},4:{'c':[1,6,10],'b':[8,11]}}
Use DataFrame.melt with GroupBy.agg and list for MultiIndex Series and then create nested dictionary:
s = df.melt('loc_id').groupby(['variable','value'])['loc_id'].agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{'2': {'b': [1, 1, 6, 8, 10, 11]},
'3': {'a': [8, 11], 'b': [1, 1, 6, 10]},
'4': {'b': [8, 11], 'c': [1, 1, 6, 10]}}
Or create dictionary of Series and aggregate index to list:
d = {k: v.groupby(v).agg(lambda x: list(x.index)).to_dict()
for k, v in df.set_index('loc_id').to_dict('series').items()}
I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.
You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4
So I have a DataFrame that looks like the following
a b c
0 AB 10 {a: 2, b: 1}
1 AB 1 {a: 3, b: 2}
2 AC 2 {a: 4, b: 3}
...
400 BC 4 {a: 1, b: 4}
Given another key pair like {c: 2} what's the syntax to add this to every value in row c?
a b c
0 AB 10 {a: 2, b: 1, c: 2}
1 AB 1 {a: 3, b: 2, c: 2}
2 AC 2 {a: 4, b: 3, c: 2}
...
400 BC 4 {a: 1, b: 4, c: 2}
I've tried df['C'] +=, and df['C'].append(), and df.C.append, but neither seem to work.
Here is a generalized way for updating dictionaries in a column with another dictionary, which can be used for multiple keys.
Test dataframe:
>>> x = pd.Series([{'a':2,'b':1}])
>>> df = pd.DataFrame(x, columns=['c'])
>>> df
c
0 {'b': 1, 'a': 2}
And just apply a lambda function:
>>> update_dict = {'c': 2}
>>> df['c'].apply(lambda x: {**x, **update_dict})
0 {'b': 1, 'a': 2, 'c': 2}
Name: c, dtype: object
Note: this uses the Python3 update dictionary syntax mentioned in an answer to How to merge two Python dictionaries in a single expression?. For Python2, you can use the merge_two_dicts function in the top answer. You can use the function definition from that answer and then write:
df['c'].apply(lambda x: merge_two_dicts(x, update_dict))
I'm novice in python.Now I'm learning difflib in python.I want to know why
for x in difflib.Differ().compare([1,2,3],[0,2,1]):
print x
result:
+ 0
+ 2
1
- 2
- 3
why not :
+ 0
2
1
Difflib respects the ordering of arguments. It essentially shows the edits that would transform one sequence into another.
When you don't care about order, a set difference may be what you want:
>>> {1, 2, 3} - {0, 2, 1}
set([3])
>>> {0, 2, 1} - {1, 2, 3}
set([0])