Appending to dictionary stored in dataframe on value - python

So I have a DataFrame that looks like the following
a b c
0 AB 10 {a: 2, b: 1}
1 AB 1 {a: 3, b: 2}
2 AC 2 {a: 4, b: 3}
...
400 BC 4 {a: 1, b: 4}
Given another key pair like {c: 2} what's the syntax to add this to every value in row c?
a b c
0 AB 10 {a: 2, b: 1, c: 2}
1 AB 1 {a: 3, b: 2, c: 2}
2 AC 2 {a: 4, b: 3, c: 2}
...
400 BC 4 {a: 1, b: 4, c: 2}
I've tried df['C'] +=, and df['C'].append(), and df.C.append, but neither seem to work.

Here is a generalized way for updating dictionaries in a column with another dictionary, which can be used for multiple keys.
Test dataframe:
>>> x = pd.Series([{'a':2,'b':1}])
>>> df = pd.DataFrame(x, columns=['c'])
>>> df
c
0 {'b': 1, 'a': 2}
And just apply a lambda function:
>>> update_dict = {'c': 2}
>>> df['c'].apply(lambda x: {**x, **update_dict})
0 {'b': 1, 'a': 2, 'c': 2}
Name: c, dtype: object
Note: this uses the Python3 update dictionary syntax mentioned in an answer to How to merge two Python dictionaries in a single expression?. For Python2, you can use the merge_two_dicts function in the top answer. You can use the function definition from that answer and then write:
df['c'].apply(lambda x: merge_two_dicts(x, update_dict))

Related

pandas: Find overlap of clubs

I am given a (pandas) dataframe telling me about membership relations of people and clubs. What I want to find is the number of members that any two clubs have in common.
Example Input:
Person Club
1 A
1 B
1 C
2 A
2 C
3 A
3 B
4 C
In other words, A = {1,2,3}, B = {1,3}, and C = {1,2,4}.
Desired output:
Club 1 Club 2 Num_Overlaps
A B 2
A C 2
B C 1
I can of course write python code that calculates those numbers, but I guess there must be a more dataframe-ish way using groupby or so to accomplish the same.
First, I grouped the dataframe on the club to get a set of each person in the club.
grouped = df.groupby("Club").agg({"Person": set}).reset_index()
Club Person
0 A {1, 2, 3}
1 B {1, 3}
2 C {1, 2, 4}
Then, I created a Cartesian product of this dataframe. I didn't have pandas 1.2.0, so I couldn't use the cross join available in df.merge(). Instead, I used the idea from this answer: pandas two dataframe cross join
grouped["key"] = 0
product = grouped.merge(grouped, on="key", how="outer").drop(columns="key")
Club_x Person_x Club_y Person_y
0 A {1, 2, 3} A {1, 2, 3}
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
3 B {1, 3} A {1, 2, 3}
4 B {1, 3} B {1, 3}
5 B {1, 3} C {1, 2, 4}
6 C {1, 2, 4} A {1, 2, 3}
7 C {1, 2, 4} B {1, 3}
8 C {1, 2, 4} C {1, 2, 4}
I then filtered out pairs where Club_x < Club_y so it removes duplicate pairs.
filtered = product[product["Club_x"] < product["Club_y"]]
Club_x Person_x Club_y Person_y
1 A {1, 2, 3} B {1, 3}
2 A {1, 2, 3} C {1, 2, 4}
5 B {1, 3} C {1, 2, 4}
Finally, I added the column with the overlap size and renamed columns as necessary.
result = filtered.assign(Num_Overlaps=filtered.apply(lambda row: len(row["Person_x"].intersection(row["Person_y"])), axis=1))
result = result.rename(columns={"Club_x": "Club 1", "Club_y": "Club 2"}).drop(["Person_x", "Person_y"], axis=1)
Club 1 Club 2 Num_Overlaps
1 A B 2
2 A C 2
5 B C 1
You can indeed do this with groupby and some set manipulation. I would also use itertools.combinations, to get the list of club pairs.
import pandas as pd
from itertools import combinations
df = pd.DataFrame({'Person': [1, 1, 1, 2, 2, 3, 3, 4],
'Club': list('ABCACABC')})
members = df.groupby('Club').agg(set)
clubs = sorted(list(set(df.Club)))
overlap = pd.DataFrame(list(combinations(clubs, 2)),
columns=['Club 1', 'Club 2'])
def n_overlap(row):
club1, club2 = row
members1 = members.loc[club1, 'Person']
members2 = members.loc[club2, 'Person']
return len(members1.intersection(members2))
overlap['Num_Overlaps'] = overlap.apply(n_overlap, axis=1)
overlap
Club 1 Club 2 Num_Overlaps
0 A B 2
1 A C 2
2 B C 1
Note there is one difference to your desired output, but that is probably as it should be, as noted by #rchome in the comment above.

Convert Dictionary into Dataframe

I am trying to convert a dictionary into a dataframe.
import pandas as pd
dict = {'A': [1,2,3], 'B': [1,2,3,4])
pd.DataFrame.from_dict(dict, orient = 'index').T
Expect:
A B
0 [1,2,3] [1,2,3,4]
But got instead:
A B
-----------
0 1 a
1 2 b
2 3 c
3 None d
Try to put the dictionary inside list ([]):
import pandas as pd
dct = {"A": [1, 2, 3], "B": [1, 2, 3, 4]}
df = pd.DataFrame([dct])
print(df)
Prints:
A B
0 [1, 2, 3] [1, 2, 3, 4]
Note: Don't use reserved words such as dict for variable names.

Summing up collections.Counter objects using `groupby` in pandas

I am trying to group the words_count column by both essay_Set and domain1_score and adding the counters in words_count to add the counters results as mentioned here:
>>> c = Counter(a=3, b=1)
>>> d = Counter(a=1, b=2)
>>> c + d # add two counters together: c[x] + d[x]
Counter({'a': 4, 'b': 3})
I grouped them using this command:
words_freq_by_set = words_freq_by_set.groupby(by=["essay_set", "domain1_score"]) but do not know how to pass the Counter addition function to apply it on words_count column which is simply +.
Here is my dataframe:
GroupBy.sum works with Counter objects. However I should mention the process is pairwise, so this may not be very fast. Let's try
words_freq_by_set.groupby(by=["essay_set", "domain1_score"])['words_count'].sum()
df = pd.DataFrame({
'a': [1, 1, 2],
'b': [Counter([1, 2]), Counter([1, 3]), Counter([2, 3])]
})
df
a b
0 1 {1: 1, 2: 1}
1 1 {1: 1, 3: 1}
2 2 {2: 1, 3: 1}
df.groupby(by=['a'])['b'].sum()
a
1 {1: 2, 2: 1, 3: 1}
2 {2: 1, 3: 1}
Name: b, dtype: object

How to Convert a dataframe into nested dictionary in the following format

2 3 4 loc_id
0 b b c 1
1 b b c 6
2 b a b 8
3 b b c 10
4 b a b 11
Can somone help me with converting the above dataframe to the following dictionary in Python with column names as first key and a dictionary inside that with keys as columns values of some columns and values as column values of another column
{2:{'b':[1,6,8,10,11]},3:{'b':[1,6,10],'a':[8,11]},4:{'c':[1,6,10],'b':[8,11]}}
Use DataFrame.melt with GroupBy.agg and list for MultiIndex Series and then create nested dictionary:
s = df.melt('loc_id').groupby(['variable','value'])['loc_id'].agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{'2': {'b': [1, 1, 6, 8, 10, 11]},
'3': {'a': [8, 11], 'b': [1, 1, 6, 10]},
'4': {'b': [8, 11], 'c': [1, 1, 6, 10]}}
Or create dictionary of Series and aggregate index to list:
d = {k: v.groupby(v).agg(lambda x: list(x.index)).to_dict()
for k, v in df.set_index('loc_id').to_dict('series').items()}

Sorting a Dictionary in Python by two keys (frequency and lexicographically)

I have a dictionary in Python like this:
{'c': 3, 'b': 3, 'aa': 2, 'a': 2}
and I want to print it like this:
b
c
a
aa
I need to sort the dictionary first by the second key and if there are any collisions, sort them lexicographically.
I have searched and can't find any solutions. Here is what I have already tried:
temp = {'c' : 3, 'b': 3, 'aa' : 2, 'a' : 2}
results = []
for key, value in temp.items():
results.append([key, value])
results.sort(key = operator.itemgetter(1,0), reverse = True)
for result in results:
print(result)
This doesn't work though it results in this:
c
b
aa
a
The output should be:
b
c
a
aa
I appreciate any help!
(Note: using Python 3)
>>> d = {'c': 3, 'b': 3, 'aa': 2, 'a': 2}
>>> sorted(d, key=lambda key: (-d[key], key))
['b', 'c', 'a', 'aa']
- was used to make the value make it ordered descendingly by the value.

Categories

Resources