I need to aggregate two columns of my dataframe, count the values of the second columns and then take only the row with the highest value in the "count" column, let me show:
df =
col1|col2
---------
A | AX
A | AX
A | AY
A | AY
A | AY
B | BX
B | BX
B | BX
B | BY
B | BY
C | CX
C | CX
C | CX
C | CX
C | CX
------------
df1 = df.groupby(['col1', 'col2']).agg({'col2': 'count'})
df1.columns = ['count']
df1= df1.reset_index()
out:
col1 col2 count
A AX 2
A AY 3
B BX 3
B BY 2
C CX 5
so far so good, but now I need to get only the row of each 'col1' group that has the maximum 'count' value, but keeping the value in 'col2'.
expected output in the end:
col1 col2 count
A AY 3
B BX 3
C CX 5
I have no idea how to do that. My attempts so far of using the max() aggregation always left the 'col2' out.
From your original DataFrame you can .value_counts, which returns a descending count within group, and then given this sorting drop_duplicates will keep the most frequent within group.
df1 = (df.groupby('col1')['col2'].value_counts()
.rename('counts').reset_index()
.drop_duplicates('col1'))
col1 col2 counts
0 A AY 3
2 B BX 3
4 C CX 5
Probably not ideal, but this works:
df1.loc[df1.groupby(level=0).idxmax()['count']]
col1 col2 count
A AY 3
B BX 3
C CX 5
This works because the groupby within the loc will return a list of indices, which loc will then pull up.
I guess you need this: df['qty'] = 1 and then df.groupby([['col1', 'col2']].sum().reset_index(drop=True)
Option 1: Include Ties
In case you have ties and want to show them.
Ties could be, for instance, both (B, BX) and (B, BY) occur 3 times.
# Prepare packages
import pandas as pd
# Create dummy date
df = pd.DataFrame({
'col1': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'col2': ['AX', 'AX', 'AY', 'AY', 'AY', 'BX', 'BX', 'BX', 'BY', 'BY', 'BY', 'CX', 'CX', 'CX', 'CX', 'CX'],
})
# Get Max Value by Group with Ties
df_count = (df.groupby('col1', as_index=0)['col2'].value_counts())
m = df_count.groupby(['col1'])['count'].transform(max) == df_count['count']
df1 = df_count[m]
col1 col2 count
0 A AY 3
2 B BX 3
3 B BY 3
4 C CX 5
Option 2: Short Code Ignoring Ties
df1 = (df
.groupby('col1')['col2']
.value_counts()
.groupby(level=0)
.head(1)
# .to_frame('count').reset_index() # Uncomment to get exact output requested
)
Related
Suppose there is the following dataframe:
import pandas as pd
df = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value': [1, 2, 3, 4, 5, 6]})
I would like to subtract the values from group B and C with those of group A and make a new column with the difference. That is, I would like to do something like this:
df[df['Group'] == 'B']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
df[df['Group'] == 'C']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
and place the result in a new column. Is there a way of doing it without a for loop?
Assuming you want to subtract the first A to the first B/C, second A to second B/C, etc. the easiest might be to reshape:
df2 = (df
.assign(cnt=df.groupby('Group').cumcount())
.pivot('cnt', 'Group', 'Value')
)
# Group A B C
# cnt
# 0 1 3 5
# 1 2 4 6
df['new_col'] = df2.sub(df2['A'], axis=0).melt()['value']
variant:
df['new_col'] = (df
.assign(cnt=df.groupby('Group').cumcount())
.groupby('cnt', group_keys=False)
.apply(lambda d: d['Value'].sub(d.loc[d['Group'].eq('A'), 'Value'].iloc[0]))
)
output:
Group Value new_col
0 A 1 0
1 A 2 0
2 B 3 2
3 B 4 2
4 C 5 4
5 C 6 4
I have a pandas data frame like this:
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
The unique nodes are:
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
For each Relationship the source (Src) and destination (Dst) can be generated:
df1 = pd.DataFrame(
data=list(combinations(uniq_nodes, 2)),
columns=['Src', 'Dst'])
df1
Src Dst
0 a b
1 a c
2 a d
3 b c
4 b d
5 c d
I need the new dataframe newdf based on the shared elements in col2 of df_rel. The Relationship column comes from the col2. Thus the desire dataframe with edgelist will be:
newdf
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 c d XY
Is there any fastest way to achieve this? The original dataframe has 30,000 rows.
I took this approach. It works but still not very fast for the large dataframe.
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
data=list(combinations(unique_nodes, 2)),
columns=['Src', 'Dst'])
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
Src Relationship Dst
0 a XX b
1 a XX d
3 b XX d
5 c XY d
6 a YY b
You need to loop through your df1 rows, and find the rows from df_rel that matches the df1['Src'] and df1['Dst'] columns. Once you have the df1['col2'] values of Src and Dst, compare them and if they match create a row in newdf. Try this - check if it performs for large datasets
Data setup (same as yours):
d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)
uniq_nodes = df_rel['col1'].unique()
df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)), columns=['Src', 'Dst'])
Code:
newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i, row in df1.iterrows():
src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
for x in src:
if x in dst:
newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
ignore_index=True, sort=False)
print(newdf)
Result:
Src Dst Relationship
0 a b XX
1 a b YY
2 a d XX
3 b d XX
4 c d XY
I have a DataFrame like this:
data = {'col1': ['A', 'B', 'B', 'A', 'B', 'C', 'B', 'B', 'B',
'A', 'C', 'A', 'B', 'C'],
'col2': ['NaN', 'comment1', 'comment2', 'NaN', 'comment3', NaN,
'comment4', 'comment5', 'comment6',
'NaN', 'NaN', 'NaN', 'comment7', 'NaN]}
frame = pd.DataFrame(data)
frame
col1 col2
A NaN
B comment1
B comment2
A NaN
B comment3
C NaN
B comment4
B comment5
B comment6
A NaN
C NaN
A NaN
B comment7
C NaN
Each row with col1 == 'B' has a comment which will be a string. I need to aggregate the comments and fill the preceding row (where col1 != 'B') with the resulting aggregated string.
Any given row where col1 != 'B' could have none, one or many corresponding rows of comments (col1 == 'B') which seems to be the crux of the problem. I can't just use fillna('bfill') etc.
I have looked into iterrows(), groupby(), while loops and tried to build my own function. But, I don't think I'm fully understanding how all of those are working.
Finished product should look like this:
col1 col2
A comment1 + comment2
B comment1
B comment2
A comment3
B comment3
C comment4 + comment5 + comment6
B comment4
B comment5
B comment6
A NaN
C NaN
A comment7
B comment7
C NaN
Eventually I will be dropping all rows where col1 == 'B', but for now I'd like to keep them for verification.
Here's one way using GroupBy with a custom grouper to concatenate the strings where col1 is B:
where_a = frame.col1.ne('B')
g = where_a.cumsum()
com = frame[frame.col1.eq('B')].groupby(g).col2.agg(lambda x: x.str.cat(sep=' + '))
till = (frame.col2.isna() & frame.col2.shift(-1).notna())[::-1].idxmax()
ixs = where_a[:till+1].reindex(frame.index).fillna(False)
frame.loc[ixs, 'col2'] = com.values
print(frame)
col1 col2
0 A comment1 + comment2
1 B comment1
2 B comment2
3 A comment3
4 B comment3
5 C comment4 + comment5 + comment6
6 B comment4
7 B comment5
8 B comment6
9 A NaN
10 C NaN
df['col_group'] = -1
col_group = 0
for i in df.index:
if df.loc[i, 'col1'] != 'B':
col_group += 1
df.loc[i, 'col_group'] = col_group
comments = df[df['col1'] == 'B']
transactions = df[df['col1'] != 'B']
agg_comments = comments.groupby('col_group')['col2'].apply(lambda x: reduce(lambda i,j: i+"&$#"+j,x)).reset_index()
df = pd.merge(transactions, agg_comments, on='col_group', how='outer')
I'm summing the values in a pivot table using pandas.
dfr = pd.DataFrame({'A': [1,1,1,1,2,2,2,2],
'B': [1,2,2,3,1,2,2,2],
'C': [1,1,1,2,1,1,2,2],
'Val':[1,1,1,1,1,1,1,1]})
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum)
dfr
Output:
A B C |Val
------------|---
1 1 1 |1
2 1 |2
3 2 |1
2 1 1 |1
2 1 |1
2 |2
The way I need the output is to show the largest in each group A, like this:
A B C |Val
------------|---
1 2 1 |2
2 2 2 |2
I've googled a bit around and tried using nlargest() in different ways without being able to produce the result I want. Anyone got any ideas?
I think you need groupby + nlargest by level A:
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum)
dfr = dfr.groupby(level='A')['Val'].nlargest(1).reset_index(level=0, drop=True).reset_index()
print (dfr)
A B C Val
0 1 2 1 2
1 2 2 2 2
because if use pivot_table another levels are lost:
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum).reset_index()
dfr = dfr.pivot_table(values='Val', index='A', aggfunc=lambda x: x.nlargest(1))
print (dfr)
Val
A
1 2
2 2
And if use all levels it return nlrgest by all levels (not what you want)
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=np.sum).reset_index()
dfr = dfr.pivot_table(values='Val', index=['A', 'B', 'C'], aggfunc=lambda x: x.nlargest(1))
print (dfr)
Val
A B C
1 1 1 1
2 1 2
3 2 1
2 1 1 1
2 1 1
2 2
I have two dataframe: df1 and df2.
df1 is following:
name exist
a 1
b 1
c 1
d 1
e 1
df2 (just have one column:name)is following:
name
e
f
g
a
h
I want to merge these two dataframe, and didn't merge repeat names, I mean, if the name in df2 exist in df1, just show one time, else if the name is df2 not exist in df1, set the exist value is 0 or Nan. for example as df1(there is a and e), and df2(there is a and e, just showed a, e one time), I want to be the following df:
a 1
b 1
c 1
d 1
e 1
f 0
g 0
h 0
I used the concat function to do it, my code is following:
import pandas as pd
df1 = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'e'],
'exist': ['1', '1', '1', '1', '1']})
df2 = pd.DataFrame({'name': ['e', 'f', 'g', 'h', 'a']})
df = pd.concat([df1, df2])
print(df)
but the result is wrong(name a and e is repeated to be showed):
exist name
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
0 NaN e
1 NaN f
2 NaN g
3 NaN h
4 NaN a
please give your hands, thanks in advance!
As indicated by your title, you can use merge instead of concat and specify how parameter as outer since you want to keep all records from df1 and df2 which defines an outer join:
import pandas as pd
pd.merge(df1, df2, on = 'name', how = 'outer').fillna(0)
# exist name
# 0 1 a
# 1 1 b
# 2 1 c
# 3 1 d
# 4 1 e
# 5 0 f
# 6 0 g
# 7 0 h