My Pandas df is like following and want to apply groupby and then want to calculate the average and first of many columns
index col1 col2 col3 col4 col5 col6
0 a c 1 2 f 5
1 a c 1 2 f 7
2 a d 1 2 g 9
3 b d 6 2 g 4
4 b e 1 2 g 8
5 b e 1 2 g 2
something like this I tried
df.groupby(['col1','col5').agg({['col6','col3']:'mean',['col4','col2']:'first'})
expecting output
col1 col5 col6 col3 col4 col2
a f 6 1 2 c
a g 9 1 2 d
b g 4 3 2 e
but it seems, list is not an option here, in my real dataset I have 100 of columns of different nature so I cant pass them individually. Any thoughts on passing them as list?
if you have lists depending on the aggregation, you can do:
l_mean = ['col6','col3']
l_first = ['col4','col2']
df.groupby(['col1','col5']).agg({**{col:'mean' for col in l_mean},
**{col:'first' for col in l_first}})
the notation **{} is for unpacking dictionary, doing {**{}, **{}} create one dictionary from 2 dictionaries (it could be ore than two), it is like union of dictionaries. And doing {col:'mean' for col in l_mean} create a dictionary with each col of the list as a key and 'mean' as value, it is dictionary comprehension.
Or using concat:
gr = df.groupby(['col1','col5'])
pd.concat([gr[l_mean].mean(),
gr[l_first].first()],
axis=1)
and reset_index after to get the expected output
(
df.groupby(['col1','col5'])
.agg(col6=('col6', 'mean'),
col3=('col3', 'mean'),
col4=('col4', 'first'),
col2=('col2', 'first'))
)
this is an extension of #Ben.T's solution, just wrapping it in a function and passing it via the pipe method :
#set the list1, list2
def fil(grp,list1,list2):
A = grp.mean().filter(list1)
B = grp.first().filter(list2)
C = A.join(B)
return C
grp1 = ['col6','col3']
grp2 = ['col4','col2']
m = df.groupby(['col1','col5']).pipe(fil,grp1,grp2)
m
Related
I have a df
Name Symbol Dummy
A (BO),(BO),(AD),(TR) 2
B (TV),(TV),(TV) 2
C (HY) 2
D (UI) 2
I need df as
Name Symbol Dummy
A (BO),(AD),(TR) 2
B (TV) 2
C (HY) 2
D (UI) 2
Tried with this function but not working as expected.
drop_duplicates
Split the strings around delimiter ,, then dedupe using dict.fromkeys which also preserves the order of strings, finally join around delimiter ,
df['Symbol'] = df['Symbol'].str.split(',').map(dict.fromkeys).str.join(',')
Name Symbol Dummy
0 A (BO),(AD),(TR) 2
1 B (TV) 2
2 C (HY) 2
3 D (UI) 2
Another method
#original DF
index
col1
col2
0
(BO),(BO),(AD),(TR)
2
df.col1 = df.col1.str.split(',').apply(lambda x: sorted(set(x), key=x.index)).str.join(',')
df
#output
index
col1
col2
0
(BO),(AD),(TR)
2
If values order not important you can simply do:
df.col1 = df.col1.str.split(',').apply(lambda x: set(x)).str.join(',')
df
#output
index
col1
col2
0
(AD),(BO),(TR)
2
I want to use iloc with value in column.
df1 = pd.DataFrame({'col1': ['1' ,'1','1','2','2','2','2','2','3' ,'3','3'],
'col2': ['A' ,'B','C','D','E','F','G','H','I' ,'J','K']})
I want to select index 2 in each column value as data frame and the result will be like
col1 col2
1 C
2 F
3 K
Thank you so much
Use GroupBy.nth:
df2 = df1.groupby('col1', as_index=False).nth(2)
Alternative with GroupBy.cumcount:
df2 = df1[df1.groupby('col1').cumcount().eq(2)]
print (df2)
col1 col2
2 1 C
5 2 F
10 3 K
Use GroupBy.nth with as_index=False:
df1.groupby('col1', as_index=False).nth(2)
output:
col1 col2
2 1 C
5 2 F
10 3 K
df1.groupby('col1').agg(lambda ss:ss.iloc[2])
col2
col1
1 C
2 F
3 K
Let's assume I have a list similar to the one below:
l = ['A','B','C','D','E','F','G','H','I','L','M','N']
I want to create a dataframe that has 4 columns from the fact that every 4 objects in the list is a row. The outcome should be a dataframe with the following form:
Col1 Col2 Col3 Col4
A B C D
E F G H
I L M N
Can anyone help me do it?
Thanks!
Convert values to numpy array and then use reshape:
l = ['A','B','C','D','E','F','G','H','I','L','M','N']
df = pd.DataFrame(np.array(l).reshape(-1, 4)).add_prefix('col')
print(df)
col0 col1 col2 col3
0 A B C D
1 E F G H
2 I L M N
I have this file with 19 columns of mixed dtypes. One of the column Names contain elements which are separated by space. For example:
Col1 Col2
adress1 x
adress2 a b
adress3 x c
adress4 a x d
What I want to do is go over Col2 and find out how many times each element occurs and put the result in a new column along with its corresponding in Col1
Note the above columns were already processed as a Dataframe.
I have this which somewhat give me the results but not what I want ultimately.
new_df = pd.Dataframe(old_df.Col2.str.split(' ').tolist(), index=old_df.Col1).stack
How do I put the results in a new column (replacing Col2) and also have the remaining columnS?
Something like:
Col1 Col2 Col3
adress1 x something
adress2 a something1
adress2 b something1
adress3 x NaN
adress3 c NaN
Also calculate occurrence of items in Col2?
We can do split first then do explode
s=df.assign(Col2=df.Col2.str.split()).explode('Col2')
s=s.groupby(['Col1','Col2']).size().to_frame('count').reset_index()
Out[48]:
Col1 Col2 count
0 adress1 x 1
1 adress2 a 1
2 adress2 b 1
3 adress3 c 1
4 adress3 x 1
5 adress4 a 1
6 adress4 d 1
7 adress4 x 1
I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN