I have this data frame as example:
Col1 Col2 Col3 Col4
1 2 3 2.2
I would like to to add a 4th column called 'Gmean' that calculate the geometric mean of the first 3 columns on each row.
How can get it done ?
Thanks!
One way would be with Scipy's geometric mean function -
from scipy.stats.mstats import gmean
df['Gmean'] = gmean(df.iloc[:,:3],axis=1)
Another way with the formula of geometric mean itself -
df['Gmean'] = np.power(df.iloc[:,:3].prod(axis=1),1.0/3)
If there are exactly 3 columns, just use df instead of df.iloc[:,:3]. Also, if you are looking for performance, you might want to work with the underlying array data with df.values or df.iloc[:,:3].values.
df.assign(Gmean=df.iloc[:, :3].prod(1) ** (1. / 3))
Col1 Col2 Col3 Col4 Gmean
0 1 2 3 2.2 1.817121
Related
I have a dataframe like this
df1 = pd.DataFrame([['a',1,100],['b',2,300],['c',3,400]],columns = ['col1','col2','col3'])
Required output
summary_df = df1.groupby('col1').agg({'col2':'sum','col3':'sum'}).reset_index() #line1
summary_df['col3'] = round(summary_df['col3']/1000,2)
Can we do the division and rounding function in the line1 itself. I have more columns to do like that. So adding a line for each column is not a good idea.
You can also pass a lambda function as aggregate, and perform column specific functions for a particular column.
>>> df1.groupby('col1').agg({'col2':'sum','col3':lambda x:round(x.sum()/1000,2)})
col2 col3
col1
a 1 0.1
b 2 0.3
c 3 0.4
If you need to apply the same function more than once, its better to create a normal function, and use it for multiple columns, instead of using lambda:
def func(x):
return round(x.sum()/1000,2)
df1.groupby('col1').agg({'col2':'sum','col3':func})
col2 col3
col1
a 1 0.1
b 2 0.3
c 3 0.4
Yes you can do that using assign.
summary_df = (df1.groupby('col1')
.agg({'col2':'sum','col3':'sum'})
.reset_index()
.assign(col3=lambda x: round(x['col3']/1000,2))) #line1
I would like to locate all places when in Col2 there is a change in value (for ex. change from A to C) and then modify value from Col1 (corresponding to row when the change happens, so when A -> C then it will be value in the same row as C) by dividing subtraction current value and previous value by two (in this example will be 1 + (1.5-1)/2 = 1.25.
Output table is result of replacing all that occurrences in whole table
How I can achieve that ?
Col1
Col2
1
A
1.5
C
2.0
A
2.5
A
3.0
D
3.5
D
OUTPUT:
Col1
Col2
1
A
1.25
C
1.75
A
2.5
A
2.75
D
3.5
D
Use np.where and series holding values of your formula
solution = df.Col1.shift() + ((df.Col1 - df.Col1.shift()) / 2)
df['Col1'] = np.where(~df.Col2.eq(df.Col2.shift()), solution.fillna(df.Col1), df.Col1)
I cannot work out how to add a new row at the end. The last row needs to do sum() on specific columns and dividing 2 other columns. While the DF has applied a filter to sum only specific rows.
df:
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.79
1 Cat2 2 -81.91 -15.30 -16.00 10.06
2 Cat3 3 -57.70 -18.62 0.00 0.00
I would like the output to be like so:
3 Total -123.60 -119.02 -26.91 100*(-119.02/-26.91)
col3,col4,col5 would have sum(), and col6 would be the above formula.
If [CategID]==2, then don't include in the TOTAL
I was able to get it almost as I wanted by using .query(), like so:
#tg is a list
df.loc['Total'] = df.query("categID in #tg").sum()
But with the above I cannot have the 'col6' like this 100*(col4.sum() / col5.sum()), because they are all sum().
Then I tried with Series like so, but I don't understand how to apply filter .where()
s = pd.Series( [df['col3'].sum()\
,df['col4'].sum()\
,df['col5'].sum()\
,100*(df['col4'].sum()/df['col5'].sum())\
,index = ['col3','col4','col5','col6'])
df.loc['Total'] = s.where('tag1' in tg)
using the above Series() works, until I add .where()
this gives the error:
ValueError: Array conditional must be same shape as self
So, can I accomplish this with the first method, using .query(), just somehow modify one of the column in TOTAL ?
Otherwise what am I doing wrong in the second method .where()
Thanks
IIUC, you can try:
s = df.mask(df['CategID'].eq(2)).drop("CategID",1).sum()
s.loc['col6'] = 100*(s['col4'] / s['col5'])
df.loc[len(df)] = s
df = df.fillna({'Categ':'Total',"CategID":''})
print(df)
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.790000
1 Cat2 2 -81.91 -15.30 -16.00 10.060000
2 Cat3 3 -57.70 -18.62 0.00 0.000000
3 Total -123.60 -119.02 -26.91 442.289112
I am using python with pandas imported to manipulate some data from a csv file I have. Just playing around to try and learn something new.
I have the following data frame:
I would like to group the data by col1 so that I get the following result. Which is a groupby on col1 and col3 and col4 multiplied together.
I have been watching some youtube videos and reading some similar questions on stack overflow but I am having trouble. So far I have the following which involves creating a new Col to hold the result of Col3 x Col4:
df['Col5'] = df.Col3 * df.Col4
gf = df.groupby(['col1', 'Col5'])
You can use solution without creating new column, you can multiple columns and aggregate by column df['Col1'] with aggregate sum, it is syntactic sugar:
gf = (df.Col3 * df.Col4).groupby(df['Col1']).sum().reset_index(name='Col2')
print (gf)
Col1 Col2
0 12345 38.64
1 23456 2635.10
2 45678 419.88
Another solution is possible create index by Col1 by set_index, multiple columns by prod and last sum by index by level=0:
gf = df.set_index('Col1')[['Col3','Col4']].prod(axis=1).sum(level=0).reset_index(name='Col2')
Almost, but you are grouping by too many columns in the end. Try:
gf = df.groupby('Col1')['Col5'].sum()
Or to get it as a dataframe, rather than Col1 as an index (I'm judging that this is what you want from your image), include as_index=False in your groupby:
gf = df.groupby('Col1', as_index=False)['Col5'].sum()
I have a table or df(if pandas has a better way) with one of the columns with multiple mixed character and string, i need to count them and append a unique mixed string to it, what would be best way to do a python loop or pandas has some syntax to do it? example data
col0 col1 col2
ENSG0001 E001 ENSG001:E001
ENSG0001 E002 ENSG001:E002
.
.
ENSG001 E028 ENSG001:E028
ENSG002 E001 ENSG002:E001
.
ENSG002 E012 ENSG002:E012
Edit:
Need to count the elements in col0 and instead of a number I need E001 as the counter and concatenate col0 and col1 in col2
Add to column Series created by cumcount + astype to string + zfill.
df['col3'] = df['col0'] + ':E' +
df.groupby('col0').cumcount().add(1).astype(str).str.zfill(3)
print (df)
col0 col1 col2 col3
0 ENSG0001 E001 ENSG001:E001 ENSG0001:E001
1 ENSG0001 E002 ENSG001:E002 ENSG0001:E002
2 ENSG001 E028 ENSG001:E028 ENSG001:E001
3 ENSG002 E001 ENSG002:E001 ENSG002:E001
4 ENSG002 E012 ENSG002:E012 ENSG002:E002