Geometric mean applied on row

Geometric mean applied on row - python

I have this data frame as example:
Col1 Col2 Col3 Col4
1 2 3 2.2
I would like to to add a 4th column called 'Gmean' that calculate the geometric mean of the first 3 columns on each row.
How can get it done ?
Thanks!

One way would be with Scipy's geometric mean function -
from scipy.stats.mstats import gmean
df['Gmean'] = gmean(df.iloc[:,:3],axis=1)
Another way with the formula of geometric mean itself -
df['Gmean'] = np.power(df.iloc[:,:3].prod(axis=1),1.0/3)
If there are exactly 3 columns, just use df instead of df.iloc[:,:3]. Also, if you are looking for performance, you might want to work with the underlying array data with df.values or df.iloc[:,:3].values.

df.assign(Gmean=df.iloc[:, :3].prod(1) ** (1. / 3))
Col1 Col2 Col3 Col4 Gmean
0 1 2 3 2.2 1.817121

Related

Group by and apply sum, divide , round functions to a single column along with aggregators on other column

I have a dataframe like this
df1 = pd.DataFrame([['a',1,100],['b',2,300],['c',3,400]],columns = ['col1','col2','col3'])
Required output
summary_df = df1.groupby('col1').agg({'col2':'sum','col3':'sum'}).reset_index() #line1
summary_df['col3'] = round(summary_df['col3']/1000,2)
Can we do the division and rounding function in the line1 itself. I have more columns to do like that. So adding a line for each column is not a good idea.

You can also pass a lambda function as aggregate, and perform column specific functions for a particular column.
>>> df1.groupby('col1').agg({'col2':'sum','col3':lambda x:round(x.sum()/1000,2)})
col2 col3
col1
a 1 0.1
b 2 0.3
c 3 0.4
If you need to apply the same function more than once, its better to create a normal function, and use it for multiple columns, instead of using lambda:
def func(x):
return round(x.sum()/1000,2)
df1.groupby('col1').agg({'col2':'sum','col3':func})
col2 col3
col1
a 1 0.1
b 2 0.3
c 3 0.4

Yes you can do that using assign.
summary_df = (df1.groupby('col1')
.agg({'col2':'sum','col3':'sum'})
.reset_index()
.assign(col3=lambda x: round(x['col3']/1000,2))) #line1

How to locate and replace values in dataframe based on some criteria

I would like to locate all places when in Col2 there is a change in value (for ex. change from A to C) and then modify value from Col1 (corresponding to row when the change happens, so when A -> C then it will be value in the same row as C) by dividing subtraction current value and previous value by two (in this example will be 1 + (1.5-1)/2 = 1.25.
Output table is result of replacing all that occurrences in whole table
How I can achieve that ?
Col1
Col2
1
A
1.5
C
2.0
A
2.5
A
3.0
D
3.5
D
OUTPUT:
Col1
Col2
1
A
1.25
C
1.75
A
2.5
A
2.75
D
3.5
D

Use np.where and series holding values of your formula
solution = df.Col1.shift() + ((df.Col1 - df.Col1.shift()) / 2)
df['Col1'] = np.where(~df.Col2.eq(df.Col2.shift()), solution.fillna(df.Col1), df.Col1)

sum() on specific columns of dataframe

I cannot work out how to add a new row at the end. The last row needs to do sum() on specific columns and dividing 2 other columns. While the DF has applied a filter to sum only specific rows.
df:
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.79
1 Cat2 2 -81.91 -15.30 -16.00 10.06
2 Cat3 3 -57.70 -18.62 0.00 0.00
I would like the output to be like so:
3 Total -123.60 -119.02 -26.91 100*(-119.02/-26.91)
col3,col4,col5 would have sum(), and col6 would be the above formula.
If [CategID]==2, then don't include in the TOTAL
I was able to get it almost as I wanted by using .query(), like so:
#tg is a list
df.loc['Total'] = df.query("categID in #tg").sum()
But with the above I cannot have the 'col6' like this 100*(col4.sum() / col5.sum()), because they are all sum().
Then I tried with Series like so, but I don't understand how to apply filter .where()
s = pd.Series( [df['col3'].sum()\
,df['col4'].sum()\
,df['col5'].sum()\
,100*(df['col4'].sum()/df['col5'].sum())\
,index = ['col3','col4','col5','col6'])
df.loc['Total'] = s.where('tag1' in tg)
using the above Series() works, until I add .where()
this gives the error:
ValueError: Array conditional must be same shape as self
So, can I accomplish this with the first method, using .query(), just somehow modify one of the column in TOTAL ?
Otherwise what am I doing wrong in the second method .where()
Thanks

IIUC, you can try:
s = df.mask(df['CategID'].eq(2)).drop("CategID",1).sum()
s.loc['col6'] = 100*(s['col4'] / s['col5'])
df.loc[len(df)] = s
df = df.fillna({'Categ':'Total',"CategID":''})
print(df)
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.790000
1 Cat2 2 -81.91 -15.30 -16.00 10.060000
2 Cat3 3 -57.70 -18.62 0.00 0.000000
3 Total -123.60 -119.02 -26.91 442.289112

Pandas Data frame group by one column whilst multiplying others

I am using python with pandas imported to manipulate some data from a csv file I have. Just playing around to try and learn something new.
I have the following data frame:
I would like to group the data by col1 so that I get the following result. Which is a groupby on col1 and col3 and col4 multiplied together.
I have been watching some youtube videos and reading some similar questions on stack overflow but I am having trouble. So far I have the following which involves creating a new Col to hold the result of Col3 x Col4:
df['Col5'] = df.Col3 * df.Col4
gf = df.groupby(['col1', 'Col5'])

You can use solution without creating new column, you can multiple columns and aggregate by column df['Col1'] with aggregate sum, it is syntactic sugar:
gf = (df.Col3 * df.Col4).groupby(df['Col1']).sum().reset_index(name='Col2')
print (gf)
Col1 Col2
0 12345 38.64
1 23456 2635.10
2 45678 419.88
Another solution is possible create index by Col1 by set_index, multiple columns by prod and last sum by index by level=0:
gf = df.set_index('Col1')[['Col3','Col4']].prod(axis=1).sum(level=0).reset_index(name='Col2')

Almost, but you are grouping by too many columns in the end. Try:
gf = df.groupby('Col1')['Col5'].sum()
Or to get it as a dataframe, rather than Col1 as an index (I'm judging that this is what you want from your image), include as_index=False in your groupby:
gf = df.groupby('Col1', as_index=False)['Col5'].sum()

Appending unique mixed string using pandas or python

I have a table or df(if pandas has a better way) with one of the columns with multiple mixed character and string, i need to count them and append a unique mixed string to it, what would be best way to do a python loop or pandas has some syntax to do it? example data
col0 col1 col2
ENSG0001 E001 ENSG001:E001
ENSG0001 E002 ENSG001:E002
.
.
ENSG001 E028 ENSG001:E028
ENSG002 E001 ENSG002:E001
.
ENSG002 E012 ENSG002:E012
Edit:
Need to count the elements in col0 and instead of a number I need E001 as the counter and concatenate col0 and col1 in col2

Add to column Series created by cumcount + astype to string + zfill.
df['col3'] = df['col0'] + ':E' +
df.groupby('col0').cumcount().add(1).astype(str).str.zfill(3)
print (df)
col0 col1 col2 col3
0 ENSG0001 E001 ENSG001:E001 ENSG0001:E001
1 ENSG0001 E002 ENSG001:E002 ENSG0001:E002
2 ENSG001 E028 ENSG001:E028 ENSG001:E001
3 ENSG002 E001 ENSG002:E001 ENSG002:E001
4 ENSG002 E012 ENSG002:E012 ENSG002:E002

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Geometric mean applied on row - python

I have this data frame as example: Col1 Col2 Col3 Col4 1 2 3 2.2 I would like to to add a 4th column called 'Gmean' that calculate the geometric mean of the first 3 columns on each row. How can get it done ? Thanks!

df.assign(Gmean=df.iloc[:, :3].prod(1) ** (1. / 3)) Col1 Col2 Col3 Col4 Gmean 0 1 2 3 2.2 1.817121

Related

Group by and apply sum, divide , round functions to a single column along with aggregators on other column

How to locate and replace values in dataframe based on some criteria

sum() on specific columns of dataframe

Pandas Data frame group by one column whilst multiplying others

Appending unique mixed string using pandas or python

Categories

Resources