Get max value from row of a dataframe in python [duplicate] - python

This question already has answers here:
Find the max of two or more columns with pandas
(3 answers)
How to select max and min value in a row for selected columns
(2 answers)
Closed 5 years ago.
This is my dataframe df
a b c
1.2 2 0.1
2.1 1.1 3.2
0.2 1.9 8.8
3.3 7.8 0.12
I'm trying to get max value from each row of a dataframe, I m expecting output like this
max_value
2
3.2
8.8
7.8
This is what I have tried
df[len(df.columns)].argmax()
I'm not getting proper output, any help would be much appreciated. Thanks

Use max with axis=1:
df = df.max(axis=1)
print (df)
0 2.0
1 3.2
2 8.8
3 7.8
dtype: float64
And if need new column:
df['max_value'] = df.max(axis=1)
print (df)
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8

You could use numpy
df.assign(max_value=df.values.max(1))
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8

Related

How to append multiple columns to the first 3 columns and repeat the index values using pandas?

I have a data set in which the columns are in multiples of 3 (excluding index column[0]).
I am new to python.
Here there are 9 columns excluding index. So I want to append 4th column to the 1st,5th column to 2nd,6th to 3rd, again 7th to 1st, 8th to 2nd, 9th to 3rd, and so on for large data set. My large data set will always be in multiples of 3 (excl.index col.).
Also I want the index values to repeat in same order. In this case 6,9,4,3 to repeat 3 times.
import pandas as pd
import io
data =io.StringIO("""
6,5.6,4.6,8.2,2.5,9.4,7.6,9.3,4.1,1.9
9,2.3,7.8,1,4.8,6.7,8.4,45.2,8.9,1.5
4,4.8,9.1,0,7.1,5.6,3.6,63.7,7.6,4
3,9.4,10.6,7.5,1.5,4.3,14.3,36.1,6.3,0
""")
df = pd.read_csv(data,index_col=[0],header = None)
Expected Output:
df
6,5.6,4.6,8.2
9,2.3,7.8,1
4,4.8,9.1,0
3,9.4,10.6,7.5
6,2.5,9.4,7.6
9,4.8,6.7,8.4
4,7.1,5.6,3.6
3,1.5,4.3,14.3
6,9.3,4.1,1.9
9,45.2,8.9,1.5
4,63.7,7.6,4
3,36.1,6.3,0
Idea is reshape by stack with sorting second level of MultiIndex and also for correct ordering create ordered CategoricalIndex:
a = np.arange(len(df.columns))
df.index = pd.CategoricalIndex(df.index, ordered=True, categories=df.index.unique())
df.columns = [a // 3, a % 3]
df = df.stack(0).sort_index(level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
Split the data frame horizontally and concatenate the components vertically:
df.columns=[1,2,3]*(len(df.columns)//3)
rslt= pd.concat( [ df.iloc[:,i:i+3] for i in range(0,len(df.columns),3) ])
1 2 3
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0

Combine two columns with if condition in pandas

I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10

How to create new column based on top and bottom parts of single dataframe in PANDAS?

I have merged two dataframes having same column names. Is there a easy way to get another column of mean of these two appended dataframes?
Maybe code explains it better.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[10,20,30,40]})
df2 = pd.DataFrame({'a':[1.2,2.2,3.2,4.2],'b':[10.2,20.2,30.2,40.2]})
df = df1.append(df2)
print(df)
df['a_mean'] = ???
a b
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
0 1.2 10.2
1 2.2 20.2
2 3.2 30.2
3 4.2 40.2
How to create a new column a_mean with values
[1.1, 2.1, 3.1, 4.1, 1.1, 2.1, 3.1, 4.1] effectively ?
melt()
df=df.assign(a_mean=df1.add(df2).div(2).melt().value)
Or taking only df, you can do:
df=df.assign(a_mean=df.groupby(df.index)['a'].mean())
a b a_mean
0 1.0 10.0 1.1
1 2.0 20.0 2.1
2 3.0 30.0 3.1
3 4.0 40.0 4.1
0 1.2 10.2 1.1
1 2.2 20.2 2.1
2 3.2 30.2 3.1
3 4.2 40.2 4.1
Try this:
df['a_mean'] = np.tile( (df1.a.to_numpy() + df2.a.to_numpy())/2, 2)
As per the comments, there is already a great answer by Anky, but to extend this method you can do this:
df['a_mean2'] = np.tile( (df.iloc[0: len(df)//2].a.to_numpy() + df.iloc[len(df)//2:].a.to_numpy())/2, 2)
Update:
df['a_mean3'] = np.tile(df.a.to_numpy().reshape(2,-1).mean(0), 2)
Outptut
print(df)
a b a_mean2 a_mean a_mean3
0 1.0 10.0 1.1 1.1 1.1
1 2.0 20.0 2.1 2.1 2.1
2 3.0 30.0 3.1 3.1 3.1
3 4.0 40.0 4.1 4.1 4.1
0 1.2 10.2 1.1 1.1 1.1
1 2.2 20.2 2.1 2.1 2.1
2 3.2 30.2 3.1 3.1 3.1
3 4.2 40.2 4.1 4.1 4.1

Pandas dataframe threshold -- Keep number fixed if exceed

I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)
You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0

Creating summed summary rows in pandas dataframe with specific criteria

Lets say I have the following pandas dataframe, and I am trying to post process the results to generate my (now blank) summary rows:
code entry_type value1 value2 value3 value4
1 A Holding 1.1 1.2 1.3 1.4
2 A Holding 2.1 2.2 2.3 2.4
3 B Holding 3.1 3.2 3.3 3.4
4 C Holding 4.1 4.2 4.3 4.4
5 C Holding 5.1 5.2 5.3 5.4
6 A Summary nan nan nan nan
7 C Summary nan nan nan nan
8 B Summary nan nan nan nan
Essentially, I would like the value1-value4 in the summary lines to be the sum of the holdings in each of the code:
code entry_type value1 value2 value3 value4
1 A Holding 1.1 1.2 1.3 1.4
2 A Holding 2.1 2.2 2.3 2.4
3 B Holding 3.1 3.2 3.3 3.4
4 C Holding 4.1 4.2 4.3 4.4
5 C Holding 5.1 5.2 5.3 5.4
6 A Summary 3.2 3.4 3.6 3.8
7 C Summary 9.2 9.4 9.6 9.8
8 B Summary 3.1 3.2 3.3 3.4
I have tried a few group-by lines of code, and came up with the following:
set = df[df['entry_type']=="Holding"].groupby('code')[['value1', 'value2', 'value3', 'value4']].sum()
Which yields:
value1 value2 value3 value4
code
A 3.2 3.4 3.6 3.8
B 3.1 3.2 3.3 3.4
C 9.2 9.4 9.6 9.8
However I am not sure how I would apply this back to the original DataFrame, specifically due to the fact that the code order is not necessarily the same as the original DataFrame. Any thoughts on how to apply this? Or a better approach? (Note- there is a bunch of additional data in the summary rows in other columns that already exists, so I can't just generate the new rows inline).
It seems concat can helps:
df1 = df[df['entry_type']=="Holding"]
.groupby('code')[['value1', 'value2', 'value3', 'value4']].sum()
#print (df1)
#if need filter `df` for only rows with Holding use boolean indexing
print (pd.concat([df[df['entry_type']=="Holding"].set_index('code'), df1])
.fillna({'entry_type':'Summary'})
.reset_index())
code entry_type value1 value2 value3 value4
0 A Holding 1.1 1.2 1.3 1.4
1 A Holding 2.1 2.2 2.3 2.4
2 B Holding 3.1 3.2 3.3 3.4
3 C Holding 4.1 4.2 4.3 4.4
4 C Holding 5.1 5.2 5.3 5.4
5 A Summary 3.2 3.4 3.6 3.8
6 B Summary 3.1 3.2 3.3 3.4
7 C Summary 9.2 9.4 9.6 9.8
Another possible solution with combine_first for replace NaN by df1 with align by index values of df:
print (df.set_index('code')
.combine_first(df1)
.sort_values(['entry_type'])
.reset_index())
code entry_type value1 value2 value3 value4
0 A Holding 1.1 1.2 1.3 1.4
1 A Holding 2.1 2.2 2.3 2.4
2 B Holding 3.1 3.2 3.3 3.4
3 C Holding 4.1 4.2 4.3 4.4
4 C Holding 5.1 5.2 5.3 5.4
5 A Summary 3.2 3.4 3.6 3.8
6 B Summary 3.1 3.2 3.3 3.4
7 C Summary 9.2 9.4 9.6 9.8

Categories

Resources