Creating summed summary rows in pandas dataframe with specific criteria - python

Lets say I have the following pandas dataframe, and I am trying to post process the results to generate my (now blank) summary rows:
code entry_type value1 value2 value3 value4
1 A Holding 1.1 1.2 1.3 1.4
2 A Holding 2.1 2.2 2.3 2.4
3 B Holding 3.1 3.2 3.3 3.4
4 C Holding 4.1 4.2 4.3 4.4
5 C Holding 5.1 5.2 5.3 5.4
6 A Summary nan nan nan nan
7 C Summary nan nan nan nan
8 B Summary nan nan nan nan
Essentially, I would like the value1-value4 in the summary lines to be the sum of the holdings in each of the code:
code entry_type value1 value2 value3 value4
1 A Holding 1.1 1.2 1.3 1.4
2 A Holding 2.1 2.2 2.3 2.4
3 B Holding 3.1 3.2 3.3 3.4
4 C Holding 4.1 4.2 4.3 4.4
5 C Holding 5.1 5.2 5.3 5.4
6 A Summary 3.2 3.4 3.6 3.8
7 C Summary 9.2 9.4 9.6 9.8
8 B Summary 3.1 3.2 3.3 3.4
I have tried a few group-by lines of code, and came up with the following:
set = df[df['entry_type']=="Holding"].groupby('code')[['value1', 'value2', 'value3', 'value4']].sum()
Which yields:
value1 value2 value3 value4
code
A 3.2 3.4 3.6 3.8
B 3.1 3.2 3.3 3.4
C 9.2 9.4 9.6 9.8
However I am not sure how I would apply this back to the original DataFrame, specifically due to the fact that the code order is not necessarily the same as the original DataFrame. Any thoughts on how to apply this? Or a better approach? (Note- there is a bunch of additional data in the summary rows in other columns that already exists, so I can't just generate the new rows inline).

It seems concat can helps:
df1 = df[df['entry_type']=="Holding"]
.groupby('code')[['value1', 'value2', 'value3', 'value4']].sum()
#print (df1)
#if need filter `df` for only rows with Holding use boolean indexing
print (pd.concat([df[df['entry_type']=="Holding"].set_index('code'), df1])
.fillna({'entry_type':'Summary'})
.reset_index())
code entry_type value1 value2 value3 value4
0 A Holding 1.1 1.2 1.3 1.4
1 A Holding 2.1 2.2 2.3 2.4
2 B Holding 3.1 3.2 3.3 3.4
3 C Holding 4.1 4.2 4.3 4.4
4 C Holding 5.1 5.2 5.3 5.4
5 A Summary 3.2 3.4 3.6 3.8
6 B Summary 3.1 3.2 3.3 3.4
7 C Summary 9.2 9.4 9.6 9.8
Another possible solution with combine_first for replace NaN by df1 with align by index values of df:
print (df.set_index('code')
.combine_first(df1)
.sort_values(['entry_type'])
.reset_index())
code entry_type value1 value2 value3 value4
0 A Holding 1.1 1.2 1.3 1.4
1 A Holding 2.1 2.2 2.3 2.4
2 B Holding 3.1 3.2 3.3 3.4
3 C Holding 4.1 4.2 4.3 4.4
4 C Holding 5.1 5.2 5.3 5.4
5 A Summary 3.2 3.4 3.6 3.8
6 B Summary 3.1 3.2 3.3 3.4
7 C Summary 9.2 9.4 9.6 9.8

Related

Combine two columns with if condition in pandas

I have two columns with data which overlap for some entries (and are almost similar when they do).
df = pd.DataFrame(
{'x':[2.1,3.1,5.4,1.9,np.nan,4.3,np.nan,np.nan,np.nan],
'y':[np.nan,np.nan,5.3,1.9,3.2,4.2,9.1,7.8,4.1]
}
)
I want the result to be a column 'xy' which contains the average of x and y when they both have values and x or y when only one of them has a value like this:
df['xy']=[2.1,3.1,5.35,1.9,3.2,4.25,9.1,7.8,4.1]
Here you go:
Solution
df['xy'] = df[['x','y']].mean(axis=1)
Output
print(df.to_string())
x y xy
0 2.1 NaN 2.10
1 3.1 NaN 3.10
2 5.4 5.3 5.35
3 1.9 1.9 1.90
4 NaN 3.2 3.20
5 4.3 4.2 4.25
6 NaN 9.1 9.10
7 NaN 7.8 7.80
8 NaN 4.1 4.10

Merging two dataframes with one common column name [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")

How to create new column based on top and bottom parts of single dataframe in PANDAS?

I have merged two dataframes having same column names. Is there a easy way to get another column of mean of these two appended dataframes?
Maybe code explains it better.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[10,20,30,40]})
df2 = pd.DataFrame({'a':[1.2,2.2,3.2,4.2],'b':[10.2,20.2,30.2,40.2]})
df = df1.append(df2)
print(df)
df['a_mean'] = ???
a b
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
0 1.2 10.2
1 2.2 20.2
2 3.2 30.2
3 4.2 40.2
How to create a new column a_mean with values
[1.1, 2.1, 3.1, 4.1, 1.1, 2.1, 3.1, 4.1] effectively ?
melt()
df=df.assign(a_mean=df1.add(df2).div(2).melt().value)
Or taking only df, you can do:
df=df.assign(a_mean=df.groupby(df.index)['a'].mean())
a b a_mean
0 1.0 10.0 1.1
1 2.0 20.0 2.1
2 3.0 30.0 3.1
3 4.0 40.0 4.1
0 1.2 10.2 1.1
1 2.2 20.2 2.1
2 3.2 30.2 3.1
3 4.2 40.2 4.1
Try this:
df['a_mean'] = np.tile( (df1.a.to_numpy() + df2.a.to_numpy())/2, 2)
As per the comments, there is already a great answer by Anky, but to extend this method you can do this:
df['a_mean2'] = np.tile( (df.iloc[0: len(df)//2].a.to_numpy() + df.iloc[len(df)//2:].a.to_numpy())/2, 2)
Update:
df['a_mean3'] = np.tile(df.a.to_numpy().reshape(2,-1).mean(0), 2)
Outptut
print(df)
a b a_mean2 a_mean a_mean3
0 1.0 10.0 1.1 1.1 1.1
1 2.0 20.0 2.1 2.1 2.1
2 3.0 30.0 3.1 3.1 3.1
3 4.0 40.0 4.1 4.1 4.1
0 1.2 10.2 1.1 1.1 1.1
1 2.2 20.2 2.1 2.1 2.1
2 3.2 30.2 3.1 3.1 3.1
3 4.2 40.2 4.1 4.1 4.1

Pandas dataframe threshold -- Keep number fixed if exceed

I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)
You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0

Get max value from row of a dataframe in python [duplicate]

This question already has answers here:
Find the max of two or more columns with pandas
(3 answers)
How to select max and min value in a row for selected columns
(2 answers)
Closed 5 years ago.
This is my dataframe df
a b c
1.2 2 0.1
2.1 1.1 3.2
0.2 1.9 8.8
3.3 7.8 0.12
I'm trying to get max value from each row of a dataframe, I m expecting output like this
max_value
2
3.2
8.8
7.8
This is what I have tried
df[len(df.columns)].argmax()
I'm not getting proper output, any help would be much appreciated. Thanks
Use max with axis=1:
df = df.max(axis=1)
print (df)
0 2.0
1 3.2
2 8.8
3 7.8
dtype: float64
And if need new column:
df['max_value'] = df.max(axis=1)
print (df)
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8
You could use numpy
df.assign(max_value=df.values.max(1))
a b c max_value
0 1.2 2.0 0.10 2.0
1 2.1 1.1 3.20 3.2
2 0.2 1.9 8.80 8.8
3 3.3 7.8 0.12 7.8

Categories

Resources