Pandas groupby and compute ratio of values with NA in multiple columns - python

I have a dataframe like as below
id,status,amount,qty
1,pass,123,4500
1,pass,156,3210
1,fail,687,2137
1,fail,456,1236
2,pass,216,324
2,pass,678,241
2,nan,637,213
2,pass,213,543
df = pd.read_clipboard(sep=',')
I would like to do the below
a) Groupby id and compute the pass percentage for each id
b) Groupby id and compute the average amount for each id
So, I tried the below
df['amt_avg'] = df.groupby('id')['amount'].mean()
df['pass_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
df['fail_pct'] = df.groupby('status').apply(lambda x: x['status']/ x['status'].count())
but this doesn't work.
I am having trouble in getting the pass percentage.
In my real data I have lot of columns like status for which I have to find these % distribution of a specific value (ex: pass)
I expect my output to be like as below
id,pass_pct,fail_pct,amt_avg
1,50,50,2770.75
2,75,0,330.25

Use crosstab with replace missing values by nan with remove nan column and then add new column amt_avg by DataFrame.join:
s = df.groupby('id')['qty'].mean()
df = (pd.crosstab(df['id'], df['status'].fillna('nan'), normalize=0)
.drop('nan', 1)
.mul(100)
.join(s.rename('amt_avg')))
print (df)
fail pass amt_avg
id
1 50.0 50.0 2770.75
2 0.0 75.0 330.25

Related

Having Issues with pandas groupby.mean() not ignoring NaN as expected

Im currently trying to get the mean() of a group in my dataframe (tdf), but I have a mix of some NaN values and filled values in my dataset. Example shown below
Test #
a
b
1
1
1
1
2
NaN
1
3
2
2
4
3
My code needs to take this dataset, and make a new dataset containing the mean, std, and 95% interval of the set.
i = 0
num_timeframes = 2 #writing this in for example sake
new_df = pd.DataFrame(columns = tdf.columns)
while i < num_timeframes:
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).mean()
new_df = pd.concat([new_df,results])
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
results = 2*tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
new_df['Test #'] = new_df['Test #'].fillna(i) #fill out test number values
i+=1
For simplicity, i will show the desired output on the first pass of the while loop, only calculating the mean. The problem impacts every row however. The expected output for the mean of Test # 1 is shown below:
Test #
a
b
1
2
1.5
However, columns which contain any NaN rows are calculating the entire mean as NaN resulting in the output shown below
Test #
a
b
1
2
NaN
I have tried passing skipna=True, but got an error stating that mean doesn't have a skipna argument. Im really at a loss here because it was my understanding that df.mean() ignores NaN rows by default. I have limited experience with python so any help is greatly appreciated.
Use following
DataFrame.mean( axis=None, skipna=True)
I eventually solved this by removing the groupby function entirely (I was looking through it and realized I had no reason to call groupby here other than benefit from groupby keeping my columns in the correct orientation). Figured I'd post my fix in case anyone ever comes across this.
for i in range(num_timeframes):
results = tdf.loc[tdf["Test #"] == i].mean()
results = pd.concat([results, tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = pd.concat([results, 2*tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = results.transpose()
results["Test #"] = i
new_df = pd.concat([new_df,results])
new_df.loc[new_df.shape[0]] = [None]*len(new_df.columns)
All i had to do was transpose my results because df.mean() flips the dataframe for some reason which is likely why I had tried using groupby in the first place.

Access pandas dataframe column with two header pandas

I created a dataframe using groupby and pd.cut to calculate the mean, std and number of elements inside a bin. I used the agg()and this is the command I used:
df_bin=df.groupby(pd.cut(df.In_X, ranges,include_lowest=True)).agg(['mean', 'std','size'])
df_bin looks like this:
X Y
mean std size mean std size
In_X
(10.424, 10.43] 10.425 NaN 1 0.003786 NaN 1
(10.43, 10.435] 10.4 NaN 0 NaN NaN 0
I want to create an array with the values of the mean for the first header X. If I didn't have the two header level, I would use something like:
mean=np.array(df_bin['mean'])
But how to do that with the two headers?
This documentation would serve you well: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
To answer your question, if you just want a particular column:
mean = np.array(df_bin['X', 'mean'])
But if you wanted to slice to the second level:
mean = np.array(df_bin.loc[:, (slice(None), 'mean')])
Or:
mean = np.array(df_bin.loc[:, pd.IndexSlice[:, 'mean']])
We can do
df_bin.stack(level=0)['mean'].values

Compare two rows in a data frame after groupby and perform operations

I have two different csv files, I have merged them into a single data frame and grouped according to the 'class_name' column. The group by works as intended but I dont know how to perform the operation by comparing the groups against one other. From r1.csv the class algebra has gone down by 5 students, so I want -5, calculus has increased by 5 so it has to +5, this has to be added as a new column in a separate data frame. Same with date arithmetics.
This is what I tried so far
import pandas as pd
report_1_df=pd.read_csv('r1.csv')
report_2_df=pd.read_csv('r2.csv')
for group,elements in pd.concat([report_1_df, report_2_df], axis=0, sort=False).groupby('class_name'):
print(elements)
I can see that my group by works, I tried .sum() .diff() but none seem to do what I want, what can I do here. Thanks.
r1.csv
class_name,student_count,start_time,end_time
algebra,15,"2019,Dec,08","2019,Dec,09"
calculus,10,"2019,Dec,08","2019,Dec,09"
statistics,12,"2019,Dec,08","2019,Dec,09"
r2.csv
class_name,student_count,start_time,end_time
calculus,15,"2019,Dec,09","2019,Dec,10"
algebra,10,"2019,Dec,09","2019,Dec,10"
trigonometry,12,"2019,Dec,09","2019,Dec,10"
Needed
class_name,student_count,student_count_change,start_time,start_time_delay,end_time,end_time_delay
algebra,10,-5,"2019,Dec,09",1,"2019,Dec,10",1
calculus,15,5,"2019,Dec,09",1,"2019,Dec,10",1
statistics,12,-12,"2019,Dec,08",0,"2019,Dec,09",0
trigonometry,12,12,"2019,Dec,09",0,"2019,Dec,10",0
Not sure if there's a more direct way, but you can start by appending missing data on both your dfs:
classes = (df1["class_name"].append(df2["class_name"])).unique()
def fill_data(df):
for i in np.setdiff1d(classes, df["class_name"].values):
df.loc[df.shape[0]] = [i, 0, *df.iloc[0,2:].values]
return df
df1 = fill_data(df1)
df2 = fill_data(df2)
With the missing classes filled, now you can use groupby to assign a new column for the difference and lastly drop_duplicates:
df = pd.concat([df1,df2],axis=0).reset_index(drop=True)
df["diff"] = df.groupby("class_name")["student_count"].diff().fillna(df["student_count"])
print (df.drop_duplicates("class_name",keep="last"))
class_name student_count start_time end_time diff
4 calculus 15 2019,Dec,09 2019,Dec,10 5.0
5 algebra 10 2019,Dec,09 2019,Dec,10 -5.0
6 trigonometry 12 2019,Dec,09 2019,Dec,10 12.0
7 statistics 0 2019,Dec,09 2019,Dec,10 -12.0

Column in DataFrame in Pandas with value 0

I try to create 2 new columns in DataFrame in Pandas Python and the first column aa which shows average temperaturÄ™ is correct, nevertheless, the second column bb which should present temperature in City minus average temperature in all cities displays value 0??
Where is the problem? Did I correctly use lambda? Could you give me the solution? Thank you very much!
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.groupby(['City'])["Temperature"].transform(lambda x: x - np.mean(x))
display(file.head(10))
EDIT: Updated according to gereleth's comment. You can simplify it even more!
file['bb'] = file.Temperature - file.aa
Since we've already calculated the mean value in the aa column we can simply reuse this column to calculate the difference of the Temperature and aa column of each row by using pandas apply method like below:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.apply(lambda row: row['Temperature'] - row['aa'], axis=1)
display(file.sample(10))
If you are looking to subtract the average of all cities temperature you can use mean on the column aa instead:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
avg_all_cities = file['aa'].mean()
file["bb"] = file.apply(lambda row: row['Temperature'] - avg_all_cities, axis=1)
display(file.sample(10))

merge pandas pivot tables

I have a dataframe like this:
Application|Category|Feature|Scenario|Result|Exec_Time
A1|C1|F1|scenario1|PASS|2.3
A1|C1|F1|scenario2|FAIL|20.3
A2|C1|F3|scenario3|PASS|12.3
......
The outcome i am looking for will be a pivot with count of results by Feature and also the sum of exec times. Like this
Application|Category|Feature|Count of PASS|Count of FAIL|SumExec_Time
A1|C1|F1|200|12|45.62
A1|C1|F2|90|0|15.11
A1|C2|F3|97|2|33.11*
I got individual dataframes to get the pivots of result counts and the sum of execution time by feature but I am not able to merge those dataframes to get my final expected outcome.
dfr = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Final_Result"],aggfunc=[len])
dft = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Exec_time_mins"],aggfunc=[np.sum])
You don't need to merge results here, you can create this with a single pivot_table or groupby/apply. I don't have your data but does this get you what you want?
pivot = pd.pivot_table(df, index=["Application","Category","Feature"],
values = ["Final_Result", "Exec_time_mins"],
aggfunc = [len, np.sum])
#Count total records, number of FAILs and total time.
df2 = df.groupby(by=['Application','Category','Feature']).agg({'Result':[len,lambda x: len(x[x=='FAIL'])],'Exec_Time':sum})
#rename columns
df2.columns=['Count of PASS','Count of FAIL','SumExec_Time']
#calculate number of pass
df2['Count of PASS']-=df2['Count of FAIL']
#reset index
df2.reset_index(inplace=True)
df2
Out[1197]:
Application Category Feature Count of PASS Count of FAIL SumExec_Time
0 A1 C1 F1 1 1 22.6
1 A2 C1 F3 1 0 12.3

Categories

Resources