Im currently trying to get the mean() of a group in my dataframe (tdf), but I have a mix of some NaN values and filled values in my dataset. Example shown below
Test #
a
b
1
1
1
1
2
NaN
1
3
2
2
4
3
My code needs to take this dataset, and make a new dataset containing the mean, std, and 95% interval of the set.
i = 0
num_timeframes = 2 #writing this in for example sake
new_df = pd.DataFrame(columns = tdf.columns)
while i < num_timeframes:
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).mean()
new_df = pd.concat([new_df,results])
results = tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
results = 2*tdf.loc[tdf["Test #"] == i].groupby(["Test #"]).std()
new_df = pd.concat([new_df,results])
new_df['Test #'] = new_df['Test #'].fillna(i) #fill out test number values
i+=1
For simplicity, i will show the desired output on the first pass of the while loop, only calculating the mean. The problem impacts every row however. The expected output for the mean of Test # 1 is shown below:
Test #
a
b
1
2
1.5
However, columns which contain any NaN rows are calculating the entire mean as NaN resulting in the output shown below
Test #
a
b
1
2
NaN
I have tried passing skipna=True, but got an error stating that mean doesn't have a skipna argument. Im really at a loss here because it was my understanding that df.mean() ignores NaN rows by default. I have limited experience with python so any help is greatly appreciated.
Use following
DataFrame.mean( axis=None, skipna=True)
I eventually solved this by removing the groupby function entirely (I was looking through it and realized I had no reason to call groupby here other than benefit from groupby keeping my columns in the correct orientation). Figured I'd post my fix in case anyone ever comes across this.
for i in range(num_timeframes):
results = tdf.loc[tdf["Test #"] == i].mean()
results = pd.concat([results, tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = pd.concat([results, 2*tdf.loc[tdf["Test #"] == i].std()], axis = 1)
results = results.transpose()
results["Test #"] = i
new_df = pd.concat([new_df,results])
new_df.loc[new_df.shape[0]] = [None]*len(new_df.columns)
All i had to do was transpose my results because df.mean() flips the dataframe for some reason which is likely why I had tried using groupby in the first place.
I created a dataframe using groupby and pd.cut to calculate the mean, std and number of elements inside a bin. I used the agg()and this is the command I used:
df_bin=df.groupby(pd.cut(df.In_X, ranges,include_lowest=True)).agg(['mean', 'std','size'])
df_bin looks like this:
X Y
mean std size mean std size
In_X
(10.424, 10.43] 10.425 NaN 1 0.003786 NaN 1
(10.43, 10.435] 10.4 NaN 0 NaN NaN 0
I want to create an array with the values of the mean for the first header X. If I didn't have the two header level, I would use something like:
mean=np.array(df_bin['mean'])
But how to do that with the two headers?
This documentation would serve you well: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
To answer your question, if you just want a particular column:
mean = np.array(df_bin['X', 'mean'])
But if you wanted to slice to the second level:
mean = np.array(df_bin.loc[:, (slice(None), 'mean')])
Or:
mean = np.array(df_bin.loc[:, pd.IndexSlice[:, 'mean']])
We can do
df_bin.stack(level=0)['mean'].values
I have two different csv files, I have merged them into a single data frame and grouped according to the 'class_name' column. The group by works as intended but I dont know how to perform the operation by comparing the groups against one other. From r1.csv the class algebra has gone down by 5 students, so I want -5, calculus has increased by 5 so it has to +5, this has to be added as a new column in a separate data frame. Same with date arithmetics.
This is what I tried so far
import pandas as pd
report_1_df=pd.read_csv('r1.csv')
report_2_df=pd.read_csv('r2.csv')
for group,elements in pd.concat([report_1_df, report_2_df], axis=0, sort=False).groupby('class_name'):
print(elements)
I can see that my group by works, I tried .sum() .diff() but none seem to do what I want, what can I do here. Thanks.
r1.csv
class_name,student_count,start_time,end_time
algebra,15,"2019,Dec,08","2019,Dec,09"
calculus,10,"2019,Dec,08","2019,Dec,09"
statistics,12,"2019,Dec,08","2019,Dec,09"
r2.csv
class_name,student_count,start_time,end_time
calculus,15,"2019,Dec,09","2019,Dec,10"
algebra,10,"2019,Dec,09","2019,Dec,10"
trigonometry,12,"2019,Dec,09","2019,Dec,10"
Needed
class_name,student_count,student_count_change,start_time,start_time_delay,end_time,end_time_delay
algebra,10,-5,"2019,Dec,09",1,"2019,Dec,10",1
calculus,15,5,"2019,Dec,09",1,"2019,Dec,10",1
statistics,12,-12,"2019,Dec,08",0,"2019,Dec,09",0
trigonometry,12,12,"2019,Dec,09",0,"2019,Dec,10",0
Not sure if there's a more direct way, but you can start by appending missing data on both your dfs:
classes = (df1["class_name"].append(df2["class_name"])).unique()
def fill_data(df):
for i in np.setdiff1d(classes, df["class_name"].values):
df.loc[df.shape[0]] = [i, 0, *df.iloc[0,2:].values]
return df
df1 = fill_data(df1)
df2 = fill_data(df2)
With the missing classes filled, now you can use groupby to assign a new column for the difference and lastly drop_duplicates:
df = pd.concat([df1,df2],axis=0).reset_index(drop=True)
df["diff"] = df.groupby("class_name")["student_count"].diff().fillna(df["student_count"])
print (df.drop_duplicates("class_name",keep="last"))
class_name student_count start_time end_time diff
4 calculus 15 2019,Dec,09 2019,Dec,10 5.0
5 algebra 10 2019,Dec,09 2019,Dec,10 -5.0
6 trigonometry 12 2019,Dec,09 2019,Dec,10 12.0
7 statistics 0 2019,Dec,09 2019,Dec,10 -12.0
I try to create 2 new columns in DataFrame in Pandas Python and the first column aa which shows average temperaturÄ™ is correct, nevertheless, the second column bb which should present temperature in City minus average temperature in all cities displays value 0??
Where is the problem? Did I correctly use lambda? Could you give me the solution? Thank you very much!
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.groupby(['City'])["Temperature"].transform(lambda x: x - np.mean(x))
display(file.head(10))
EDIT: Updated according to gereleth's comment. You can simplify it even more!
file['bb'] = file.Temperature - file.aa
Since we've already calculated the mean value in the aa column we can simply reuse this column to calculate the difference of the Temperature and aa column of each row by using pandas apply method like below:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
file["bb"] = file.apply(lambda row: row['Temperature'] - row['aa'], axis=1)
display(file.sample(10))
If you are looking to subtract the average of all cities temperature you can use mean on the column aa instead:
file["aa"] = file.groupby(['City'])["Temperature"].transform(np.mean)
display(file.sample(10))
avg_all_cities = file['aa'].mean()
file["bb"] = file.apply(lambda row: row['Temperature'] - avg_all_cities, axis=1)
display(file.sample(10))
I have a dataframe like this:
Application|Category|Feature|Scenario|Result|Exec_Time
A1|C1|F1|scenario1|PASS|2.3
A1|C1|F1|scenario2|FAIL|20.3
A2|C1|F3|scenario3|PASS|12.3
......
The outcome i am looking for will be a pivot with count of results by Feature and also the sum of exec times. Like this
Application|Category|Feature|Count of PASS|Count of FAIL|SumExec_Time
A1|C1|F1|200|12|45.62
A1|C1|F2|90|0|15.11
A1|C2|F3|97|2|33.11*
I got individual dataframes to get the pivots of result counts and the sum of execution time by feature but I am not able to merge those dataframes to get my final expected outcome.
dfr = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Final_Result"],aggfunc=[len])
dft = pd.pivot_table(df,index=["Application","Category","Feature"],
values=["Exec_time_mins"],aggfunc=[np.sum])
You don't need to merge results here, you can create this with a single pivot_table or groupby/apply. I don't have your data but does this get you what you want?
pivot = pd.pivot_table(df, index=["Application","Category","Feature"],
values = ["Final_Result", "Exec_time_mins"],
aggfunc = [len, np.sum])
#Count total records, number of FAILs and total time.
df2 = df.groupby(by=['Application','Category','Feature']).agg({'Result':[len,lambda x: len(x[x=='FAIL'])],'Exec_Time':sum})
#rename columns
df2.columns=['Count of PASS','Count of FAIL','SumExec_Time']
#calculate number of pass
df2['Count of PASS']-=df2['Count of FAIL']
#reset index
df2.reset_index(inplace=True)
df2
Out[1197]:
Application Category Feature Count of PASS Count of FAIL SumExec_Time
0 A1 C1 F1 1 1 22.6
1 A2 C1 F3 1 0 12.3