Losing the label column while doing aggregate function on a dataframe - python

I've been trying to do aggregate function on a dataframe that consist of number and string. While doing the aggregate function, I realize that the string data is missing. I want to keep the string data (label) as I need it to label the result of aggregation. Here is what I've coded :
def function (df) :
l_dfrange = []
step = 10
gr = df.groupby(['label'], as_index=False)
l_grouped = list(gr)
for i in range(len(l_grouped)):
df_range = pd.DataFrame(l_grouped[i][1])
df_range["ID"] = np.arange(len(df_range))//step
df_range = df_range.groupby("ID").agg([np.mean, np.std])
l_dfrange.append(df_range)
return l_dfrange, df_range
Initial dataframe :
gyro_x gyro_y gyro_z label
1 0.05 0.05 0.6 jump
2 0.03 0.03 0.6 jump
3 0.02 0.04 0.6 jump
4 0.08 0.09 0.6 stand
5 0.03 0.03 0.6 stand
6 0.02 0.04 0.6 stand
7 0.05 0.05 0.6 jump
8 0.03 0.03 0.6 jump
9 0.02 0.04 0.6 jump
Result that I want :
Note that for the example, i limit the group to only consisted of 3 rows each group, and they were sorted by label and ID to indentify the group
gyro_x gyro_y gyro_z label ID
1 0.05 0.05 0.6 jump 1
2 0.03 0.03 0.6 jump 1
3 0.02 0.04 0.6 jump 1
7 0.05 0.05 0.6 jump 2
8 0.03 0.03 0.6 jump 2
9 0.02 0.04 0.6 jump 2
4 0.08 0.09 0.6 stand 3
5 0.03 0.03 0.6 stand 3
6 0.02 0.04 0.6 stand 3
The end result that i want
ID mean_gyro_x std_gyro_x mean_gyro_y std_gyro_y label
1 0.05 0.05 0.6 0.6 jump
2 0.05 0.05 0.6 0.6 jump
3 0.03 0.03 0.6 0.6 stand
I combine first 3 rows in the example to get the aggregate result but also keeping the label (as they have been grouped by their label before). Is there any way I could keep the label? Also can I change the type to data frame? When I turn the l_dfrange into dataframe, it always return with feature name (columns), but no data.

created tmp.csv as follows:
gyro_x,gyro_y,gyro_z,label
0.05,0.05,0.6,jump
0.03,0.03,0.6,jump
0.02,0.04,0.6,jump
0.08,0.09,0.6,stand
0.03,0.03,0.6,stand
0.02,0.04,0.6,stand
pythonic style and pandas is very cool as you can see below:
import numpy as np
import pandas as pd
df = pd.read_csv('tmp.csv')
print(df)
df = df.groupby('label').agg({'gyro_x': ['mean', 'std'], 'gyro_y': ['mean', 'std']}).reset_index()
df.columns = ['label', 'mean_gyro_x', 'std_gyro_x', 'mean_gyro_y', 'std_gyro_y']
print(df)
Output:
gyro_x gyro_y gyro_z label
0 0.05 0.05 0.6 jump
1 0.03 0.03 0.6 jump
2 0.02 0.04 0.6 jump
3 0.08 0.09 0.6 stand
4 0.03 0.03 0.6 stand
5 0.02 0.04 0.6 stand
label mean_gyro_x std_gyro_x mean_gyro_y std_gyro_y
0 jump 0.033333 0.015275 0.040000 0.010000
1 stand 0.043333 0.032146 0.053333 0.032146

Related

python Get cumulative sum until condition is met in another column, then reset

I have 3 columns. I want to get the cumulative return given there is no trade. Once there is a trade, then reset the starting point of the cumulative return.
Return
Price
Trade
0.00
400
0
0.08
432.00
0
0.04
419.28
-30
0.02
427.6656
0
0.06
513.325536
60
0.10
564.65809
0
I am trying to do a cumulative return by each row using iterrows(), but no luck. Would somebody know how to get this output?
Trade values can be divided into groups using condition ~df.Trade.eq(0) (Trade not equal to 0) as a split point; to further calculate cumulative sums:
df['Ret_cumsum'] = df.groupby((~df.Trade.eq(0)).cumsum())['Return'].cumsum()
Return Price Trade Ret_cumsum
0 0.00 400.000000 0 0.00
1 0.08 432.000000 0 0.08
2 0.04 419.280000 -30 0.04
3 0.02 427.665600 0 0.06
4 0.06 513.325536 60 0.06
5 0.10 564.658090 0 0.16

How to calculate Mean Absolute Error (MAE) and Mean Signed Error (MSE) using pandas/numpy/python math libray?

I have a dataset like below. In this dataset, there are different colored thermometers, and given a 'True' or reference temperature, how different they measure according to some measurement methods 'Method 1' and 'Method 2'.
I am having trouble calculating two important parameters that I need which are Mean Absolute Error (MAE) and Mean Signed Error (MSE). I want to use the non-NaN values for each method and print the result.
I was able to get the to a point where I can return a two column series of index and sum, but the problem in this case is that I need to divide by the number of method values summed, which changes depending on how many NaN's there are in a row. And I do NOT want to just skip an entire row just because there is an NaN in it.
number
date
Thermometer
True Temperature
Method 1
Method 2
0
1/1/2021
red
0.2
0.2
0.5
1
1/1/2021
red
0.6
0.6
0.3
2
1/1/2021
red
0.4
0.6
0.23
3
1/1/2021
green
0.2
0.4
NaN
4
1/1/2021
green
1
1
0.23
5
1/1/2021
yellow
0.4
0.4
0.32
6
1/1/2021
yellow
0.1
NaN
0.4
7
1/1/2021
yellow
1.3
0.5
0.54
8
1/1/2021
yellow
1.5
0.5
0.43
9
1/1/2021
yellow
1.5
0.5
0.43
10
1/1/2021
blue
0.4
0.3
NaN
11
1/1/2021
blue
0.8
0.2
0.11
My Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('default'
data = pd.read_csv('data.txt', index_col=0)
data
data["M1_ABS_Error"]= abs(data["True_Temperature"]-data["Method_1"])
data["M2_ABS_Error"]= abs(data["True_Temperature"]-data["Method_2"])
MAE_Series=data[['Name', 'M1_ABS_Error', 'M2_ABS_Error' ]]
MAE_Series.sum(axis=1, skipna=True)
but output is something like this at the moment, which doesn't specify which color thermometer this belongs to, and I would like this to print out in a way that is easy to associate it with which it belongs. Also, as I mentioned, this does not yet account for how to divide by the number of values/methods in the given row to account for NaN. :
0 4.94
1 3.03
2 11.88
3 3.28
4 8.14
5 7.80
6 2.76
7 2.71
I would appreciate your help on this. Thanks!
Edit
I think I understand now, let me know if this is what you want
MAE:
df['MAE'] = df[['M1_ABS_Error','M2_ABS_Error']].mean(axis = 1)
df
produces
date Thermometer True_Temperature Method_1 Method_2 M1_ABS_Error M2_ABS_Error MAE
-- -------- ------------- ------------------ ---------- ---------- -------------- -------------- -----
0 1/1/2021 red 0.2 0.2 0.5 0 0.3 0.15
1 1/1/2021 red 0.6 0.6 0.3 0 0.3 0.15
2 1/1/2021 red 0.4 0.6 0.23 0.2 0.17 0.185
3 1/1/2021 green 0.2 0.4 nan 0.2 nan 0.2
4 1/1/2021 green 1 1 0.23 0 0.77 0.385
5 1/1/2021 yellow 0.4 0.4 0.32 0 0.08 0.04
6 1/1/2021 yellow 0.1 nan 0.4 nan 0.3 0.3
7 1/1/2021 yellow 1.3 0.5 0.54 0.8 0.76 0.78
8 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035
9 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035
10 1/1/2021 blue 0.4 0.3 nan 0.1 nan 0.1
11 1/1/2021 blue 0.8 0.2 0.11 0.6 0.69 0.645
and for MSE (Signed error)
df["MSE"]= df[['Method_1','Method_2']].mean(axis = 1)- df['True_Temperature']
produces
date Thermometer True_Temperature Method_1 Method_2 M1_ABS_Error M2_ABS_Error MAE MSE
-- -------- ------------- ------------------ ---------- ---------- -------------- -------------- ----- ------
0 1/1/2021 red 0.2 0.2 0.5 0 0.3 0.15 0.15
1 1/1/2021 red 0.6 0.6 0.3 0 0.3 0.15 -0.15
2 1/1/2021 red 0.4 0.6 0.23 0.2 0.17 0.185 0.015
3 1/1/2021 green 0.2 0.4 nan 0.2 nan 0.2 0.2
4 1/1/2021 green 1 1 0.23 0 0.77 0.385 -0.385
5 1/1/2021 yellow 0.4 0.4 0.32 0 0.08 0.04 -0.04
6 1/1/2021 yellow 0.1 nan 0.4 nan 0.3 0.3 0.3
7 1/1/2021 yellow 1.3 0.5 0.54 0.8 0.76 0.78 -0.78
8 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035 -1.035
9 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035 -1.035
10 1/1/2021 blue 0.4 0.3 nan 0.1 nan 0.1 -0.1
11 1/1/2021 blue 0.8 0.2 0.11 0.6 0.69 0.645 -0.645
Original answer
It is not entirely clear what you want, but somewhat guessing here, is this what you are after? If you groupby by color and apply mean to the `ABS columns within each group
data.groupby('Thermometer', sort = False)[['M1_ABS_Error','M2_ABS_Error']].mean()
you get this
M1_ABS_Error M2_ABS_Error
Thermometer
red 0.066667 0.256667
green 0.100000 0.770000
yellow 0.700000 0.656000
blue 0.350000 0.690000
Here, for example, the first top left number '0.066667is the average of theM1_ABS_Errorcolumn for those Thermometers that arered`. Similar to others. NaNs are skipped within each color/column
to get MSE (which normally means Mean Squared Error so I assume this is what you are after) you can do
import numpy as np
data["M1_Sqr_Error"]= (data["True_Temperature"]-data["Method_1"])**2
data["M2_Sqr_Error"]= (data["True_Temperature"]-data["Method_2"])**2
data.groupby('Thermometer', sort = False)[['M1_Error','M2_Error']].apply(lambda v: np.sqrt(np.mean(v)))
to get
M1_Error M2_Error
Thermometer
red 0.115470 0.263881
green 0.141421 0.770000
yellow 0.812404 0.769909
blue 0.430116 0.690000
I would rather use ready made and tested and correctly defined functions from libraries (here scikit-learn). Note: I give here answer for MAE==mean absolute error and MSE==mean squared error (more usually one uses root mean squared error RMSE) and NOT for mean signed error, which is very seldom used
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error
data = {"predicted": [11.3, 22.2, 51.4], "true": [10.1, 25.2, 60.3]}
df = pd.DataFrame(data)
mae = mean_absolute_error(data["predicted"], data["true"])
mse = mean_squared_error(data["predicted"], data["true"], squared=True)
print(f"mae:{round(mae,2)} mse:{round(mse,2)}")

sum column based on level selected in column header

I have a pd.dataframe and it looks like this. Note column names represent level.
df
PC 0 1 2 3
0 PC_1 0.74 0.25 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1
I want to create another 4 columns next to the existing columns and shift the values based on the condition assigned.
For example: if level =1, df should look like this:
df
PC 0 1 2 3 0_1 1_1 2_1 3_1
0 PC_1 0.74 0.25 0.1 0.0 0.0 (0.72+0.25) 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1 0.0 (0.72+0.26) 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0 0.0 (0.80+0.18) 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1 0.0 (0.79+0.19) 0.1 0.0
If level=3,
df
PC 0 1 2 3 0_3 1_3 2_3 3_3
0 PC_1 0.74 0.25 0.1 0.0 0.0 0.0 0.0 sum(0.74+0.25+0.1+0.0)
1 PC_1 0.72 0.26 0.1 0.1 0.0 0.0 0.0 sum(0.72+0.26+0.1+0.1)
2 PC_2 0.80 0.18 0.2 0.0 0.0 0.0 0.0 sum(0.80+0.18+0.20+0.0)
3 PC_3 0.79 0.19 0.1 0.1 0.0 0.0 0.0 sum(0.79+0.19+0.1+0.1)
I don't know how to solve the problem and am looking for help.
Thank you in advance.
Set 'PC' to the index to make things easier. We zero everything before your column, cumsum up to the column, and keep everything as is after your column.
df = df.set_index('PC')
def add_sum(df, level):
i = df.columns.get_loc(level)
df_add = (pd.concat([pd.DataFrame(0, index=df.index, columns=df.columns[:i]),
df.cumsum(1).iloc[:, i],
df.iloc[:, i+1:]],
axis=1)
.add_suffix(f'_{level}'))
return pd.concat([df, df_add], axis=1)
add_sum(df, '1') # 1 if columns labels are int
0 1 2 3 0_1 1_1 2_1 3_1
PC
PC_1 0.74 0.25 0.1 0.0 0 0.99 0.1 0.0
PC_1 0.72 0.26 0.1 0.1 0 0.98 0.1 0.1
PC_2 0.80 0.18 0.2 0.0 0 0.98 0.2 0.0
PC_3 0.79 0.19 0.1 0.1 0 0.98 0.1 0.1
add_sum(df, '3')
0 1 2 3 0_3 1_3 2_3 3_3
PC
PC_1 0.74 0.25 0.1 0.0 0 0 0 1.09
PC_1 0.72 0.26 0.1 0.1 0 0 0 1.18
PC_2 0.80 0.18 0.2 0.0 0 0 0 1.18
PC_3 0.79 0.19 0.1 0.1 0 0 0 1.18
As you wrote based on level selected in column header in the title,
I understand that:
there is no "external" level variable,
the level (how many columns to sum) results just from
the source column name.
So the task is actually to "concatenate" your both expected results (you presented only how to compute column 1_1 and 3_1) and compute other
new columns the same way.
The solution to do it is surprisingly concise.
Run the following one-liner:
df = df.join(df.iloc[:, 1:].cumsum(axis=1)
.rename(lambda name: str(name) + '_1', axis=1))
Details:
df.iloc[:, 1:] - Take all rows, starting from column 1 (column
numbers from 0).
cumsum(axis=1) - Compute cumulative sum, horizontally.
rename(..., axis=1) - Rename columns.
lambda name: str(name) + '_1' - Lambda function to compute new
column name.
The result so far - new columns.
df = df.join(...) - Join with the original DataFrame and save the
result back under df.

Python Pandas Shift Dataframe Column Down Into Rows (reset index on column?)

How would you drop / reset the column axis to shift the data down causing the column headers to be something like [0, 1, 2, 3, 4, 5] then set column headers to df[5] values? I reset the index of the rows axis all the time but never had the need to do it to columns.
df = pd.DataFrame({'very_low': ['High', 'Low', 'Middle', 'Low'], '0.2': [0.10000000000000001, 0.050000000000000003, 0.14999999999999999, 0.080000000000000002], '0.1': [0.080000000000000002, 0.059999999999999998, 0.10000000000000001, 0.080000000000000002], '0.4': [0.90000000000000002, 0.33000000000000002, 0.29999999999999999, 0.23999999999999999], '0': [0.080000000000000002, 0.059999999999999998, 0.10000000000000001, 0.080000000000000002], '0.3': [0.23999999999999999, 0.25, 0.65000000000000002, 0.97999999999999998]})
0 0.1 0.2 0.3 0.4 very_low
0 0.08 0.08 0.10 0.24 0.90 High
1 0.06 0.06 0.05 0.25 0.33 Low
2 0.10 0.10 0.15 0.65 0.30 Middle
3 0.08 0.08 0.08 0.98 0.24 Low
If I understood you correctly, something like this?
df2 = pd.concat([pd.DataFrame(df.columns).T, pd.DataFrame(df.values)],
ignore_index=True).iloc[:, :-1]
df2.columns = [df.columns[-1]] + df.iloc[:, -1].tolist()
>>> df2
very_low High Low Middle Low
0 0 0.1 0.2 0.3 0.4
1 0.08 0.08 0.1 0.24 0.9
2 0.06 0.06 0.05 0.25 0.33
3 0.1 0.1 0.15 0.65 0.3
4 0.08 0.08 0.08 0.98 0.24
I think this is what you want:
tdf = df.T
tdf.columns = tdf.iloc[5]
tdf.drop(tdf.tail(1).index,inplace=True)
>>> tdf
very_low High Low Middle Low
0 0.08 0.06 0.1 0.08
0.1 0.08 0.06 0.1 0.08
0.2 0.1 0.05 0.15 0.08
0.3 0.24 0.25 0.65 0.98
0.4 0.9 0.33 0.3 0.24

Combine Rows in Pandas

I have a pandas DataFrame like this
100 200 300
283.1 0.01 0.02 0.40
284.1 0.02 0.03 0.42
285.1 0.05 0.01 0.8
286.1 0.06 0.02 0.9
I need to combine a certain number of consecutive rows and calculate the average value for each column and a new index as the average of the indices I used, in order to obtain something like this:
100 200 300
283.6 0.015 0.025 0.41
285.6 0.055 0.015 0.85
Is there a way to to this with pandas?
Yes -- you could do this in Pandas. Here's one way to do it.
Let's say out, our initial dataframe df is like
index 100 200 300
0 283.1 0.01 0.02 0.40
1 284.1 0.02 0.03 0.42
2 285.1 0.05 0.01 0.80
3 286.1 0.06 0.02 0.90
Now, calculate the length of dataframe
N = len(df.index)
N
4
We create a grp column -- to be used for aggregation,
where for 2 rows aggregation use [x ]*2 and for n-consecutive rows use [x]*n
df['grp'] = list(itertools.chain.from_iterable([x]*2 for x in range(0, N/2)))
df
index 100 200 300 grp
0 283.1 0.01 0.02 0.40 0
1 284.1 0.02 0.03 0.42 0
2 285.1 0.05 0.01 0.80 1
3 286.1 0.06 0.02 0.90 1
Now, get the means by grouping the grp column like this --
df.groupby('grp').mean()
index 100 200 300
grp
0 283.6 0.015 0.025 0.41
1 285.6 0.055 0.015 0.85
A simple way:
>>> print df
index 100 200 300
0 283.1 0.01 0.02 0.40
1 284.1 0.02 0.03 0.42
2 285.1 0.05 0.01 0.80
3 286.1 0.06 0.02 0.90
break the DataFrame up into the portions that you want and find the mean of the relevant columns:
>>> pieces = [df[:2].mean(), df[2:].mean()]
then put the pieces back together using concat:
>>> avgdf = pd.concat(pieces, axis=1).transpose()
index 100 200 300
0 283.6 0.015 0.025 0.41
1 285.6 0.055 0.015 0.85
Alternatively, you can recombine the data with a list comprehension [i for i in pieces] or a generator expression:
>>> z = (i for i in pieces)
and use this to create your new DataFrame:
>>> avgdf = pd.DataFrame(z)
Finally, to set the index:
>>> avgdf.set_index('index', inplace=True)
>>> print avgdf
100 200 300
index
283.6 0.015 0.025 0.41
285.6 0.055 0.015 0.85

Categories

Resources