Computing MSE per row in pandas dataframe - python

I have the below dataframe with many columns- 2016_x, 2016_y, 2017_x, etc,
where x represents my actual values and y represents the forecast values.
How would I compute the mean squared error (MSE) row-wise to see it for different fruits.
Here is the below code-
import pandas as pd
s={'Fruits':['Apple','Mango'],'2016_x':[2,3],'2017_x':[4,5],'2018_x':[12,13],'2016_y':[3,4],'2017_y':[3,4],'2018_y':[12,13]}
p=pd.DataFrame(data=s)
This is how the dataframe looks like-
The desired output should show MSE of Apple and Mango, i.e. row by row.
MSE should take difference of x and y values of the year.
Basically, I need the total MSE for Apple and Mango respectively.
I know MSE can be calculated as-
MSE = np.mean((p['x'] - p['y'])**2, axis=1)
But how would I calculate for this type of data frame?

Set the index to Fruits and transform the columns into a MultiIndex of (x/y, year):
p = p.set_index('Fruits')
p.columns = p.columns.str.split('_', expand=True)
p = p.swaplevel(axis=1)
# x y
# 2016 2017 2018 2016 2017 2018
# Fruits
# Apple 2 4 12 3 3 12
# Mango 3 5 13 4 4 13
Then the MSE arithmetic can be vectorized:
mse = p['x'].sub(p['y']).pow(2).mean(axis=1)
# Fruits
# Apple 0.666667
# Mango 0.666667
# dtype: float64
Note that chaining sub and pow is just a cleaner way of applying - and ** on columns:
mse = ((p['x'] - p['y']) ** 2).mean(axis=1)

Related

Centre of mass row-wise in a dataframe and multiply each column by different mass

I'm trying to calculate the centre of mass of 20 objects, where each object has it's own different mass.
These objects are represented in a dataframe cm_x, and their associated masses in a list. Below I show an example of just 3 of those 20 objects, for the sake of saving space. Each object has an x, y, z coordinate, but I'll just show the x and then I can apply the same technique to the rest. Below is the head of the dataframe.
bar_head_x bar_hip_centre_x bar_left_ankle_x
0 -203.3502 -195.4573 -293.262
1 -203.4280 -195.4720 -293.251
2 -203.4954 -195.4675 -293.248
3 -203.5022 -195.9193 -293.219
4 -203.5014 -195.9092 -293.328
m_head = 0.081
m_hipc = 0.139
m_lank = 0.0465
m = [m_head,m_hipc,m_lank]
I saw in another similar question, someone has suggested this method, however this doesn't incorporate the masses, and that is where I'm having an issue:
def series_sum(pd_series):
return np.sum(np.dot(pd_series.values, np.asarray(range(1, len(pd_series)+1)))/np.sum(pd_series))
cm_x.apply(series_sum, axis=1)
Basically I want for each row, to have an associated centre of mass, using the formula for centre of mass which is sum(x_i * m_i) / sum(m_i).
The desired result would be a new column in the dataframe like so:
cm_x
0 -214.92
1 ...
2 ...
3 ...
4 ...
Any help?
If I understand correctly, you can compute the desired column like this:
>>> df.mul(m).sum(axis=1)/sum(m)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
Use DataFrame.dot and divide by sum of list m:
s = df.dot(m).div(sum(m))
print (s)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
dtype: float64
If need DataFrame add Series.to_frame:
df1 = df.dot(m).div(sum(m)).to_frame('cm_x')
print (df1)
cm_x
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

Python Function to Compute a Beta Matrix

I'm looking for an efficient function to automatically produce betas for every possible multiple regression model given a dependent variable and set of predictors as a DataFrame in python.
For example, given this set of data:
https://i.stack.imgur.com/YuPuv.jpg
The dependent variable is 'Cases per Capita' and the columns following are the predictor variables.
In a simpler example:
Student Grade Hours Slept Hours Studied ...
--------- -------- ------------- --------------- -----
A 90 9 1 ...
B 85 7 2 ...
C 100 4 5 ...
... ... ... ... ...
where the beta matrix output would look as such:
Regression Hours Slept Hours Studied
------------ ------------- ---------------
1 # N/A
2 N/A #
3 # #
The table size would be [2^n - 1] where n is the number of variables, so in the case with 5 predictors and 1 dependent, there would be 31 regressions, each with a different possible combination of beta calculations.
The process is described in greater detail here and an actual solution that is written in R is posted here.
I am not aware of any package that already does this. But you can create all those combinations (2^n-1), where n is the number of columns in X (independent variables), and fit a linear regression model for each combination and then get coefficients/betas for each model.
Here is how I would do it, hope this helps
from sklearn import datasets, linear_model
import numpy as np
from itertools import combinations
#test dataset
X, y = datasets.load_boston(return_X_y=True)
X = X[:,:3] # Orginal X has 13 columns, only taking n=3 instead of 13 columns
#create all 2^n-1 (here 7 because n=3) combinations of columns, where n is the number of features/indepdent variables
all_combs = []
for i in range(X.shape[1]):
all_combs.extend(combinations(range(X.shape[1]),i+1))
# print 2^n-1 combinations
print('2^n-1 combinations are:')
print(all_combs)
## Create a betas/coefficients as zero matrix with rows (2^n-1) and columns equal to X
betas = np.zeros([len(all_combs), X.shape[1]])+np.NaN
## Fit a model for each combination of columns and add the coefficients into betas matrix
lr = linear_model.LinearRegression()
for regression_no, comb in enumerate(all_combs):
lr.fit(X[:,comb], y)
betas[regression_no, comb] = lr.coef_
## Print Coefficients of each model
print('Regression No'.center(15)+" ".join(['column {}'.format(i).center(10) for i in range(X.shape[1])]))
print('_'*50)
for index, beta in enumerate(betas):
print('{}'.format(index + 1).center(15), " ".join(['{:.4f}'.format(beta[i]).center(10) for i in range(X.shape[1])]))
results in
2^n-1 combinations are:
[(0,), (1,), (2,), (0, 1), (0, 2), (1, 2), (0, 1, 2)]
Regression No column 0 column 1 column 2
__________________________________________________
1 -0.4152 nan nan
2 nan 0.1421 nan
3 nan nan -0.6485
4 -0.3521 0.1161 nan
5 -0.2455 nan -0.5234
6 nan 0.0564 -0.5462
7 -0.2486 0.0585 -0.4156

How to convert DataFrame with 1-Level to 3-Level Index Hierarchy

I have a flat DataFrame like this:
And i would like to convert this into a DataFrame like this:
For every test (T) for every version (Version) i would like to sum up the counts of answers mapped on a given likert scale (i cut it down to 3 entries for demonstration purposes) as percentages.
The whole set of likert scale values for every combination of T and Version should sum up to 100 Percent.
likert = {
'Agree': 1,
'Undecided': 2,
'Disagree': 3,
}
How is this possible?
Thanks for your help!
Probably not the most elegant solution but I think this achieves your goal. Suppose your dataframe is named df (I randomly sampled between the scales so my df isn't exactly what you described):
res = df.melt(id_vars=['T', 'Version'], value_vars=['Q1', 'Q2'], value_name='Scale')
This transforms your dataframe to long format:
# T Version variable Scale
# 0 1 A Q1 Undecided
# 1 1 A Q1 Disagree
# 2 1 A Q1 Undecided
# 3 1 A Q1 Agree
Then you want to calculate the size of every combination of your variables, which can be accomplished the following way:
res = res.groupby(['T', 'Version', 'Scale', 'variable']).size()
Which yields:
# T Version Scale variable
# 1 A Agree Q1 2
# Q2 1
# Disagree Q2 3
# Undecided Q1 2
# B Agree Q1 1
Then, to move Q1 and Q2 to the columns, you unstack the last index level like so:
res = res.unstack(level=-1).fillna(0)
# variable Q1 Q2
# T Version Scale
# 1 A Agree 2.0 1.0
# Disagree 0.0 3.0
# Undecided 2.0 0.0
Finally, to compute the percent for each combination of the first two index levels:
res = res.groupby(level=[0, 1]).apply(lambda x: 100. * x / x.sum())
Which gives the desired result:
# variable Q1 Q2
# T Version Scale
# 1 A Agree 50.000000 25.000000
# Disagree 0.000000 75.000000
# Undecided 50.000000 0.000000
# B Agree 33.333333 0.000000
# Disagree 66.666667 66.666667

Python Dataframe: Calculating R^2 and RMSE Using Groupby on One Column

I have the following Python dataframe:
Type Actual Predicted
A 4 3
A 10 18
A 13 11
B 3 10
B 4 2
B 8 33
C 20 17
C 40 33
C 87 80
C 32 30
I have the code to calculate R^2 and RMSE but I don't know how to calculate it by distinct "Type".
For now, my methodology is breaking the larger table into three smaller tables consisting of only A, B, C values and then calculating R^2 and RMSE off each smaller table...then appending them back together.
But the above method is inefficient and I believe there should be an easier way?
Below is the format I want the results to produce when things are grouped:
Type R^2 RMSE
A value value
B value value
C value value
Here is a groupby method:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error
def r2_rmse(g):
r2 = r2_score(g['Actual'], g['Predicted'])
rmse = np.sqrt(mean_squared_error(g['Actual'], g['Predicted']))
return pd.Series(dict(r2 = r2, rmse = rmse))
your_df.groupby('Type').apply(r2_rmse).reset_index()

Categories

Resources