Python Dataframe: Calculating R^2 and RMSE Using Groupby on One Column - python

I have the following Python dataframe:
Type Actual Predicted
A 4 3
A 10 18
A 13 11
B 3 10
B 4 2
B 8 33
C 20 17
C 40 33
C 87 80
C 32 30
I have the code to calculate R^2 and RMSE but I don't know how to calculate it by distinct "Type".
For now, my methodology is breaking the larger table into three smaller tables consisting of only A, B, C values and then calculating R^2 and RMSE off each smaller table...then appending them back together.
But the above method is inefficient and I believe there should be an easier way?
Below is the format I want the results to produce when things are grouped:
Type R^2 RMSE
A value value
B value value
C value value

Here is a groupby method:
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error
def r2_rmse(g):
r2 = r2_score(g['Actual'], g['Predicted'])
rmse = np.sqrt(mean_squared_error(g['Actual'], g['Predicted']))
return pd.Series(dict(r2 = r2, rmse = rmse))
your_df.groupby('Type').apply(r2_rmse).reset_index()

Related

Create Crosstab Using Percentile Counts In Python

I'm doing work involving stock market research and I wanted to create a crosstab to run a chi squared test on. I have stock market price change data as a data frame, and I wanted to create another crosstab based on counts by percentile of two of the columns. Ideally it'd look something like this:
0.25
0.5
0.75
1.0
0.25
12
45
13
12
0.5
2
27
9
15
0.75
14
11
89
23
1.0
10
52
11
7
Where for example the (.75,.5) entry is the count of data points that lie between the 0.5 and 0.75 percentiles for the first variable and the 0.25 and 0.5 percentiles for the second variable. obviously those numbers probably aren't actually possible but you get the point.
All I can think of so far is just doing it by brute force where you get each percentile for each variable individually and then get the counts for each and add them in manually to a table. Is there any shorter way of doing this?
Preparing a sample dataset
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(100,2), columns=['A', 'B'])
The percentiles can be computed using the qcut. The 4 is the number of percentiles you want to split your variable.
df['A_binned'] = pd.qcut(df['A'], 4)
df['B_binned'] = pd.qcut(df['B'], 4)
Counts the number of records in each percentile
dff = df.groupby(by=['A_binned', 'B_binned']).count().reset_index()
Finally you can pivot the dataframe
dff.pivot_table(index='A_binned', columns = 'B_binned', values='A')

Computing MSE per row in pandas dataframe

I have the below dataframe with many columns- 2016_x, 2016_y, 2017_x, etc,
where x represents my actual values and y represents the forecast values.
How would I compute the mean squared error (MSE) row-wise to see it for different fruits.
Here is the below code-
import pandas as pd
s={'Fruits':['Apple','Mango'],'2016_x':[2,3],'2017_x':[4,5],'2018_x':[12,13],'2016_y':[3,4],'2017_y':[3,4],'2018_y':[12,13]}
p=pd.DataFrame(data=s)
This is how the dataframe looks like-
The desired output should show MSE of Apple and Mango, i.e. row by row.
MSE should take difference of x and y values of the year.
Basically, I need the total MSE for Apple and Mango respectively.
I know MSE can be calculated as-
MSE = np.mean((p['x'] - p['y'])**2, axis=1)
But how would I calculate for this type of data frame?
Set the index to Fruits and transform the columns into a MultiIndex of (x/y, year):
p = p.set_index('Fruits')
p.columns = p.columns.str.split('_', expand=True)
p = p.swaplevel(axis=1)
# x y
# 2016 2017 2018 2016 2017 2018
# Fruits
# Apple 2 4 12 3 3 12
# Mango 3 5 13 4 4 13
Then the MSE arithmetic can be vectorized:
mse = p['x'].sub(p['y']).pow(2).mean(axis=1)
# Fruits
# Apple 0.666667
# Mango 0.666667
# dtype: float64
Note that chaining sub and pow is just a cleaner way of applying - and ** on columns:
mse = ((p['x'] - p['y']) ** 2).mean(axis=1)

Statistical significance test after matching

I have a dataframe consisting of four columns and around 20000 rows, like this.
import pandas as pd
import numpy as np
d = {'x': [1,1,0,1,0,0,1],'BPM':[70,55,45,np.nan,35,25,np.nan],'AGE': [50, 47,21, 50,24,47,16], 'WEIGHT': [50,100,50,np.nan,np.nan,100,27]}
df = pd.DataFrame(data=d)
x BPM AGE WEIGHT
1 70 50 50
1 55 47 100
0 45 21 50
1 nan 24 nan
0 35 50 nan
0 25 47 100
1 nan 16 27
Is there any significant difference in "BPM" between class '1' and class '0' after matching AGE and WEIGHT?
There are two classes: 0 and 1. The number of samples is not equal in both classes. I understand that first I have to match values, then I can apply t-test. I am new to this field, so I do not understand how to proceed.
You could calculate the t-scores by hand.
mean_bpm_df = df.groupby(['AGE','WEIGHT','x']).mean().unstack(level=-1)
mean_bpm_df.columns = ['mean_bpm_0','mean_bpm_1']
std_count_df = df.drop(columns='x').groupby(['AGE','WEIGHT']).agg(['std','count'])
std_count_df.columns = ['std_bpm','count_bpm']
t_df = (mean_bpm_df.mean_bpm_0 - mean_bpm_df.mean_bpm_1) / (std_count_df.std_bpm / np.sqrt(std_count_df.count_bpm))
Now, if you also want the p-values, those can be calculated by hand too. Assume a 2-sided t-test (you can modify this if needed).
from scipy.stats import t
p_df = pd.DataFrame(index=t_df.index, data=2*(1 - t.cdf(abs(t_df), std_count_df.count_bpm-1)))

Python Function to Compute a Beta Matrix

I'm looking for an efficient function to automatically produce betas for every possible multiple regression model given a dependent variable and set of predictors as a DataFrame in python.
For example, given this set of data:
https://i.stack.imgur.com/YuPuv.jpg
The dependent variable is 'Cases per Capita' and the columns following are the predictor variables.
In a simpler example:
Student Grade Hours Slept Hours Studied ...
--------- -------- ------------- --------------- -----
A 90 9 1 ...
B 85 7 2 ...
C 100 4 5 ...
... ... ... ... ...
where the beta matrix output would look as such:
Regression Hours Slept Hours Studied
------------ ------------- ---------------
1 # N/A
2 N/A #
3 # #
The table size would be [2^n - 1] where n is the number of variables, so in the case with 5 predictors and 1 dependent, there would be 31 regressions, each with a different possible combination of beta calculations.
The process is described in greater detail here and an actual solution that is written in R is posted here.
I am not aware of any package that already does this. But you can create all those combinations (2^n-1), where n is the number of columns in X (independent variables), and fit a linear regression model for each combination and then get coefficients/betas for each model.
Here is how I would do it, hope this helps
from sklearn import datasets, linear_model
import numpy as np
from itertools import combinations
#test dataset
X, y = datasets.load_boston(return_X_y=True)
X = X[:,:3] # Orginal X has 13 columns, only taking n=3 instead of 13 columns
#create all 2^n-1 (here 7 because n=3) combinations of columns, where n is the number of features/indepdent variables
all_combs = []
for i in range(X.shape[1]):
all_combs.extend(combinations(range(X.shape[1]),i+1))
# print 2^n-1 combinations
print('2^n-1 combinations are:')
print(all_combs)
## Create a betas/coefficients as zero matrix with rows (2^n-1) and columns equal to X
betas = np.zeros([len(all_combs), X.shape[1]])+np.NaN
## Fit a model for each combination of columns and add the coefficients into betas matrix
lr = linear_model.LinearRegression()
for regression_no, comb in enumerate(all_combs):
lr.fit(X[:,comb], y)
betas[regression_no, comb] = lr.coef_
## Print Coefficients of each model
print('Regression No'.center(15)+" ".join(['column {}'.format(i).center(10) for i in range(X.shape[1])]))
print('_'*50)
for index, beta in enumerate(betas):
print('{}'.format(index + 1).center(15), " ".join(['{:.4f}'.format(beta[i]).center(10) for i in range(X.shape[1])]))
results in
2^n-1 combinations are:
[(0,), (1,), (2,), (0, 1), (0, 2), (1, 2), (0, 1, 2)]
Regression No column 0 column 1 column 2
__________________________________________________
1 -0.4152 nan nan
2 nan 0.1421 nan
3 nan nan -0.6485
4 -0.3521 0.1161 nan
5 -0.2455 nan -0.5234
6 nan 0.0564 -0.5462
7 -0.2486 0.0585 -0.4156

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Categories

Resources