Multiple Linear Regression - Determining Coefficients for 3 independent variables - python

I am struggling to find the coefficients for b1, b2 and b3. My model has 3 independent variable x1, x2 and x3 and one dependent variable y.
x1,x2,x3,y
89,4,3.84,7
66,1,3.19,5.4
78,3,3.78,6.6
111,6,3.89,7.4
44,1,3.57,4.8
77,3,3.57,6.4
80,3,3.03,7
66,2,3.51,5.6
109,5,3.54,7.3
76,3,3.25,6.4
I want to use the matrix method to find out the coefficients for b1, b2 and b3. From the tutorial that I am following the value for b1 is 0.0141, b2 is 0.383 and b3 is -0.607.
I am not sure about how to achieve those values mentioned above, when I tried to inverse the matrix containing x1, x2, x3 values I am getting the below error.
raise LinAlgError('Last 2 dimensions of the array must be square')
numpy.linalg.linalg.LinAlgError: Last 2 dimensions of the array must be square
Please someone help me solve this matrix so that I can get the desired values.

In matrix form, the regression coefficients are given by
Where x is your data matrix of predictors, and y is a vector of outcome values
In python (numpy), that looks something like this:
import numpy as np
b = np.dot(x.T, x)
b = np.linalg.inv(b)
b = np.dot(b, x.T)
b = np.dot(b, y)
Using that on your data you get the following coefficients:
0.0589514 , -0.25211869, 0.70097577
Those values don't match your expected output, and it's because the tutorial you're following must also be modelling an intercept. To do that we add a column of ones to the data matrix so it looks like this:
x.insert(loc=0, column='x0', value=np.ones(10))
x0 x1 x2 x3
0 1.0 89 4 3.84
1 1.0 66 1 3.19
2 1.0 78 3 3.78
3 1.0 111 6 3.89
4 1.0 44 1 3.57
5 1.0 77 3 3.57
6 1.0 80 3 3.03
7 1.0 66 2 3.51
8 1.0 109 5 3.54
9 1.0 76 3 3.25
Now we get the expected regression coefficients (plus an additional value for the intercept):
6.21137766, 0.01412189, 0.38315024, -0.60655271

Related

Predict existing data using scikit learn

My dataset looks like this:
age address freetime goout Dalc Walc G1 G2 G3 AverageG
17 U 1 1 3 5 7 7 7 7
15 X 3 2 6 3 5 4 2 3.6666
20 T 1 5 4 1 3 2 1 2
What I'm trying to do using python is to predict the value AverageG which is the average of G1, G2, G3.
I know that the value of AverageG can be calculated by making the average of G1, G2 and G3 but in my case it has to be predicted by using the library scikit-learn
For this toy example you can use linear regression.
I will give the general idea, then you can translate it for your specific dataframe:
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.random.randint(0,10,(1000,3))
y = X.mean(axis=1)
model = LinearRegression()
model.fit(X, y)
new_data = np.array([1,2,3]).reshape(1, -1)
model.predict(new_data)
and the model correctly predicts:
array([2.])

Pandas: Calculate Distance and Angle between X, Y groupped

Well, I have following columns:
Id PlayId X Y
0 0 2.3 3.4
1 0 5.4 3.2
2 1 3.2 5.1
3 1 4.2 1.7
If I have two rows groupped by one PlayId, I want to add two columns of Distance and Angle:
Id PlayId X Y Distance_0 Distance_1 Angle_0 Angle_1
0 0 2.3 3.4 0.0 ? 0.0 ?
1 0 5.4 3.2 ? 0.0 ? 0.0
2 1 3.2 5.1
3 1 4.2 1.7
Every Distance-column describes Euclidean distance between i-th and j-th element in a group:
dist(x0, x1, y0, y1) = sqrt((x0 - x1) ** 2 + (y0 - y1) ** 2)
Similar way, the angle between i-th and j-th element is calculated.
So, how can I perform this efficiently, without processing elements one-by-one?
You can compute the pairwise distances by using the pdist function from SciPy:
df = pd.DataFrame({'X': [5, 6, 7], 'Y': [3, 4, 5]})
# df
# X Y
# 0 5 3
# 1 6 4
# 2 7 5
from scipy.spatial.distance import pdist, squareform
cols = [f'Distance_{i}' for i in range(len(df))]
pd.DataFrame(squareform(pdist(df.values)), columns=cols)
which produces the following DataFrame:
Distance_0 Distance_1 Distance_2
0 0.000000 1.638991 2.828427
1 1.638991 0.000000 1.638991
2 2.828427 1.638991 0.000000
This works, since pdist takes an array of size m * n, where m is the number of observations (=rows) and n the dimension of said observations (in this case: two - X and Y)
You could subsequently concat the original DataFrame with the newly created one if needed (using pd.concat).
For the angle, you could use pdist as well, using metric='cosine' to compute the cosine distance. See this post for more information.

Can I use lambda, map, apply, or applymap to fill a dataframe?

This is a simplified version of my data. I have a dataframe of coordinates, and an empty dataframe which should be filled with the distance of each pair using the function provided.
What is the quickest method to fill this dataframe? As much as possible, I want to stay away from nested for loops (slow!). Can I use apply or applymap?
You may modify the function or other parts accordingly. Thanks.
import pandas as pd
def get_distance(point1, point2):
"""Gets the coordinates of two points as two lists, and outputs their distance"""
return (((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2 + (point1[2] - point2[2]) ** 2) ** 0.5)
#Dataframe of coordinates.
df = pd.DataFrame({"No.": [25, 36, 70, 95, 112, 101, 121, 201], "x": [1,2,3,4,2,3,4,5], "y": [2,3,4,5,3,4,5,6], "z": [3,4,5,6,4,5,6,7]})
df.set_index("No.", inplace = True)
#Dataframe to be filled with each pair distance.
df_dist = pd.DataFrame({'target': [112, 101, 121, 201]}, columns=["target", 25, 36, 70, 95])
df_dist.set_index("target", inplace = True)
AFAIK there are no clear speed benefit of lambda over a for loop - and it's very hard to write a double lambda, usually that is reserved for straightforward row operations.
However with some engineering, we can reduce our code to a few simple and self explanatory lines:
import numpy as np
get = lambda i: df.loc[i,:].values
dist = lambda i, j: np.sqrt(sum((get(i) - get(j))**2))
# Fills your df_dist
for i in df_dist.columns:
for j in df_dist.index:
df_dist.loc[j,i] = dist(i, j)
The resulting df_dist:
25 36 70 95
target
112 1.732051 0.000000 1.732051 3.464102
101 3.464102 1.732051 0.000000 1.732051
121 5.196152 3.464102 1.732051 0.000000
201 6.928203 5.196152 3.464102 1.732051
If you don't want to use for loops, you can compute the distances between all the possible pairs in the following way.
You first need to do the cartesian product of df with itself to have all the possible pairs of point.
i, j = np.where(1 - np.eye(len(df)))
df=df.iloc[i].reset_index(drop=True).join(
df.iloc[j].reset_index(drop=True), rsuffix='_2')
Where i and j are the boolean indexes of the upper and lower triangles of a square matrix of size len(df). After you did this you just need to apply your distance function
df['distance'] = get_distance([df['x'],df['y'],df['z']], [df['x_2'],df['y_2'],df['z_2']])
df.head()
No. x y z No._2 x_2 y_2 z_2 distance
0 25 1 2 3 36 2 3 4 1.732051
1 25 1 2 3 70 3 4 5 3.464102
2 25 1 2 3 95 4 5 6 5.196152
3 25 1 2 3 112 2 3 4 1.732051
4 25 1 2 3 101 3 4 5 3.464102
If you wanted to compute only the points from df_dist you can modify accordingly the matrix 1 - np.eye(len(df)).

Subtracting group-wise mean from a matrix or data frame in python (the "within" transformation for panel data)

In datasets where units are observed multiple times, many statistical methods (particularly in econometrics) apply a transformation to the data in which the group-wise mean of each variable is subtracted off, creating a dataset of unit-level (non-standardized) anomalies from a unit level mean.
I want to do this in Python.
In R, it is handled quite cleanly by the demeanlist function in the lfe package. Here's an example dataset, which a grouping variable fac:
> df <- data.frame(fac = factor(c(rep("a", 5), rep("b", 6), rep("c", 4))),
+ x1 = rnorm(15),
+ x2 = rbinom(15, 10, .5))
> df
fac x1 x2
1 a -0.77738784 6
2 a 0.25487383 4
3 a 0.05457782 4
4 a 0.21403962 7
5 a 0.08518492 4
6 b -0.88929876 4
7 b -0.45661751 5
8 b 1.05712683 3
9 b -0.24521251 5
10 b -0.32859966 7
11 b -0.44601716 3
12 c -0.33795597 4
13 c -1.09185690 7
14 c -0.02502279 6
15 c -1.36800818 5
And the transformation:
> library(lfe)
> demeanlist(df[,c("x1", "x2")], list(df$fac))
x1 x2
1 -0.74364551 1.0
2 0.28861615 -1.0
3 0.08832015 -1.0
4 0.24778195 2.0
5 0.11892725 -1.0
6 -0.67119563 -0.5
7 -0.23851438 0.5
8 1.27522996 -1.5
9 -0.02710938 0.5
10 -0.11049653 2.5
11 -0.22791403 -1.5
12 0.36775499 -1.5
13 -0.38614594 1.5
14 0.68068817 0.5
15 -0.66229722 -0.5
In other words, the following numbers are subtracted from groups a, b, and c:
> library(doBy)
> summaryBy(x1+x2~fac, data = df)
fac x1.mean x2.mean
1 a -0.03374233 5.0
2 b -0.21810313 4.5
3 c -0.70571096 5.5
I'm sure I could figure out a function to do this, but I'll be calling it thousands of times on very large datasets, and would like to know if something fast and optimized has already been built, or is obvious to construct.

Dynamic number of X values (dependent variables) in glm function in R isn't giving the right output

R-glm for logistic regression. I was trying to dynamically input values to the formula according to another stack-overflow post.
The function is called from python using rpy2. when I printed out summery(glm.out).
I ran the test for 2 different scenarios.
The input x,y values were taken directly from the python part of the code, and converted to the right format, and passed to logestic_regression function in R. The input value from python is printed below (2nd code block). glm is run on those values using as.formula. And this gave me an output (4th block of code)
The input x,y values are just created in R as given in the code (in this case x=k1,k2 and y=m.) And glm function is run in the traditional way. And that gave me a different output (6th block of code)
The inputs are numerically correct. But the format is different. First scenario- a dataframe and Second- vectors.
Or my glm call is wrong.
R code.
logistic_regression = function(y,x,colnames){
print("Y value is ")
print(y)
print("X value is ")
print(x)
m <- c(1,1,1,0,0,0)
k1 <- c(4,3,5,1,2,3)
k2 <- c(6,7,8,5,6,3)
glm.out = glm(as.formula(paste("y~", paste(colnames, collapse="+"))), family=binomial(logit), data=x)
# glm.out = glm(m~k1+k2, family=binomial(logit), data=x)
return(summary(glm.out))
}
INPUT PRINTED
[1] "Y value is "
[1] 1 1 1 0 0 0
[1] "X value is "
X0 X1
0 4 6
1 3 7
2 5 8
3 1 5
4 2 6
5 3 3
When I ran the code
glm.out = glm(as.formula(paste("y~", paste(colnames, collapse="+"))), family=binomial(logit), data=x)
OUTPUT
Call:
glm(formula = as.formula(paste("y~", paste(colnames, collapse = "+"))),
family = binomial(logit), data = x)
Deviance Residuals:
[1] 0 0 0 0 0 0
Coefficients: (3 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.457e+01 1.310e+05 0 1
X02 6.872e-14 1.853e+05 0 1
X03 3.566e-14 1.853e+05 0 1
X04 4.913e+01 1.853e+05 0 1
X05 4.913e+01 1.853e+05 0 1
X15 NA NA NA NA
X16 NA NA NA NA
X17 4.913e+01 1.853e+05 0 1
X18 NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178e+00 on 5 degrees of freedom
Residual deviance: 2.5720e-10 on 0 degrees of freedom
AIC: 12
Number of Fisher Scoring iterations: 23
But when I ran
glm.out = glm(m~k1+k2, family=binomial(logit), data=x)
The output was completely different (looked more correct)
Call:
glm(formula = m ~ k1 + k2, family = binomial(logit), data = x)
Deviance Residuals:
0 1 2 3 4 5
1.532e-06 1.390e-05 2.110e-08 -2.110e-08 -1.344e-05 -2.110e-08
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -199.05 1221734.18 0 1
k1 25.30 281753.45 0 1
k2 20.89 288426.19 0 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8.3178e+00 on 5 degrees of freedom
Residual deviance: 3.7636e-10 on 3 degrees of freedom
AIC: 6
Number of Fisher Scoring iterations: 24
In glm, the formula argument is a symbolic description of the model to be fitted and the data argument is an optional data frame containing the variables in the model.
In your logistic_regression function call of glm(), the model variables indicated in formula y~k1+k2 are not contained within data=x (a data frame with two columns named X0 and X1), and thus, are taken from the environment from which glm is called (your logistic_regression function). The 3 hardcoded vectors (m, k1, k2) in that environment are not associated with the inputs (i.e., the x=k1,k2 and y=m step done in your second scenario is not occurring within your function).
To call glm() using your logistic_regression() input, you could create a data frame consisting of the model variables to use as a single input and edit your function accordingly. For example, you could use:
x <- data.frame(y=c(1, 1, 1, 0, 0, 0), k1=c(4,3,5,1,2,3), k2= c(6,7,8,5,6,3))
logistic_regression <- function(x){
glm.out <- glm(as.formula(paste("y~", paste(colnames(x[,-1]), collapse="+"))), family=binomial(logit), data=x)
return(summary(glm.out))
}
logistic_regression(x)

Categories

Resources