Calculate the values of dependent variable using multivariate linear regression with numpy - python

I am trying to implement multivariate linear regression using numpy. There are several questions in this forum regarding that but seems to answer my question. I have the following independent variables (X1, X2, X3, X4, X5) and dependent variable Y. I want to predict the values of Y'.
X1 X2 X3 X4 Y Y'
1 0 1 0 1 ? // ? -> referring this value as y'1
0 0 1 1 0 ? // ? -> referring this value as y'2
0 1 0 1 0 ? // ? -> referring this value as y'3
0 0 0 1 1 ? // ? -> referring this value as y'4
1 0 1 1 0 ? // ? -> referring this value as y'5
So, I am using numpy as:
>>> X1 = np.array([1,0,0,0,1])
>>> X2 = np.array([0,0,1,0,0])
>>> X3 = np.array([1,1,0,0,1])
>>> X4 = np.array([0,1,1,1,1])
>>> Y = np.array([1,0,0,1,0])
>>> x = np.array([X1,X2,X3,X4], np.int32)
>>> n = np.max(x.shape)
>>> X = np.vstack([np.ones(n), x]).T
>>> print np.linalg.lstsq(X, Y)[0]
[ 2.00000000e+00 -2.22044605e-16 -1.00000000e+00 -1.00000000e+00 -1.00000000e+00]
So, I have the equation y = a + b1.x1 +b2.x2 + b3.x3 + b4.x4 . From above, I have got the values of a,b1,b2,b3,b4.
So,how do I calculate the values of Y' which are y'1, y'2,y'3, y'4,y'5 from the above coefficient values?

The point of OLS is to fit parameters based on data you have and use that to predict a new Y. Try ...
>>> import numpy as np
>>> X = np.array([[1,0,1,0], [0,0,1,1], [0,1,0,1], [0,0,0,1], [1,0,1,1]])
>>> Y = np.array([1,0,0,1,0]).reshape((5,1))
>>> b = np.linalg.inv((X.T).dot(X)).dot(X.T).dot(Y)
>>> b
out [1]: array([[0.666], [-0.333], [-0.333], [0.333]])
Then use this to predict a new Y given 4 new X's. Also, if your Y's are binary (all zeros and ones), you should look at using Logistic Regression.

Related

How to get linear value from Dataframe

I have simple question.
i want to get linear value with dataframe.(like N-D Lookuptable in Matlab simulink)
type python
import numpy as np
import scipy as sp
import pandas as pd
X = np.array([100,200,300,400,500])
Y = np.array([1,2,3])
W = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
df = pd.DataFrame(W,index=Y,columns=X)
# 100 200 300 400 500
#1 1 1 1 1 1
#2 2 2 2 2 2
#3 3 3 3 3 3
#want function.
#Ex
#input x = 150 y= 1
#result = 1
#input x = 100 y = 1.5
#result = 1.5
somebody let me know, if there is a function or a lib, or how can I do this, or Some beetter way.
I just want to make a filter or getting function with some array data, using Numpy, Pandas or Scipy, if possible.
You are looking for scipy.interpolate.RectBivariateSpline:
#lib
import numpy as np
from scipy import interpolate
#input
X = np.array([100,200,300,400,500])
Y = np.array([1,2,3])
W = np.array([[1,1,1,1,1],[2,2,2,2,2],[3,3,3,3,3]])
#solution
spline = interpolate.RectBivariateSpline(X, Y, W.T, kx=1, ky=1)
#example
spline(150, 1) #returns 1
spline(100, 1.5) #returns 1.5
Let me know if it solves the issue.

Get non empty values of dataframe as a single column

I have a sparse dataframe and would like to get all non empty values as a single column. See the image that I made up to illustrate the problem. I somehow managed to solve it using the python code below. However, I feel there migh be some better | simpler | efficient way to solve it
import pandas as pd
list1 = ["x1","x2","?","?","?","?"]
list2 = ["?","?","y1","y2","?","?"]
list3 = ["?","?","?","?","z1","z2"]
df_sparse = pd.DataFrame({"A":list1,"B":list2,"C":list3})
values_vect = []
for col in df_sparse.columns:
values = [ i for i in list(df_sparse[col]) if i !="?"]
values_vect.extend(values)
df_sparse["D"] = pd.DataFrame(values_vect,columns=["D"])
display(df_sparse)
df_sparse["D"] = df_sparse.replace("?", np.nan).ffill(axis="columns").iloc[:, -1]
replace "?"s with NaNs
forward fill the values along columns so that non-NaN values will slide to the rightmost positions
query the rightmost column, that's where the values are
to get
>>> df_sparse
A B C D
0 x1 ? ? x1
1 x2 ? ? x2
2 ? y1 ? y1
3 ? y2 ? y2
4 ? ? z1 z1
5 ? ? z2 z2
Using masking, stack and groupby.last:
df_sparse['D'] = (df_sparse
.where(df_sparse.ne('?'))
.stack()
.groupby(level=0).last()
)
print(df_sparse)
Output:
A B C D
0 x1 ? ? x1
1 x2 ? ? x2
2 ? y1 ? y1
3 ? y2 ? y2
4 ? ? z1 z1
5 ? ? z2 z2

How to create a new dataframe and add new variables in Python?

I created two random variables (x and y) with certain properties. Now, I want to create a dataframe from scratch out of these two variables. Unfortunately, what I type seems to be wrong. How can I do this correctly?
# creating variable x with Bernoulli distribution
from scipy.stats import bernoulli, binom
x = bernoulli.rvs(size=100,p=0.6)
# form a column vector (n, 1)
x = x.reshape(-100, 1)
print(x)
# creating variable y with normal distribution
y = norm.rvs(size=100,loc=0,scale=1)
# form a column vector (n, 1)
y = y.reshape(-100, 1)
print(y)
# creating a dataframe from scratch and assigning x and y to it
df = pd.DataFrame()
df.assign(y = y, x = x)
df
There are a lot of ways to go about this.
According to the documentation pd.DataFrame accepts ndarray (structured or homogeneous), Iterable, dict, or DataFrame. Your issue is that x and y are 2d numpy array
>>> x.shape
(100, 1)
where it expects either one 1d array per column or a single 2d array.
One way would be to stack the array into one before calling the DataFrame constructor
>>> pd.DataFrame(np.hstack([x,y]))
0 1
0 0.0 0.764109
1 1.0 0.204747
2 1.0 -0.706516
3 1.0 -1.359307
4 1.0 0.789217
.. ... ...
95 1.0 0.227911
96 0.0 -0.238646
97 0.0 -1.468681
98 0.0 1.202132
99 0.0 0.348248
The alernatives mostly revolve around calling np.Array.flatten(). e.g. to construct a dict
>>> pd.DataFrame({'x': x.flatten(), 'y': y.flatten()})
x y
0 0 0.764109
1 1 0.204747
2 1 -0.706516
3 1 -1.359307
4 1 0.789217
.. .. ...
95 1 0.227911
96 0 -0.238646
97 0 -1.468681
98 0 1.202132
99 0 0.348248

Multi-column interpolation in Python

I want to use scipy or pandas to interpolate on a table like this one:
df = pd.DataFrame({'x':[1,1,1,2,2,2],'y':[1,2,3,1,2,3],'z':[10,20,30,40,50,60] })
df =
x y z
0 1 1 10
1 1 2 20
2 1 3 30
3 2 1 40
4 2 2 50
5 2 3 60
I want to be able to interpolate for a x value of 1.5 and a y value of 2.5 and obtain a 40.
The process would be:
Starting from the first interpolation parameter (x), find the values that surround the target value. In this case the target is 1.5 and the surrounding values are 1 and 2.
Interpolate in y for a target of 2.5 considering x=1. In this case between rows 1 and 2, obtaining a 25
Interpolate in y for a target of 2.5 considering x=2. In this case between rows 4 and 5, obtaining a 55
Interpolate the values form previous steps to the target x value. In this case I have 25 for x=1 and 55 for x=2. The interpolated value for 1.5 is 40
The order in which interpolation is to be performed is fixed and the data will be correctly sorted.
I've found this question but I'm wondering if there is a standard solution already available in those libraries.
You can use scipy.interpolate.interp2d:
import scipy.interpolate
f = scipy.interpolate.interp2d(df.x, df.y, df.z)
f([1.5], [2.5])
[40.]
The first line creates an interpolation function z = f(x, y) using three arrays for x, y, and z. The second line uses this function to interpolate for z given values for x and y. The default is linear interpolation.
Define your interpolate function:
def interpolate(x, y, df):
cond = df.x.between(int(x), int(x) + 1) & df.y.between(int(y), int(y) + 1)
return df.loc[cond].z.mean()
interpolate(1.5,2.5,df)
40.0

Equivalent of adding a value in a new row/column to numpy that works like R's data.frame

In R I can do:
> y = c(2,3)
> x = c(4,5)
> z = data.frame(x,y)
> z[3,3]<-6
> z
x y V3
1 4 2 NA
2 5 3 NA
3 NA NA 6
R automatically fills the empty cells with NA.
If I use numpy.insert from numpy, numpy throws by default an error:
import numpy
y = [2,3]
x = [4,5]
z = numpy.array([y, x])
z = numpy.insert(z, 3, 6, 3)
IndexError: axis 3 is out of bounds for an array of dimension 2
Is there a way to insert values in a way that works similar to R in numpy?
numpy is more of a replacement for R's matrices, and not so much for its data frames. You should consider using python's pandas library for this. For example:
In [1]: import pandas
In [2]: y = pandas.Series([2,3])
In [3]: x = pandas.Series([4,5])
In [4]: z = pandas.DataFrame([x,y])
In [5]: z
Out[5]:
0 1
0 4 5
1 2 3
In [19]: z.loc[3,3] = 6
In [20]: z
Out[20]:
0 1 3
0 4 5 NaN
1 2 3 NaN
3 NaN NaN 6
In numpy you need to initialize an array with the appropriate size:
z = numpy.empty(3, 3)
z.fill(numpy.nan)
z[:2, 0] = x
z[:2, 1] = z
z[3,3] = 6
Looking at the raised error is possible to understand why it occurred:
you are trying to insert values in an axes non existent in z.
you can fix it doing
import numpy as np
y = [2,3]
x = [4,5]
array = np.array([y, x])
z = np.insert(array, 1, [3,6], axis=1))
The interface is quite different from the R's one. If you are using IPython,
you can easily access the documentation for some numpy function, in this case
np.insert, doing:
help(np.insert)
which gives you the function signature, explain each parameter used to call it and provide
some examples.
you could, alternatively do
import numpy as np
x = [4,5]
y = [2,3]
array = np.array([y,x])
z = [3,6]
new_array = np.vstack([array.T, z]).T # or, as below
# new_array = np.hstack([array, z[:, np.newaxis])
Also, give a look at the Pandas module. It provides
an interface similar to what you asked, implemented with numpy.
With pandas you could do something like:
import pandas as pd
data = {'y':[2,3], 'x':[4,5]}
dataframe = pd.DataFrame(data)
dataframe['z'] = [3,6]
which gives the nice output:
x y z
0 4 2 3
1 5 3 5
If you want a more R-like experience within python, I can highly recommend pandas, which is a higher-level numpy based library, which performs operations of this kind.

Categories

Resources