Multiple Linear Regression using Python - python

Firstly, there are a few topics on this but they involve deprecated packages with pandas etc. Suppose I'm trying to predict a variable w with variables x,y and z. I want to run a multiple linear regression to try and predict w. There are quite a few solutions that will produce the coefficients but I'm not sure how to use these. So, in pseudocode;
import numpy as np
from scipy import stats
w = np.array((1,2,3,4,5,6,7,8,9,10)) # Time series I'm trying to predict
x = np.array((1,3,6,1,4,6,8,9,2,2)) # The three variables to predict w
y = np.array((2,7,6,1,5,6,3,9,5,7))
z = np.array((1,3,4,7,4,8,5,1,8,2))
def model(w,x,y,z):
# do something!
return guess # where guess is some 10 element array formed
# using multiple linear regression of x,y,z
guess = model(w,x,y,z)
r = stats.pearsonr(w,guess) # To see how good guess is
Hopefully this makes sense as I'm new to MLR. There is probably a package in scipy that does all this so any help welcome!

You can use the normal equation method.
Let your equation be of the form : ax+by+cz +d =w
Then
import numpy as np
x = np.asarray([[1,3,6,1,4,6,8,9,2,2],
[2,7,6,1,5,6,3,9,5,7],
[1,3,4,7,4,8,5,1,8,2],
[1,1,1,1,1,1,1,1,1,1]]).T
y = numpy.asarray([1,2,3,4,5,6,7,8,9,10]).T
a,b,c,d = np.linalg.pinv((x.T).dot(x)).dot(x.T.dot(y))

Think I've found out now. If anyone could confirm that this produces the correct results that'd be great!
import numpy as np
from scipy import stats
# What I'm trying to predict
y = [-6,-5,-10,-5,-8,-3,-6,-8,-8]
# Array that stores two predictors in columns
x = np.array([[-4.95,-4.55],[-10.96,-1.08],[-6.52,-0.81],[-7.01,-4.46],[-11.54,-5.87],[-4.52,-11.64],[-3.36,-7.45],[-2.36,-7.33],[-7.65,-10.03]])
# Fit linear least squares and get regression coefficients
beta_hat = np.linalg.lstsq(x,y)[0]
print(beta_hat)
# To store my best guess
estimate = np.zeros((9))
for i in range(0,9):
# y = x1b1 + x2b2
estimate[i] = beta_hat[0]*x[i,0]+beta_hat[1]*x[i,1]
# Correlation between best guess and real values
print(stats.pearsonr(estimate,y))

Related

Function minimization with error in Python

I have a function of the form: (y1,y2,y3)=x*(a1,a2,a3)+(b1,b2,b3), where x,y1,y2,y3 are measured values and a1,a2,a3,b1,b2,b3 are parameters I want to fit for. I also have some measurement errors associated with x,y1,y2,y3. I would like to fit this function for a1,a2,a3,b1,b2,b3, and obtain an error on the values of each of these parameters, while taking into account the errors on x,y1,y2,y3. How can I do this? I looked into scipy and lmfit, but I didn't really find something that allows me to both pass the errors on the measured points and return the errors on the fitted parameters. Here is some code I have for the data I need to fit:
import numpy as np
x = np.array([1,2,3,4,5])
err_x = np.array([0.1,0.1,0.2,0.2,0.1])
y = []
for i in range(len(x)):
y = y + [x[i]*np.array([3,4,5])+np.array([-2,3,1])]
y = np.array(y)
err_y = np.array([[0.2,0.2,0.1],[0.2,0.2,0.1],[0.2,0.2,0.1],[0.2,0.2,0.1],[0.2,0.2,0.1]])

Matrix operations using parameters modified through moving horizon estimation

I've recently started trying out moving horizon estimation with GEKKO. My specified manipulated variables are used in a heat balance equation within my model, and I am having some issues with the matrix operations in the model.
Example code:
from gekko import GEKKO
import numpy as np
#creating a sample array of input values
nt = 51
u_meas = np.zeros(nt)
u_meas[3:10] = 1.0
u_meas[10:20] = 2.0
u_meas[20:40] = 0.5
u_meas[40:] = 3.0
p = GEKKO(remote=False)
p.time = np.linspace(0,10,nt)
n = 1 #process model order
#designating u as my input, and that I'm going to be using these measurements to estimate my parameters with MHE
p.u = p.MV(value=u_meas)
p.u.FSTATUS=1
#parameters I'm looking to modulate
p.K = p.FV(value=1, lb = 1, ub = 3) #gain
p.tau = p.FV(value=5, lb = 1, ub = 10) #time constant
p.x = [p.Intermediate(p.u)]
#constants within the model that do not change
X_O2 = 0.5
X_SiO2 = 0.25
X_N2 = 0.1
m_feed = 100
#creating an array with my feed separated into components. This creates a 1D array with the individual feed streams of my components.
mdot_F_i = (np.tile(m_feed,3)*np.array([X_O2, X_SiO2, X_N2])
#at this point, I want to add my MV values to the end of my component feed array for later heat and mass balance equations. Normally, in my previous model without MHE, I would put
mdot_c_i = np.concatenate(mdot_F_i, x, (other MV variables after))
However, now that u is a specified MV in GEKKO, and not a set value, I get an error at the mdot_c_i line that says that the array at index 0 has 1 dimension, and the array at index 1 has 2 dimensions.
I'm guessing that I have to specify mdot_c_i as an intermediate variable within Gekko. I've tried a couple different variations, alternately specifying mdot_c_i as an intermediate and trying to use only the values of the MV; however, I keep getting that error.
Has anyone experiences similar issues to this?
Thank you!
You can resolve this by using np.append() instead of np.concatenate(). Try something like:
mdot_c_i = np.append(mdot_F_i, p.u)
Here is a minimum and complete example if you'd like to try it.
import numpy as np
from gekko import GEKKO
m = GEKKO(remote=False)
x = m.Array(m.Var,3,lb=-10,ub=10)
y = m.Var(5,lb=-5,ub=5)
z = np.append(x,y)
m.Minimize(np.dot([1,1,-1,1],z))
m.solve(disp=False)
print([zi.value[0] for zi in z])
# solution: [-10.0, -10.0, 10.0, -5.0]
Gekko variables need to be stored as objects, not as numerical values. The error may be because the np.concatenate() function is trying to access the length of the Gekko manipulated variable data p.u.value to concatenate those values instead of concatenating p.u as an object.

The shape variable in pymc3.DensityDist does not work properly

I am trying to define a multivariate custom distribution through pymc3.DensityDist(); however, I keep getting the following error that dimensions do not match:
"LinAlgError: 0-dimensional array given. Array must be two-dimensional"
I have already seen https://github.com/pymc-devs/pymc3/issues/535 but I could not find the answer to my question. Just for clarity, here is my simple example
import numpy as np
import pymc3 as pm
def pdf(x):
y = 0
print(x)
sigma = np.identity(2)
isigma = sigma
mu = np.array([[1,2],[3,4]])
for i in range(2):
x0 = x- mu[i,:]
xsinv = np.linalg.multi_dot([x0,isigma,x0])
y = y + np.exp(-0.5*xsinv)
return y
logp = lambda x: np.log(pdf(x))
with pm.Model() as model:
pm.DensityDist('x',logp, shape=2)
step = pm.Metropolis(tune=False, S=np.identity(2))
trace = pm.sample(100000, step=step, chain=1, tune=0,progressbar=False)
result = trace['x']
In this simple code I want to define an unnormilized pdf function, which is sum of two unnormalized normal distributions, and take samples from this pdf through Metropolis algorithm.
Thanks,
Try replacing numpy for theano in the following lines:
xsinv = tt.dot(tt.dot(x0, isigma), x0)
y = y + tt.exp(-0.5 * xsinv)
as a side note, try using NUTS instead of metropolis and let PyMC3 choose the sampling method for you, just do
trace = pm.sample(1000)
For future reference you can also ask questions here

Single Component Metropolis-Hastings

So, let's say I have the following 2-dimensional target distribution that I would like to sample from (a mixture of bivariate normal distributions) -
import numba
import numpy as np
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
%matplotlib inline
def targ_dist(x):
target = (stats.multivariate_normal.pdf(x,[0,0],[[1,0],[0,1]])+stats.multivariate_normal.pdf(x,[-6,-6],[[1,0.9],[0.9,1]])+stats.multivariate_normal.pdf(x,[4,4],[[1,-0.9],[-0.9,1]]))/3
return target
and the following proposal distribution (a bivariate random walk) -
def T(x,y,sigma):
return stats.multivariate_normal.pdf(y,x,[[sigma**2,0],[0,sigma**2]])
The following is the Metropolis Hastings code for updating the "entire" state in every iteration -
#Initialising
n_iter = 30000
# tuning parameter i.e. variance of proposal distribution
sigma = 2
# initial state
X = stats.uniform.rvs(loc=-5, scale=10, size=2, random_state=None)
# count number of acceptances
accept = 0
# store the samples
MHsamples = np.zeros((n_iter,2))
# MH sampler
for t in range(n_iter):
# proposals
Y = X+stats.norm.rvs(0,sigma,2)
# accept or reject
u = stats.uniform.rvs(loc=0, scale=1, size=1)
# acceptance probability
r = (targ_dist(Y)*T(Y,X,sigma))/(targ_dist(X)*T(X,Y,sigma))
if u < r:
X = Y
accept += 1
MHsamples[t] = X
However, I would like to update "per component" (i.e. component-wise updating) in every iteration. Is there a simple way of doing this?
Thank you for your help!
From the tone of your question I assume you are looking performance improvements.
MonteCarlo algorithms are quite compute intensive. You will get better results, if you perform in algorithms on a lower level than in an interpreted language like python, e.g. writing a c-extension.
There are also implementations available for python (PyStan, PyMC3).

Issues with 2D-Interpolation in Scipy

In my application, the data data is sampled on a distorted grid, and I would like to resample it to a nondistorted grid. In order to test this, I wrote this program with examplary distortions and a simple function as data:
from __future__ import division
import numpy as np
import scipy.interpolate as intp
import pylab as plt
# Defining some variables:
quadratic = -3/128
linear = 1/16
pn = np.poly1d([quadratic, linear,0])
pixels_x = 50
pixels_y = 30
frame = np.zeros((pixels_x,pixels_y))
x_width= np.concatenate((np.linspace(8,7.8,57) , np.linspace(7.8,8,pixels_y-57)))
def data(x,y):
z = y*(np.exp(-(x-5)**2/3) + np.exp(-(x)**2/5) + np.exp(-(x+5)**2))
return(z)
# Generating grid coordinates
yt = np.arange(380,380+pixels_y*4,4)
xt = np.linspace(-7.8,7.8,pixels_x)
X, Y = np.meshgrid(xt,yt)
Y=Y.T
X=X.T
Y_m = np.zeros((pixels_x,pixels_y))
X_m = np.zeros((pixels_x,pixels_y))
# generating distorted grid coordinates:
for i in range(pixels_y):
Y_m[:,i] = Y[:,i] - pn(xt)
X_m[:,i] = np.linspace(-x_width[i],x_width[i],pixels_x)
# Sample data:
for i in range(pixels_y):
for j in range(pixels_x):
frame[j,i] = data(X_m[j,i],Y_m[j,i])
Y_m = Y_m.flatten()
X_m = X_m.flatten()
frame = frame.flatten()
##
Y = Y.flatten()
X = X.flatten()
ipf = intp.interp2d(X_m,Y_m,frame)
interpolated_frame = ipf(xt,yt)
At this point, I have to questions:
The code works, but I get the the following warning:
Warning: No more knots can be added because the number of B-spline coefficients
already exceeds the number of data points m. Probably causes: either
s or m too small. (fp>s)
kx,ky=1,1 nx,ny=54,31 m=1500 fp=0.000006 s=0.000000
Also, some interpolation artifacts appear, and I assume that they are related to the warning - Do you guys know what I am doing wrong?
For my actual applications, the frames need to be around 500*100, but when doing this, I get a MemoryError - Is there something I can do to help that, apart from splitting the frame into several parts?
Thanks!
This problem is most likely related to the usage of bisplrep and bisplev within interp2d. The docs mention that they use a smooting factor of s=0.0 and that bisplrep and bisplev should be used directly if more control over s is needed. The related docs mention that s should be found between (m-sqrt(2*m),m+sqrt(2*m)) where m is the number of points used to construct the splines. I had a similar problem and found it solved when using bisplrep and bisplev directly, where s is only optional.
For 2d interpolation,
griddata
is solid, local, fast.
Take a look at problem-with-2d-interpolation-in-scipy-non-rectangular-grid on SO.
You might want to look at the following interp method in basemap:
mpl_toolkits.basemap.interp
http://matplotlib.sourceforge.net/basemap/doc/html/api/basemap_api.html
unless you really need spline-based interpolation.

Categories

Resources