I want to do OLS fitting on this model using python - python

a is {0, 1} binary variable, X dimension is 3 (first column is all-ones vector, the number of predictor is 2)
If write the expression differently, it becomes like this
y = Xb0 + aX(b1-b0) + e
= b00 + b01X1 + b02X2 + (b10-b00)a + (b11-b01)aX1 + (b12-b02)aX2 + e
What I interest is interaction between a and x so I want to know all the values for beta. \
How to code this using python??
I made newX = (1, X1, X2, a, aX1, aX2) and using this,
model = ols(formula='Y~X1+X2+a+aX1+aX2', data=data).fit()
but I think this would be inefficient if the input dimension grew.
I searched and found out 'weights' options in R,
lm(y~x1+x2, weights=(I=a))
Should I find something similar in Python and use it?
Which way is right?? If there is another way, please let me know.

Related

What exactly is coef_ from sklearn LinearRegression? and how to interpret a formula from it

When I use LinearRegression in sklearn, I would do
m = 100
X = 6*np.random.rand(m,1)-3
y = 0.5*X**2 + X+2 + np.random.randn(m,1)
lin_reg = LinearRegression()
lin_reg.fit(X,y)
y_pred_1 = lin_reg.predict(X)
y_pred_1 = [_[0] for _ in y_pred_1]
and when I plot (X,y) and (X, y_pred_1) it seems to be correct.
I wanted create formula for line of best fit by:
y= (lin_reg.coef_)x + lin_reg.intercept_
Manually I've inserted values into formula I've got by using coef_, intercept_ and compared it to predicted value from lin_reg.predict(value) which are the same so lin_reg.predict in fact uses formula I've made above using coef, intercept.
My problem is how to I create a formula for simple polynomial regression?
I would do
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly_2 = poly_features.fit_transform(X)
poly_reg_2 = LinearRegression()
poly_reg_2.fit(X_poly_2, y)
then poly_reg_2.coef_ gives me array([[0.93189329, 0.43283304]]) and poly_reg_2.intercept_ = array([2.20637695]).
Since it is "simple" polynomial regression it should look something like
y = x^2 + x + b where x are same variable.
from poly_reg_2.coef_ which one is x^2 and which one is not?
Thanks to https://www.youtube.com/watch?v=Hwj_9wMXDVo I've gotten insight and found out how to interpret formula for polynomial regression.
So poly_reg_2.coef_ = array([[0.93189329, 0.43283304]])
you know simple linear regression looks like
y = b + m1x
Then 2-degree polynomial regression looks like
y = b + m1x + m2(x^2)
and 3-degree:
y = b + m1x + m2(x^2) + m3(x^3)
and so on... so for my case two coefficients are just m1 and m2 in according order.
so finally my formula becomes:
y = b + 0.93189329x + 0.43283304(x^2).

Modelling bivariate data

I am having trouble modelling some data in python. I want to improve the very crude method that I am currently using:
x = np.linspace(-1, 1, 130)
y = np.linspace(0, 0.5, 113)
data = np.random.rand(130, 113)
X, Y = np.meshgrid(x,y)
X1 = X.flatten()
Y1 = Y.flatten()
Z1 = data.flatten()
With this data I am trying to fit a model:
A2 = np.array([X1**3, Y1**3, (X1**2)*Y1, (Y1**2)*X1,
X1**2, Y1**2, X1*Y1, X1, Y1, X1*0+1]).T
c2, r2, rank2, s2 = np.linalg.lstsq(A2, Z1)
tst_z2 = c2[0]*(X**3) + c2[1]*(Y**3) + c2[2]*((X**2)*Y)
+ c2[3]*((Y**2)*X) + c2[4]*(X**2) + c2[5]*(Y**2) + c2[6]*(X*Y)
+ c2[7]*X + c2[8]*Y + c2[9]*X*0+1
The model I am using is a standard bivariate cubic polynomial where I have typed out all of the terms needed. If I want to use quartic or higher order polynomials and move to data that has three or more independent variables then this method is going to quickly become unusable.
Is there a way to streamline/generalise this method without having to write out all of the terms so that I can move to higher order polynomials?
For a bit more context into the problem, I am trying to produce a model that fits to this data:
The red-orange-ish surface is my attempt at using a bivariate cubic polynomial to fit the data. If someone knows of a better way to do this I would very much appreciate the help.

Plotting horizontal hyperbola/circle using fsolve, numpy, and matplotlib

I was recently trying to plot a nonlinear decision boundary, and the function ended up being a partially horizontal hyperbola, where there were multiple y-values for a given x. Although I got it to work, I know there has to be a more pythonic or numpythonic way of plotting this line.
Background: The problem was a perceptron classifier on a set of inputs that were not linearly separable. In order to find this, the inputs were mapped to a general hyperbola function to increase the dimensionality to 5, and have these separable by a hyperplane. The equation for the decision boundary that will be plotted is
d(x) = w0 + w1xx + w2yy + w3xy + wx + w5y
Through the course of the perceptron's gradient descent, the values for w0-w5 are found, and the boundary is the x,y value when d(x)=0.
Current implementation: I got it to work, but I think it is hacky. I first have to create an array of the given size so that I can append these values, and I have to delete the initialized value the first time I append my found value. I then sweep through my the space on my graph and find a y-value, first by guessing high, second by guessing low, in order to find both possible y-values. I put these found values at the front and back of D, in order to plot this using matplotlib.
D = np.array([[0], [0]])
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
a_iter, b_iter = 0, 0 # used as initial guess for numeric solver
for xx in range(x_min, x_max):
# used to print top and bottom sides of hyperbola
yya = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, max(a_iter, 7))
yyb = fsolve(lambda yy: W[:,0] + W[:,1]*xx**2 + W[:,2]*yy**2 + W[:,3]*xx*yy + W[:,4]*xx + W[:,5]*yy, b_iter)
a_iter = yya
b_iter = yyb
# add these points to a single matrix for printing
dda = np.array([[xx],[yya]])
ddb = np.array([[xx],[yyb]])
D = np.concatenate((dda, D), axis=1)
if xx == x_min: # delete initial [0; 0]
D = dda
D = np.concatenate((D, ddb), axis=1)
I know there has to be a better way to do this. Any insight is appreciated.
Edit: Apologies, I realize that without an image this is really difficult to understand. The main issue of finding multiple roots and populating a numpy array are a bit generic. I don't have enough rep to post images, but the link is below
nonlinear classifier
If you want plot an implicit equation curve, you can use pyplot.contour(), here is an example:
np.random.seed(1)
w = np.random.randn(6)
def f(x, y, w):
return w[0] + w[1]*x**2 + w[2]*y**2 + w[3]*x*y + w[4]*x + w[5]*y
X, Y = np.mgrid[-2:2:100j, -2:2:100j]
pl.contour(X, Y, f(X, Y, w), levels=[0])
there are parameterized options too - a trig one, branches centered at 0, pi
t = np.linspace(-np.pi/3, np.pi/3, 200) # 0 centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default blue)
Out[94]: [<matplotlib.lines.Line2D at 0xe26e6a0>]
t = np.linspace(np.pi-np.pi/3, np.pi+np.pi/3, 200) # pi centered branch
y = 1/np.cos(t)
x = 1*np.tan(t)
plt.plot(x, y) # (default orange)
Out[96]: [<matplotlib.lines.Line2D at 0xf68e780>]
sympy ought to be up to finding the full denormalized, rotated, offset parameterized hyperbola coefficients from the bivariate polynomial ws
(or continue the hackage with a fit)

Getting input dimensions in pymc3 correct

Say I have 10 coins from the same mint, I flip them each 50 times, now I want to estimate bias of the mint as well as the individual bias of all the coins.
The way I want to do this is like this:
# Generate a list of 10 arrays with 50 flips in each
test = [bernoulli.rvs(0.5, size=50) for x in range(10)]
with pm.Model() as test_model:
k = pm.Gamma('k', 0.01, 0.01) + 2
w = pm.Beta('w', 1, 1)
thetas = pm.Beta('thetas', w * (k - 2) + 1, (1 - w) * (k - 2) + 1, shape = len(test))
y = pm.Bernoulli('y', thetas, observed=test)
But this doesn't work, because now it seems like pymc expects 50 coins with 10 flips. I can hack around this issue in this instance. But, I'm both a beginner at python and pymc(3) so I want to learn why it behaves like this and what a proper simulation of this situation should look like.
If you are new to Python may be you are not familiar with the concept of broadcasting, that is used when working with NumPy arrays, and is also useful for defining PyMC3 models. Broadcasting enables us to operate arithmetically with arrays of difference size under certain circumstances.
For your particular example the problem is that according to the broadcasting rules the shape of the data vector and the shape of the thetas vector are not compatible. The easiest solution for your problem is to transpose the data vector (make rows columns and columns rows).
Notice also that using SciPy you can create you mock data without using a list comprehension, you just need to pass the proper shape.
test = bernoulli.rvs(0.5, size=(50, 10))
with pm.Model() as test_model:
k = pm.Gamma('k', 0.01, 0.01) + 2
w = pm.Beta('w', 1, 1)
thetas = pm.Beta('thetas', w * (k - 2) + 1, (1 - w) * (k - 2) + 1, shape = test.shape[1])
y = pm.Bernoulli('y', thetas, observed=test)

numpy.poly1d , root-finding optimization, shifting polynom on x-axis

it is commonly an easy task to build an n-th order polynomial
and find the roots with numpy:
import numpy
f = numpy.poly1d([1,2,3])
print numpy.roots(f)
array([-1.+1.41421356j, -1.-1.41421356j])
However, suppose you want a polynomial of type:
f(x) = a*(x-x0)**0 + b(x-x0)**1 + ... + n(x-x0)**n
Is there a simple way to construct a numpy.poly1d type function
and find the roots ? I've tried scipy.fsolve but it is very unstable as it depends highly on the choice of the starting values
in my particular case.
Thanks in advance
Best Regards
rrrak
EDIT: Changed "polygon"(wrong) to "polynomial"(correct)
First of all, surely you mean polynomial, not polygon?
In terms of providing an answer, are you using the same value of "x0" in all the terms? If so, let y = x - x0, solve for y and get x using x = y + x0.
You could even wrap it in a lambda function if you want. Say, you want to represent
f(x) = 1 + 3(x-1) + (x-1)**2
Then,
>>> g = numpy.poly1d([1,3,1])
>>> f = lambda x:g(x-1)
>>> f(0.0)
-1.0
The roots of f are given by:
f.roots = numpy.roots(g) + 1
In case x0 are different by power, such as:
f(x) = 3*(x-0)**0 + 2*(x-2)**1 + 3*(x-1)**2 + 2*(x-2)**3
You can use polynomial operation to calculate the finally expanded polynomial:
import numpy as np
import operator
ks = [3,2,3,2]
offsets = [0,2,1,2]
p = reduce(operator.add, [np.poly1d([1, -x0])**i * c for i, (c, x0) in enumerate(zip(ks, offsets))])
print p
The result is:
3 2
2 x - 9 x + 20 x - 14

Categories

Resources