PolynomialFeatures sklearn - python

Here is my code:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
X_arr = []
Y_arr = []
with open('input.txt') as fp:
for line in fp:
b = line.split("|")
x,y = b
X_arr.append(int(x))
Y_arr.append(int(y))
X=np.array([X_arr]).T
print(X)
y=np.array(Y_arr)
print(y)
model = make_pipeline(PolynomialFeatures(degree=2),
LinearRegression(fit_intercept = False))
model.fit(X,y)
X_predict = np.array([[3]])
print(model.predict(X_predict))
Please, i have a question about:
model = make_pipeline(PolynomialFeatures(degree=2),
Please, how can i choose this value (2 or 3 or 4 etc.) ? is there a method to set this value dynamically ?
For example, i have this file of test:
1 1
2 4
4 16
5 75
for the first three lines the model is
y=a*x*x+b*x + c (b=c=0)
for the last line, the model is:
y=a*x*x*x+b*x+c (b=c=0)

This is by no means a fool-proof way to approach your problem, but I think I understand what you want, perhaps:
import math
epsilon = 1e-2
# Do your error checking on size of array
...
# Warning: This only works for positive x, negative logarithm is not proper.
# If you really want to, do an `abs(X_arr[0])` and check the degree is even.
deg = math.log(Y_arr[0], X_arr[0])
assert deg % 1 < epsilon
for x, y in zip(X_arr[1:], Y_arr[1:]):
if x == y == 1: continue # All x^n fits this and will cause divide by 0
assert abs(math.log(y, x) - deg) < epsilon
...
PolynomialFeature(degree=int(deg))
This checks to see if the degree is an integer value, and that all other data points fit the same polynomial.
This is purely a heuristic. If you have a bunch of data points of (1,1), there's no way you can decide what the actual degree is. Without any assumptions of the data, you cannot determine the degree of the polynomial x^n.
This is just an example of how you'd implement such a heuristic, and please don't use this in production.

Related

Python statsmodels – ValueError: how to create variable in range 0 to 1?

Code:
import numpy as np
import pandas as pd
import statsmodels.api as sm
sacramento = pd.read_csv("sacramento.csv")
X = sacramento[["beds", "sqft", "price"]]
Y = sacramento["baths"]
X = sm.add_constant(X)
model = sm.Logit(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
print(mod.params.round(2))
print(mod.pvalues.round(2))
print('The smallest p-value is for sqft')
The problem I have is with the "You will need to create a new variable from baths, and it should make it such that those observations of 1 bath correspond to a value of 0, and those with more than 1 bath correspond to a 1." instruction.
I really do not know how to do that. I know that it causes a ValueError: endog must be in the unit interval.
Link to the csv file: https://drive.google.com/file/d/1A3LQ2vZ9IUkv_2HkqP8c2sCQGAvdII-r/view?usp=sharing
Can you try this?
sacramento["baths"] = sacramento["baths"].apply(lambda x: 0 if x== 1 else 1)

How can I get a value from a polynomial defined with np.poly1d?

I made a model from a series of data. My model is represented by the red line which has the following formula:
p4=np.poly1d(np.polyfit(x,y,4)) #0.04253 x - 3.593 x + 89.6 x - 470.3 x + 666.4
How can I retrieve a value from my model (from the red polynomial line)?
I tried with this code but results are not coherent:
y=np.arange(len(x))
X=scale.fit_transform(y.values)
X=np.array(X)
X.reshape(-1,1)
est = sm.OLS(y, X).fit()
scaled = scale.transform(50)
predicted = est.predict(scaled[0])
With x=50 I retrieve 1 as prediction that's obviously not coherent with the model.
Could you help me?
You can get the value by using the polynomial function returned by np.poly1d.
See the example shown in the documentation:
import numpy as np
p = np.poly1d([1, 2, 3])
print(np.poly1d(p))
# Evaluate the polynomial at x = 0.5:
y = p(0.5)
print(y)

Finding Kneighbors in sklearn using KDtree with multiple target variables with multiple search criteria

Lets say this is my simple KD tree alogrithm that I am implementing
def Test():
features = np.random.random((10, 2))
X = np.array(features[0:2])
print(X)
tree = KDTree(features, leaf_size=40)
indic = tree.query_radius(X, r= 0.1)
counter = 0
for i in indic:
a = (features[i])
np.savetxt('file{}.txt'.format(counter), a, fmt='%s')
counter += 1
yield i
tree = Test()
[X for X in tree]
Here I am saving text file for each neighbor elements of that each target position and this work quite fine.
Are there any tricks that I can use different search criteria for each target points without creating a separate tree query again and again?
For example, lets say I want to use one variable X = np.array(features[0] with r = 0.1
and another variable Y = np.array(features[1] with r = 0.5
Right now, I can think only like this way
indic1 = tree.query_radius(X, r= 0.1)
indic2 = tree.query_radius(Y, r= 0.5)
Is there a way that I can combine these two and make a one tree query?
Yes there is a way of doing it, using just one query_radius call, from the documentation:
r can be a single value, or an array of values of shape x.shape[:-1]
if different radii are desired for each point.
So you can do it like this:
import numpy as np
from sklearn.neighbors import KDTree
np.random.seed(42)
features = np.random.random((10, 2))
X = np.array(features[0:2])
tree = KDTree(features, leaf_size=40)
indices = tree.query_radius(X, r=np.array([0.1, 0.5]))
for cursor, ix in enumerate(indices):
np.savetxt('file{}.txt'.format(cursor), features[ix], fmt='%s')
The output was file0.txt and file1.txt, file0.txt has 1 point (lower radius) and file1.txt has 5 points (higher radius).

How to use scipy.minimize with multiple parameters in error function?

I have two sets of frequencies data from experiment and from theoretical formula. I want to use minimize function of scipy.
Here's my code snippet.
where g is coupling which I want to find out.
Ad ind is inductance for plotting on x-axis.
from scipy.optimize import minimize
def eigenfreq1_func(ind,w_q,w_r,g):
return (w_q+w_r)+np.sqrt((w_q+w_r)**2.0-4*(w_q+w_r-g**2.0))/2
def eigenfreq2_func(ind,w_q,w_r,g):
return (w_q+w_r)-np.sqrt((w_q+w_r)**2.0-4*(w_q+w_r-g**2))/2.0
def err_func(y1,y1_fit,y2,y2_fit):
return np.sqrt((y1-y1_fit)**2+(y2-y2_fit)**2)
g_init=80e6
res1=eigenfreq1_func(ind,qubit_freq,readout_freq,g_init)
print res1
res2=eigenfreq2_func(ind,qubit_freq,readout_freq,g_init)
print res2
fit=minimize(err_func,args=[qubit_freq,res1,readout_freq,res2])
But it's showing the following error :
"TypeError: minimize() takes at least 2 arguments (2 given)"
First, the indentation in your example is messed up. Hope you don't try and run this
Second, here is a baby example to minimize the chi2 with the function scipy.optimize.minimize (note you can minimize what you want: likelihood, |chi|**?, toto, etc.):
import numpy as np
import scipy.optimize as opt
def functionyouwanttofit(x,y,z,t,u):
return np.array([x+y+z+t+u , x+y+z+t-u , x+y+z-t-u , x+y-z-t-u ]) # baby test here but put what you want
def calc_chi2(parameters):
x,y,z,t,u = parameters
data = np.array([100,250,300,500])
chi2 = sum( (data-functiontofit(x,y,z,t,u))**2 )
return chi2
# baby example for init, min & max values
x_init = 0
x_min = -1
x_max = 10
y_init = 1
y_min = -2
y_max = 9
z_init = 2
z_min = 0
z_max = 1000
t_init = 10
t_min = 1
t_max = 100
u_init = 10
u_min = 1
u_max = 100
parameters = [x_init,y_init,z_init,t_init,u_init]
bounds = [[x_min,x_max],[y_min,y_max],[z_min,z_max],[t_min,t_max],[u_min,u_max]]
result = opt.minimize(calc_chi2,parameters,bounds=bounds)
In your example you don't give initial values... This with the indentation... Were you waiting for someone doing the job for you ?
Third, note the optimization processes proposed by scipy are not always adapted to your needs. You may prefer minimizers such as lmfit

Ridge Regression: Scikit-learn vs. direct calculation does not match for alpha > 0

In Ridge Regression, we are solving Ax=b with L2 Regularization. The direct calculation is given by:
x = (ATA + alpha * I)-1ATb
I have looked at the scikit-learn code and they do implement the same calculation. But, I can't seem to get the same results for alpha > 0
The minimal code to reproduce this.
import numpy as np
A = np.asmatrix(np.c_[np.ones((10,1)),np.random.rand(10,3)])
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print(x.T)
>>> [[ 0.37371021 0.19558433 0.06065241 0.17030177]]
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha).fit(A[:,1:],b)
print(np.c_[model.intercept_, model.coef_])
>>> [[ 0.61241566 0.02727579 -0.06363385 0.05303027]]
Any suggestions on what I can do to resolve this discrepancy?
This modification seems to yield the same result for the direct version and the numpy version:
import numpy as np
A = np.asmatrix(np.random.rand(10,3))
b = np.asmatrix(np.random.rand(10,1))
I = np.identity(A.shape[1])
alpha = 1
x = np.linalg.inv(A.T*A + alpha * I)*A.T*b
print (x.T)
from sklearn.linear_model import Ridge
model = Ridge(alpha = alpha, tol=0.1, fit_intercept=False).fit(A ,b)
print model.coef_
print model.intercept_
It seems the main reason for the difference is the class Ridge has the parameter fit_intercept=True (by inheritance from class _BaseRidge) (source)
This is applying a data centering procedure before passing the matrices to the _solve_cholesky function.
Here's the line in ridge.py that does it
X, y, X_mean, y_mean, X_std = self._center_data(
X, y, self.fit_intercept, self.normalize, self.copy_X,
sample_weight=sample_weight)
Also, it seems you were trying to implicitly account for the intercept by adding the column of 1's. As you see, this is not necessary if you specify fit_intercept=False
Appendix: Does the Ridge class actually implement the direct formula?
It depends on the choice of the solverparameter.
Effectively, if you do not specify the solverparameter in Ridge, it takes by default solver='auto' (which internally resorts to solver='cholesky'). This should be equivalent to the direct computation.
Rigorously, _solve_cholesky uses numpy.linalg.solve instead of numpy.inv. But it can be easily checked that
np.linalg.solve(A.T*A + alpha * I, A.T*b)
yields the same as
np.linalg.inv(A.T*A + alpha * I)*A.T*b

Categories

Resources