Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame named df:
import pandas as pd
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
x is only for plotting purposes.
I'm trying to predict the y value based on the p values. I am using SVR from sklearn:
from sklearn.svm import SVR
nlm = SVR(kernel='poly').fit(df[['p']], df['y'])
df['nml'] = nlm.predict(df[['p']])
I have already tried all of kernels but it still doesn't work correct enough.
p x y nml
0 15 0 666.666667 524.669572
1 14 1 714.285714 713.042459
2 13 2 769.230769 876.338765
3 12 3 833.333333 1016.349674
Do you know which sklearn model or other libraries should I use to better fit a model?
You missed the fundamental step "normalize the data"
Fix
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
# Normalize the data (x - mean(x))/std(x)
s_p = np.std(df['p'])
m_p = np.mean(df['p'])
s_y = np.std(df['y'])
m_y = np.mean(df['y'])
df['p_'] = (df['p'] - s_p)/m_p
df['y_'] = (df['y'] - s_y)/m_y
# Fit and make prediction
nlm = SVR(kernel='rbf').fit(df[['p_']], df['y_'])
df['nml'] = nlm.predict(df[['p_']])
# Plot
plt.plot(df['p_'], df['y_'], 'r')
plt.plot(df['p_'], df['nml'], 'g')
plt.show()
# Rescale back and plot
plt.plot(df['p_']*s_p+m_p, df['y_']*s_y+m_y, 'r')
plt.plot(df['p_']*s_p+m_p, df['nml']*s_y+m_y, 'g')
plt.show()
As #mujjiga pointed out, scaling is important part of the process.
I would like to draw your attention on another two key points:
model selection which determines your ability to solve a class of problem;
new scklearn API which helps you to standardize solution development.
Let's start with your dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.arange(14)
df = pd.DataFrame({'x': x, 'p': 15-x})
df['y'] = 1e4/df['p']
Then we import somesklearn API objects of interest:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
First we create a scaler function for target values:
ysc = StandardScaler()
Notice that we can use different scalers, or build a custom transformation.
# Scaler robust against outliers:
ysc = RobustScaler()
# Logarithmic Transformation:
ysc = FunctionTransformer(func=np.log, inverse_func=np.exp, check_inverse=True)
We scale target using the scaler of our choice:
ysc.fit(df[['y']])
df['yn'] = ysc.transform(df[['y']])
We also build a pipeline with features standardizer and the selected model (we adjusted parameters to improve the fit). We fit it to your dataset using the pipeline:
reg = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1e3, epsilon=1e-3))
reg.fit(df[['p']], df['yn'])
At this point we can predict values and transform them back to the original scale:
df['ynhat'] = reg.predict(df[['p']])
df['yhat'] = ysc.inverse_transform(df[['ynhat']])
We check the fit score:
reg.score(df[['p']], df['yn']) # 0.9999646718755011
We can also compute absolute and relative error for each point:
df['yaerr'] = df['yhat'] - df['y']
df['yrerr'] = df['yaerr']/df['y']
Final result is:
x p y yn ynhat yhat yaerr yrerr
0 0 15 666.666667 -0.834823 -0.833633 668.077018 1.410352 0.002116
1 1 14 714.285714 -0.794636 -0.795247 713.562403 -0.723312 -0.001013
2 2 13 769.230769 -0.748267 -0.749627 767.619013 -1.611756 -0.002095
3 3 12 833.333333 -0.694169 -0.693498 834.128425 0.795091 0.000954
4 4 11 909.090909 -0.630235 -0.629048 910.497550 1.406641 0.001547
5 5 10 1000.000000 -0.553514 -0.555029 998.204445 -1.795555 -0.001796
6 6 9 1111.111111 -0.459744 -0.460002 1110.805275 -0.305836 -0.000275
7 7 8 1250.000000 -0.342532 -0.341099 1251.697707 1.697707 0.001358
8 8 7 1428.571429 -0.191830 -0.193295 1426.835676 -1.735753 -0.001215
9 9 6 1666.666667 0.009105 0.010458 1668.269984 1.603317 0.000962
10 10 5 2000.000000 0.290414 0.291060 2000.764717 0.764717 0.000382
11 11 4 2500.000000 0.712379 0.690511 2474.088446 -25.911554 -0.010365
12 12 3 3333.333333 1.415652 1.416874 3334.780642 1.447309 0.000434
13 13 2 5000.000000 2.822199 2.821420 4999.076799 -0.923201 -0.000185
Graphically it leads to:
fig, axe = plt.subplots()
axe.plot(df['p'], df['y'], label='$y(p)$')
axe.plot(df['p'], df['yhat'], 'o', label='$\hat{y}(p)$')
axe.set_title(r"SVR Fit for $y(x) = \frac{k}{x-a}$")
axe.set_xlabel('$p = x-a$')
axe.set_ylabel('$y, \hat{y}$')
axe.legend()
axe.grid()
Linearization
In the example above we could not use the poly kernel, we had to use the rbf kernel instead. This is because if we aim to fit a rational function using polynomial we are better to transform our data before fitting using a p = x/(x-b) substitution at the first place. In this case it will merely boil down to perform a linear regression. The example below shows that it works:
Scaler and transformation can be composed into a pipeline as well. We define a pipeline that linearize and scale the problem:
# Rational Fraction Substitution with consecutive Standardization
ysc = make_pipeline(
FunctionTransformer(func=lambda x: x/(x+1),
inverse_func=lambda x: x/(1-x),
check_inverse=True),
StandardScaler()
)
Then we can regress the data using classical OLS:
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(df[['p']], df['yn'])
Which provides correct result:
reg.score(df[['p']], df['yn']) # 0.9999998722172933
This second solution take advantage of a known linearization and thus remove the need to parametrize the model.
Related
I am getting a flat regression even with a 10th degree regresor. But If I change the date vaues to numeric then the regression works! Anybody knows why?
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from scipy.optimize import curve_fit
## RESHAPE DATA ##
X = transformed_data.ds.values.reshape(-1, 1)
y = transformed_data.y
# X = data.fecha.dt.day.values.reshape(-1, 1)
## PLOT ##
fig, ax = plt.subplots(figsize=(15,8))
ax.plot(X, y, 'o', label="data")
for i in (range(1, 10)):
polyreg = make_pipeline(PolynomialFeatures(i), LinearRegression())
polyreg.fit(X, y)
mse = round(np.mean((y - polyreg.predict(X))**2))
mae = round(np.mean(abs(y - polyreg.predict(X))))
ax.plot(X, polyreg.predict(X), label='Degree: ' + str(i) + ' MSE: ' + f'{mse:,}' +' MAE: ' + f'{mae:,}')
Datetime Data
ds y
0 2019-01-10 3658.0
1 2019-01-11 2952.0
2 2019-01-12 2855.0
3 2019-01-13 3904.0
Flat regressions
Numeric Data
ds y
0 10 3658.0
1 11 2952.0
2 12 2855.0
3 13 3904.0
Curved regressions
Linear Regression imply the associating of numerical values to a calculated coefficient. What happens next is that the values are multiplied by the coefficients, which in turn gives you an output which is used for predictions.
BUT, in your case, one of the variables is a date and, as explained above, the regression model doesn't know what to do with it. As you noticed, you need to convert them to numerical data.
my code is as followed:
transform scale
X = dataset #(100, 18)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print(scaled_series.head())
invert transform
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print(inverted_series.head())
the problem is that scaled_series and inverted_series are the same result, how should I correct the code?
I guess the problem is specific to your dataset. For instance, when I use an example dataset, the scaled_seriesand the inverted_series gave two different outputs:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 0.698347
1 0.706612
2 0.706612
3 0.657025
4 0.797521
dtype: float32
Both scaled_series and inverted_series gave different outputs but the values are close to each other. If you scale your data before using MinMaxScalar:
from sklearn.preprocessing import scale
X = scale(X)
Result:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 -0.188240
1 -0.123413
2 -0.123413
3 -0.512372
4 0.589678
dtype: float32
Now, the outputs are not close to each other, they are completely different.
Code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.preprocessing import MinMaxScaler, scale
from pandas import Series
X, _ = fetch_olivetti_faces(return_X_y=True)
X = scale(X)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print("\nScaled Series output:")
print(scaled_series.head())
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print("\nInverted Series output:")
print(inverted_series.head())
You have to consider the range of your dataset X. If we consider the formula for the MinMax scaler:
Should the range of X be [0,1], there will be no difference made as you will be subtracting 0 and dividing by 1. Thus, returning the same value.
Normalization is only viable for values which are not on the scale of 0-1.
i have a code snippet in R with simple lm() function with one dependent and one independent variable which is as follows.
X = ([149876.9876, 157853.421, 147822.3803, 147904.6639, 152625.6781, 147229.8083, 181202.081, 164499.6566, 171461.6586, 164309.3919])
Y = ([26212109.07, 28376408.76, 30559566.77, 26765176.65, 28206749.66, 27560521.33, 32713878.83, 31263763.7, 30812063.54, 30225631.6])
lmfit <- lm(formula = Data_df$Y ~ Data_df$X, data=Data_df)
lmpred <- predict(lmfit, newdata=Data_df, se.fit=TRUE, interval = "prediction")
print(lmpred) #prints out fit, se.fit, df, residual.scale
The output of the above code are 3 vectors
1.) fit
2.) se.fit
3.) df
4.) residual.scale
Please help me find the way to calculate se.fit and residual.scale in python.
I m using statsmodels.ols to do the linear regression model.
Below is the python code that i m using to build the linear regression.
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
ols_result = smf.ols(formula='Y ~ X', data=DATA_X_Y_OLS).fit()
ols_result.predict(data_x_values)
R output
$fit
fit lwr upr
1 27594475 23262089 31926862
2 28768803 24486082 33051524
3 27291987 22943619 31640354
4 27304101 22956398 31651804
5 27999150 23686118 32312183
6 27204745 22851531 31557960
7 32206302 27951767 36460836
8 29747293 25490577 34004009
9 30772271 26527501 35017042
10 29719281 25462018 33976544
$se.fit
1 2 3 4 5 6 7 8 9 10
578003.4 483363.7 605520.6 604399.0 542961.1 613642.7 420890.0 426036.9 397072.7 427318.3
$df
[1] 24
$residual.scale
[1] 2017981
To find the fit, se.fit, df, residual.scale that are outputs of lm() function in R.
Below is the python code to calculate the above mentioned 4 values
import statsmodels.formula.api as smf
import numpy as np
ols_result = smf.ols(formula='Y ~ X', data=DATA).fit()
fit = ols_result.predict(X_new) //predicted values ie, fit from lm()
covariance_matrix= ols_result.cov_params()
x = DATA['x'].values
xO = pd.DataFrame({"Constant":np.ones(len(x))}).join(pd.DataFrame(x)).values
x1 = np.dot(xO, COVARIANCE_MATRIX)
se_fit = np.sqrt(np.sum(x1 * xO,axis = 1)) //Standard error of the fitted values ie, se.fit in lm()
df = ols_result.df_resid //Degree of freedom ie, df in lm()
residual_scale = round(np.sqrt(np.dot(np.transpose(x), x)/df)) //Residual SD ie, Residual standard deviation
Here is my code:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
X_arr = []
Y_arr = []
with open('input.txt') as fp:
for line in fp:
b = line.split("|")
x,y = b
X_arr.append(int(x))
Y_arr.append(int(y))
X=np.array([X_arr]).T
print(X)
y=np.array(Y_arr)
print(y)
model = make_pipeline(PolynomialFeatures(degree=2),
LinearRegression(fit_intercept = False))
model.fit(X,y)
X_predict = np.array([[3]])
print(model.predict(X_predict))
Please, i have a question about:
model = make_pipeline(PolynomialFeatures(degree=2),
Please, how can i choose this value (2 or 3 or 4 etc.) ? is there a method to set this value dynamically ?
For example, i have this file of test:
1 1
2 4
4 16
5 75
for the first three lines the model is
y=a*x*x+b*x + c (b=c=0)
for the last line, the model is:
y=a*x*x*x+b*x+c (b=c=0)
This is by no means a fool-proof way to approach your problem, but I think I understand what you want, perhaps:
import math
epsilon = 1e-2
# Do your error checking on size of array
...
# Warning: This only works for positive x, negative logarithm is not proper.
# If you really want to, do an `abs(X_arr[0])` and check the degree is even.
deg = math.log(Y_arr[0], X_arr[0])
assert deg % 1 < epsilon
for x, y in zip(X_arr[1:], Y_arr[1:]):
if x == y == 1: continue # All x^n fits this and will cause divide by 0
assert abs(math.log(y, x) - deg) < epsilon
...
PolynomialFeature(degree=int(deg))
This checks to see if the degree is an integer value, and that all other data points fit the same polynomial.
This is purely a heuristic. If you have a bunch of data points of (1,1), there's no way you can decide what the actual degree is. Without any assumptions of the data, you cannot determine the degree of the polynomial x^n.
This is just an example of how you'd implement such a heuristic, and please don't use this in production.
I have a question about Scikit-Learn's PCA transform method. The code is found here - scroll down to find the transform() method.
They show the procedure in this simple example - the procedure is to first fit and then transform:
pca.fit(X) #step 1: fit()
X_transformed = fast_dot(X, self.components_.T) #step 2: transform()
I am trying to do this manually as follows:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.utils.extmath import fast_dot
iris = load_iris()
X = iris.data
y = iris.target
pca = PCA(n_components=3)
pca.fit(X)
Xm = X.mean(axis=1)
print pca.transform(X)[:5,:] #Method 1 - expected
X = X - Xm[None].T # or can use X = X - Xm[:, np.newaxis]
print fast_dot(X,pca.components_.T)[:5,:] #Method 2 - manual
Expected:
[[-2.68420713 -0.32660731 0.02151184]
[-2.71539062 0.16955685 0.20352143]
[-2.88981954 0.13734561 -0.02470924]
[-2.7464372 0.31112432 -0.03767198]
[-2.72859298 -0.33392456 -0.0962297 ]]
Manual
[[-0.98444292 -2.74509617 2.28864171]
[-0.75404746 -2.44769323 2.35917528]
[-0.89110797 -2.50829893 2.11501947]
[-0.74772562 -2.33452022 2.10205674]
[-1.02882877 -2.75241342 2.17090017]]
As you can see, the two results are different. Is there a step missing somewhere in the transform() method?
I'm not a great expert on PCA, but by looking at the sklearn source code I found your problem - you take the mean along the wrong axis.
Here's the solution:
Xm = X.mean(axis=0) # Axis 0 instead of 1
print pca.transform(X)[:5,:] #Method 1 - expected
X = X - Xm # No need for transpose now
print fast_dot(X,pca.components_.T)[:5,:] #Method 2 - manual
Results:
[[-2.68420713 0.32660731 -0.02151184]
[-2.71539062 -0.16955685 -0.20352143]
[-2.88981954 -0.13734561 0.02470924]
[-2.7464372 -0.31112432 0.03767198]
[-2.72859298 0.33392456 0.0962297 ]]
[[-2.68420713 0.32660731 -0.02151184]
[-2.71539062 -0.16955685 -0.20352143]
[-2.88981954 -0.13734561 0.02470924]
[-2.7464372 -0.31112432 0.03767198]
[-2.72859298 0.33392456 0.0962297 ]]