Scikit Learn transform method - manual calculation? - python

I have a question about Scikit-Learn's PCA transform method. The code is found here - scroll down to find the transform() method.
They show the procedure in this simple example - the procedure is to first fit and then transform:
pca.fit(X) #step 1: fit()
X_transformed = fast_dot(X, self.components_.T) #step 2: transform()
I am trying to do this manually as follows:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.utils.extmath import fast_dot
iris = load_iris()
X = iris.data
y = iris.target
pca = PCA(n_components=3)
pca.fit(X)
Xm = X.mean(axis=1)
print pca.transform(X)[:5,:] #Method 1 - expected
X = X - Xm[None].T # or can use X = X - Xm[:, np.newaxis]
print fast_dot(X,pca.components_.T)[:5,:] #Method 2 - manual
Expected:
[[-2.68420713 -0.32660731 0.02151184]
[-2.71539062 0.16955685 0.20352143]
[-2.88981954 0.13734561 -0.02470924]
[-2.7464372 0.31112432 -0.03767198]
[-2.72859298 -0.33392456 -0.0962297 ]]
Manual
[[-0.98444292 -2.74509617 2.28864171]
[-0.75404746 -2.44769323 2.35917528]
[-0.89110797 -2.50829893 2.11501947]
[-0.74772562 -2.33452022 2.10205674]
[-1.02882877 -2.75241342 2.17090017]]
As you can see, the two results are different. Is there a step missing somewhere in the transform() method?

I'm not a great expert on PCA, but by looking at the sklearn source code I found your problem - you take the mean along the wrong axis.
Here's the solution:
Xm = X.mean(axis=0) # Axis 0 instead of 1
print pca.transform(X)[:5,:] #Method 1 - expected
X = X - Xm # No need for transpose now
print fast_dot(X,pca.components_.T)[:5,:] #Method 2 - manual
Results:
[[-2.68420713 0.32660731 -0.02151184]
[-2.71539062 -0.16955685 -0.20352143]
[-2.88981954 -0.13734561 0.02470924]
[-2.7464372 -0.31112432 0.03767198]
[-2.72859298 0.33392456 0.0962297 ]]
[[-2.68420713 0.32660731 -0.02151184]
[-2.71539062 -0.16955685 -0.20352143]
[-2.88981954 -0.13734561 0.02470924]
[-2.7464372 -0.31112432 0.03767198]
[-2.72859298 0.33392456 0.0962297 ]]

Related

Why isn't this Linear Regression line a straight line?

I have points with x and y coordinates I want to fit a straight line to with Linear Regression but I get a jagged looking line.
I am attemting to use LinearRegression from sklearn.
To create the points run a for loop that randomly crates one hundred points into an array that is 100 x 2 in shape. I slice the left side of it for the xs and the right side of it for the ys.
I expect to have a straight line when I print m.predict.
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
X = []
adder = 0
for z in range(100):
r = random.random() * 20
r2 = random.random() * 15
X.append([r+adder-0.4, r2+adder])
adder += 0.6
X = np.array(X)
plt.scatter(X[:,0], X[:,1], s=10)
plt.show()
m = LinearRegression()
m.fit(X[:,0].reshape(1, -1), X[:,1].reshape(1, -1))
plt.plot(m.predict(X[:,0].reshape(1, -1))[0])
I am not good with numpy but, I think it is because the use of reshape() function to convert X[:,0] and X[:,1] from 1D to 2D, the resulting 2D array contains only one element, instead of creating a 2D array of len(X[:,0]) and len(X[:,1]) respectively. And resulting into an undesired regressor.
I am able to recreate this model using pandas and able to plot the desired result. Code as follows
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
import pandas as pd
X = []
adder = 0
for z in range(100):
r = random.random() * 20
r2 = random.random() * 15
X.append([r+adder-0.4, r2+adder])
adder += 0.6
X = np.array(X)
y_train = pd.DataFrame(X[:,1],columns=['y'])
X_train = pd.DataFrame(X[:,0],columns=['X'])
//plt.scatter(X_train, y_train, s=10)
//plt.show()
m = LinearRegression()
m.fit(X_train, y_train)
plt.scatter(X_train,y_train)
plt.plot(X_train,m.predict(X_train),color='red')

How to predict y=1/x values in Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame named df:
import pandas as pd
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
x is only for plotting purposes.
I'm trying to predict the y value based on the p values. I am using SVR from sklearn:
from sklearn.svm import SVR
nlm = SVR(kernel='poly').fit(df[['p']], df['y'])
df['nml'] = nlm.predict(df[['p']])
I have already tried all of kernels but it still doesn't work correct enough.
p x y nml
0 15 0 666.666667 524.669572
1 14 1 714.285714 713.042459
2 13 2 769.230769 876.338765
3 12 3 833.333333 1016.349674
Do you know which sklearn model or other libraries should I use to better fit a model?
You missed the fundamental step "normalize the data"
Fix
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
# Normalize the data (x - mean(x))/std(x)
s_p = np.std(df['p'])
m_p = np.mean(df['p'])
s_y = np.std(df['y'])
m_y = np.mean(df['y'])
df['p_'] = (df['p'] - s_p)/m_p
df['y_'] = (df['y'] - s_y)/m_y
# Fit and make prediction
nlm = SVR(kernel='rbf').fit(df[['p_']], df['y_'])
df['nml'] = nlm.predict(df[['p_']])
# Plot
plt.plot(df['p_'], df['y_'], 'r')
plt.plot(df['p_'], df['nml'], 'g')
plt.show()
# Rescale back and plot
plt.plot(df['p_']*s_p+m_p, df['y_']*s_y+m_y, 'r')
plt.plot(df['p_']*s_p+m_p, df['nml']*s_y+m_y, 'g')
plt.show()
As #mujjiga pointed out, scaling is important part of the process.
I would like to draw your attention on another two key points:
model selection which determines your ability to solve a class of problem;
new scklearn API which helps you to standardize solution development.
Let's start with your dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.arange(14)
df = pd.DataFrame({'x': x, 'p': 15-x})
df['y'] = 1e4/df['p']
Then we import somesklearn API objects of interest:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
First we create a scaler function for target values:
ysc = StandardScaler()
Notice that we can use different scalers, or build a custom transformation.
# Scaler robust against outliers:
ysc = RobustScaler()
# Logarithmic Transformation:
ysc = FunctionTransformer(func=np.log, inverse_func=np.exp, check_inverse=True)
We scale target using the scaler of our choice:
ysc.fit(df[['y']])
df['yn'] = ysc.transform(df[['y']])
We also build a pipeline with features standardizer and the selected model (we adjusted parameters to improve the fit). We fit it to your dataset using the pipeline:
reg = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1e3, epsilon=1e-3))
reg.fit(df[['p']], df['yn'])
At this point we can predict values and transform them back to the original scale:
df['ynhat'] = reg.predict(df[['p']])
df['yhat'] = ysc.inverse_transform(df[['ynhat']])
We check the fit score:
reg.score(df[['p']], df['yn']) # 0.9999646718755011
We can also compute absolute and relative error for each point:
df['yaerr'] = df['yhat'] - df['y']
df['yrerr'] = df['yaerr']/df['y']
Final result is:
x p y yn ynhat yhat yaerr yrerr
0 0 15 666.666667 -0.834823 -0.833633 668.077018 1.410352 0.002116
1 1 14 714.285714 -0.794636 -0.795247 713.562403 -0.723312 -0.001013
2 2 13 769.230769 -0.748267 -0.749627 767.619013 -1.611756 -0.002095
3 3 12 833.333333 -0.694169 -0.693498 834.128425 0.795091 0.000954
4 4 11 909.090909 -0.630235 -0.629048 910.497550 1.406641 0.001547
5 5 10 1000.000000 -0.553514 -0.555029 998.204445 -1.795555 -0.001796
6 6 9 1111.111111 -0.459744 -0.460002 1110.805275 -0.305836 -0.000275
7 7 8 1250.000000 -0.342532 -0.341099 1251.697707 1.697707 0.001358
8 8 7 1428.571429 -0.191830 -0.193295 1426.835676 -1.735753 -0.001215
9 9 6 1666.666667 0.009105 0.010458 1668.269984 1.603317 0.000962
10 10 5 2000.000000 0.290414 0.291060 2000.764717 0.764717 0.000382
11 11 4 2500.000000 0.712379 0.690511 2474.088446 -25.911554 -0.010365
12 12 3 3333.333333 1.415652 1.416874 3334.780642 1.447309 0.000434
13 13 2 5000.000000 2.822199 2.821420 4999.076799 -0.923201 -0.000185
Graphically it leads to:
fig, axe = plt.subplots()
axe.plot(df['p'], df['y'], label='$y(p)$')
axe.plot(df['p'], df['yhat'], 'o', label='$\hat{y}(p)$')
axe.set_title(r"SVR Fit for $y(x) = \frac{k}{x-a}$")
axe.set_xlabel('$p = x-a$')
axe.set_ylabel('$y, \hat{y}$')
axe.legend()
axe.grid()
Linearization
In the example above we could not use the poly kernel, we had to use the rbf kernel instead. This is because if we aim to fit a rational function using polynomial we are better to transform our data before fitting using a p = x/(x-b) substitution at the first place. In this case it will merely boil down to perform a linear regression. The example below shows that it works:
Scaler and transformation can be composed into a pipeline as well. We define a pipeline that linearize and scale the problem:
# Rational Fraction Substitution with consecutive Standardization
ysc = make_pipeline(
FunctionTransformer(func=lambda x: x/(x+1),
inverse_func=lambda x: x/(1-x),
check_inverse=True),
StandardScaler()
)
Then we can regress the data using classical OLS:
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(df[['p']], df['yn'])
Which provides correct result:
reg.score(df[['p']], df['yn']) # 0.9999998722172933
This second solution take advantage of a known linearization and thus remove the need to parametrize the model.

PolynomialFeatures sklearn

Here is my code:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
X_arr = []
Y_arr = []
with open('input.txt') as fp:
for line in fp:
b = line.split("|")
x,y = b
X_arr.append(int(x))
Y_arr.append(int(y))
X=np.array([X_arr]).T
print(X)
y=np.array(Y_arr)
print(y)
model = make_pipeline(PolynomialFeatures(degree=2),
LinearRegression(fit_intercept = False))
model.fit(X,y)
X_predict = np.array([[3]])
print(model.predict(X_predict))
Please, i have a question about:
model = make_pipeline(PolynomialFeatures(degree=2),
Please, how can i choose this value (2 or 3 or 4 etc.) ? is there a method to set this value dynamically ?
For example, i have this file of test:
1 1
2 4
4 16
5 75
for the first three lines the model is
y=a*x*x+b*x + c (b=c=0)
for the last line, the model is:
y=a*x*x*x+b*x+c (b=c=0)
This is by no means a fool-proof way to approach your problem, but I think I understand what you want, perhaps:
import math
epsilon = 1e-2
# Do your error checking on size of array
...
# Warning: This only works for positive x, negative logarithm is not proper.
# If you really want to, do an `abs(X_arr[0])` and check the degree is even.
deg = math.log(Y_arr[0], X_arr[0])
assert deg % 1 < epsilon
for x, y in zip(X_arr[1:], Y_arr[1:]):
if x == y == 1: continue # All x^n fits this and will cause divide by 0
assert abs(math.log(y, x) - deg) < epsilon
...
PolynomialFeature(degree=int(deg))
This checks to see if the degree is an integer value, and that all other data points fit the same polynomial.
This is purely a heuristic. If you have a bunch of data points of (1,1), there's no way you can decide what the actual degree is. Without any assumptions of the data, you cannot determine the degree of the polynomial x^n.
This is just an example of how you'd implement such a heuristic, and please don't use this in production.

Canonical Discriminant Function in Python sklearn

I am learning about Linear Discriminant Analysis and am using the scikit-learn module. I am confused by the "coef_" attribute from the LinearDiscriminantAnalysis class. As far as I understand, these are the discriminant function coefficients (sklearn calls them weight vectors). Since there should be (n_classes-1) discriminant functions, I would expect the coef_ attribute to be an array with shape (n_components, n_features), but instead it prints an (n_classes, n_features) array. Below is an example of this using the Iris dataset example from sklearn. Since there are 3 classes and 2 components, I would expect print(lda.coef_) to give me a 2x4 array instead of a 3x4 array...
Maybe I'm misinterpreting what the weight vectors are, perhaps they are the coefficients for the classification function?
And how do I get the coefficients for each variable in each discriminant/canonical function?
screenshot of jupyter notebook
Code here:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
target_names = iris.target_names
lda = LinearDiscriminantAnalysis(n_components=2,store_covariance=True)
X_r = lda.fit(X, y).transform(X)
plt.figure()
for color, i, target_name in zip(colors, [0, 1, 2], target_names):
plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('Function 1 (%.2f%%)' %(lda.explained_variance_ratio_[0]*100))
plt.ylabel('Function 2 (%.2f%%)' %(lda.explained_variance_ratio_[1]*100))
plt.title('LDA of IRIS dataset')
print(lda.coef_)
#output -> [[ 6.24621637 12.24610757 -16.83743427 -21.13723331]
# [ -1.51666857 -4.36791652 4.64982565 3.18640594]
# [ -4.72954779 -7.87819105 12.18760862 17.95082737]]
You can calculate the coefficients with the following code:
def LDA_coefficients(X,lda):
nb_col = X.shape[1]
matrix= np.zeros((nb_col+1,nb_col), dtype=int)
Z=pd.DataFrame(data=matrix,columns=X.columns)
for j in range(0,nb_col):
Z.iloc[j,j] = 1
LD = lda.transform(Z)
nb_funct= LD.shape[1]
results = pd.DataFrame();
index = ['const']
for j in range(0,LD.shape[0]-1):
index = np.append(index,'C'+str(j+1))
for i in range(0,LD.shape[1]):
coef = [LD[-1][i]]
for j in range(0,LD.shape[0]-1):
coef = np.append(coef,LD[j][i]-LD[-1][i])
result = pd.Series(coef)
result.index = index
column_name = 'LD' + str(i+1)
results[column_name] = result
return results
Before calling this function you need to complete the linear discriminant analysis:
lda = LinearDiscriminantAnalysis()
lda.fit(X,y)

Linear Regression Returns Different Results Than Synthetic Parameters

trying this code:
from sklearn import linear_model
import numpy as np
x1 = np.arange(0,10,0.1)
x2 = x1*10
y = 2*x1 + 3*x2
X = np.vstack((x1, x2)).transpose()
reg_model = linear_model.LinearRegression()
reg_model.fit(X,y)
print reg_model.coef_
# should be [2,3]
print reg_model.predict([5,6])
# should be 2*5 + 3*6 = 28
print reg_model.intercept_
# perfectly at the expected value of 0
print reg_model.score(X,y)
# seems to be rather confident to be right
The results are
[ 0.31683168 3.16831683]
20.5940594059
0.0
1.0
and therefore not what I expected - they are not the same as the parameters used to synthesize the data. Why is this so?
Your problem is with the uniqueness of solutions, as both dimensions are the same (applying a linear transform to one dimension does not make unique data in the eyes of this model), you get an infinite number of possible solutions that will fit you data. Applying a non-linear transformation to your second dimension you will see the desired output.
from sklearn import linear_model
import numpy as np
x1 = np.arange(0,10,0.1)
x2 = x1**2
X = np.vstack((x1, x2)).transpose()
y = 2*x1 + 3*x2
reg_model = linear_model.LinearRegression()
reg_model.fit(X,y)
print reg_model.coef_
# should be [2,3]
print reg_model.predict([[5,6]])
# should be 2*5 + 3*6 = 28
print reg_model.intercept_
# perfectly at the expected value of 0
print reg_model.score(X,y)
Outputs are
[ 2. 3.]
[ 28.]
-2.84217094304e-14
1.0

Categories

Resources