Matlab
I have some PLSR regression in Matlab that I need to translate to Python. The Matlab code is as follows:
% PLSR with 15 latent variables
[~,~,~,~,~,MSEcv] = plsregress(X,y,15,"cv",5,"MCReps",100);
rRMSE = 100*sqrt(MSEcv(2,:))/max(y)-min(y));
% Find optimal number of latent variables
minl = 1; maxl = 15
nlv = find(rMSE(minl+1:maxl+1)==min(rRMSE(minl+1:maxl+1)))-1+minl;
nlv = nlv(1)
% Fit with optimal number of latent variables
[XL,yl,XS,YS,beta,PCTVAR;MSE;stats] = plsregress(X,y,nlv)
(PLS regression with a maximum of 15 latent variables, cross-validation with 1/5 of the values and 100 Monte-Carlo repetitions.)
Python
I tried to implement the code in Python as follows:
import numpy as np
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
# Maximum number of latent variables
mlv = 15
# Cross-validation
cv = 5
# Monte-Carlo repetitions
mcr = 100
# 1...mlv to fit models with various number of latent variables
try_latent_vars = np.arange(1, mlv)
##----------------------------------------------------------------------------|
# Define funtion to fit PLSmodel
def optimise_pls_cv(X_vals, y_vals, n_comp, crossval, mcreps):
'''Fit PLS regression model using cross-validation.'''
# Define PLS object
pls = PLSRegression(n_components = n_comp)
# Cross-validation fit
cv_split = ShuffleSplit(n_splits = mcreps, test_size = 1/crossval,
random_state = 0)
cvs = cross_validate(pls, X_vals, y_vals, cv = cv_split,
scoring = ["r2", "neg_mean_squared_error"])
mean_r2_error = np.mean(cvs["test_r2"])
test_mse = -np.mean(cvs["test_neg_mean_squared_error"])
return pls, mean_r2_error, test_mse
##----------------------------------------------------------------------------|
# Fit PLS model
# Empty lists to store R^2 and mean squared error values
r2s = []
mses = []
for n_comp in try_latent_vars:
model, r2, mse = optimise_pls_cv(X_vals = X.T,
y_vals = y,
n_comp = n_comp,
crossval = cv,
mcreps = mcr)
r2s.append(r2)
mses.append(mse)
index_max_r2s = np.argmax(r2s)
lv = try_latent_vars[index_max_r2s]
##----------------------------------------------------------------------------|
## Fit model with optimized number of components
model, r2, mse = optimise_pls_cv(X_vals = X.T,
y_vals = y,
n_comp = lv,
crossval = cv,
mcreps = mcr)
metrics = {"R2" : r2,
"MSE" : mse}
metrics_str = "R2: %0.4f, MSE: %0.4f" % (r2, mse)
In both code snippets, X and y are a set of spectral information and some variable to predict from the spectra, respectively.
The problem
Unfortunately, I get vastly different results with those versions (R^2 values for the Matlab version are all 0.5 or higher while the Python version ends up with R^2 values close to zero or negative). How is that and how can I solve this issue?
Example data
Here is some example data as .txt files in Dropbox. Read as
import pandas as pd
X = pd.read_table("/path/to/saved/files/X.txt", header = None)
y = pd.read_table("/path/to/saved/files/y.txt", header = None)
I tried to convert your code to python and this is what I got...
import numpy as np
from sklearn.cross_decomposition import PLSRegression
# PLSR with 15 latent variables and 5-fold cross-validation, 100 Monte Carlo
repetitions
pls = PLSRegression(n_components=15)
pls.fit(X, y)
MSEcv = pls.score(X, y)
# Compute relative root mean squared error (rRMSE) for each number of latent
variables
rRMSE = 100 * np.sqrt(MSEcv[1:]) / (max(y) - min(y))
# Find optimal number of latent variables
minl = 1
maxl = 15
nlv = np.argmin(rRMSE) + 1
# Fit with optimal number of latent variables
pls = PLSRegression(n_components=nlv)
pls.fit(X, y)
XL, yl, XS, YS, beta = pls.x_scores_, pls.y_scores_, pls.x_loadings_,
pls.y_loadings_, pls.coef_
PCTVAR = pls.score(X, y)
MSE = np.mean((y - pls.predict(X)) ** 2)
stats = {'MSE': MSE}
print(f"Optimal number of latent variables: {nlv}")
In the Python code, I used scikit-learn's PLSRegression class to perform PLSR. I first fit the model with 15 latent variables and 5-fold cross-validation, and compute the mean squared error (MSE) for each fold using the score method. Then I compute the relative root mean squared error (rRMSE) for each number of latent variables, and find the optimal number of latent variables with the lowest rRMSE. Finally, I fit the model with the optimal number of latent variables, and compute the PLS components, PCTVAR, MSE, and statistics.
Related
I am using the Extreme Learning Machine classifier for hand gestures recognition but I still have 20% as accuracy.Can anyone help me to implement an iterative training loop to improve the accuracy?I am a beginner and Here the code I am using:I split the dataset that I prepared into train and test parts after normalization and I train it using the train function by calculating the Moore Penrose inverse and then predict the class of each gesture using the prediction function.
# -*- coding: utf-8 -*-
"""
Created on Sat Jul 4 17:52:25 2020
#author: lenovo
"""
# -*- coding: utf-8 -*-
__author__ = 'Sarra'
import numpy as np
class ELM(object):
def __init__(self, inputSize, outputSize, hiddenSize):
"""
Initialize weight and bias between input layer and hidden layer
Parameters:
inputSize: int
The number of input layer dimensions or features in the training data
outputSize: int
The number of output layer dimensions
hiddenSize: int
The number of hidden layer dimensions
"""
self.inputSize = inputSize
self.outputSize = outputSize
self.hiddenSize = hiddenSize
# Initialize random weight with range [-0.5, 0.5]
self.weight = np.matrix(np.random.uniform(-0.5, 0.5, (self.hiddenSize, self.inputSize)))
# Initialize random bias with range [0, 1]
self.bias = np.matrix(np.random.uniform(0, 1, (1, self.hiddenSize)))
self.H = 0
self.beta = 0
def sigmoid(self, x):
"""
Sigmoid activation function
Parameters:
x: array-like or matrix
The value that the activation output will look for
Returns:
The results of activation using sigmoid function
"""
return 1 / (1 + np.exp(-1 * x))
def predict(self, X):
"""
Predict the results of the training process using test data
Parameters:
X: array-like or matrix
Test data that will be used to determine output using ELM
Returns:
Predicted results or outputs from test data
"""
X = np.matrix(X)
y = self.sigmoid((X * self.weight.T) + self.bias) * self.beta
return y
def train(self, X, y):
"""
Extreme Learning Machine training process
Parameters:
X: array-like or matrix
Training data that contains the value of each feature
y: array-like or matrix
Training data that contains the value of the target (class)
Returns:
The results of the training process
"""
X = np.matrix(X)
y = np.matrix(y)
# Calculate hidden layer output matrix (Hinit)
self.H = (X * self.weight.T) + self.bias
# Sigmoid activation function
self.H = self.sigmoid(self.H)
# Calculate the Moore-Penrose pseudoinverse matriks
H_moore_penrose = np.linalg.pinv(self.H.T * self.H) * self.H.T
# Calculate the output weight matrix beta
self.beta = H_moore_penrose * y
return self.H * self.beta
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
# read the dataset
database = pd.read_csv(r"C:\\Users\\lenovo\\tensorflow\\tensorflow1\\Numpy-ELM\\hand_gestures_database.csv")
#separate data from labels
data = database.iloc[:, 1:].values.astype('float64')
#normalize data
#n_data = preprocessing.minmax_scale(data, feature_range=(0, 1), axis=0, copy=True)
scaler = MinMaxScaler()
scaler.fit(data)
n_data = scaler.transform(data)
#identify the labels
label = database.iloc[:, 0]
#encoding labels to transform each label to a value between 0 to number of labels-1
def prepare_targets(n):
le =preprocessing.LabelEncoder()
le.fit(n)
label_enc = le.transform(n)
return label_enc
label_enc = prepare_targets(label)
CLASSES = 10
#transform the value of each label to a binary vector
target = np.zeros([label_enc.shape[0], CLASSES])
for i in range(label_enc.shape[0]):
target[i][label_enc[i]] = 1
target.view(type=np.matrix)
print("target",target)
# Create instance of ELM object with 10 hidden neuron
maxx=0
for u in range(10):
elm = ELM(45, 10, 10)
# Train test split 80:20
X_train, X_test, y_train, y_test = train_test_split(n_data, target, test_size=0.34, random_state=1)
elm.train(X_train,y_train)
y_pred = elm.predict(X_test)
# Train data
correct = 0
total = y_pred.shape[0]
for i in range(total):
predicted = np.argmax(y_pred[i])
test = np.argmax(y_test[i])
correct = correct + (1 if predicted == test else 0)
print('Accuracy: {:f}'.format(correct/total))
if(correct/total>maxx):
maxx=correct/total
print(maxx)
###confusion matrix
import seaborn as sns
y_pred=np.argmax(y_pred, axis=1)
y_true=(np.argmax(y_test, axis=1))
target_names=["G1","G2","G3","G4","G5","G6","G7","G8","G9","G10"]
cm=confusion_matrix(y_true, y_pred)
#cmn = cm.astype('float')/cm.sum(axis=1)[:, np.newaxis]*100
fig, ax = plt.subplots(figsize=(15,8))
sns.heatmap(cm/np.sum(cm), annot=True, fmt='.2f',xticklabels=target_names, yticklabels=target_names, cmap='Blues')
#sns.heatmap(cmn, annot=True, fmt='.2%', xticklabels=target_names, yticklabels=target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.ylim(-0.5, len(target_names) + 0.5)
plt.show(block=False)
def perf_measure(y_actual, y_pred):
TP = 0
FP = 0
TN = 0
FN = 0
for i in range(len(y_pred)):
if y_actual[i]==y_pred[i]==1:
TP += 1
if y_pred[i]==1 and y_actual[i]!=y_pred[i]:
FP += 1
if y_actual[i]==y_pred[i]==0:
TN += 1
if y_pred[i]==0 and y_actual[i]!=y_pred[i]:
FN += 1
return(TP, FP, TN, FN)
TP, FP, TN, FN=perf_measure(y_true, y_pred)
print("precision",TP/(TP+FP))
print("sensivity",TP/(TP+FN))
print("specifity",TN/(TN+FP))
print("accuracy",(TP+TN)/(TN+FP+FN+TP))
To your question about whether you can implement an iterative training loop for an ELM:
No, you can not. An ELM consists of one random layer followed by an output layer. Because the first layer is fixed, this is essentially a linear model and we can find the optimal output weights by using the pseudo-inverse, as you pointed out.
However, since you already find the perfect solution for this model in one step, there is no direct way to iteratively improve this result.
I would, however, not advise using an extreme learning machine.
Besides the controversy about their origin, they are very limited in the functions they can learn.
There are other well-established approaches for gesture classification that are likely more useful to pursue.
I am trying to learn the manifold of a particular class (say digit 1 for the MNIST dataset) using the latent space density estimation. I am using a Gaussian model (Mixture Density Network, MDN) to estimate the densities of the latent space (obtained through an autoencoder). I am using a neural network to approximate mean and variance for the MDN.
I facing a problem that, all of my estimated densities and consequently the negative log-likelihood loss is coming constant. This can be true for learning the manifold of digit 1 (although there should be some variance) but should be constant while testing it with the other digits.
Below you can see the loss plot for the learned class (normal class) and the new class (anomaly class)
Below is the code for the MDN and MDn loss estimation -
import sys
import torch
import warnings
import numpy as np
import torch.nn as nn
import seaborn as sns
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torch.autograd import Variable
from sklearn.model_selection import train_test_split
from scipy import ndimage
#from hessian import hessian
warnings.filterwarnings('ignore')
COEFS = 10
IN_DIM =32
OUT_DIM = IN_DIM
class MDN(nn.Module):
def __init__(self, input_dim=IN_DIM, out_dim=OUT_DIM, layer_size=10, coefs=COEFS, test = False, sd =0.2):
super(MDN, self).__init__()
self.fc = nn.Linear(input_dim, layer_size)
self.fc2 = nn.Linear(layer_size, layer_size)
self.pi = nn.Linear(layer_size, coefs)
self.mu = nn.Linear(layer_size, out_dim*coefs) # mean
# self.mu = nn.Linear(layer_size, coefs) # mean
self.sigma_sq = nn.Linear(layer_size, coefs*out_dim) # isotropic independent variance
self.out_dim = out_dim
self.coefs = coefs
self.test = test
self.sd = sd
def forward(self, x):
if self.test:
y = x
else:
## Adding Noise ##
y = add_noise(x, noise_type="gaussian", sd =self.sd)
# y =x
## Learning ditribution parameters ##
for i in range(x.size(0)):
x = F.leaky_relu(self.fc(y[i]))
x = F.leaky_relu(self.fc2(x))
pi = F.softmax(self.pi(x), dim=-1)
sigma_sq = F.softplus(self.sigma_sq(x).view(self.coefs,-1)) #logvar
mu = self.mu(x).view(self.coefs,-1) #mean
return pi, mu, sigma_sq
'''
functions to compute the log-likelihood of data according to a gaussian mixture model.
All the computations are done with log because exp will lead to numerical underflow errors.
'''
def log_gaussian(x, mean, logvar):
'''
Computes the Gaussian log-likelihoods
Parameters:
x: [samples,features] data samples
mean: [features] Gaussian mean (in a features-dimensional space)
logvar: [features] the logarithm of variances [no linear dependance hypotesis: we assume one variance per dimension]
Returns:
[samples] log-likelihood of each sample
'''
a = (x - mean) ** 2 # works on multiple samples thanks to tensor broadcasting
log_p = (logvar + a / logvar.exp()).sum(-1)
log_p = -0.5 * ( np.log(2 * np.pi) + log_p )
# log_p = -0.5 * log_p #forgetting constants
return log_p
def log_gmm(x, means, logvars, weights, total=True):
'''
Computes the Gaussian Mixture Model log-likelihoods
Parameters:
x: [samples,features] data samples
means: [K,features] means for K Gaussians
logvars: [K,features] logarithm of variances for K Gaussians [no linear dependance hypotesis: we assume one variance per dimension]
weights: [K] the weights of each Gaussian
total: wether to sum the probabilities of each Gaussian or not (see returning value)
Returns:
[samples] if total=True. Log-likelihood of each sample
[K,samples] if total=False. Log-likelihood of each sample for each model
'''
res = log_gaussian(
x[None,:,:], # (1,examples,features)
means[:,None,:], # (K,1,features)
logvars[:,None,:] # (K,1,features)
)
# res is (K,examples)
res = weights[:,None] * res
if total:
return res.sum(0)
else:
return res
# for Mishra
def mdn_loss_function(x, means, logvars, weights, test = False):
if test:
res = -log_gmm(x, means, logvars, weights)
else:
res = torch.mean(-log_gmm(x, means, logvars, weights))
return res
def loss_fn(y, pi, mu, sigma_sq, model, sd =0.2, test = False):
loss1 = mdn_loss_function(y, mu,sigma_sq, pi, test = test)
if test:
return loss1
else:
# hes = hessian(loss1, model.parameters(), out=h)
hes = torch.tensor(ndimage.filters.laplace(y.detach().cpu().numpy())).cuda()
# print(f"hessian : {hes}")
loss = loss1 + 0.5*(torch.tensor(sd)**2)*torch.sum(hes)
return loss
##### Adding Noise ############
def add_noise(latent, noise_type="gaussian", sd =0.2):
"""Here we add noise to the latent features concatenated from the 4 autoencoders.
Arguments:
'gaussian' (string): Gaussian-distributed additive noise.
'speckle' (string) : Multiplicative noise using out = image + n*image, where n is uniform noise with specified mean & variance.
'sd' (integer) : standard deviation used for generating noise
Input :
latent : numpy array or cuda tensor.
Output:
Array: Noise added input, can be np array or cuda tnesor.
"""
assert sd >= 0.0
if noise_type=="gaussian":
mean=0.
n = torch.distributions.Normal(torch.tensor([mean]), torch.tensor([sd]))
noise = n.sample(latent.size()).squeeze(2).cuda()
latent = latent + noise
return latent
if noise_type=="speckle":
noise = torch.randn(latent.size()).cuda()
latent = latent + latent*noise
return latent
I managed to successfully implement multilinear regression using only numpy for Iris dataset. I wanted to do the same for
boston houses data set but my model won't learn and I have no idea why.
import pandas as pd
# read data and split into test and training sets
data = pd.read_csv('train.csv')
data = (data - data.mean()) / data.std() # normalize data
split_data = np.random.rand(len(data)) < 0.8
train_data = data[split_data].round(5)
test_data = data[~split_data]
# create matrices
input_features_train = train_data.drop(['ID', 'medv'], 1).values
output_feature_train = train_data.medv.values.reshape(-1, 1)
ones = np.ones([input_features_train.shape[0], 1])
input_features_train = np.concatenate((ones, input_features_train), 1)
weight = np.zeros([1, 14])
def computeCost(X, y, theta):
summed = np.power(((X # theta.T) - y), 2)
return np.sum(summed) / (2 * len(X))
def gradientDescent(X, y, theta, iters, alpha):
costs = np.zeros(iters)
for i in range(iters):
theta = theta - (alpha / len(X)) * np.sum(X * (X # theta.T - y), 0)
costs[i] = computeCost(X, y, theta)
return theta, costs
learning_rate = 0.01
iterations = 100000
weights, cost = gradientDescent(input_features_train, output_feature_train, weight, iterations, learning_rate)
print("Weights: ", weights)
finalCost = computeCost(input_features_train, output_feature_train, weights)
# test
input_features_test = test_data.drop(['ID', 'medv'], 1).values
output_feature_test = test_data.medv.values.reshape(-1, 1)
ones = np.ones([input_features_test.shape[0], 1])
input_features_test = np.concatenate((ones, input_features_test), 1)
def test_data(input_features, output_feature, weights):
predictions = np.round(np.dot(input_features, weights.T))
for i in range(len(output_feature)):
predicted = predictions[i]
success = predictions[i] == output_feature[i]
print('For features: ', input_features[i], ' housing price should be ', output_feature[i])
print("Predicted: %f" % predicted)
print("Is success? ", success)
print()
test_data(input_features_test, output_feature_test, weights)
predictions = np.round(np.dot(input_features_test, weights.T))
accuracy = (sum(predictions == output_feature_test) / float(len(output_feature_test)) * 100)[0]
print("Accuracy of the model is ", accuracy, "% after ", iterations, "iterations")
example output goes as follow
Weights: [[ 0.01465871 -0.11583742 0.17729105 0.01249782 0.09822299 -0.31249182
0.25208063 -0.00937766 -0.48751822 0.46772537 -0.27637035 -0.1590125
0.12926108 -0.48910136]]
For features: [ 1. -0.44852959 -0.47141352 0.09095532 -0.25240023 0.13793157
0.46506236 0.03105118 -0.62153314 -0.98758424 -0.79769195 1.18594974
0.37563165 -0.40259248] housing price should be [-0.04019949]
Predicted: 0.000000
Is success? [False]
I tried even 10000000 iterations and still it fails all tests and has 0% accuracy. On iris data set I managed to get 100% with this model so I don't understand why it won't work.
I suspect it might be something with data normalization as without it I get RuntimeWarning: overflow encountered in power
summed = np.power(((X # theta.T) - y), 2) error which I also don't know why is happening.
Could you please point me in the right direction ? Thanks!
I really suggest you to use scikit learn for this. You can use SGD Regressor,or Cat Boost Regressor which have inbuilt support required for approaches like these.
The main motive behind this suggestion is using gradient descent manually may lead to some logical error which may go undetected.
Try to solve with scikit learn.That might help.
context: I am trying to create a generic function to optimize the cost of any regression problem using polynomial regression (of any specified degree).
I am trying to fit my model to the load_boston dataset (with the house price as the label and 13 features).
I used multiple degrees of polynomials, and multiple learning rates and epochs (with gradient descent) and the MSE is coming out to be so high even on the training dataset (I am using 100% of the data to train the model, and I am checking the cost on the same data, but the MSE cost is still very high).
import tensorflow as tf
from sklearn.datasets import load_boston
def polynomial(x, coeffs):
y = 0
for i in range(len(coeffs)):
y += coeffs[i]*x**i
return y
def initial_parameters(dimensions, data_type, degree): # list number of dims/features and degree
thetas = [tf.Variable(0, dtype=data_type)] # the constant theta/bias
for i in range(degree):
thetas.append(tf.Variable( tf.zeros([dimensions, 1], dtype=data_type)))
return thetas
def regression_error(x, y, thetas):
hx = thetas[0] # constant thetas - no need to have 1 for each variable (e.g x^0*th + y^0*th...)
for i in range(1, len(thetas)):
hx = tf.add(hx, tf.matmul( tf.pow(x, i), thetas[i]))
return tf.reduce_mean(tf.squared_difference(hx, y))
def polynomial_regression(x, y, data_type, degree, learning_rate, epoch): #features=dimensions=variables
thetas = initial_parameters(x.shape[1], data_type, degree)
cost = regression_error(x, y, thetas)
init = tf.initialize_all_variables()
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
with tf.Session() as sess:
sess.run(init)
for epoch in range(epoch):
sess.run(optimizer)
return cost.eval()
x, y = load_boston(True) # yes just use the entire dataset
for deg in range(1, 2):
for lr in range(-8, -5):
error = polynomial_regression(x, y, tf.float64, deg, 10**lr, 100 )
print (deg, lr, error)
It outputs 97.3 even though most of the labels are around 30 (degree = 1, learning rate = 10^-6).
what is wrong with the code?
The problem is that the different features are on different orders of magnitude and hence are not compatible with the learning rate which is the same for all features. Even more, when using a non-zero variable initialization, one has to make sure that these initial values are as well compatible with the feature values.
In [1]: from sklearn.datasets import load_boston
In [2]: x, y = load_boston(True)
In [3]: x.std(axis=0)
Out[3]:
array([8.58828355e+00, 2.32993957e+01, 6.85357058e+00, 2.53742935e-01,
1.15763115e-01, 7.01922514e-01, 2.81210326e+01, 2.10362836e+00,
8.69865112e+00, 1.68370495e+02, 2.16280519e+00, 9.12046075e+01,
7.13400164e+00])
In [4]: x.mean(axis=0)
Out[4]:
array([3.59376071e+00, 1.13636364e+01, 1.11367787e+01, 6.91699605e-02,
5.54695059e-01, 6.28463439e+00, 6.85749012e+01, 3.79504269e+00,
9.54940711e+00, 4.08237154e+02, 1.84555336e+01, 3.56674032e+02,
1.26530632e+01])
A common approach is to normalize the input data (e.g. to have zero mean and unit variance) and to choose the initial weights randomly (e.g. normal distribution, std.dev. = 1). sklearn.preprocessing offers various functionality for these cases.
PolynomialFeatures can be used to generate the polynomial features automatically.
StandardScaler scales the data to zero mean and unit variance.
pipeline.Pipeline can be used for convenience to combine these preprocessing steps.
The polynomial_regression function then reduces to:
pipeline = Pipeline([
('poly', PolynomialFeatures(degree)),
('scaler', StandardScaler())
])
x = pipeline.fit_transform(x)
thetas = tf.Variable(tf.random_normal([x.shape[1], 1], dtype=data_type))
cost = tf.reduce_mean(tf.squared_difference(tf.matmul(x, thetas), y))
# Perform variable initialization and optimizer instantiation here.
# Run optimization over epochs.
I trying to understand linear regression... here is script that I tried to understand:
'''
A linear regression learning algorithm example using TensorFlow library.
Author: Aymeric Damien
Project: https://github.com/aymericdamien/TensorFlow-Examples/
'''
from __future__ import print_function
import tensorflow as tf
from numpy import *
import numpy
import matplotlib.pyplot as plt
rng = numpy.random
# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
# Training Data
train_X = numpy.asarray([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
7.042,10.791,5.313,7.997,5.654,9.27,3.1])
train_Y = numpy.asarray([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
2.827,3.465,1.65,2.904,2.42,2.94,1.3])
train_X=numpy.asarray(train_X)
train_Y=numpy.asarray(train_Y)
n_samples = train_X.shape[0]
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
for (x, y) in zip(train_X, train_Y):
sess.run(optimizer, feed_dict={X: x, Y: y})
# Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(cost, feed_dict={X: train_X, Y:train_Y})
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_cost = sess.run(cost, feed_dict={X: train_X, Y: train_Y})
print("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
# Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.plot(train_X, sess.run(W) * train_X + sess.run(b), label='Fitted line')
plt.legend()
plt.show()
Question is what this part represent:
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
And why are there random float numbers?
Also could you show me some math with formals represents cost, pred, optimizer variables?
let's try to put up some intuition&sources together with the tfapproach.
General intuition:
Regression as presented here is a supervised learning problem. In it, as defined in Russel&Norvig's Artificial Intelligence, the task is:
given a training set (X, y) of m input-output pairs (x1, y1), (x2, y2), ... , (xm, ym), where each output was generated by an unknown function y = f(x), discover a function h that approximates the true function f
For that sake, the h hypothesis function combines somehow each x with the to-be-learned parameters, in order to have an output that is as close to the corresponding y as possible, and this for the whole dataset. The hope is that the resulting function will be close to f.
But how to learn this parameters? in order to be able to learn, the model has to be able to evaluate. Here comes the cost (also called loss, energy, merit...) function to play: it is a metric function that compares the output of h with the corresponding y, and penalizes big differences.
Now it should be clear what is exactly the "learning" process here: alter the parameters in order to achieve a lower value for the cost function.
Linear Regression:
The example that you are posting performs a parametric linear regression, optimized with gradient descent based on the mean squared error as cost function. Which means:
Parametric: The set of parameters is fixed. They are held in the exact same memory placeholders thorough the learning process.
Linear: The output of h is merely a linear (actually, affine) combination between the input x and your parameters. So if x and w are real-valued vectors of the same dimensionality, and b is a real number, it holds that h(x,w, b)= w.transposed()*x+b. Page 107 of the Deep Learning Book brings more quality insights and intuitions into that.
Cost function: Now this is the interesting part. The average squared error is a convex function. This means it has a single, global optimum, and furthermore, it can be directly found with the set of normal equations (also explained in the DLB). In the case of your example, the stochastic (and/or minibatch) gradient descent method is used: this is the preferred method when optimizing non-convex cost functions (which is the case in more advanced models like neural networks) or when your dataset has a huge dimensionality (also explained in the DLB).
Gradient descent: tf deals with this for you, so it is enough to say that GD minimizes the cost function by following its derivative "downwards", in small steps, until reaching a saddle point. If you totally need to know, the exact technique applied by TF is called automatic differentiation, kind of a compromise between the numeric and symbolic approaches. For convex functions like yours this point will be the global optimum, and (if your learning rate is not too big) it will always converge to it, so it doesn't matter which values you initialize your Variables with. The random initialization is necessary in more complex architectures like neural networks. There is some extra code regarding the management of the minibatches, but I won't get into that because it is not the main focus of your question.
The TensorFlow approach:
Deep Learning frameworks are nowadays about nesting lots of functions by building computational graphs (you may want to take a look at the presentation on DL frameworks that I did some weeks ago). For constructing and running the graph, TensoFlow follows a declarative style, which means that the graph has to be first completely defined and compiled, before it is deployed and executed. It is very reccommended to read this short wiki article, if you haven't yet. In this context, the setup is split in two parts:
Firstly, you define your computational Graph, where you put your dataset and parameters in memory placeholders, define the hypothesis and cost functions building on them, and tell tf which optimization technique to apply.
Then you run the computation in a Session and the library will be able to (re)load the data placeholders and perform the optimization.
The code:
The code of the example follows this approach closely:
Define the test data X and labels Y, and prepare a placeholder in the Graph for them (which is fed in the feed_dict part).
Define the 'W' and 'b' placeholders for the parameters. They have to be Variables because they will be updated during the Session.
Define pred (our hypothesis) and cost as explained before.
From this, the rest of the code should be clearer. Regarding the optimizer, as I said, tf already knows how to deal with this but you may want to look into gradient descent for more details (again, the DLB is a pretty good reference for that)
Cheers!
Andres
CODE EXAMPLES: GRADIENT DESCENT VS. NORMAL EQUATIONS
This small snippets generate simple multi-dimensional datasets and test both approaches. Notice that the normal equations approach doesn't require looping, and brings better results. For small dimensionality (DIMENSIONS<30k) is probably the preferred approach:
from __future__ import absolute_import, division, print_function
import numpy as np
import tensorflow as tf
####################################################################################################
### GLOBALS
####################################################################################################
DIMENSIONS = 5
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
####################################################################################################
### GRADIENT DESCENT APPROACH
####################################################################################################
# dataset globals
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset is used for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
ALPHA = 1e-8 # learning rate
LAMBDA = 0.5 # L2 regularization factor
TRAINING_STEPS = 1000
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)] # synthesize data
# ds = normalize_data(ds)
ds = [(x, [f(x)+noise()]) for x in ds] # add labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
x_test = tf.placeholder(tf.float32, shape=(_test_size, DIMENSIONS))
y_test = tf.placeholder(tf.float32, shape=(_test_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS)])
theta_0 = tf.Variable([[0.0]]) # don't forget the bias term!
# forward propagation
train_prediction = tf.matmul(x_train, theta)+theta_0
test_prediction = tf.matmul(x_test, theta) +theta_0
# cost function and optimizer
train_cost = (tf.nn.l2_loss(train_prediction - y_train)+LAMBDA*tf.nn.l2_loss(theta))/float(_train_size)
optimizer = tf.train.GradientDescentOptimizer(ALPHA).minimize(train_cost)
# test results
test_cost = (tf.nn.l2_loss(test_prediction - y_test)+LAMBDA*tf.nn.l2_loss(theta))/float(_test_size)
# run the computation
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized"); print(theta.eval())
for step in range(TRAINING_STEPS):
_, train_c, test_c = s.run([optimizer, train_cost, test_cost],
feed_dict={x_train: train_data, y_train: train_labels,
x_test: test_data, y_test: test_labels })
if (step%100==0):
# it should return bias close to zero and parameters all close to 1 (see definition of f)
print("\nAfter", step, "iterations:")
#print(" Bias =", theta_0.eval(), ", Weights = ", theta.eval())
print(" train cost =", train_c); print(" test cost =", test_c)
PARAMETERS_GRADDESC = tf.concat(0, [theta_0, theta]).eval()
print("Solution for parameters:\n", PARAMETERS_GRADDESC)
####################################################################################################
### NORMAL EQUATIONS APPROACH
####################################################################################################
# dataset globals
DIMENSIONS = 5
DS_SIZE = 5000
TRAIN_RATIO = 0.6 # 60% of the dataset isused for training
_train_size = int(DS_SIZE*TRAIN_RATIO)
_test_size = DS_SIZE - _train_size
f = lambda(x): sum(x) # the "true" function: f = 0 + 1*x1 + 1*x2 + 1*x3 ...
noise = lambda: np.random.normal(0,10) # some noise
# training globals
LAMBDA = 1e6 # L2 regularization factor
# generate the dataset, the labels and split into train/test
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
np.random.shuffle(ds)
train_data, train_labels = zip(*ds[0:_train_size])
test_data, test_labels = zip(*ds[_train_size:])
# define the computational graph
graph = tf.Graph()
with graph.as_default():
# declare graph inputs
x_train = tf.placeholder(tf.float32, shape=(_train_size, DIMENSIONS+1))
y_train = tf.placeholder(tf.float32, shape=(_train_size, 1))
theta = tf.Variable([[0.0] for _ in range(DIMENSIONS+1)]) # implicit bias!
# optimum
optimum = tf.matrix_solve_ls(x_train, y_train, LAMBDA, fast=True)
# run the computation: no loop needed!
with tf.Session(graph=graph) as s:
tf.initialize_all_variables().run()
print("initialized")
opt = s.run(optimum, feed_dict={x_train:train_data, y_train:train_labels})
PARAMETERS_NORMEQ = opt
print("Solution for parameters:\n",PARAMETERS_NORMEQ)
####################################################################################################
### PREDICTION AND ERROR RATE
####################################################################################################
# generate test dataset
ds = [[np.random.rand()*1000 for d in range(DIMENSIONS)] for _ in range(DS_SIZE)]
ds = [([1]+x, [f(x)+noise()]) for x in ds] # add x[0]=1 dimension and labels
test_data, test_labels = zip(*ds)
# define hypothesis
h_gd = lambda(x): PARAMETERS_GRADDESC.T.dot(x)
h_ne = lambda(x): PARAMETERS_NORMEQ.T.dot(x)
# define cost
mse = lambda pred, lab: ((pred-np.array(lab))**2).sum()/DS_SIZE
# make predictions!
predictions_gd = np.array([h_gd(x) for x in test_data])
predictions_ne = np.array([h_ne(x) for x in test_data])
# calculate and print total error
cost_gd = mse(predictions_gd, test_labels)
cost_ne = mse(predictions_ne, test_labels)
print("total cost with gradient descent:", cost_gd)
print("total cost with normal equations:", cost_ne)
Variables allow us to add trainable parameters to a graph. They are constructed with a type and initial value:
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W * x + b
The variable with type tf.Variable is the parameter which we will learn use TensorFlow. Assume you use the gradient descent to minimize the loss function. You need initial these parameter first. The rng.randn() is used to generate a random value for this purpose.
I think the Getting Started With TensorFlow is a good start point for you.
I'll first define the variables:
W is a multidimensional line that spans R^d (same dimensionality as X)
b is a scalar value (bias)
Y is also a scalar value i.e. the value at X
pred = W (dot) X + b # dot here refers to dot product
# cost equals the average squared error
cost = ((pred - Y)^2) / 2*num_samples
#finally optimizer
# optimizer computes the gradient with respect to each variable and the update
W += learning_rate * (pred - Y)/num_samples * X
b += learning_rate * (pred - Y)/num_samples
Why are W and b set to random well this updates based on gradients from the error calculated from the cost so W and b could have been initialized to anything. It isn't performing linear regression via least squares method although both will converge to the same solution.
Look here for more information: Getting Started