In Neural Networks, the number of samples used for training data is 5000 and before the data is given for training it was normalized using the formula
y - mean(y)
y' = -----------
stdev(y)
Now I want to de-normalise the data after getting the predicted output. Generally for prediction a test data data is used which is 2000 samples. In order to de-normalize, following formula is used
y = y' * stdev(y) + mean(y)
This approach is taken from the following thread
[How to denormalise (de-standardise) neural net predictions after normalising input data
Could anyone explain me how the same mean and standard deviation used in normalizing the training data(5000*2100) could be used in de-normalizing the predicted data as you know for prediction test data(2000*2100) is used,both the counts are different.
The denormalization equation is simple algebra: it's the same equation as normalization, but solved for y instead of y'. The function is to reverse the normalization process, recovering the "shape" of the original data; that's why you have to use the original stdev and mean.
Normalization is a process of shifting the data to center on 0 (using the mean), and then squeezing the distribution to a standard normal curve (for a new stdev of 1.0). To return to the original shape, you have to un-shift and un-squeeze the same amounts as the original distribution.
Note that we expect the predicted data to have a mean of 0 and a stdev around 1.0 (with some change in variations due to the central tendency theorem). Your worry is not silly: we do have a different population count for the stdev.
Related
I'd like to use a neural network to predict a scalar value which is the sum of a function of the input values and a random value (I'm assuming gaussian distribution) whose variance also depends on the input values. Now I'd like to have a neural network that has two outputs - the first output should approximate the deterministic part - the function, and the second output should approximate the variance of the random part, depending on the input values. What loss function do I need to train such a network?
(It would be nice if there was an example with Python for Tensorflow, but I'm also interested in general answers. I'm also not quite clear how I could write something like in Python code - none of the examples I found so far show how to address individual outputs from the loss function.)
You can use dropout for that. With a dropout layer you can make several different predictions based on different settings of which nodes dropped out. Then you can simply count the outcomes and interpret the result as a measure for uncertainty.
For details, read:
Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Since I've found nothing simple to implement, I wrote something myself, that models that explicitly: here is a custom loss function that tries to predict mean and variance. It seems to work but I'm not quite sure how well that works out in practice, and I'd appreciate feedback. This is my loss function:
def meanAndVariance(y_true: tf.Tensor , y_pred: tf.Tensor) -> tf.Tensor :
"""Loss function that has the values of the last axis in y_true
approximate the mean and variance of each value in the last axis of y_pred."""
y_pred = tf.convert_to_tensor(y_pred)
y_true = math_ops.cast(y_true, y_pred.dtype)
mean = y_pred[..., 0::2]
variance = y_pred[..., 1::2]
res = K.square(mean - y_true) + K.square(variance - K.square(mean - y_true))
return K.mean(res, axis=-1)
The output dimension is twice the label dimension - mean and variance of each value in the label. The loss function consists of two parts: a mean squared error that has the mean approximate the mean of the label value, and the variance that approximates the difference of the value from the predicted mean.
When using dropout to estimate the uncertainty (or any other stochastic regularization method), make sure to also checkout our recent work on providing a sampling-free approximation of Monte-Carlo dropout.
https://arxiv.org/pdf/1908.00598.pdf
We essentially follow ur idea. Treat the activations as random variables and then propagate mean and variance using error propagation to the output layer. Consequently, we obtain two outputs - the mean and the variance.
Let's say we have a system of ODE's that describe how X affects Y:
dXdt = -k * X
dYdt = Kin * (1 - (Vmax * X)/(Km + X)) - kout * Y
I am trying to use a neural network to input X(0), Y(0), and t, and output Y(t). I made a feed-forward network in TensorFlow, and trained it on data generated with the above equations, using Y data generated with the initial value of X being 5 and 10. The initial value of Y I left constant, using its steady-state value (the value where dYdt = 0 and X = 0). For testing, I tried initial X values in-between and outside of two values I trained with. For all testing, Y(0) was left the same.
The testing results for in-between values are very good, and the results for the outside values are pretty good as well. However, this is only over the time period that the network was trained over, say t=[0,10]. Once I try to predict a value past this time period, the predictions start to drift off.
Is there a better method of implementing the network so that I can predict the values past the training interval? Ideally, I'd like to be able to predict the return of Y to steady-state, once X has reached 0. I've been reading about using RNNs, however I need it to be trained on sparse data, where the time points aren't evenly spaced. The network I used above was able to do this, at least for the trained interval. Also, most examples of RNNs that I've seen (that aren't for language processing) rely on predicting future time points based on the previous time points, instead of in the way that I am trying to use it.
An idea I have would be to use my original network to predict the values over the trained time range (and a lot of them to create a rich data-set), and then feed that into an RNN to predict the values past the time range. Would this be a feasible idea, or are there some other methods that I could try that would work better.
I am trying to implement a solution to Ridge regression in Python using Stochastic gradient descent as the solver. My code for SGD is as follows:
def fit(self, X, Y):
# Convert to data frame in case X is numpy matrix
X = pd.DataFrame(X)
# Define a function to calculate the error given a weight vector beta and a training example xi, yi
# Prepend a column of 1s to the data for the intercept
X.insert(0, 'intercept', np.array([1.0]*X.shape[0]))
# Find dimensions of train
m, d = X.shape
# Initialize weights to random
beta = self.initializeRandomWeights(d)
beta_prev = None
epochs = 0
prev_error = None
while (beta_prev is None or epochs < self.nb_epochs):
print("## Epoch: " + str(epochs))
indices = range(0, m)
shuffle(indices)
for i in indices: # Pick a training example from a randomly shuffled set
beta_prev = beta
xi = X.iloc[i]
errori = sum(beta*xi) - Y[i] # Error[i] = sum(beta*x) - y = error of ith training example
gradient_vector = xi*errori + self.l*beta_prev
beta = beta_prev - self.alpha*gradient_vector
epochs += 1
The data I'm testing this on is not normalized and my implementation always ends up with all the weights being Infinity, even though I initialize the weights vector to low values. Only when I set the learning rate alpha to a very small value ~1e-8, the algorithm ends up with valid values of the weights vector.
My understanding is that normalizing/scaling input features only helps reduce convergence time. But the algorithm should not fail to converge as a whole if the features are not normalized. Is my understanding correct?
You can check from scikit-learn's Stochastic Gradient Descent documentation that one of the disadvantages of the algorithm is that it is sensitive to feature scaling. In general, gradient based optimization algorithms converge faster on normalized data.
Also, normalization is advantageous for regression methods.
The updates to the coefficients during each step will depend on the ranges of each feature. Also, the regularization term will be affected heavily by large feature values.
SGD may converge without data normalization, but that is subjective to the data at hand. Therefore, your assumption is not correct.
Your assumption is not correct.
It's hard to answer this, because there are so many different methods/environments but i will try to mention some points.
Normalization
When some method is not scale-invariant (i think every linear-regression is not) you really should normalize your data
I take it that you are just ignoring this because of debugging / analyzing
Normalizing your data is not only relevant for convergence-time, the results will differ too (think about the effect within the loss-function; big values might effect in much more loss to small ones)!
Convergence
There is probably much to tell about convergence of many methods on normalized/non-normalized data, but your case is special:
SGD's convergence theory only guarantees convergence to some local-minimum (= global-minimum in your convex-opt problem) for some chosings of hyper-parameters (learning-rate and learning-schedule/decay)
Even optimizing normalized data can fail with SGD when those params are bad!
This is one of the most important downsides of SGD; dependence on hyper-parameters
As SGD is based on gradients and step-sizes, non-normalized data has a possibly huge effect on not achieving this convergence!
In order for sgd to converge in linear regression the step size should be smaller than 2/s where s is the largest singular value of the matrix (see the Convergence and stability in the mean section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter), in the case of ridge regression it should be less than 2*(1+p/s^2)/s where p is the ridge penalty.
Normalizing rows of the matrix (or gradients) changes the loss function to give each sample an equal weight and it changes the singular values of the matrix such that you can choose a step size near 1 (see the NLMS section in https://en.m.wikipedia.org/wiki/Least_mean_squares_filter). Depending on your data it might require smaller step sizes or allow for larger step sizes. It all depends on whether or not the normalization increases or deacreses the largest singular value of the matrix.
Note that when deciding whether or not to normalize the rows you shouldn't just think about the convergence rate (which is determined by the ratio between the largest and smallest singular values) or stability in the mean, but also about how it changes the loss function and whether or not it fits your needs because of that, sometimes it makes sense to normalize but sometimes (for example when you want to give different importance for different samples or when you think that a larger energy for the signal means better snr) it doesn't make sense to normalize.
I am trying to use SK learn to perform linear regression on time series labeled data.
My data format is data=(timestamp,value,label)
The labels that are assigned to my data are either 0 or 1.
I tried to follow this example from SKLearn website
My questions:
1- Where are the labels of the training data in the example ? Are they in diabetes_y_train ?
2- What are the return values of the method predict() ? In my code, it returns an array of n_samples as predicted values in the range [0,1]. However, I expected to have return binary values of either 0 or 1 (no intermediate values)
1 - diabetes_y_train are the labels for train
2 - You are using a regression function, so it is right to have continous variables. If you want to have binary output you are not solving a regression problem but a classification one you can then set a threshold to discretise the predictions or use one of the classifier offered by sklearn.
1 - Yes
2 - Predict calculates a floating point number, because the example is trying to predict a floating point value and not a binary value. So there is no yes/no answer, but a predictaed value, and to estimate the error, a difference is calculated and averaged in np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)
I'm attempting kaggle.com's digit recognizer competition using Python and scikit-learn.
After removing labels from the training data, I add each row in CSV into a list like this:
for row in csv:
train_data.append(np.array(np.int64(row)))
I do the same for the test data.
I pre-process this data with PCA in order to perform dimension reduction (and feature extraction?):
def preprocess(train_data, test_data, pca_components=100):
# convert to matrix
train_data = np.mat(train_data)
# reduce both train and test data
pca = decomposition.PCA(n_components=pca_components).fit(train_data)
X_train = pca.transform(train_data)
X_test = pca.transform(test_data)
return (X_train, X_test)
I then create a kNN classifier and fit it with the X_train data and make predictions using the X_test data.
Using this method I can get around 97% accuracy.
My question is about the dimensionality of the data before and after PCA is performed
What are the dimensions of train_data and X_train?
How does the number of components influence the dimensionality of the output? Are they the same thing?
TL;DR: Yes, the number of the desired PCA components is the dimensionality of the output data (after the transformation).
The PCA algorithm finds the eigenvectors of the data's covariance matrix. What are eigenvectors? Nobody knows, and nobody cares (just kidding!). What's important is that the first eigenvector is a vector parallel to the direction along which the data has the largest variance (intuitively: spread). The second one denotes the second-best direction in terms of the maximum spread, and so on. Another important fact is that these vectors are orthogonal to each other, so they form a basis.
The pca_components parameter tells the algorithm how many best basis vectors are you interested in. So, if you pass 100 it means you want to get 100 basis vectors that describe (statistician would say: explain) most of the variance of your data.
The transform function transforms (srsly?;)) the data from the original basis to the basis formed by the chosen PCA components (in this example - the first best 100 vectors). You can visualize this as a cloud of points being rotated and having some of its dimensions ignored. As correctly pointed out by Jaime in the comments, this is equivalent of projecting the data onto the new basis.
For the 3D case, if you wanted to get a basis formed of the first 2 eigenvectors, then again, the 3D point cloud would be first rotated, so the most variance would be parallel to the coordinate axes. Then, the axis where the variance is smallest is being discarded, leaving you with 2D data.