My training data are saved in 3 files, each file is too large and cannot fit into memory.For each training example, the data are two dimensionality (2805 rows and 222 columns, the 222nd column is for label) and are numerical values. I would like to normalize the data before feeding into models for training. Below is my code for input_pipeline, and
the data has not been normalized before creating dataset. Is there some functions in tensorflow that can do normalization for my case?
dataset = tf.data.TextLineDataset([file1, file2, file3])
# combine 2805 lines into a single example
dataset = dataset.batch(2805)
def parse_example(line_batch):
record_defaults = [[1.0] for col in range(0, 221)]
record_defaults.append([1])
content = tf.decode_csv(line_batch, record_defaults = record_defaults, field_delim = '\t')
features = tf.stack(content[0:221])
features = tf.transpose(features)
label = content[-1][-1]
label = tf.one_hot(indices = tf.cast(label, tf.int32), depth = 2)
return features, label
dataset = dataset.map(parse_example)
dataset = dataset.shuffle(1000)
# batch multiple examples
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
data_batch, label_batch = iterator.get_next()
There are different ways of "normalizing data". Depending which one you have in mind, it may or may not be easy to implement in your case.
1. Fixed normalization
If you know the fixed range(s) of your values (e.g. feature #1 has values in [-5, 5], feature #2 has values in [0, 100], etc.), you could easily pre-process your feature tensor in parse_example(), e.g.:
def normalize_fixed(x, current_range, normed_range):
current_min, current_max = tf.expand_dims(current_range[:, 0], 1), tf.expand_dims(current_range[:, 1], 1)
normed_min, normed_max = tf.expand_dims(normed_range[:, 0], 1), tf.expand_dims(normed_range[:, 1], 1)
x_normed = (x - current_min) / (current_max - current_min)
x_normed = x_normed * (normed_max - normed_min) + normed_min
return x_normed
def parse_example(line_batch,
fixed_range=[[-5, 5], [0, 100], ...],
normed_range=[[0, 1]]):
# ...
features = tf.transpose(features)
features = normalize_fixed(features, fixed_range, normed_range)
# ...
2. Per-sample normalization
If your features are supposed to have approximately the same range of values, per-sample normalization could also be considered, i.e. applying normalization considering the features moments (mean, variance) for each sample:
def normalize_with_moments(x, axes=[0, 1], epsilon=1e-8):
mean, variance = tf.nn.moments(x, axes=axes)
x_normed = (x - mean) / tf.sqrt(variance + epsilon) # epsilon to avoid dividing by zero
return x_normed
def parse_example(line_batch):
# ...
features = tf.transpose(features)
features = normalize_with_moments(features)
# ...
3. Batch normalization
You could apply the same procedure over a complete batch instead of per-sample, which may make the process more stable:
data_batch = normalize_with_moments(data_batch, axis=[1, 2])
Similarly, you could use tf.nn.batch_normalization
4. Dataset normalization
Normalizing using the mean/variance computed over the whole dataset would be the trickiest, since as you mentioned it is a large, split one.
tf.data.Dataset isn't really meant for such global computation. A solution would be to use whatever tools you have to pre-compute the dataset moments, then use this information for your TF pre-processing.
As mentioned by #MiniQuark, Tensorflow has a Transform library you could use to preprocess your data. Have a look at the Get Started, or for instance at the tft.scale_to_z_score() method for sample normalization.
Exapnding on benjaminplanche's answer for "#4 Dataset normalization", there is actually a pretty easy way to accomplish this.
Tensorflow's Keras provides a preprocessing normalization layer. Now as this is a layer, its intent is to be used within the model. However you don't have to (more on that later).
The model usage is simple:
input = tf.keras.Input(shape=dataset.element_spec.shape)
norm = tf.keras.layers.preprocessing.Normalization()
norm.adapt(dataset) # you can use dataset.take(N) if N samples is enough for it to figure out the mean & variance.
layer1 = norm(input)
...
The advantage of using it in the model is that the normalization mean & variance are saved as part of the model weights. So when you load the saved model, it'll use the same values it was trained with.
As mentioned earlier, if you don't want to use keras models, you don't have to use the layer as part of one. If you'd rather use it in your dataset pipeline, you can do that too.
norm = tf.keras.layers.experimental.preprocessing.Normalization()
norm.adapt(dataset)
dataset = dataset.map(lambda t: norm(t))
The disadvantage is that you need to save and restore those weights manually now (norm.get_weights() and norm.set_weights()). Numpy has convenient save() and load() functions you can use here.
np.save("norm_weights.npy", norm.get_weights())
norm.set_weights(np.load("norm_weights.npy", allow_pickle=True))
After defining inputs, execute the following line of code:
import tensorflow as tf
inputs = tf.keras.layers.LayerNormalization(
axis=-1,
center=True,
scale=True,
trainable=True,
name='input_normalized',
)(inputs)
I inferred this from the tensorflow API (which has been updated since the answers above).
Related
I used the LSTM model to predict the future open price of a stock. Here the data was preprocessed and the model was built and trained without any errors, and I used Standard Scaler to scale down the values in the DataFrame. But while retrieving the predictions from the model, when I used the scaler.reverse() method it gave the following error.
ValueError: non-broadcastable output operand with shape (59,1) doesn't match the broadcast shape (59,4)
The complete code is a too big jupyter notebook to directly show, so I have uploaded it in a git repository
This is because the model is predicting output with shape (59, 1). But your Scaler was fit on (251, 4) data frame. Either create a new scaler on the data frame of the shape of y values or change your model dense layer output to 4 dimensions instead of 1.
The data shape on which scaler is fit, it will take that shape only during scaler.inverse_transform.
Old Code - Shape (n,1)
trainY.append(df_for_training_scaled[i + n_future - 1:i + n_future, 0])
Updated Code - Shape (n,4) - use all 4 outputs
trainY.append(df_for_training_scaled[i + n_future - 1:i + n_future,:])
Normally you'd be re-scaling independent variables (features) as differences in scale can affect model calculations, but the dependent variable that you're trying to predict is normally left untouched. There's usually no reason to re-scale the dependent variable and scaling it makes it extremely difficult to interpret results.
The first line of documentation of StandardScaler class even specifies as much:
Standardize features by removing the mean and scaling to unit variance
You can optionally also scale labels, but once again this is not normally required.
So what I'd do in your place is (assuming your original dataframe contains 3 independent variables and 1 target variable) is this:
X = some_df.iloc[:, :3].values
y = some_df.iloc[3].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
# And then goes everything as usual
Now, when you go to predict values you simply need to transform your input with the scaler in the same way it's been done before.
The better way, though, would be to add to your model a Normalization layer as a pre-processing step. This way you just feed raw data into your estimator and it handles all the nitty-gritty for you. And, similarly, you won't need to normalize data when generating predictions, the model will do everything for you. You could add something like:
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras import Model
# this is your default batch_size
BATCH_SIZE = 128
# Here's your raw (non-normalized) X data
X = some_df.iloc[:, :3].values
norm = Normalization()
norm.adapt(X)
preprocess = Sequential([
Input(shape=(BATCH_SIZE, 3)),
norm
])
# Now finally, when you build your actual model you add
# pre-processing step in the beginning
inp = preprocess()
x = Dense(64)(input)
x = Dense(128)(x)
x = Dense(1)(x)
model = Model(inputs=inp, outputs=x)
Here the pre-process step is a part of the model itself so once you do that you can just feed it raw data without any additional transformations.
This is what it will do:
# Skipping the imports as they are the same as above + numpy
X = np.array([[1, 2, 3], [10, 20, 40], [100, 200, 400]])
norm = Normalization()
norm.adapt(X)
preprocess = Sequential([
Input(shape=(3, 3)),
norm
])
x_new = preprocess(X)
print(x_new)
Out: tf.Tensor(
[[-0.80538726 -0.80538726 -0.807901 ]
[-0.60404044 -0.60404044 -0.6012719 ]
[ 1.4094278 1.4094278 1.4091729 ]], shape=(3, 3), dtype=float32)
I'm trying to solve a very simple problem (simple supposedly, it's giving me nightmares).
My data is this
0.64900194, 2.32144675, 4.36117903, 6.8795263 , 8.70335759,
10.52469321, 12.50494439, 14.92118469, 16.31657096, 18.69954666,
20.653336 , 22.08447934, 24.29878371, 26.01567801, 28.3626067 ,
30.75065028, 32.81166691, 34.52029737, 36.90956918, 38.55743122
and the corresponding target for the above sequence of data is 40.24253
As you can see it's a simple lstm sequence prediction problem, where input data is past 20 values of a 2's multiplication sequence, and target is the next number in sequence + some random uniform number (for adding a little noise).
Sample input and target sizes are: (batch_size, 20, 1) and (batch_size, )
This is the code I'm using for prediction:
def univariate_data(dataset, start_index, end_index, history_size, target_size):
data = []
labels = []
start_index = start_index + history_size
if end_index is None:
end_index = len(dataset) - target_size
for i in range(start_index, end_index):
indices = range(i-history_size, i)
# Reshape data from (history_size,) to (history_size, 1)
data.append(np.reshape(dataset[indices], (history_size, 1)))
labels.append(dataset[i+target_size])
return np.array(data), np.array(labels)
uni_data = np.array([(i*2)+random.random() for i in range(0,400000)])
TRAIN_SPLIT = 300000
uni_train_mean = uni_data[:TRAIN_SPLIT].mean()
uni_train_std = uni_data[:TRAIN_SPLIT].std()
uni_data = (uni_data-uni_train_mean)/uni_train_std
univariate_past_history = 20
univariate_future_target = 0
x_train_uni, y_train_uni = univariate_data(uni_data, 0, TRAIN_SPLIT,
univariate_past_history,
univariate_future_target)
x_val_uni, y_val_uni = univariate_data(uni_data, TRAIN_SPLIT, None,
univariate_past_history,
univariate_future_target)
print ('Single window of past history')
print (x_train_uni.shape)
print ('\n Target temperature to predict')
print (y_train_uni.shape)
BATCH_SIZE = 256
BUFFER_SIZE = 10000
train_univariate = tf.data.Dataset.from_tensor_slices((x_train_uni, y_train_uni))
train_univariate = train_univariate.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()
val_univariate = tf.data.Dataset.from_tensor_slices((x_val_uni, y_val_uni))
val_univariate = val_univariate.batch(BATCH_SIZE).repeat()
simple_lstm_model = tf.keras.models.Sequential([
tf.keras.layers.LSTM(8, input_shape=x_train_uni.shape[-2:]),
tf.keras.layers.Dense(1)
])
simple_lstm_model.compile(optimizer='adam', loss='mae')
for x, y in val_univariate.take(1):
print(simple_lstm_model.predict(x).shape)
EVALUATION_INTERVAL = 200
EPOCHS = 10
simple_lstm_model.fit(train_univariate, epochs=EPOCHS,
steps_per_epoch=EVALUATION_INTERVAL,
validation_data=val_univariate, validation_steps=50)
Prediction of for any given sequence is way off the actual value, any suggestions would help.
Some previous searches gave suggestions for Normalizing, Standardizing, I've tried both. I also tried varying layers of LSTM, and tried with SimpleRNN, GRU. Tried with different activation function, 'tanh', 'relu'. Tried using past 10, 30 and 50 values instead of past 20. None of them helped. I believe i'm making very simple mistake, any guidance would help a lot. Thanks and stay safe!!
So I finally figured out the solution.
Problem in above approach is that the mean and std of my train and test data were very different. In other words, I was training model with data of range(0,400000) and my test set was of range(400000, 500000). Now the mean and standard deviation which I obtained from training data was vastly different than test data, also std deviation in the above case is around 173,250 (of training data). It's very difficult for any model to predict accurately when trained with data having such high standard deviation.
The solution is, that instead of directly feeding data into the model, feed the difference of consecutive elements. Example, instead of feeding the data p = [0, 1, 2, 3, 4, 5, 6], feed the data q = [2, 2, 2, 2, 2, 2, 2], where q is obtained by q[i] = p[i] - p[i-1]. So now if we feed the model with data q, ofc model will predict 2, as model has only seen input of 2, which we can just add to the last actual value and obtain the result.
So, basic problem with the model is high standard deviation of training data and unseen values in test, and the solution is to feed the difference of values.
But another question can be how do we do it, if we want to predict next element of 2**x i.e. exponential of 2, in this case again model will possibly learn trend, given data of type q, but still model won't be very accurate as at some point it'll again have values that have a very high mean and std.
Lastly, I read somewhere LSTM isn't meant for extrapolating data from an embedding space model hasn't been exposed to, there are other models for extrapolating data, but it's not LSTM.
I was surprised that the deep learning algorithms I had implemented did not work, and I decided to create a very simple example, to understand the functioning of CNN better. Here is my attempt of constructing a small CNN for a very simple task, which provides unexpected results.
I have implemented a simple CNN with only one layer of one filter. I have created a dataset of 5000 samples, the inputs x being 256x256 simulated images, and the outputs y being the corresponding blurred images (y = signal.convolvded2d(x,gaussian_kernel,boundary='fill',mode='same')).
Thus, I would like my CNN to learn the convolutional filter which would transform the original image into its blurred version. In other words, I would like my CNN to recover the gaussian filter I used to create the blurred images. Note: As I want to 'imitate' the convolution process such as it is described in the mathematical framework, I am using a gaussian filter which has the same size as my images: 256x256.
It seems to me quite an easy task, and nonetheless, the CNN is unable to provide the results I would expect. Please find below the code of my training function and the results.
# Parameters
size_image = 256
normalization = 1
sigma = 7
n_train = 4900
ind_samples_training =np.linspace(1, n_train, n_train).astype(int)
nb_epochs = 5
minibatch_size = 5
learning_rate = np.logspace(-3,-5,nb_epochs)
tf.reset_default_graph()
tf.set_random_seed(1)
seed = 3
n_train = len(ind_samples_training)
costs = []
# Create Placeholders of the correct shape
X = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'X')
Y_blur_true = tf.placeholder(tf.float64, shape=(None, size_image, size_image, 1), name = 'Y_true')
learning_rate_placeholder = tf.placeholder(tf.float32, shape=[])
# parameters to learn --should be an approximation of the gaussian filter
filter_to_learn = tf.get_variable('filter_to_learn',\
shape = [size_image,size_image,1,1],\
dtype = tf.float64,\
initializer = tf.contrib.layers.xavier_initializer(seed = 0),\
trainable = True)
# Forward propagation: Build the forward propagation in the tensorflow graph
Y_blur_hat = tf.nn.conv2d(X, filter_to_learn, strides = [1,1,1,1], padding = 'SAME')
# Cost function: Add cost function to tensorflow graph
cost = tf.losses.mean_squared_error(Y_blur_true,Y_blur_hat,weights=1.0)
# Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
opt_adam = tf.train.AdamOptimizer(learning_rate=learning_rate_placeholder)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = opt_adam.minimize(cost)
# Initialize all the variables globally
init = tf.global_variables_initializer()
lr = learning_rate[0]
# Start the session to compute the tensorflow graph
with tf.Session() as sess:
# Run the initialization
sess.run(init)
# Do the training loop
for epoch in range(nb_epochs):
minibatch_cost = 0.
seed = seed + 1
permutation = list(np.random.permutation(n_train))
shuffled_ind_samples = np.array(ind_samples_training)[permutation]
# Learning rate update
if learning_rate.shape[0]>1:
lr = learning_rate[epoch]
nb_minibatches = int(np.ceil(n_train/minibatch_size))
for num_minibatch in range(nb_minibatches):
# Minibatch indices
ind_minibatch = shuffled_ind_samples[num_minibatch*minibatch_size:(num_minibatch+1)*minibatch_size]
# Loading of the original image (X) and the blurred image (Y)
minibatch_X, minibatch_Y = load_dataset_blur(ind_minibatch,size_image, normalization, sigma)
_ , temp_cost, filter_learnt = sess.run([optimizer,cost,filter_to_learn],\
feed_dict = {X:minibatch_X, Y_blur_true:minibatch_Y, learning_rate_placeholder: lr})
I have run the training on 5 epochs of 4900 samples, with a batch size equal to 5. The gaussian kernel has a variance of 7^2=49.
I have tried to initialize the filter to be learnt both with the xavier initiliazer method provided by tensorflow, and with the true values of the gaussian kernel we actually would like to learn. In both cases, the filter that is learnt results too different from the true gaussian one as it can be seen on the two images available at https://github.com/megalinier/Helsinki-project.
By examining the photos it seems like the network is learning OK, as the predicted image is not so far off the true label - for better results you can tweak some hyperparams but that is not the case.
I think what you are missing is the fact that different kernels can get quite similar results since it is a convolution.
Think about it, you are multiplying some matrix with another, and then summing all the results to create a new pixel. Now if the true label sum is 10, it could be a results of 2.5 + 2.5 + 2.5 + 2.5 and -10 + 10 + 10 + 0.
What I am trying to say, is that your network could be learning just fine, but you will get a different values in the conv kernel than the filter.
I think this would better serve as a comment as it's somewhat speculative, but it's too long...
Hard to say what exactly is wrong but there could be multiple culprits here. For one, squared error provides a weak signal in the case that target and prediction are already quite similar -- and while the xavier-initalized filter looks quite bad, the predicted (filtered) image isn't too far off the target. You could experiment with other metrics such as absolute error (e.g. 1-norm instead of 2-norm).
Second, adding regularization should help, i.e. add a weight penalty to the loss function to encourage the filter values to become small where they are not needed. As it is, what I suppose happens is: The random values in the filter average out to about 0, leading to a similar "filtering" effect as if they were actually all 0. As such, the learning algorithm doesn't have much incentive to actually pull them to 0. By adding a weight penalty, you provide this incentive.
Third, it could just be Adam messing up. It is known to provide "strange" non-optimal solutions in some very simple (e.g. convex) problems. Maybe try default Gradient Descent with learning rate decay (and possibly momentum).
I am trying to build a custom loss function in keras. Unfortunately i have little knowledge with tensor flow. Is there a way i can convert the incoming tensors into a numpy array so i can compute my loss function?
Here is my function:
def getBalance(x_true, x_pred):
x_true = np.round(x_true)
x_pred = np.round(x_pred)
NumberOfBars = len(x_true)
NumberOfHours = NumberOfBars/60
TradeIndex = np.where( x_pred[:,1] == 0 )[0]
##remove predictions that are not tradable
x_true = np.delete(x_true[:,0], TradeIndex)
x_pred = np.delete(x_pred[:,0], TradeIndex)
CM = confusion_matrix(x_true, x_pred)
correctPredictions = CM[0,0]+CM[1,1]
wrongPredictions = CM[1,0]+CM[0,1]
TotalTrades = correctPredictions+wrongPredictions
Accuracy = (correctPredictions/TotalTrades)*100
return Accuracy
If its not possible to use numpy array's what is the best way to compute that function with tensorflow? Any direction would be greatly appreciated, thank you!
Edit 1:
Here are some details of my model. I am using a LSTM network with heavy drop out. The inputs are a multi-variable multi-time step.
The outputs are a 2d array of binary digits (20000,2)
model = Sequential()
model.add(Dropout(0.4, input_shape=(train_input_data_NN.shape[1], train_input_data_NN.shape[2])))
model.add(LSTM(30, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(2))
model.compile(loss='getBalance', optimizer='adam')
history = model.fit(train_input_data_NN, outputs_NN, epochs=50, batch_size=64, verbose=1, validation_data=(test_input_data_NN, outputs_NN_test))
EDIT: 1 Here is an untested substitution:
(took the liberty of normalizing the variable names )
def get_balance(x_true, x_pred):
x_true = K.tf.round(x_true)
x_pred = K.tf.round(x_pred)
# didnt see the need for these
# NumberOfBars = (x_true)
# NumberOfHours = NumberOfBars/60
trade_index = K.tf.not_equal(x_pred[:,1], 0 )
##remove predictions that are not tradable
x_true_tradeable = K.tf.boolean_mask(x_true[:,0], trade_index)
x_pred_tradeable = K.tf.boolean_mask(x_pred[:,0], trade_index)
cm = K.tf.confusion_matrix(x_true_tradeable, x_pred_tradeable)
correct_predictions = cm[0,0]+cm[1,1]
wrong_predictions = cm[1,0]+cm[0,1]
total_trades = correction_predictions + wrong_predictions
accuracy = (correct_predictions/total_trades)*100
return accuracy
Original Answer
Welcome to SO. As you might know we need to compute the the gradient on the loss function. We can't compute the gradient correctly on numpy arrays (they're just constants).
What is done ( in keras/theano which are the backends one uses with keras) is automatic differentiation on Tensors (e.g tf.placeholder()).This is not the entire story but what you should know at this point is that tf / theano gives us gradients by default on operators like tf.max, tf.sum.
What that means for you is all the operations on tensors (y_true and y_pred) should be rewritten to use tf / theano operators.
I'll comment with what I think would be rewritten and you can substitute accordingly and test.
See tf.round used as K.tf.round where K is the reference to the keras backend imported as
import keras.backend as K
x_true = np.round(x_true)
x_pred = np.round(x_pred)
Grab the shape of the tensor x_true. K.shape. Compute the ratio over a constant could remain as
it as Here
NumberOfBars = len(x_true)
NumberOfHours = NumberOfBars/60
See tf.where used as K.tf.where
TradeIndex = np.where( x_pred[:,1] == 0 )[0]
You could mask the tensor w/ a condition instead of deleting - see masking
##remove predictions that are not tradable
x_true = np.delete(x_true[:,0], TradeIndex)
x_pred = np.delete(x_pred[:,0], TradeIndex)
See tf.confusion_matrix
CM = confusion_matrix(x_true, x_pred)
The computation that follow are computation overs constants and so remain essentially the same ( conditioned on
whatever changes have to made given the new API )
Hopefully I can update this answer with a valid substitution that runs. But I hope this sets on the right path.
A suggestion on coding style: I see you use three version of variable naming in your code choose one and stick with it.
I'm trying to build a model from scratch that can classify MNIST images (handwritten digits). The model needs to output a list of probabilities representing how likely it is that the input image is a certain number.
This is the code I have so far:
from sklearn.datasets import load_digits
import numpy as np
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
digits = load_digits()
features = digits.data
targets = digits.target
train_count = int(0.8 * len(features))
train_x = features[: train_count]
train_y = targets[: train_count]
test_x = features[train_count:]
test_y = targets[train_count:]
bias = np.random.rand()
weights = np.random.rand(len(features[0]))
rate = 0.02
for i in range(1000):
for i, sample in enumerate(train_x):
prod = np.dot(sample, weights) - bias
soft = softmax(prod)
predicted = np.argmax(soft) + 1
error = predicted - train_y[i]
weights -= error * rate * sample
bias -= rate * error
# print(error)
I'm trying to build the model so that it uses stochastic gradient descent but I'm a little confused as to what to pass to the softmax function. I understand it's supposed to expect a vector of numbers, but what I'm used to (when building a small NN) is that the model should produce one number, which is passed to an activation function, which in turn produces the prediction. Here, I feel like I'm missing a step and I don't know what it is.
In the simplest implementation, your last layer (just before softmax) should indeed output a 10-dim vector, which will be squeezed to [0, 1] by the softmax. This means that weights should be a matrix of shape [features, 10] and bias should be a [10] vector.
In addition to this, you should one-hot encode your train_y labels, i.e. convert each item to [0, 0, ..., 1, ..., 0] vector. The shape of train_y is thus [size, 10].
Take a look at logistic regression example - it's in tensorflow, but the model is likely to be similar to yours: they use 768 features (all pixels), one-hot encoding for labels and a single hidden layer. They also use mini-batches to speed-up learning.