Multinomial logistic softmax regression with SGD

Multinomial logistic softmax regression with SGD - python

I'm trying to build a model from scratch that can classify MNIST images (handwritten digits). The model needs to output a list of probabilities representing how likely it is that the input image is a certain number.
This is the code I have so far:
from sklearn.datasets import load_digits
import numpy as np
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=0)
digits = load_digits()
features = digits.data
targets = digits.target
train_count = int(0.8 * len(features))
train_x = features[: train_count]
train_y = targets[: train_count]
test_x = features[train_count:]
test_y = targets[train_count:]
bias = np.random.rand()
weights = np.random.rand(len(features[0]))
rate = 0.02
for i in range(1000):
for i, sample in enumerate(train_x):
prod = np.dot(sample, weights) - bias
soft = softmax(prod)
predicted = np.argmax(soft) + 1
error = predicted - train_y[i]
weights -= error * rate * sample
bias -= rate * error
# print(error)
I'm trying to build the model so that it uses stochastic gradient descent but I'm a little confused as to what to pass to the softmax function. I understand it's supposed to expect a vector of numbers, but what I'm used to (when building a small NN) is that the model should produce one number, which is passed to an activation function, which in turn produces the prediction. Here, I feel like I'm missing a step and I don't know what it is.

In the simplest implementation, your last layer (just before softmax) should indeed output a 10-dim vector, which will be squeezed to [0, 1] by the softmax. This means that weights should be a matrix of shape [features, 10] and bias should be a [10] vector.
In addition to this, you should one-hot encode your train_y labels, i.e. convert each item to [0, 0, ..., 1, ..., 0] vector. The shape of train_y is thus [size, 10].
Take a look at logistic regression example - it's in tensorflow, but the model is likely to be similar to yours: they use 768 features (all pixels), one-hot encoding for labels and a single hidden layer. They also use mini-batches to speed-up learning.

Related

Error in reverse scaling outputs predicted by a LSTM RNN

I used the LSTM model to predict the future open price of a stock. Here the data was preprocessed and the model was built and trained without any errors, and I used Standard Scaler to scale down the values in the DataFrame. But while retrieving the predictions from the model, when I used the scaler.reverse() method it gave the following error.
ValueError: non-broadcastable output operand with shape (59,1) doesn't match the broadcast shape (59,4)
The complete code is a too big jupyter notebook to directly show, so I have uploaded it in a git repository

This is because the model is predicting output with shape (59, 1). But your Scaler was fit on (251, 4) data frame. Either create a new scaler on the data frame of the shape of y values or change your model dense layer output to 4 dimensions instead of 1.
The data shape on which scaler is fit, it will take that shape only during scaler.inverse_transform.
Old Code - Shape (n,1)
trainY.append(df_for_training_scaled[i + n_future - 1:i + n_future, 0])
Updated Code - Shape (n,4) - use all 4 outputs
trainY.append(df_for_training_scaled[i + n_future - 1:i + n_future,:])

Normally you'd be re-scaling independent variables (features) as differences in scale can affect model calculations, but the dependent variable that you're trying to predict is normally left untouched. There's usually no reason to re-scale the dependent variable and scaling it makes it extremely difficult to interpret results.
The first line of documentation of StandardScaler class even specifies as much:
Standardize features by removing the mean and scaling to unit variance
You can optionally also scale labels, but once again this is not normally required.
So what I'd do in your place is (assuming your original dataframe contains 3 independent variables and 1 target variable) is this:
X = some_df.iloc[:, :3].values
y = some_df.iloc[3].values
scaler = StandardScaler()
X = scaler.fit_transform(X)
# And then goes everything as usual
Now, when you go to predict values you simply need to transform your input with the scaler in the same way it's been done before.
The better way, though, would be to add to your model a Normalization layer as a pre-processing step. This way you just feed raw data into your estimator and it handles all the nitty-gritty for you. And, similarly, you won't need to normalize data when generating predictions, the model will do everything for you. You could add something like:
from tensorflow.keras.layers.experimental.preprocessing import Normalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras import Model
# this is your default batch_size
BATCH_SIZE = 128
# Here's your raw (non-normalized) X data
X = some_df.iloc[:, :3].values
norm = Normalization()
norm.adapt(X)
preprocess = Sequential([
Input(shape=(BATCH_SIZE, 3)),
norm
])
# Now finally, when you build your actual model you add
# pre-processing step in the beginning
inp = preprocess()
x = Dense(64)(input)
x = Dense(128)(x)
x = Dense(1)(x)
model = Model(inputs=inp, outputs=x)
Here the pre-process step is a part of the model itself so once you do that you can just feed it raw data without any additional transformations.
This is what it will do:
# Skipping the imports as they are the same as above + numpy
X = np.array([[1, 2, 3], [10, 20, 40], [100, 200, 400]])
norm = Normalization()
norm.adapt(X)
preprocess = Sequential([
Input(shape=(3, 3)),
norm
])
x_new = preprocess(X)
print(x_new)
Out: tf.Tensor(
[[-0.80538726 -0.80538726 -0.807901 ]
[-0.60404044 -0.60404044 -0.6012719 ]
[ 1.4094278 1.4094278 1.4091729 ]], shape=(3, 3), dtype=float32)

LSTM doesn't learn to add random numbers

I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
lr=0.1
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data = tf.data.Dataset.from_tensor_slices((data, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
tape.watch(my_inp)
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
global_step=tf.train.get_or_create_global_step())
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
EDIT:
Set activations to linear, I realised that the tanh() squashes the values.
SOLVED:
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

I am trying to build a Neural Network in tensorflow where the cost of a Type I error (false-positive) is more costly than a Type II error (false-negative). Is there a way to impose this during the training process (i.e. inputting a cost matrix)? This is possible with simple models like Logistic Regression in scikit learn by specifying the class_weight parameter.
cw = {0: 3,1:1}
clf = LogisticRegression(class_weight = cw )
In this case, incorrectly predicting a 0 is 3x more costly than incorrectly predicting a 1. However, this cannot be performed with a Neural Network, so I want to see if it is possible in tensorflow.
Thanks

You could use tf.nn.weighted_cross_entropy_with_logits and it's pos_weight argument.
This argument weights positive class, as described by documentation (in TF2.0 at least):
A value pos_weights > 1 decreases the false negative count, hence increasing the recall.
Conversely setting pos_weights < 1 decreases the false positive count and increases the precision.
In your case, you could create custom loss function like this:
import tensorflow as tf
# Output logits from your network, not the values after sigmoid activation
class WeightedBinaryCrossEntropy:
def __init__(self, positive_weight: float):
self.positive_weight = positive_weight
def __call__(self, targets, logits, sample_weight=None):
return tf.nn.weighted_cross_entropy_with_logits(
targets, logits, pos_weight=self.positive_weight
)
And create a custom neural network with it, for example using tf.keras (samples are weighted as they were in your question:
import numpy as np
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(32, input_shape=(10,)),
tf.keras.layers.Activation("relu"),
tf.keras.layers.Dense(10),
tf.keras.layers.Activation("relu"),
# Output one logit for binary classification
tf.keras.layers.Dense(1),
]
)
# Example random data
data = np.random.random((32, 10))
targets = np.random.randint(2, size=32)
# 3 times as costly to make type I error
model.compile(optimizer="rmsprop", loss=WeightedBinaryCrossEntropy(positive_weight=3))
model.fit(data, targets, batch_size=32)

You can use a logarithmic scale. For a 0 incorrectly predicted as 1, y - ŷ = -1, log goes to 1.71. For a 1 predicted as 0, y - ŷ = 1 log equals 0.63. For y == ŷ log equals 0. Almost the three times more costly, for a 0 incorrectly predicted as 1.
import numpy as np
from math import exp
loss=abs(1-exp(-np.log(exp(y-ŷ))))
#abs(1-exp(-np.log(exp(0))))
#Out[53]: 0.0
#abs(1-exp(-np.log(exp(-1))))
#Out[54]: 1.718281828459045
#abs(1-exp(-np.log(exp(1))))
#Out[55]: 0.6321205588285577
Then you will have a convex optimization. Implementing:
import keras.backend as K
def custom_loss(y_true,y_pred):
return K.mean(abs(1-exp(-np.log(exp(y_true-y_pred)))))
Then:
model.compile(loss=custom_loss, optimizer=sgd,metrics = ['accuracy'])

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:
The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.
This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with
stddev=sqrt(2/(filter_size*filter_size*num_filters)
Q1. How should I initialize the variables? Is the code below correct?
weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)
This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.
Q2: Is it possible to set the learning rate in log scale?
Q3: How to create the loss function described above?
I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,
inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])
# get the model output and weights for each conv
pred, weights = model()
# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)
for weight in weights:
regularizers += tf.nn.l2_loss(weight)
loss = tf.reduce_mean(loss + 0.0001 * regularizers)
learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)
NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)

Q1. How should I initialize the variables? Is the code below correct?
Use tf.get_variable or switch to slim (it does the initialization automatically for you). example
Q2: Is it possible to set the learning rate in log scale?
You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3
However, just for reference, use following notation.
learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)
Q3: How to create the loss function described above?
At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).
There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write
pred = model()
pred = tf.matmul(pred, weights) + biases
logits = tf.nn.softmax(pred)
loss = tf.reduce_mean(tf.abs(logits - labels))
This will give you the output of Y : Label to be used
If your dataset's labeled images are denoised ones, in this case you need to follow this one:
pred = model()
pred = tf.matmul(image, weights) + biases
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(image - labels))
Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:
pred = model()
pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')
logits = tf.nn.softmax(pred)
image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)
loss = tf.reduce_mean(tf.abs(logits - labels))
This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation
Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.
For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.
If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80.
Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.
tf.reduce_mean(tf.nn.l2_loss(logits - image))
Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))

Q1. How should I initialize the variables? Is the code below correct?
That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.
Q2: Is it possible to set the learning rate in log scale?
Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.
EDIT: Look at the other answer, use the staircase=True argument!
Q3: How to create the loss function described above?
Your loss function looks correct.

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.
import tensorflow as tf
import numpy as np
# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)
print y.shape
#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])
lr = tf.placeholder(tf.float32)
pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)
train_op = opt.minimize(mse)
learning_rate = 0.0001
do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
for i in range (100000):
if i > 0 and i % N == 0:
# epoch done, decrease learning rate by 2
learning_rate /= 2
print "Epoch completed. LR =", learning_rate
idx = i/batch_size + i%batch_size
f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
_, err = sess.run([train_op, mse], feed_dict = f)
acc_err += err
if i%5000 == 0:
print "Average error = {}".format(acc_err/5000)
acc_err = 0.0

model.get_weights() returning array of NaNs after training due to NaN masking

I'm trying to train an LSTM to classify sequences of various lengths. I want to get the weights of this model, so I can use them in stateful version of the model. Before training, the weights are normal. Also, the training seems to run successfully, with a gradually decreasing error. However, when I change the mask value from -10 to np.Nan, mod.get_weights() starts returning arrays of NaNs and the validation error drops suddenly to a value close to zero. Why is this occurring?
from keras import models
from keras.layers import Dense, Masking, LSTM
from keras.optimizers import RMSprop
from keras.losses import categorical_crossentropy
from keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt
def gen_noise(noise_len, mag):
return np.random.uniform(size=noise_len) * mag
def gen_sin(t_val, freq):
return 2 * np.sin(2 * np.pi * t_val * freq)
def train_rnn(x_train, y_train, max_len, mask, number_of_categories):
epochs = 3
batch_size = 100
# three hidden layers of 256 each
vec_dims = 1
hidden_units = 256
in_shape = (max_len, vec_dims)
model = models.Sequential()
model.add(Masking(mask, name="in_layer", input_shape=in_shape,))
model.add(LSTM(hidden_units, return_sequences=False))
model.add(Dense(number_of_categories, input_shape=(number_of_categories,),
activation='softmax', name='output'))
model.compile(loss=categorical_crossentropy, optimizer=RMSprop())
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,
validation_split=0.05)
return model
def gen_sig_cls_pair(freqs, t_stops, num_examples, noise_magnitude, mask, dt=0.01):
x = []
y = []
num_cat = len(freqs)
max_t = int(np.max(t_stops) / dt)
for f_i, f in enumerate(freqs):
for t_stop in t_stops:
t_range = np.arange(0, t_stop, dt)
t_len = t_range.size
for _ in range(num_examples):
sig = gen_sin(f, t_range) + gen_noise(t_len, noise_magnitude)
x.append(sig)
one_hot = np.zeros(num_cat, dtype=np.bool)
one_hot[f_i] = 1
y.append(one_hot)
pad_kwargs = dict(padding='post', maxlen=max_t, value=mask, dtype=np.float32)
return pad_sequences(x, **pad_kwargs), np.array(y)
if __name__ == '__main__':
noise_mag = 0.01
mask_val = -10
frequencies = (5, 7, 10)
signal_lengths = (0.8, 0.9, 1)
dt_val = 0.01
x_in, y_in = gen_sig_cls_pair(frequencies, signal_lengths, 50, noise_mag, mask_val)
mod = train_rnn(x_in[:, :, None], y_in, int(np.max(signal_lengths) / dt_val), mask_val, len(frequencies))
This persists even if I change the network architecture to return_sequences=True and wrap the Dense layer with TimeDistributed, nor does removing the LSTM layer.

I had the same problem. In your case I can see it was probably something different but someone might have the same problem and come here from Google. So in my case I was passing sample_weight parameter to fit() method and when the sample weights contained some zeros in it, get_weights() was returning an array with NaNs. When I omitted the samples where sample_weight=0 (they were useless anyway if sample_weight=0), it started to work.

The weights are indeed changing. The unchanging weights are from the edge of the image, and they may have not changed because the edge isn't helpful for classifying digits.
to check select a specific layer and see the result:
print(model.layers[70].get_weights()[1])
70 : is the number of the last layer in my case.

get_weights() method of keras.engine.training.Model instance should retrieve the weights of the model.
This should be a flat list of Numpy arrays, or in other words this should be the list of all weight tensors in the model.
mw = model.get_weights()
print(mw)
If you got the NaN(s) this has a specific meaning. You are dealing simple with vanishing gradients problem. (In some cases even with Exploding gradients).
I would first try to alter the model to reduce the chances for the vanishing gradients. Try reducing the hidden_units first, and normalize your activations.
Even though LSTM are there to solve the problem of vanishing/exploding gradients problem you need to set the right activations from (-1, 1) interval.
Note this interval is where float points are most precise.
Working with np.nan under the masking layer is not a predictable operation since you cannot do comparison with np.nan.
Try print(np.nan==np.nan) and it will return False. This is an old problem with the IEEE 754 standard.
Or it may actually be this is a bug in Tensorflow, based on the IEEE 754 standard weakness.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multinomial logistic softmax regression with SGD - python

Related

Error in reverse scaling outputs predicted by a LSTM RNN

LSTM doesn't learn to add random numbers

For a classification model in tensorflow, is there a way to impose an asymmetric cost function during the training?

Tensorflow: How to set the learning rate in log scale and some Tensorflow questions

model.get_weights() returning array of NaNs after training due to NaN masking

Categories

Resources