Neural network: estimating sine wave frequency - python

With an objective of learning Keras LSTM and RNNs, I thought to create a simple problem to work on: given a sine wave, can we predict its frequency?
I wouldn't expect a simple neural network to be able to predict the frequency, given that the notion of time is important here. However, even with LSTMs, I am unable to learn the frequency; I'm able to learn a trivial zero as the estimated frequency (even for train samples).
Here's the code to create the train set.
import numpy as np
import matplotlib.pyplot as plt
def create_sine(frequency):
return np.sin(frequency*np.linspace(0, 2*np.pi, 2000))
train_x = np.array([create_sine(x) for x in range(1, 300)])
train_y = list(range(1, 300))
Now, here's a simple neural network for this example.
from keras.models import Model
from keras.layers import Dense, Input, LSTM
input_series = Input(shape=(2000,),name='Input')
dense_1 = Dense(100)(input_series)
pred = Dense(1, activation='relu')(dense_1)
model = Model(input_series, pred)
model.compile('adam','mean_absolute_error')
model.fit(train_x[:100], train_y[:100], epochs=100)
As expected, this NN doesn't learn anything useful. Next, I tried a simple LSTM example.
input_series = Input(shape=(2000,1),name='Input')
lstm = LSTM(100)(input_series)
pred = Dense(1, activation='relu')(lstm)
model = Model(input_series, pred)
model.compile('adam','mean_absolute_error')
model.fit(train_x[:100].reshape(100, 2000, 1), train_y[:100], epochs=100)
However, this LSTM based model also doesn't learn anything useful.

Why doesn't it learn?
You think it's a simple problem to train an RNN on, but actually your setup isn't easy for the network at all:
As already mentioned, there's lack of important samples. You throw so much data into it (300 * 2000 points), but the actual target (frequency) is seen only once by the network. Even if the network does learn something, there's high chance it will overfit.
Inconsistent data. Remember that RNNs are good at capturing similar patterns in the series data. For instance, in NLP all sentences in the corpus are governed by the same language rules and more sentences help RNN to understand these rules better, i.e., more data helps.
In your case, the series with different frequencies aren't very much alike: compare the sine with frequency=1 and frequency=100. This kind of diversity in the data makes it harder to learn, not easier. It doesn't mean that the frequency is impossible for an RNN to learn, it simply means that you shouldn't be surprised that a trivial RNN like yours has hard time.
Data scale. Changing the frequency from 1 to 300, changes the scale of both x and y by two orders of magnitude, which may be problematic for any neural network.
Solution
Since your goal is rather educational, I solved the second and third items simply by limiting the target frequency to 10, so that scaling and distribution diversity isn't much of an issue (you are welcome to try different values here: you should see that increasing this one parameter to, say, 50 makes the task much more complex).
The first item is solved by giving the RNN 10 examples of each frequency, instead of just one. I've also added one more hidden layer to increase network flexibility, plus a simple regularizer (Dropout layer).
The complete code:
import numpy as np
from keras.models import Model
from keras.layers import Input, Dense, Dropout, LSTM
max_freq = 10
time_steps = 100
def create_sine(frequency, offset):
return np.sin(frequency * np.linspace(offset, 2 * np.pi + offset, time_steps))
train_y = list(range(1, max_freq)) * 10
train_x = np.array([create_sine(freq, np.random.uniform(0,1)) for freq in train_y])
train_y = np.array(train_y)
input_series = Input(shape=(time_steps, 1), name='Input')
lstm = LSTM(units=100)(input_series)
hidden = Dense(units=100, activation='relu')(lstm)
dropout = Dropout(rate=0.1)(hidden)
output = Dense(units=1, activation='relu')(dropout)
model = Model(input_series, output)
model.compile('adam', 'mean_squared_error')
model.fit(train_x.reshape(-1, time_steps, 1), train_y, epochs=200)
# Trying the network on the same data
test_x = train_x.reshape(-1, time_steps, 1)
test_y = train_y
predicted = model.predict(test_x).reshape([-1])
print()
print((predicted - train_y)[:12])
print(np.mean(np.abs(predicted - train_y)))
The output:
max_freq=10
[-0.05612183 -0.01982236 -0.03744316 -0.02568841 -0.11959982 -0.0770483
0.04643679 0.12057972 -0.00625324 -0.00724655 -0.16919005 -0.04512954]
0.0503574344847
max_freq=20 (everything else is the same)
[ 0.51365542 0.09269333 -0.009691 0.0619092 0.09852839 0.04378462
0.01430321 -0.01953268 0.00722599 0.02558327 -0.04520988 -0.0614748 ]
0.146024380232
max_freq=30 (everything else is the same)
[-0.28205156 -0.28922796 -0.00569081 -0.21314907 0.1068716 0.23497915
0.23975039 0.25955486 0.26333141 0.24235058 0.08320332 -0.03686047]
0.406703719805
Note that results are random and actually increasing the max_freq increases the changes of divergence. But even when it converges, the performance doesn't improve despite having more data, instead gets worse and pretty fast.

sample data item very low, one for each freq,
add small noise and use more data,
normalize output data -1 to 1 range
then try again

As you said, you want to predict the frequency. You also want to use LSTM. First we generate enough data to train, then we build the network. I'm sorry my example is not with keras, I'm using tflearn.
import numpy as np
import tflearn
from random import shuffle
# parameters
n_input=100
n_train=2000
n_test = 500
# generate data
xs=[]
ys=[]
frequencies = np.linspace(1,50,n_train+n_test)
shuffle(frequencies)
t=np.linspace(0,2*np.pi,n_input)
for freq in frequencies:
xs.append(np.sin(t*freq))
ys.append(freq)
xs_train=np.array(xs[:n_train]).reshape(n_train,n_input,1)
xs_test=np.array(xs[n_train:]).reshape(n_test,n_input,1)
ys_train = np.array(ys[:n_train]).reshape(-1,1)
ys_test = np.array(ys[n_train:]).reshape(-1,1)
# LSTM network prediction
net = tflearn.input_data(shape=[None, n_input, 1])
net = tflearn.lstm(net, 10)
net = tflearn.fully_connected(net, 100, activation="relu")
net = tflearn.fully_connected(net, 1)
net = tflearn.regression(net, optimizer='adam', loss='mean_square')
model = tflearn.DNN(net)
model.fit(xs_train, ys_train, n_epoch=100)
print(np.hstack((model.predict(xs_test),ys_test))[:10])
# [[ 13.08494568 12.76470588]
# [ 22.23135376 21.98039216]
# [ 39.0812912 37.58823529]
# [ 15.77548409 15.66666667]
# [ 26.57996941 25.58823529]
# [ 26.57759476 25.11764706]
# [ 16.42217445 15.8627451 ]
# [ 32.55020905 30.80392157]
# [ 44.16622925 43.01960784]
# [ 26.18071365 25.45098039]]
If you have the data in that order, you don't actually need LSTM, you can easily replace the LSTM part with a Deep Neural Network:
# Deep network instead of LSTM
net = tflearn.input_data(shape=[None, n_input])
net = tflearn.fully_connected(net, 100)
net = tflearn.fully_connected(net, 100)
net = tflearn.fully_connected(net, 1)
net = tflearn.regression(net, optimizer='adam',loss='mean_square')
model = tflearn.DNN(net)
model.fit(xs_train, ys_train)
print(np.hstack((model.predict(xs_test),ys_test))[:10])
Both codes are going to give you as result the predicted value of the frequency. I also created a gist with the program.

Related

Predicting Fibonacci Using LSTM RNN

New to neural nets so please correct my syntax.
I'm trying to create a LSTM RNN that will predict the Fibonacci sequence. When I ran the code below, the loss remains incredibly high (around 35339663592701874176).
Why does the shape of the input have to be (batch_size, timesteps, input_dim)? In my example I have 100 data entries so that'd be my batch_size, and the Fibonacci sequence takes in 2 inputs so that'd be input_dim but what would timesteps be in this case? 1?
Shouldn't the the units of the LSTM be 1? If I'm understanding correctly, the "units" are just the amount of hidden state nodes that are in the LSTM. So in theory, each of the 2 inputs would have a "1" coefficient weight towards that hidden state after training.
Would an RNN be a suitable model for this problem? When I've looked online, most people like to use the Fibonacci sequence as an example to explain how RNN's work.
Thanks for the help!
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Create Training Data
xs = [[[1, 1]]]
ys = []
i = 0
while i < 100:
ys.append([xs[i][0][0]+xs[i][0][1]])
xs.append([[xs[i][0][1], ys[len(ys)-1][0]]])
i = i + 1
del xs[len(xs)-1]
xs = np.array(xs, dtype=float)
ys = np.array(ys, dtype=float)
# Create Model
model = keras.Sequential()
model.add(layers.LSTM(1, input_shape=(1, 2)))
model.add(layers.Dense(1))
model.compile(optimizer="adam", loss="mean_absolute_error", metrics=[ 'accuracy' ])
# Train
model.fit(xs, ys, epochs=100000)
You can't feed a NN data where some of the values are 10^21 times as large as some of the others and expect it to work, it just doesn't happen.
You're not doing anything here that actually calls for LSTM (or any RNN), you're not actually using the time dimension, and you're basically just trying to learn addition. Maybe you meant to do something different (like input digits as a sequence, or have the output run for multiple timesteps and give you several values of the sequence), but that's not what you're doing, and it's unclear what you want.
The number of units is your memory/procesing capacity. Each unit of an RNN is able to receive values from all of the units in the previous timestep. One unit alone can't do anything interesting, especially with no layer before it to preprocess the data.

Training a LSTM auto-encoder gets NaN / super high MSE loss

I'm trying to train a LSTM ae.
It's like a seq2seq model, you throw a signal in to get a reconstructed signal sequence. And the I'm using a sequence which should be quite easy. The loss function and metric is MSE. The first hundred epochs went well. However after some epochs I got MSE which is super high and it goes to NaN sometimes. I don't know what causes this.
Can you inspect the code and give me a hint?
The sequence gets normalization before, so it's in a [0,1] range, how can it produce such a high MSE error?
This is the input sequence I get from training set:
sequence1 = x_train[0][:128]
looks like this:
I get the data from a public signal dataset(128*1)
This is the code: (I modify it from keras blog)
# lstm autoencoder recreate sequence
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.utils import plot_model
from keras import regularizers
# define input sequence. sequence1 is only a one dimensional list
# reshape sequence1 input into [samples, timesteps, features]
n_in = len(sequence1)
sequence = sequence1.reshape((1, n_in, 1))
# define model
model = Sequential()
model.add(LSTM(1024, activation='relu', input_shape=(n_in,1)))
model.add(RepeatVector(n_in))
model.add(LSTM(1024, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
for epo in [50,100,1000,2000]:
model.fit(sequence, sequence, epochs=epo)
The first few epochs went all well. all the losses are about 0.003X or so. Then it became big suddenly, to some very big number, the goes to NaN all the way up.
You might have a problem with exploding gradient values when doing the backpropagation.
Try using the clipnorm and clipvalue parameters to control gradient clipping: https://keras.io/optimizers/
Alternatively, what is the learning rate you are using? I would also try to reduce the learning rate by 10,100,1000 to check if you observe the same behavior.
'relu' is the main culprit - see here. Possible solutions:
Initialize weights to smaller values, e.g. keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01)
Clip weights (at initialization, or via kernel_constraint, recurrent_constraint, ...)
Increase weight decay
Use a warmup learning rate scheme (start low, gradually increase)
Use 'selu' activation, which is more stable, is ReLU-like, and works better than ReLU on some tasks
Since your training went stable for many epochs, 3 sounds the most promising, as it seems that eventually your weights norm gets too large and gradients explode. Generally, I suggest keeping the weight norms around 1 for 'relu'; you can monitor the l2 norms using the function below. I also recommend See RNN for inspecting layer activations & gradients.
def inspect_weights_l2(model, names='lstm', axis=-1):
def _get_l2(w, axis=-1):
axis = axis if axis != -1 else len(w.shape) - 1
reduction_axes = tuple([ax for ax in range(len(w.shape)) if ax != axis])
return np.sqrt(np.sum(np.square(w), axis=reduction_axes))
def _print_layer_l2(layer, idx, axis=-1):
W = layer.get_weights()
l2_all = []
txt = "{} "
for w in W:
txt += "{:.4f}, {:.4f} -- "
l2 = _get_l2(w, axis)
l2_all.extend([l2.max(), l2.mean()])
txt = txt.rstrip(" -- ")
print(txt.format(idx, *l2_all))
names = [names] if isinstance(names, str) else names
for idx, layer in enumerate(model.layers):
if any([name in layer.name.lower() for name in names]):
_print_layer_l2(layer, idx, axis=axis)

RAM-memory usage while training a RNN with LSTM units

I'm following a tutorial on Recurrent neural networks, and I am training a RNN to learn how to predict the next letter from the alphabet, given a sequence of letters. The problem is, my RAM-usage is slowly going up every epoch I train the network for. I can not finish training this network because I have "only" 8192MB of RAM-memory, and it is exhausted after +- 100 epochs. Why is this? I think it has something to do with the way LSTM's work, since they do keep some information in memory, but it would be nice if someone could explain me some more details.
The code I'm using is relatively simple, and completely self-contained (You can copy/paste and run it, no need for a external dataset since the dataset is just the alphabet). Therefore I included it in full, so the problem is easily reproducible.
The tensorflow version I am using is 1.14.
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras_preprocessing.sequence import pad_sequences
np.random.seed(7)
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
start = np.random.randint(len(alphabet)-2)
end = np.random.randint(start, min(start+max_len,len(alphabet)-1))
sequence_in = alphabet[start:end+1]
sequence_out = alphabet[end + 1]
dataX.append([char_to_int[char] for char in sequence_in])
dataY.append(char_to_int[sequence_out])
print(sequence_in, "->" , sequence_out)
#Pad sequences with 0's, reshape X, then normalize data
X = pad_sequences(dataX, maxlen=max_len, dtype= "float32" )
X = np.reshape(X, (X.shape[0], max_len, 1))
X = X / float(len(alphabet))
print(X.shape)
#OHE the output variable.
y = np_utils.to_categorical(dataY)
#Create & fit the model
batch_size=1
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], 1)))
model.add(Dense(y.shape[1], activation= "softmax" ))
model.compile(loss= "categorical_crossentropy" , optimizer= "adam" , metrics=[ "accuracy" ])
model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2)
The problem is that your sequences are rather long (1000 consecutive inputs). As LSTM-units do maintain some kind of state over epochs and you are trying to train it for 500 epochs (Which is a lot), especially when you're training on a CPU, your RAM will get flooded over time. I suggest you try to train on GPU, which has dedicated memory of its own. Also check out this issue: https://github.com/Element-Research/rnn/issues/5

Tensorflow NN not giving any reasonable output

I want to train a network on the isolet dataset, consisting of 6238 samples with 300 features each.
This is my code so far:
import tensorflow as tf
import sklearn.preprocessing as prep
import numpy as np
import matplotlib.pyplot as plt
def main():
X, C, Xtst, Ctst = load_isolet()
#normalize
#X = (X - np.mean(X, axis = 1)[:, np.newaxis]) / np.std(X, axis = 1)[:, np.newaxis]
#Xtst = (Xtst - np.mean(Xtst, axis = 1)[:, np.newaxis]) / np.std(Xtst, axis = 1)[:, np.newaxis]
scaler = prep.MinMaxScaler(feature_range=(0,1))
scaledX = scaler.fit_transform(X)
scaledXtst = scaler.transform(Xtst)
# Build the tf.keras.Sequential model by stacking layers. Choose an optimizer and loss function for training:
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(X.shape[1], activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(26, activation='softmax')
])
ES_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=1e-2, patience=10, verbose=1)
initial_learning_rate = 0.01
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(initial_learning_rate,decay_steps=100000,decay_rate=0.9999,staircase=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(scaledX, C, epochs=100, callbacks=[ES_callback], batch_size = 32)
plt.figure(1)
plt.plot(range(len(history.history['loss'])), history.history['loss']);
plt.plot(range(len(history.history['accuracy'])), history.history['accuracy']);
plt.show()
Up to now, I have pretty much turned every knob I know:
different number of layers
different sizes of layers
different activation functions
different learning rates
different optimizers (we should test with 'adam' and 'stochastic gradient decent'
different batch sizes
different data preparations (the features range from -1 to 1 values. I tried normalizing along the feature axes, batch normalizing (z_i = (x_i - mean) / std(x_i)) and as seen in the code above scaling the values from 0 to 1 (since I guess 'relu' activation won't work well with negative input values)
Pretty much everything I tried gives weird outputs with extremely high loss values (depending on the learning rate) and very low accuracies during learning. The loss is increasing over epochs pretty much all of the time, but seems to be independent from the accuracy values.
For the code, I followed tutorials I got provided, however something is very off, since I should find the best hyper parameters, but I'm not able to find any good whatsoever.
I'd be very glad to get some points, where got the code wrong or need to preprocess the data differently.
Edit: Using loss='categorical_crossentropy'was given, so at least this one should be correct.
first of all:
Your convergence problems may be due to "incorrect" loss function. tf.keras supports a variety of losses that depend on the shape of your input labels.
Try different possibilities like
tf.keras.losses.SparseCategoricalCrossentropy if your labels are one-hot vectors.
tf.keras.losses.CategoricalCrossentropy if your lables are 1,2,3...
or tf.keras.losses.BinaryCrossentropy if your labels are just 0,1.
Honestly, this part of tf.keras is a bit tricky and some settings like that might need tuning.
Second of all - this part:
scaler = prep.MinMaxScaler(feature_range=(0,1))
scaledX = scaler.fit_transform(X)
scaledXtst = scaler.fit_transform(Xtst)
assuming Xtst is your test set you want to scale it based on your training set. So the correct scaling would be just
scaledXtst = scaler.transform(Xtst)
Hope this helps!

LSTM doesn't learn to add random numbers

I was trying to do a pretty simple thing, train an LSTM that picks a sequence of random numbers and outputs the sum of them. But after some hours without converging I decided to ask here which of my premises doesn't work.
The idea is simple:
I generate a training set of sequences of some sequence length of random numbers and label them with the sum of them (numbers are drawn from a normal distribution)
I use an LSTM with an RMSE loss for predicting the output, the sum of these numbers, given the sequence input
Intuitively the LSTM should learn to set the weight of the input gate to 1 (bias 0) the weights of the forget gate to 0 (bias 1) and the weight to the output gate to 1 (bias 0) and learn to add these numbers, but it doesn't. I pasting the code I use, I tried with different learning rates, optimizers, batching, observed the gradients and the outputs and don't find the exact reason why is failing.
Code for generating sequences:
import tensorflow as tf
import numpy as np
tf.enable_eager_execution()
def generate_sequences(n_samples, seq_len):
total_shape = n_samples*seq_len
random_values = np.random.randn(total_shape)
random_values = random_values.reshape(n_samples, -1)
targets = np.sum(random_values, axis=1)
return random_values, targets
Code for training:
n_samples = 100000
seq_len = 2
lr=0.1
epochs = n_samples
batch_size = 1
input_shape = 1
data, targets = generate_sequences(n_samples, seq_len)
train_data = tf.data.Dataset.from_tensor_slices((data, targets))
output = tf.keras.layers.RNN(tf.keras.layers.LSTMCell(1, dtype='float64', recurrent_activation=None, activation=None), input_shape=(batch_size, seq_len, input_shape))
iterator = train_data.batch(batch_size).make_one_shot_iterator()
optimizer = tf.train.AdamOptimizer(lr)
for i in range(epochs):
my_inp, target = iterator.get_next()
with tf.GradientTape(persistent=True) as tape:
tape.watch(my_inp)
my_out = output(tf.reshape(my_inp, shape=(batch_size,seq_len,1)))
loss = tf.sqrt(tf.reduce_sum(tf.square(target - my_out)),1)/batch_size
grads = tape.gradient(loss, output.trainable_variables)
optimizer.apply_gradients(zip(grads, output.trainable_variables),
global_step=tf.train.get_or_create_global_step())
I also has a conjecture that this a theoretical problem (If we sum different random values drawn form a normal distribution then the output is not in the [-1, 1] range and the LSTM due to the tanh activations can't learn it. But changing them doesn't improved the performance (changed to linear in the example code).
EDIT:
Set activations to linear, I realised that the tanh() squashes the values.
SOLVED:
The problem was actually the tanh() of the gates and recurrent states which was squashing my outputs and not allowing them to grow by adding up the summands. Putting all activations to linear works pretty fine.

Categories

Resources