What are the expected values in the input in Pytorch? - python

I am new to Pytorch, following the tutorials. I want to implement a regresor for a nonlinear function with 4 real inputs and 2 real outputs. I cannot find anywhere what is the supposed range for the inputs and outputs. They should go between -1 and 1? Between 0 and 1? Can it be anything?
More details
I have written the following simple parameterized model for experimenting:
class Net(torch.nn.Module):
def __init__(self, n_inputs: int, n_outputs: int, n_hidden_layers: int, n_nodes_per_hidden_layer: int):
n_inputs = int(n_inputs)
n_outputs = int(n_outputs)
n_hidden_layers = int(n_hidden_layers)
n_nodes_per_hidden_layer = int(n_nodes_per_hidden_layer)
if any([i<=0 for i in [n_inputs,n_outputs,n_hidden_layers,n_nodes_per_hidden_layer]]):
raise ValueError(f'All n_inputs, n_outputs, n_hidden_layers and n_nodes_per_hidden_layer must be greater than 0.')
super().__init__()
self.input_layer = torch.nn.Linear(n_inputs, n_nodes_per_hidden_layer)
self.hidden_layers = [torch.nn.Linear(n_nodes_per_hidden_layer, n_nodes_per_hidden_layer) for i in range(n_hidden_layers)]
self.output_layer = torch.nn.Linear(n_nodes_per_hidden_layer, n_outputs)
def forward(self, x):
x *= .1
activation_function = torch.nn.functional.relu
x = activation_function(self.input_layer(x))
for idx,layer in enumerate(self.hidden_layers):
x = activation_function(layer(x))
x = self.output_layer(x)
return x
and I instantiate it in this way:
dnn = Net(
n_inputs = 4, # Defined by the number of observables (charge of each channel now).
n_outputs = 2, # x and y.
n_hidden_layers = 3, # Choose yourself.
n_nodes_per_hidden_layer = 66, # Choose yourself.
)
My input x is data that distributes in a weird way from 0 to 1, my outputs are 2 values in the range 1e-2 +- 100e-6. I have tried x -= .5 and different scalings too, but cannot make it work. I don't get any error, just it does not seem to learn what it is supposed to learn.
I know that this model should work because I have used it with similar data that distributes in a similar way but the inputs in the range 0-100e-12 using x *= 1e9 and it was performing reasonably well. I don't know why, however.

Data transformation
They should go between -1 and 1? Between 0 and 1? Can it be anything?
They can be any real valued numbers, but in general we standardize input values using mean and standard deviation (so the result has 0 mean and 1 variance) like this (for two dimensional data that you have, assuming samples are zeroth dimension and features are the first dimension):
import torch
samples, features = 128, 4
data = torch.randn(samples, features)
std, mean = torch.std_mean(data, dim=0, keepdim=True)
normalized = (data - mean) / std
In general neural networks work best with normalized input with similar ranges.
And this transformation could (and should as it will make it easier for nn.Linear layer output) also be applied to your regression target as it's reversible.
Other things
Use torch.nn.MSELoss for regression
Use standard optimizer (like Adam) with default learning rates (you can fine tune it later)
Make sure your pipeline works correctly

Related

Pytorch model output is not correct (torch.float32 and torch.float64)

I have created a DNN model with Pytorch (input_dim=6, output_dim=150). Normally, if I generate a random X_in=torch.randn(6000, 6), it will return me a model_out.shape=(6000, 150), and if I calculate the Rank of model_out, it should be 150 (since my model's weight and bias are also randomly initialised).
However, you can see this is NOT TRUE with the following code:
import torch
import torch.nn as nn
torch.manual_seed(923) # for reproducible result
class MyDNN(nn.Module):
def __init__(self):
super(MyDNN, self).__init__()
# layer 0:
self.linear_0 = nn.Linear(6, 150)
self.activ_0 = nn.Tanh()
# layer 1:
self.linear_1 = nn.Linear(150, 150)
self.activ_1 = nn.Tanh()
# layer 2:
self.linear_2 = nn.Linear(150, 150)
self.activ_2 = nn.Tanh()
# layer 3:
self.linear_3 = nn.Linear(150, 150)
self.activ_3 = nn.Tanh()
def forward(self, x):
out = self.activ_0(self.linear_0(x)) # output: layer 0
out = self.activ_1(self.linear_1(out)) # output: layer 1
out = self.activ_2(self.linear_2(out)) # output: layer 2
out = self.activ_3(self.linear_3(out)) # output: layer 3
return out
model = MyDNN()
X_in = torch.randn(6000, 6, dtype=torch.float32)
with torch.no_grad():
model_out = model(X_in)
print(f'model_out rank = {torch.linalg.matrix_rank(model_out)}')
model_out rank = 115. Apparently this is a WRONG output, there is no way that the output has so many linear dependent columns with all the inputs, weights and bias are randomly initialised!
This problem can be solved by changing the X_in dtype as well as the model dtype to float64 with the following code:
model_64 = MyDNN()
model_64.double()
X_in_64 = torch.randn(6000, 6, dtype=torch.float64)
with torch.no_grad():
model_64_out = model_64(X_in_64)
print(f'model_64_out rank = {torch.linalg.matrix_rank(model_64_out)}')
model_64_out rank = 150
Here is my question:
Why does this happen? Is this really a problem of data size? I mean float32 already has a good precision. Actually when I use my own training_data, even with mini_batch_size = 10 -> output.shape = (10, 150), my Rank(output) is less than 10.
Although this problem can be solved by using double precision, this slows down the whole training process a lot (and with Mac M1 pro GPU, it only supports float32 type). Is there any other solution?
You have to realize that we are dealing with a numerical problem here: The rank of a matrix is a discrete value derived from a e.g. a singular value decomposition in the case of torch.matrix_rank. In this case we need to consider a threshold on the singular values: At what modulus tol do we consider a singular value as exactly zero?
Remember that we are dealing with floating point values where all operations always comes with truncation and rounding errors. In short there is no sense in trying to compute an exact rank.
So instead you might reconsider what kind of tolerance you use, you could e.g. use torch.linal.matrix_rank(..., tol=1e-6). The smaller the tolerance, the higher the expected rank.
But no matter what kind of floating point precision you use, I'd argue you will never be able to find meaningful "exact" number for the rank, it will always be a trade off! Therefore I'd reconsider whether you really need to compute the rank in the first place, or wether there is some other kind of criterion that is better suited for numerical considerations in the first place!

Is it possible to mask Neural Network output variables depending on input variables

I have a weird use case for a neural network and want to understand if there is a way to accomplish what I'm trying to do.
I am trying to train a neural network that takes in 3 input variables and outputs 96 continuous variables. The output should ideally produce a continuous curve, however the expected y values have a lot of missing data points (>50%) distributed randomly which affects how the model trains. I know which data points are missing and am trying to find a way to ignore these outputs during backpropagation.
For example:
Input = [1,2,3]
Expected Output = [1,2,3,NAN,5,6,7,NAN,...] # NAN is set to 0 for training
Currently this is the method I am trying (tensorflow.keras)
in1 = layers.Input(3)
in2 = layers.Input(96) # Array of Bools, =1 if expected output variable is a number, =0 if nan
hidden1 = layers.Dense(37,activation='relu',use_bias=True)(in1)
hidden2 = layers.Dense(37,activation='relu',use_bias=True)(hidden1)
hidden3 = layers.Dense(37,activation='relu',use_bias=True)(hidden2)
hidden3_in2 = layers.concatenate([hidden3,in2])
out = layers.Dense(96)(hidden3_in2)
model = Model(inputs=[in1,in2], outputs=[out])
The expected output of this should be 0 being calculated where in2 == 0, and a number greater than 0 everywhere else. When using the model to predict data I plug in an array of 1's into in2, indicating that no expected values should equal 0, so a continuous curve should be output. However, many output variables still come out to 0, which is not ideal.
Essentially my question is: is there a good way to mask specific outputs during backprop and/or loss calculation using an array?
Thanks in advance!
All you have to do is to write your custom loss function, where you literally mask out the loss values.
Something along the lines of
def my_loss_fn(y_true, y_pred):
squared_difference = tf.square(y_true - y_pred)
mask = 1.0 - tf.math.is_nan(y_true)
return tf.reduce_mean(squared_difference * mask, axis=-1) # Note the `axis=-1`
model.compile(optimizer='adam', loss=my_loss_fn)

LSTM Autoencoder problems

TLDR:
Autoencoder underfits timeseries reconstruction and just predicts average value.
Question Set-up:
Here is a summary of my attempt at a sequence-to-sequence autoencoder. This image was taken from this paper: https://arxiv.org/pdf/1607.00148.pdf
Encoder: Standard LSTM layer. Input sequence is encoded in the final hidden state.
Decoder: LSTM Cell (I think!). Reconstruct the sequence one element at a time, starting with the last element x[N].
Decoder algorithm is as follows for a sequence of length N:
Get Decoder initial hidden state hs[N]: Just use encoder final hidden state.
Reconstruct last element in the sequence: x[N]= w.dot(hs[N]) + b.
Same pattern for other elements: x[i]= w.dot(hs[i]) + b
use x[i] and hs[i] as inputs to LSTMCell to get x[i-1] and hs[i-1]
Minimum Working Example:
Here is my implementation, starting with the encoder:
class SeqEncoderLSTM(nn.Module):
def __init__(self, n_features, latent_size):
super(SeqEncoderLSTM, self).__init__()
self.lstm = nn.LSTM(
n_features,
latent_size,
batch_first=True)
def forward(self, x):
_, hs = self.lstm(x)
return hs
Decoder class:
class SeqDecoderLSTM(nn.Module):
def __init__(self, emb_size, n_features):
super(SeqDecoderLSTM, self).__init__()
self.cell = nn.LSTMCell(n_features, emb_size)
self.dense = nn.Linear(emb_size, n_features)
def forward(self, hs_0, seq_len):
x = torch.tensor([])
# Final hidden and cell state from encoder
hs_i, cs_i = hs_0
# reconstruct first element with encoder output
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
# reconstruct remaining elements
for i in range(1, seq_len):
hs_i, cs_i = self.cell(x_i, (hs_i, cs_i))
x_i = self.dense(hs_i)
x = torch.cat([x, x_i])
return x
Bringing the two together:
class LSTMEncoderDecoder(nn.Module):
def __init__(self, n_features, emb_size):
super(LSTMEncoderDecoder, self).__init__()
self.n_features = n_features
self.hidden_size = emb_size
self.encoder = SeqEncoderLSTM(n_features, emb_size)
self.decoder = SeqDecoderLSTM(emb_size, n_features)
def forward(self, x):
seq_len = x.shape[1]
hs = self.encoder(x)
hs = tuple([h.squeeze(0) for h in hs])
out = self.decoder(hs, seq_len)
return out.unsqueeze(0)
And here's my training function:
def train_encoder(model, epochs, trainload, testload=None, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-6, reverse=False):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Training model on {device}')
model = model.to(device)
opt = optimizer(model.parameters(), lr)
train_loss = []
valid_loss = []
for e in tqdm(range(epochs)):
running_tl = 0
running_vl = 0
for x in trainload:
x = x.to(device).float()
opt.zero_grad()
x_hat = model(x)
if reverse:
x = torch.flip(x, [1])
loss = criterion(x_hat, x)
loss.backward()
opt.step()
running_tl += loss.item()
if testload is not None:
model.eval()
with torch.no_grad():
for x in testload:
x = x.to(device).float()
loss = criterion(model(x), x)
running_vl += loss.item()
valid_loss.append(running_vl / len(testload))
model.train()
train_loss.append(running_tl / len(trainload))
return train_loss, valid_loss
Data:
Large dataset of events scraped from the news (ICEWS). Various categories exist that describe each event. I initially one-hot encoded these variables, expanding the data to 274 dimensions. However, in order to debug the model, I've cut it down to a single sequence that is 14 timesteps long and only contains 5 variables. Here is the sequence I'm trying to overfit:
tensor([[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971]], dtype=torch.float64)
And here is the custom Dataset class:
class TimeseriesDataSet(Dataset):
def __init__(self, data, window, n_features, overlap=0):
super().__init__()
if isinstance(data, (np.ndarray)):
data = torch.tensor(data)
elif isinstance(data, (pd.Series, pd.DataFrame)):
data = torch.tensor(data.copy().to_numpy())
else:
raise TypeError(f"Data should be ndarray, series or dataframe. Found {type(data)}.")
self.n_features = n_features
self.seqs = torch.split(data, window)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
try:
return self.seqs[idx].view(-1, self.n_features)
except TypeError:
raise TypeError("Dataset only accepts integer index/slices, not lists/arrays.")
Problem:
The model only learns the average, no matter how complex I make the model or now long I train it.
Predicted/Reconstruction:
Actual:
My research:
This problem is identical to the one discussed in this question: LSTM autoencoder always returns the average of the input sequence
The problem in that case ended up being that the objective function was averaging the target timeseries before calculating loss. This was due to some broadcasting errors because the author didn't have the right sized inputs to the objective function.
In my case, I do not see this being the issue. I have checked and double checked that all of my dimensions/sizes line up. I am at a loss.
Other Things I've Tried
I've tried this with varied sequence lengths from 7 timesteps to 100 time steps.
I've tried with varied number of variables in the time series. I've tried with univariate all the way to all 274 variables that the data contains.
I've tried with various reduction parameters on the nn.MSELoss module. The paper calls for sum, but I've tried both sum and mean. No difference.
The paper calls for reconstructing the sequence in reverse order (see graphic above). I have tried this method using the flipud on the original input (after training but before calculating loss). This makes no difference.
I tried making the model more complex by adding an extra LSTM layer in the encoder.
I've tried playing with the latent space. I've tried from 50% of the input number of features to 150%.
I've tried overfitting a single sequence (provided in the Data section above).
Question:
What is causing my model to predict the average and how do I fix it?
Okay, after some debugging I think I know the reasons.
TLDR
You try to predict next timestep value instead of difference between current timestep and the previous one
Your hidden_features number is too small making the model unable to fit even a single sample
Analysis
Code used
Let's start with the code (model is the same):
import seaborn as sns
import matplotlib.pyplot as plt
def get_data(subtract: bool = False):
# (1, 14, 5)
input_tensor = torch.tensor(
[
[0.5122, 0.0360, 0.7027, 0.0721, 0.1892],
[0.5177, 0.0833, 0.6574, 0.1204, 0.1389],
[0.4643, 0.0364, 0.6242, 0.1576, 0.1818],
[0.4375, 0.0133, 0.5733, 0.1867, 0.2267],
[0.4838, 0.0625, 0.6042, 0.1771, 0.1562],
[0.4804, 0.0175, 0.6798, 0.1053, 0.1974],
[0.5030, 0.0445, 0.6712, 0.1438, 0.1404],
[0.4987, 0.0490, 0.6699, 0.1536, 0.1275],
[0.4898, 0.0388, 0.6704, 0.1330, 0.1579],
[0.4711, 0.0390, 0.5877, 0.1532, 0.2201],
[0.4627, 0.0484, 0.5269, 0.1882, 0.2366],
[0.5043, 0.0807, 0.6646, 0.1429, 0.1118],
[0.4852, 0.0606, 0.6364, 0.1515, 0.1515],
[0.5279, 0.0629, 0.6886, 0.1514, 0.0971],
]
).unsqueeze(0)
if subtract:
initial_values = input_tensor[:, 0, :]
input_tensor -= torch.roll(input_tensor, 1, 1)
input_tensor[:, 0, :] = initial_values
return input_tensor
if __name__ == "__main__":
torch.manual_seed(0)
HIDDEN_SIZE = 10
SUBTRACT = False
input_tensor = get_data(SUBTRACT)
model = LSTMEncoderDecoder(input_tensor.shape[-1], HIDDEN_SIZE)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.MSELoss()
for i in range(1000):
outputs = model(input_tensor)
loss = criterion(outputs, input_tensor)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"{i}: {loss}")
if loss < 1e-4:
break
# Plotting
sns.lineplot(data=outputs.detach().numpy().squeeze())
sns.lineplot(data=input_tensor.detach().numpy().squeeze())
plt.show()
What it does:
get_data either works on the data your provided if subtract=False or (if subtract=True) it subtracts value of the previous timestep from the current timestep
Rest of the code optimizes the model until 1e-4 loss reached (so we can compare how model's capacity and it's increase helps and what happens when we use the difference of timesteps instead of timesteps)
We will only vary HIDDEN_SIZE and SUBTRACT parameters!
NO SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=False
In this case we get a straight line. Model is unable to fit and grasp the phenomena presented in the data (hence flat lines you mentioned).
1000 iterations limit reached
SUBTRACT, SMALL MODEL
HIDDEN_SIZE=5
SUBTRACT=True
Targets are now far from flat lines, but model is unable to fit due to too small capacity.
1000 iterations limit reached
NO SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=False
It got a lot better and our target was hit after 942 steps. No more flat lines, model capacity seems quite fine (for this single example!)
SUBTRACT, LARGER MODEL
HIDDEN_SIZE=100
SUBTRACT=True
Although the graph does not look that pretty, we got to desired loss after only 215 iterations.
Finally
Usually use difference of timesteps instead of timesteps (or some other transformation, see here for more info about that). In other cases, neural network will try to simply... copy output from the previous step (as that's the easiest thing to do). Some minima will be found this way and going out of it will require more capacity.
When you use the difference between timesteps there is no way to "extrapolate" the trend from previous timestep; neural network has to learn how the function actually varies
Use larger model (for the whole dataset you should try something like 300 I think), but you can simply tune that one.
Don't use flipud. Use bidirectional LSTMs, in this way you can get info from forward and backward pass of LSTM (not to confuse with backprop!). This also should boost your score
Questions
Okay, question 1: You are saying that for variable x in the time
series, I should train the model to learn x[i] - x[i-1] rather than
the value of x[i]? Am I correctly interpreting?
Yes, exactly. Difference removes the urge of the neural network to base it's predictions on the past timestep too much (by simply getting last value and maybe changing it a little)
Question 2: You said my calculations for zero bottleneck were
incorrect. But, for example, let's say I'm using a simple dense
network as an auto encoder. Getting the right bottleneck indeed
depends on the data. But if you make the bottleneck the same size as
the input, you get the identity function.
Yes, assuming that there is no non-linearity involved which makes the thing harder (see here for similar case). In case of LSTMs there are non-linearites, that's one point.
Another one is that we are accumulating timesteps into single encoder state. So essentially we would have to accumulate timesteps identities into a single hidden and cell states which is highly unlikely.
One last point, depending on the length of sequence, LSTMs are prone to forgetting some of the least relevant information (that's what they were designed to do, not only to remember everything), hence even more unlikely.
Is num_features * num_timesteps not a bottle neck of the same size as
the input, and therefore shouldn't it facilitate the model learning
the identity?
It is, but it assumes you have num_timesteps for each data point, which is rarely the case, might be here. About the identity and why it is hard to do with non-linearities for the network it was answered above.
One last point, about identity functions; if they were actually easy to learn, ResNets architectures would be unlikely to succeed. Network could converge to identity and make "small fixes" to the output without it, which is not the case.
I'm curious about the statement : "always use difference of timesteps
instead of timesteps" It seem to have some normalizing effect by
bringing all the features closer together but I don't understand why
this is key ? Having a larger model seemed to be the solution and the
substract is just helping.
Key here was, indeed, increasing model capacity. Subtraction trick depends on the data really. Let's imagine an extreme situation:
We have 100 timesteps, single feature
Initial timestep value is 10000
Other timestep values vary by 1 at most
What the neural network would do (what is the easiest here)? It would, probably, discard this 1 or smaller change as noise and just predict 1000 for all of them (especially if some regularization is in place), as being off by 1/1000 is not much.
What if we subtract? Whole neural network loss is in the [0, 1] margin for each timestep instead of [0, 1001], hence it is more severe to be wrong.
And yes, it is connected to normalization in some sense come to think about it.

Tensorflow tf.math.tanh properly scale network output without requiring large batches

I am trying to implement a network presented in this paper.
This excerpt has a describing image and is accompanied by an explanation.
The input is a feature of 353 floats and the label is a float (-1500, 1500) scaled to -1, 1.
The output should also be scaled between -1, 1. I used tf.math.tanh() to do this.
However, the only outputs I get are -1 and 1 but nothing inbetween.
The reason is because when printing the output of the second to last layer I get an array of arrays e.g. :
[[-5670.9859206034189]
[-3783.2489875296314]
[6674.3844754595357]
[-1985.6217861227797]
[5615.7066561151887]]
This, as far as I know, causes tf.math.tanh to execute on each individual value in the array. Resulting in either a 1 or a -1 depending on wether the input is negative or positive.
Since all labels are between -1500 and 1500 inclusive and normalized to -1 and 1. I could opt to add -1500 and 1500 to every value and feed it through the tanh function. This would result in a properly scaled value, even if it goes out of bounds because it could at max be 1 or -1.
However, this approach is probably slower than manually doing the scaling of the value without the use of tanh, but just dividing the value by 1500 and capping it at 1 and -1.
Another option would be to add all the values in a single array and run that through the tanh function. But this feels intuitively wrong. Imagine the array: [200, 300, 400, 500]. Tanh would scale the 500 to a 1, while in reality a 1500 should equate to a 1 - thus giving a wrong label. This means that tanh would depend a lot on batch size e.g. a 1000 samples would probably give better results than a 100 samples. Inference would have the same issues and require that I always use a large batch size.
What is a proper solution for this problem?
This is part of my network code, I omitted some layers for brevity.
class FullFullyConnectedOutputLayer(tf.keras.layers.Layer):
def __init__(self):
super(FullFullyConnectedOutputLayer, self).__init__()
def build(self, input_shape):
stddev = 2 / np.sqrt(input_shape[-1] + 1)
self.w = tf.Variable(tf.random.truncated_normal((input_shape[-1], 1), dtype='float64'), trainable=True)
b_init = tf.zeros_initializer()
self.b = tf.Variable(initial_value=b_init(shape=(1), dtype='float64'), trainable=True)
def call(self, input):
return tf.matmul(input, self.w) + self.b
class FullNetwork(tf.keras.Model):
def __init__(self, ):
super(FullNetwork, self).__init__(name='')
self.inputLayer = FullFeatureLayer()
self.hiddenLayer1 = FullFeatureLayer()
self.hiddenLayer2 = FullFullyConnectedFeatureLayer()
self.outputLayer = FullFullyConnectedOutputLayer()
def call(self, input):
x = self.inputLayer(input)
x = self.hiddenLayer1(x)
x = self.hiddenLayer2(x)
x = self.outputLayer(x)
return tf.math.tanh(x)
tf.keras.backend.set_floatx('float64')
fullNetwork = FullNetwork()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
fullNetwork.compile(optimizer, loss=tf.keras.losses.MeanAbsoluteError(), metrics=["accuracy"])
#epoch is 1 for debugging, batch_size is yet to be determined but probably 1000
fullNetwork.fit(feature_array[:10], score_array[:10], epochs=1, batch_size=5)
When using layers, each layer is matrics muliply of the of previous layer input with the weights.
So you can get big value, even you have "good" weights.
When tanh get inputs of even >10, because it is base on exponent it easily returns 1 on every values that is big (mostly range of -2 to 2).
It can get only small values.
So, what if the sum of the rate is low than 1, and each input node is also less than 1, you cannot reach a sum value greater than 1, and that's what I suggest.
Few thing you can do:
Use sofmax for the activation function
Scale the input values by maxscale (scale input range to output range).
You can check the range sum - also you can devide it by the sum you check.
Best , and what I suggest,is to do the softmax on the input (it is first input transformation), and xavier with scale of -2..2, so you ensure the output be -2..2 which is best of tanh.
tanh is a bit linear for this range, so next layers will also keep small range (keeping sum of "1" at maximum).
Any layer, except the last - use tanh.
Last layer use softmax (that's the best approach).
Derivative of softmax is little complicate, so better you learn about this.
https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative
It is imporant that you won't do manipulation on output after calculating it. That breaks, when you need propogation back.
Good luck.

Where to start on creating a method that saves desired changes in a Tensor with PyTorch?

I have two tensors that I am calculating the Spearmans Rank Correlation from, and I would like to be able to have PyTorch automatically adjust the values in these Tensors in a way that increases my Spearmans Rank Correlation number as high as possible.
I have explored autograd but nothing I've found has explained it simply enough.
Initialized tensors:
a=Var(torch.randn(20,1),requires_grad=True)
psfm_s=Var(torch.randn(12,20),requires_grad=True)
How can I have a loop of constant adjustments of the values in these two tensors to get the highest spearmans rank correlation from 2 lists I make from these 2 tensors while having PyTorch do the work? I just need a guide of where to go. Thank you!
I'm not familiar with Spearman's Rank Correlation, but if I understand your question you're asking how to use PyTorch to solve problems other than deep networks?
If that's the case then I'll provide a simple least squares example which I believe should be informative to your effort.
Consider a set of 200 measurements of 10 dimensional vectors x and y. Say we want to find a linear transform from x to y.
The least squares approach dictates we can accomplish this by finding the matrix M and vector b which minimize |(y - (M x+b))²|
The following example code generates some example data and then uses pytorch to perform this minimization. I believe the comments are sufficient to help you understand what is occurring here.
import torch
from torch.nn.parameter import Parameter
from torch import optim
# define some fake data
M_true = torch.randn(10, 10)
b_true = torch.randn(10, 1)
x = torch.randn(200, 10, 1)
noise = torch.matmul(M_true, 0.05 * torch.randn(200, 10, 1))
y = torch.matmul(M_true, x) + b_true + noise
# begin optimization
# define the parameters we want to optimize (using random starting values in this case)
M = Parameter(torch.randn(10, 10))
b = Parameter(torch.randn(10, 1))
# define the optimizer and provide the parameters we want to optimize
optimizer = optim.SGD((M, b), lr=0.1)
for i in range(500):
# compute loss that we want to minimize
y_hat = torch.matmul(M, x) + b
loss = torch.mean((y - y_hat)**2)
# zero the gradients of the parameters referenced by the optimizer (M and b)
optimizer.zero_grad()
# compute new gradients
loss.backward()
# update parameters M and b
optimizer.step()
if (i + 1) % 100 == 0:
# scale learning rate by factor of 0.9 every 100 steps
optimizer.param_groups[0]['lr'] *= 0.9
print('step', i + 1, 'mse:', loss.item())
# final parameter values (data contains a torch.tensor)
print('Resulting parameters:')
print(M.data)
print(b.data)
print('Compare to the "real" values')
print(M_true)
print(b_true)
Of course this problem has a simple closed form solution, but this numerical approach is just to demonstrate how to use PyTorch's autograd to solve problems not necessarily neural network related. I also choose to explicitly define the matrix M and vector b here rather than using an equivalent nn.Linear layer since I think that would just confuse things.
In your case you want to maximize something so make sure to negate your objective function before calling backward.

Categories

Resources