Different Sigmoid Equations and its implementation - python

When reviewing through the Sigmoid function that is used in Neural Nets, we found this equation from https://en.wikipedia.org/wiki/Softmax_function#Softmax_Normalization:
Different from the standard sigmoid equation:
The first equation on top somehow involves the mean and standard deviation (I hope I didn't read the symbols wrongly) whereas the 2nd equation generalized the minus mean and divided by standard deviation as a constant since it's the same throughout all terms within a vector/matrix/tensor.
So when implementing the equations, I get different results.
With the 2nd equation (standard sigmoid function):
def sigmoid(x):
return 1. / (1 + np.exp(-x))
I get these output:
>>> x = np.array([1,2,3])
>>> print sigmoid(x)
[ 0.73105858 0.88079708 0.95257413]
I would have expect the 1st function to be the similar but the gap between the first and second element widens by quite a bit (though the ranking of the elements remains:
def get_statistics(x):
n = float(len(x))
m = x.sum() / n
s2 = sum((x - m)**2) / (n-1.)
s = s2**0.5
return m, s2, s
m, s, s2 = get_statistics(x)
sigmoid_x1 = 1 / (1 + np.exp(-(x[0] - m) / s2))
sigmoid_x2 = 1 / (1 + np.exp(-(x[1] - m) / s2))
sigmoid_x3 = 1 / (1 + np.exp(-(x[2] - m) / s2))
sigmoid_x1, sigmoid_x2, sigmoid_x3
[out]:
(0.2689414213699951, 0.5, 0.7310585786300049)
Possibly it has to do with the fact that the first equation contains some sort of softmax normalization but if it's generic softmax then the elements need to sum to one as such:
def softmax(x):
exp_x = np.exp(x)
return exp_x / exp_x.sum()
[out]:
>>> x = np.array([1,2,3])
>>> print softmax(x)
[ 0.09003057 0.24472847 0.66524096]
But the output from the first equation don't sum to one and it isn't similar/same as the standard sigmoid equation. So the question is:
Have I implemented the function for equation 1 wrongly?
Is equation 1 on the wikipedia page wrong? Or is it referring to something else and not really the sigmoid/logistic function?
Why is there a difference in the first and second equation?

You have implemented the equations correctly. Your problem is that you are mixing up the definitions of softmax and sigmoid functions.
A softmax function is a way to normalize your data by making outliers "less interesting". Additionally, it "squashes" your input vector in a way that it ensures the sum of the vector to be 1.
For your example:
> np.sum([ 0.09003057, 0.24472847, 0.66524096])
> 1.0
It is simply a generalization of a logistic function with the additional "constraint" to get every element of the vector in the interval (0, 1) and its sum to 1.0.
The sigmoid function is another special case of logistic functions. It is just a real-valued, differentiable function with a bell shape. It is interesting for neural networks because it is rather easy to compute, non-linear and has negative and positive boundaries, so your activation can not diverge but runs into saturation if it gets "too high".
However, a sigmoid function is not ensuring that an input vector sums up to 1.0.
In neural networks, sigmoid functions are used frequently as an activation function for single neurons, while a sigmoid/softmax normalization function is rather used at the output layer, to ensure the whole layer adds up to 1. You just mixed up the sigmoid function (for single neurons) versus the sigmoid/softmax normalization functions (for a whole layer).
EDIT: To clearify this for you I will give you an easy example with outliers, this demonstrates the behaviour of the two different functions for you.
Let's implement a sigmoid function:
import numpy as np
def s(x):
return 1.0 / (1.0 + np.exp(-x))
And the normalized version (in little steps, making it easier to read):
def sn(x):
numerator = x - np.mean(x)
denominator = np.std(x)
fraction = numerator / denominator
return 1.0 / (1.0 + np.exp(-fraction))
Now we define some measurements of something with huge outliers:
measure = np.array([0.01, 0.2, 0.5, 0.6, 0.7, 1.0, 2.5, 5.0, 50.0, 5000.0])
Now we take a look at the results that s (sigmoid) and sn (normalized sigmoid) give:
> s(measure)
> array([ 0.50249998, 0.549834 , 0.62245933, 0.64565631, 0.66818777,
0.73105858, 0.92414182, 0.99330715, 1. , 1. ])
> sn(measure)
> array([ 0.41634425, 0.41637507, 0.41642373, 0.41643996, 0.41645618,
0.41650485, 0.41674821, 0.41715391, 0.42447515, 0.9525677 ])
As you can see, s only translates the values "one-by-one" via a logistic function, so the outliers are fully satured with 0.999, 1.0, 1.0. The distance between the other values varies.
When we look at sn we see that the function actually normalized our values. Everything now is extremely identical, except for 0.95 which was the 5000.0.
What is this good for or how to interpret this?
Think of an output layer in a neural network: an activation of 5000.0 in one class on an output layer (compared to our other small values) means that the network is really sure that this is the "right" class to your given input. If you would have used s there, you would end up with 0.99, 1.0 and 1.0 and would not be able to distinguish which class is the correct guess for your input.

In this case you have to differentiate between three things : a sigmoid function, a sigmoid function with softmax normalization and softmax function.
A sigmoid function is a real-valued function which is simpy given by an equation f(x) = 1 / (1 + exp(-x)). For many years it was used in machine learning domain because of the fact that it squashes a real input to (0,1) interval which might be interpreted as e.g. probability value. Right now - many experts advice not to use it because of it's saturation and non zero mean problems. You can read about it (as long as how to deal with the problem e.g. here http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf).
A sigmoid with softmax normalization is used to deal with two important problems which may occur during using of sigmoid function. First is dealing with outliers (it squashes your x to be around 0 and to make it sd = 1 what normalize your data) and second (and what is in my opinion more crucial) is to make different variables equally important in further analysis. To understand this phenomenon imagine that you have two variables age and income, where age varies from 20 to 70 and income varies from 2000 to 60000. Without normalizing data - both variables would be squashed to almost one by sigmoid transformation. Moreover - due to greater mean absolute value - income variable will be significantly more important for your analysis without any rational explaination.
I think that standarization is even more crucial in understanding softmax normalization than dealing with outliers. To understand why imagine a variable which is equal to 0 in 99% of cases and 1 in others. In this case your sd ~ 0.01, mean ~ 0and softmax normalization would outlie 1 even more.
Something completely different is a softmax function. A softmax function is a mathematical transformation from R^k to R^k which squashes a real valued vector to a positive valued vector of the same size which sums up to 1. It's given by an equation softmax(v) = exp(v)/sum(exp(v)). It's something completely different than softmax normalization and it's usually used in case of multiclass classification.

Related

Trouble implementing "concurrent" softmax function from paper (PyTorch)

I am trying to implement the so called 'concurrent' softmax function given in the paper "Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels". Below is the definition of the concurrent softmax:
NOTE: I have left the (1-rij) term out for the time being because I don't think it applies to my problem given that my training dataset has a different type of labeling compared to the paper.
To keep it simple for myself I am starting off by implementing it in a very inefficient, but easy to follow, way using for loops. However, the output I get seems wrong to me. Below is the code I am using:
# here is a one-hot encoded vector for the multi-label classification
# the image thus has 2 correct labels out of a possible 3 classes
y = [0, 1, 1]
# these are some made up logits that might come from the network.
vec = torch.tensor([0.2, 0.9, 0.7])
def concurrent_softmax(vec, y):
for i in range(len(vec)):
zi = torch.exp(vec[i])
sum_over_j = 0
for j in range(len(y)):
sum_over_j += (1-y[j])*torch.exp(vec[j])
out = zi / (sum_over_j + zi)
yield out
for result in concurrent_softmax(vec, y):
print(result)
From this implementation I have realized that, no matter what value I give to the first logit in 'vec' I will always get an output of 0.5 (because it essentially always calculates zi / (zi+zi)). This seems like a major problem, because I would expect the value of the logits to have some influence on the resulting concurrent-softmax value. Is there a problem in my implementation then, or is this behaviour of the function correct and there is something theoretically that I am not understanding?
This is the expected behaviour given y[i]=1 for all other i.
Note you can simplify the summation with a dot product:
y = torch.tensor(y)
def concurrent_softmax(z, y):
sum_over_j = torch.dot((torch.ones(len(y)) - y), torch.exp(z))
for zi in z:
numerator = torch.exp(zi)
denominator = sum_over_j + numerator
yield numerator / denominator

Checking Neural Network Gradient with Finite Difference Methods Doesn't Work

After a full week of print statements, dimensional analysis, refactoring, and talking through the code out loud, I can say I'm completely stuck.
The gradients my cost function produces are too far from those produced by finite differences.
I have confirmed my cost function produces correct costs for regularized inputs and not. Here's the cost function:
def nnCost(nn_params, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels):
# reshape parameter/weight vectors to suit network size
Theta1 = np.reshape(nn_params[:hidden_layer_size * (input_layer_size + 1)], (hidden_layer_size, (input_layer_size + 1)))
Theta2 = np.reshape(nn_params[(hidden_layer_size * (input_layer_size+1)):], (num_labels, (hidden_layer_size + 1)))
if lambda_ is None:
lambda_ = 0
# grab number of observations
m = X.shape[0]
# init variables we must return
cost = 0
Theta1_grad = np.zeros(Theta1.shape)
Theta2_grad = np.zeros(Theta2.shape)
# one-hot encode the vector y
y_mtx = pd.get_dummies(y.ravel()).to_numpy()
ones = np.ones((m, 1))
X = np.hstack((ones, X))
# layer 1
a1 = X
z2 = Theta1#a1.T
# layer 2
ones_l2 = np.ones((y.shape[0], 1))
a2 = np.hstack((ones_l2, sigmoid(z2.T)))
z3 = Theta2#a2.T
# layer 3
a3 = sigmoid(z3)
reg_term = (lambda_/(2*m)) * (np.sum(np.sum(np.multiply(Theta1, Theta1))) + np.sum(np.sum(np.multiply(Theta2,Theta2))) - np.subtract((Theta1[:,0].T#Theta1[:,0]),(Theta2[:,0].T#Theta2[:,0])))
cost = (1/m) * np.sum((-np.log(a3).T * (y_mtx) - np.log(1-a3).T * (1-y_mtx))) + reg_term
# BACKPROPAGATION
# δ3 equals the difference between a3 and the y_matrix
d3 = a3 - y_mtx.T
# δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units) multiplied element-wise by the g′() of z2 (computed back in Step 2).
d2 = Theta2[:,1:].T#d3 * sigmoidGradient(z2)
# Δ1 equals the product of δ2 and a1.
Delta1 = d2#a1
Delta1 /= m
# Δ2 equals the product of δ3 and a2.
Delta2 = d3#a2
Delta2 /= m
reg_term1 = (lambda_/m) * np.append(np.zeros((Theta1.shape[0],1)), Theta1[:,1:], axis=1)
reg_term2 = (lambda_/m) * np.append(np.zeros((Theta2.shape[0],1)), Theta2[:,1:], axis=1)
Theta1_grad = Delta1 + reg_term1
Theta2_grad = Delta2 + reg_term2
grad = np.append(Theta1_grad.ravel(), Theta2_grad.ravel())
return cost, grad
Here's the code to check the gradients. I have been over every line and there is nothing whatsoever that I can think of to change here. It seems to be in working order.
def checkNNGradients(lambda_):
"""
Creates a small neural network to check the backpropagation gradients.
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Input: Regularization parameter, lambda, as int or float.
Output: Analytical gradients produced by backprop code and the numerical gradients (computed
using computeNumericalGradient). These two gradient computations should result in
very similar values.
"""
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
# generate 'random' test data
Theta1 = debugInitializeWeights(hidden_layer_size, input_layer_size)
Theta2 = debugInitializeWeights(num_labels, hidden_layer_size)
# reusing debugInitializeWeights to generate X
X = debugInitializeWeights(m, input_layer_size - 1)
y = np.ones(m) + np.remainder(np.range(m), num_labels)
# unroll parameters
nn_params = np.append(Theta1.ravel(), Theta2.ravel())
costFunc = lambda p: nnCost(p, X, y, lambda_, input_layer_size, hidden_layer_size, num_labels)
cost, grad = costFunc(nn_params)
numgrad = computeNumericalGradient(costFunc, nn_params)
# examine the two gradient computations; two columns should be very similar.
print('The columns below should be very similar.\n')
# Credit: http://stackoverflow.com/a/27663954/583834
print('{:<25}{}'.format('Numerical Gradient', 'Analytical Gradient'))
for numerical, analytical in zip(numgrad, grad):
print('{:<25}{}'.format(numerical, analytical))
# If you have a correct implementation, and assuming you used EPSILON = 0.0001
# in computeNumericalGradient.m, then diff below should be less than 1e-9
diff = np.linalg.norm(numgrad-grad)/np.linalg.norm(numgrad+grad)
print(diff)
print("\n")
print('If your backpropagation implementation is correct, then \n' \
'the relative difference will be small (less than 1e-9). \n' \
'\nRelative Difference: {:.10f}'.format(diff))
The check function generates its own data using a debugInitializeWeights function (so there's the reproducible example; just run that and it will call the other functions), and then calls the function that calculates the gradient using finite differences. Both are below.
def debugInitializeWeights(fan_out, fan_in):
"""
Initializes the weights of a layer with fan_in
incoming connections and fan_out outgoing connections using a fixed
strategy.
Input: fan_out, number of outgoing connections for a layer as int; fan_in, number
of incoming connections for the same layer as int.
Output: Weight matrix, W, of size(1 + fan_in, fan_out), as the first row of W handles the "bias" terms
"""
W = np.zeros((fan_out, 1 + fan_in))
# Initialize W using "sin", this ensures that the values in W are of similar scale;
# this will be useful for debugging
W = np.sin(range(1, np.size(W)+1)) / 10
return W.reshape(fan_out, fan_in+1)
def computeNumericalGradient(J, nn_params):
"""
Computes the gradient using "finite differences"
and provides a numerical estimate of the gradient (i.e.,
gradient of the function J around theta).
Credit: Based on the MATLAB code provided by Dr. Andrew Ng, Stanford Univ.
Inputs: Cost, J, as computed by nnCost function; Parameter vector, theta.
Output: Gradient vector using finite differences. Per Dr. Ng,
'Sets numgrad(i) to (a numerical approximation of) the partial derivative of
J with respect to the i-th input argument, evaluated at theta. (i.e., numgrad(i) should
be the (approximately) the partial derivative of J with respect
to theta(i).)'
"""
numgrad = np.zeros(nn_params.shape)
perturb = np.zeros(nn_params.shape)
e = .0001
for i in range(np.size(nn_params)):
# Set perturbation (i.e., noise) vector
perturb[i] = e
# run cost fxn w/ noise added to and subtracted from parameters theta in nn_params
cost1, grad1 = J((nn_params - perturb))
cost2, grad2 = J((nn_params + perturb))
# record the difference in cost function ouputs; this is the numerical gradient
numgrad[i] = (cost2 - cost1) / (2*e)
perturb[i] = 0
return numgrad
The code is not for class. That MOOC was in MATLAB and it's over. This is for me. Other solutions exist on the web; looking at them has proved fruitless. Everyone has a different (inscrutable) approach. So, I'm in serious need of assistance or a miracle.
Edit/Update: Fortran ordering when raveling vectors influences the outcome, but I have not been able to get the gradients to move together changing that option.
One thought: I think your perturbation is a little large, being 1e-4. For double precision floating point numbers, it should be more like 1e-8, i.e., the root of the machine precision (or are you working with single precision?!).
That being said, finite differences can be very bad approximations to true derivatives. Specifically, floating point computations in numpy are not deterministic, as you seem to have found out. The noise in evaluations can cancel out many significant digits under some circumstances. What values are you seeing and what are you expecting?
All of the following figured into the solution to my problem. For those attempting to translate MATLAB code to Python, whether from Andrew NG's Coursera Machine Learning course or not, these are things everyone should know.
MATLAB does everything in FORTRAN order; Python does everything in C order. This affects how vectors are populated and, thus, your results. You should always be in FORTRAN order, if you want your answers to match what you did in MATLAB. See docs
Getting your vectors in FORTRAN order can be as easy as passing order='F' as an argument to .reshape(), .ravel(), or .flatten(). You may, however, achieve the same thing if you are using .ravel() by transposing the vector then applying the .ravel() function like so X.T.ravel().
Speaking of .ravel(), the .ravel() and .flatten() functions do not do the same thing and may have different use cases. For example, .flatten() is preferred by SciPy optimization methods. So, if your equivalent of fminunc isn't working, it's likely because you forgot to .flatten() your response vector y. See this Q&A StackOverflow and docs on .ravel() which link to .flatten().More Docs
If you're translating your code from MATLAB live script into a Jupyter notebook or Google COLAB, you must police your name space. On one occasion, I found that the variable I thought was being passed was not actually the variable that was being passed. Why? Jupyter and Colab notebooks have a lot of global variables that one would never write ordinarily.
There is a better function to evaluate the differences between numerical and analytical gradients: Relative Error Comparison np.abs(numerical-analyitical)/(numerical+analytical). Read about it here CS231 Also, consider the accepted post above.

Implementing a trainable generalized Bump function layer in Keras/Tensorflow

I'm trying to code the following variant of the Bump function, applied component-wise:
,
where σ is trainable; but it's not working (errors reported below).
My attempt:
Here's what I've coded up so far (if it helps). Suppose I have two functions (for example):
def f_True(x):
# Compute Bump Function
bump_value = 1-tf.math.pow(x,2)
bump_value = -tf.math.pow(bump_value,-1)
bump_value = tf.math.exp(bump_value)
return(bump_value)
def f_False(x):
# Compute Bump Function
x_out = 0*x
return(x_out)
class trainable_bump_layer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(trainable_bump_layer, self).__init__(*args, **kwargs)
def build(self, input_shape):
self.threshold_level = self.add_weight(name='threshlevel',
shape=[1],
initializer='GlorotUniform',
trainable=True)
def call(self, input):
# Determine Thresholding Logic
The_Logic = tf.math.less(input,self.threshold_level)
# Apply Logic
output_step_3 = tf.cond(The_Logic,
lambda: f_True(input),
lambda: f_False(input))
return output_step_3
Error Report:
Train on 100 samples
Epoch 1/10
WARNING:tensorflow:Gradients do not exist for variables ['reconfiguration_unit_steps_3_3/threshlevel:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['reconfiguration_unit_steps_3_3/threshlevel:0'] when minimizing the loss.
32/100 [========>.....................] - ETA: 3s
...
tensorflow:Gradients do not exist for variables
Moreover, it does not seem to be applied component-wise (besides the non-trainable problem). What could be the problem?
Unfortunately, no operation to check whether x is within (-σ, σ) will be differentiable and therefore σ cannot be learnt via any gradient descent method. Specifically, it is not possible to compute the gradients with respect to self.threshold_level because tf.math.less is not differentiable with respect to the condition.
Regarding the element-wise conditional, you can instead use tf.where to select elements from f_True(input) or f_False(input) according to the component-wise boolean values of the condition. For example:
output_step_3 = tf.where(The_Logic, f_True(input), f_False(input))
NOTE: I answered based on the provided code, where self.threshold_level is not used in f_True nor f_False. If self.threshold_level is used in those functions as in the provided formula, the function will, of course, be differentiable with respect to self.threshold_level.
Updated 19/04/2020: Thank you #today for the clarification.
I suggest you try a normal distribution instead of a bump.
In my tests here, this bump function is not behaving well (I can't find a bug but don't discard it, but my graph shows two very sharp bumps, which is not good for networks)
With a normal distribution, you would get a regular and differentiable bump whose height, width and center you can control.
So, you may try this function:
y = a * exp ( - b * (x - c)²)
Try it in some graph and see how it behaves.
For this:
class trainable_bump_layer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(trainable_bump_layer, self).__init__(*args, **kwargs)
def build(self, input_shape):
#suggested shape (has a different kernel for each input feature/channel)
shape = tuple(1 for _ in input_shape[:-1]) + input_shape[-1:]
#for your desired shape of only 1:
shape = tuple(1 for _ in input_shape) #all ones
#height
self.kernel_a = self.add_weight(name='kernel_a ',
shape=shape
initializer='ones',
trainable=True)
#inverse width
self.kernel_b = self.add_weight(name='kernel_b',
shape=shape
initializer='ones',
trainable=True)
#center
self.kernel_c = self.add_weight(name='kernel_c',
shape=shape
initializer='zeros',
trainable=True)
def call(self, input):
exp_arg = - self.kernel_b * K.square(input - self.kernel_c)
return self.kernel_a * K.exp(exp_arg)
I am a bit surprised that no one has mentioned the main (and only) reason for the given warning! As it seems, that code is supposed to implement the generalized variant of Bump function; however, just take a look at the functions implemented again:
def f_True(x):
# Compute Bump Function
bump_value = 1-tf.math.pow(x,2)
bump_value = -tf.math.pow(bump_value,-1)
bump_value = tf.math.exp(bump_value)
return(bump_value)
def f_False(x):
# Compute Bump Function
x_out = 0*x
return(x_out)
The error is evident: there is no usage of the trainable weight of the layer in these functions! So there is no surprise that you get the message saying that no gradient exist for that: you are not using it at all, so no gradient to update it! Rather, this is exactly the original Bump function (i.e. with no trainable weight).
But, you might say that: "at least, I used the trainable weight in the condition of tf.cond, so there must be some gradients?!"; however, it's not like that and let me clear up the confusion:
First of all, as you have noticed as well, we are interested in element-wise conditioning. So instead of tf.cond you need to use tf.where.
The other misconception is to claim that since tf.less is used as the condition, and since it is not differentiable i.e. it has no gradient with respect to its inputs (which is true: there is no defined gradient for a function with boolean output w.r.t. its real-valued inputs!), then that results in the given warning!
That's simply wrong! The derivative here would be taken of the output of the layer w.r.t trainable weight, and the selection condition is NOT present in the output. Rather, it's just a boolean tensor which determines the output branch to be selected. That's it! The derivative of condition is not taken and will never be needed. So that's not the reason for the given warning; the reason is only and only what I mentioned above: no contribution of trainable weight in the output of layer. (Note: if the point about condition is a bit surprising to you, then think about a simple example: the ReLU function, which is defined as relu(x) = 0 if x < 0 else x. If the derivative of condition, i.e. x < 0, is considered/needed, which does not exists, then we would not be able to use ReLU in our models and train them using gradient-based optimization methods at all!)
(Note: starting from here, I would refer to and denote the threshold value as sigma, like in the equation).
All right! We found the reason behind the error in implementation. Could we fix this? Of course! Here is the updated working implementation:
import tensorflow as tf
from tensorflow.keras.initializers import RandomUniform
from tensorflow.keras.constraints import NonNeg
class BumpLayer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(BumpLayer, self).__init__(*args, **kwargs)
def build(self, input_shape):
self.sigma = self.add_weight(
name='sigma',
shape=[1],
initializer=RandomUniform(minval=0.0, maxval=0.1),
trainable=True,
constraint=tf.keras.constraints.NonNeg()
)
super().build(input_shape)
def bump_function(self, x):
return tf.math.exp(-self.sigma / (self.sigma - tf.math.pow(x, 2)))
def call(self, inputs):
greater = tf.math.greater(inputs, -self.sigma)
less = tf.math.less(inputs, self.sigma)
condition = tf.logical_and(greater, less)
output = tf.where(
condition,
self.bump_function(inputs),
0.0
)
return output
A few points regarding this implementation:
We have replaced tf.cond with tf.where in order to do element-wise conditioning.
Further, as you can see, unlike your implementation which only checked for one side of inequality, we are using tf.math.less, tf.math.greater and also tf.logical_and to find out whether the input values have magnitudes of less than sigma (alternatively, we could do this using just tf.math.abs and tf.math.less; no difference!). And let us repeat it: using boolean-output functions in this way does not cause any problems and have nothing to do with derivatives/gradients.
We are also using a non-negativity constraint on the sigma value learned by layer. Why? Because sigma values less than zero does not make sense (i.e. the range (-sigma, sigma) is ill-defined when sigma is negative).
And considering the previous point, we take care to initialize the sigma value properly (i.e. to a small non-negative value).
And also, please don't do things like 0.0 * inputs! It's redundant (and a bit weird) and it is equivalent to 0.0; and both have a gradient of 0.0 (w.r.t. inputs). Multiplying zero with a tensor does not add anything or solve any existing issue, at least not in this case!
Now, let's test it to see how it works. We write some helper functions to generate training data based on a fixed sigma value, and also to create a model which contains a single BumpLayer with input shape of (1,). Let's see if it could learn the sigma value which is used for generating training data:
import numpy as np
def generate_data(sigma, min_x=-1, max_x=1, shape=(100000,1)):
assert sigma >= 0, 'Sigma should be non-negative!'
x = np.random.uniform(min_x, max_x, size=shape)
xp2 = np.power(x, 2)
condition = np.logical_and(x < sigma, x > -sigma)
y = np.where(condition, np.exp(-sigma / (sigma - xp2)), 0.0)
dy = np.where(condition, xp2 * y / np.power((sigma - xp2), 2), 0)
return x, y, dy
def make_model(input_shape=(1,)):
model = tf.keras.Sequential()
model.add(BumpLayer(input_shape=input_shape))
model.compile(loss='mse', optimizer='adam')
return model
# Generate training data using a fixed sigma value.
sigma = 0.5
x, y, _ = generate_data(sigma=sigma, min_x=-0.1, max_x=0.1)
model = make_model()
# Store initial value of sigma, so that it could be compared after training.
sigma_before = model.layers[0].get_weights()[0][0]
model.fit(x, y, epochs=5)
print('Sigma before training:', sigma_before)
print('Sigma after training:', model.layers[0].get_weights()[0][0])
print('Sigma used for generating data:', sigma)
# Sigma before training: 0.08271004
# Sigma after training: 0.5000002
# Sigma used for generating data: 0.5
Yes, it could learn the value of sigma used for generating data! But, is it guaranteed that it actually works for all different values of training data and initialization of sigma? The answer is: NO! Actually, it is possible that you run the code above and get nan as the value of sigma after training, or inf as the loss value! So what's the problem? Why this nan or inf values might be produced? Let's discuss it below...
Dealing with numerical stability
One of the important things to consider, when building a machine learning model and using gradient-based optimization methods to train them, is the numerical stability of operations and calculations in a model. When extremely large or small values are generated by an operation or its gradient, almost certainly it would disrupt the training process (for example, that's one of the reasons behind normalizing image pixel values in CNNs to prevent this issue).
So, let's take a look at this generalized bump function (and let's discard the thresholdeding for now). It's obvious that this function has singularities (i.e. points where either the function or its gradient is not defined) at x^2 = sigma (i.e. when x = sqrt(sigma) or x=-sqrt(sigma)). The animated diagram below shows the bump function (the solid red line), its derivative w.r.t. sigma (the dotted green line) and x=sigma and x=-sigma lines (two vertical dashed blue lines), when sigma starts from zero and is increased to 5:
As you can see, around the region of singularities the function is not well-behaved for all values of sigma, in the sense that both the function and its derivative take extremely large values at those regions. So given an input value at those regions for a particular value of sigma, exploding output and gradient values would be generated, hence the issue of inf loss value.
Even further, there is a problematic behavior of tf.where which causes the issue of nan values for the sigma variable in the layer: surprisingly, if the produced value in inactive branch of tf.where is extremely large or inf, which with the bump function results in extremely large or inf gradient values, then the gradient of tf.where would be nan, despite the fact that the inf is in inactive branch and is not even selected (see this Github issue which discusses exactly this)!!
So is there any workaround for this behavior of tf.where? Yes, actually there is a trick to somehow resolve this issue which is explained in this answer: basically we can use an additional tf.where in order to prevent the function to be applied on these regions. In other words, instead of applying self.bump_function on any input value, we filter those values which are NOT in the range (-self.sigma, self.sigma) (i.e. the actual range which the function should be applied) and instead feed the function with zero (which is always produce safe values, i.e. is equal to exp(-1)):
output = tf.where(
condition,
self.bump_function(tf.where(condition, inputs, 0.0)),
0.0
)
Applying this fix would entirely resolve the issue of nan values for sigma. Let's evaluate it on training data values generated with different sigma values and see how it would perform:
true_learned_sigma = []
for s in np.arange(0.1, 10.0, 0.1):
model = make_model()
x, y, dy = generate_data(sigma=s, shape=(100000,1))
model.fit(x, y, epochs=3 if s < 1 else (5 if s < 5 else 10), verbose=False)
sigma = model.layers[0].get_weights()[0][0]
true_learned_sigma.append([s, sigma])
print(s, sigma)
# Check if the learned values of sigma
# are actually close to true values of sigma, for all the experiments.
res = np.array(true_learned_sigma)
print(np.allclose(res[:,0], res[:,1], atol=1e-2))
# True
It could learn all the sigma values correctly! That's nice. That workaround worked! Although, there is one caveat: this is guaranteed to work properly and learn any sigma value if the input values to this layer are greater than -1 and less than 1 (i.e. this is the default case of our generate_data function); otherwise, there is still the issue of inf loss value which might happen if the input values have a magnitude of greater than 1 (see point #1 and #2, below).
Here are some foods for thought for the curios and interested mind:
It was just mentioned that if the input values to this layer are greater than 1 or less than -1, then it may cause problems. Can you argue why this is the case? (Hint: use the animated diagram above and consider cases where sigma > 1 and the input value is between sqrt(sigma) and sigma (or between -sigma and -sqrt(sigma).)
Can you provide a fix for the issue in point #1, i.e. such that the layer could work for all input values? (Hint: like the workaround for tf.where, think about how you can further filter-out the unsafe values which the bump function could be applied on and produce exploding output/gradient.)
However, if you are not interested to fix this issue, and would like to use this layer in a model as it is now, then how would you guarantee that the input values to this layer are always between -1 and 1? (Hint: as one solution, there is a commonly-used activation function which produces values exactly in this range and could be potentially used as the activation function of the layer which is before this layer.)
If you take a look at the last code snippet, you will see that we have used epochs=3 if s < 1 else (5 if s < 5 else 10). Why is that? Why large values of sigma need more epochs to be learned? (Hint: again, use the animated diagram and consider the derivative of function for input values between -1 and 1 as sigma value increases. What are their magnitude?)
Do we also need to check the generated training data for any nan, inf or extremely large values of y and filter them out? (Hint: yes, if sigma > 1 and range of values, i.e. min_x and max_x, fall outside of (-1, 1); otherwise, no that's not necessary! Why is that? Left as an exercise!)

Differentiable round function in Tensorflow?

So the output of my network is a list of propabilities, which I then round using tf.round() to be either 0 or 1, this is crucial for this project.
I then found out that tf.round isn't differentiable so I'm kinda lost there.. :/
Something along the lines of x - sin(2pi x)/(2pi)?
I'm sure there's a way to squish the slope to be a bit steeper.
You can use the fact that tf.maximum() and tf.minimum() are differentiable, and the inputs are probabilities from 0 to 1
# round numbers less than 0.5 to zero;
# by making them negative and taking the maximum with 0
differentiable_round = tf.maximum(x-0.499,0)
# scale the remaining numbers (0 to 0.5) to greater than 1
# the other half (zeros) is not affected by multiplication
differentiable_round = differentiable_round * 10000
# take the minimum with 1
differentiable_round = tf.minimum(differentiable_round, 1)
Example:
[0.1, 0.5, 0.7]
[-0.0989, 0.001, 0.20099] # x - 0.499
[0, 0.001, 0.20099] # max(x-0.499, 0)
[0, 10, 2009.9] # max(x-0.499, 0) * 10000
[0, 1.0, 1.0] # min(max(x-0.499, 0) * 10000, 1)
This works for me:
x_rounded_NOT_differentiable = tf.round(x)
x_rounded_differentiable = x - tf.stop_gradient(x - x_rounded_NOT_differentiable)
Rounding is a fundamentally nondifferentiable function, so you're out of luck there. The normal procedure for this kind of situation is to find a way to either use the probabilities, say by using them to calculate an expected value, or by taking the maximum probability that is output and choose that one as the network's prediction. If you aren't using the output for calculating your loss function though, you can go ahead and just apply it to the result and it doesn't matter if it's differentiable. Now, if you want an informative loss function for the purpose of training the network, maybe you should consider whether keeping the output in the format of probabilities might actually be to your advantage (it will likely make your training process smoother)- that way you can just convert the probabilities to actual estimates outside of the network, after training.
Building on a previous answer, a way to get an arbitrarily good approximation is to approximate round() using a finite Fourier approximation and use as many terms as you need. Fundamentally, you can think of round(x) as adding a reverse (i. e. descending) sawtooth wave to x. So, using the Fourier expansion of the sawtooth wave we get
With N = 5, we get a pretty nice approximation:
Kind of an old question, but I just solved this problem for TensorFlow 2.0. I am using the following round function on in my audio auto-encoder project. I basically want to create a discrete representation of sound which is compressed in time. I use the round function to clamp the output of the encoder to integer values. It has been working well for me so far.
#tf.custom_gradient
def round_with_gradients(x):
def grad(dy):
return dy
return tf.round(x), grad
In range 0 1, translating and scaling a sigmoid can be a solution:
slope = 1000
center = 0.5
e = tf.exp(slope*(x-center))
round_diff = e/(e+1)
In tensorflow 2.10, there is a function called soft_round which achieves exactly this.
Fortunately, for those who are using lower versions, the source code is really simple, so I just copy-pasted those lines, and it works like a charm:
def soft_round(x, alpha, eps=1e-3):
"""Differentiable approximation to `round`.
Larger alphas correspond to closer approximations of the round function.
If alpha is close to zero, this function reduces to the identity.
This is described in Sec. 4.1. in the paper
> "Universally Quantized Neural Compression"<br />
> Eirikur Agustsson & Lucas Theis<br />
> https://arxiv.org/abs/2006.09952
Args:
x: `tf.Tensor`. Inputs to the rounding function.
alpha: Float or `tf.Tensor`. Controls smoothness of the approximation.
eps: Float. Threshold below which `soft_round` will return identity.
Returns:
`tf.Tensor`
"""
# This guards the gradient of tf.where below against NaNs, while maintaining
# correctness, as for alpha < eps the result is ignored.
alpha_bounded = tf.maximum(alpha, eps)
m = tf.floor(x) + .5
r = x - m
z = tf.tanh(alpha_bounded / 2.) * 2.
y = m + tf.tanh(alpha_bounded * r) / z
# For very low alphas, soft_round behaves like identity
return tf.where(alpha < eps, x, y, name="soft_round")
alpha sets how soft the function is. Greater values leads to better approximations of round function, but then it becomes harder to fit since gradients vanish:
x = tf.convert_to_tensor(np.arange(-2,2,.1).astype(np.float32))
for alpha in [ 3., 7., 15.]:
y = soft_round(x, alpha)
plt.plot(x.numpy(), y.numpy(), label=f'alpha={alpha}')
plt.legend()
plt.title('Soft round function for different alphas')
plt.grid()
In my case, I tried different values for alpha, and 3. looks like a good choice.

Fitting curve: why small numbers are better?

I spent some time these days on a problem. I have a set of data:
y = f(t), where y is very small concentration (10^-7), and t is in second. t varies from 0 to around 12000.
The measurements follow an established model:
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
And I need to find Vs, Vi, and k. So I used curve_fit, which returns the best fitting parameters, and I plotted the curve.
And then I used a similar model:
y = (Vs * t/3600 - ((Vs - Vi) * (1 - np.exp(-k * t/3600)) / k)) * 10**7
By doing that, t is a number of hour, and y is a number between 0 and about 10. The parameters returned are of course different. But when I plot each curve, here is what I get:
http://i.imgur.com/XLa4LtL.png
The green fit is the first model, the blue one with the "normalized" model. And the red dots are the experimental values.
The fitting curves are different. I think it's not expected, and I don't understand why. Are the calculations more accurate if the numbers are "reasonnable" ?
The docstring for optimize.curve_fit says,
p0 : None, scalar, or M-length sequence
Initial guess for the parameters. If None, then the initial
values will all be 1 (if the number of parameters for the function
can be determined using introspection, otherwise a ValueError
is raised).
Thus, to begin with, the initial guess for the parameters is by default 1.
Moreover, curve fitting algorithms have to sample the function for various values of the parameters. The "various values" are initially chosen with an initial step size on the order of 1. The algorithm will work better if your data varies somewhat smoothly with changes in the parameter values that on the order of 1.
If the function varies wildly with parameter changes on the order of 1, then the algorithm may tend to miss the optimum parameter values.
Note that even if the algorithm uses an adaptive step size when it tweaks the parameter values, if the initial tweak is so far off the mark as to produce a big residual, and if tweaking in some other direction happens to produce a smaller residual, then the algorithm may wander off in the wrong direction and miss the local minimum. It may find some other (undesired) local minimum, or simply fail to converge. So using an algorithm with an adaptive step size won't necessarily save you.
The moral of the story is that scaling your data can improve the algorithm's chances of of finding the desired minimum.
Numerical algorithms in general all tend to work better when applied to data whose magnitude is on the order of 1. This bias enters into the algorithm in numerous ways. For instance, optimize.curve_fit relies on optimize.leastsq, and the call signature for optimize.leastsq is:
def leastsq(func, x0, args=(), Dfun=None, full_output=0,
col_deriv=0, ftol=1.49012e-8, xtol=1.49012e-8,
gtol=0.0, maxfev=0, epsfcn=None, factor=100, diag=None):
Thus, by default, the tolerances ftol and xtol are on the order of 1e-8. If finding the optimum parameter values require much smaller tolerances, then these hard-coded default numbers will cause optimize.curve_fit to miss the optimize parameter values.
To make this more concrete, suppose you were trying to minimize f(x) = 1e-100*x**2. The factor of 1e-100 squashes the y-values so much that a wide range of x-values (the parameter values mentioned above) will fit within the tolerance of 1e-8. So, with un-ideal scaling, leastsq will not do a good job of finding the minimum.
Another reason to use floats on the order of 1 is because there are many more (IEEE754) floats in the interval [-1,1] than there are far away from 1. For example,
import struct
def floats_between(x, y):
"""
http://stackoverflow.com/a/3587987/190597 (jsbueno)
"""
a = struct.pack("<dd", x, y)
b = struct.unpack("<qq", a)
return b[1] - b[0]
In [26]: floats_between(0,1) / float(floats_between(1e6,1e7))
Out[26]: 311.4397707054894
This shows there are over 300 times as many floats representing numbers between 0 and 1 than there are in the interval [1e6, 1e7].
Thus, all else being equal, you'll typically get a more accurate answer if working with small numbers than very large numbers.
I would imagine it has more to do with the initial parameter estimates you are passing to curve fit. If you are not passing any I believe they all default to 1. Normalizing your data makes those initial estimates closer to the truth. If you don't want to use normalized data just pass the initial estimates yourself and give them reasonable values.
Others have already mentioned that you probably need to have a good starting guess for your fit. In cases like this is, I usually try to find some quick and dirty tricks to get at least a ballpark estimate of the parameters. In your case, for large t, the exponential decays pretty quickly to zero, so for large t, you have
y == Vs * t - (Vs - Vi) / k
Doing a first-order linear fit like
[slope1, offset1] = polyfit(t[t > 2000], y[t > 2000], 1)
you will get slope1 == Vs and offset1 == (Vi - Vs) / k.
Subtracting this straight line from all the points you have, you get the exponential
residual == y - slope1 * t - offset1 == (Vs - Vi) * exp(-t * k)
Taking the log of both sides, you get
log(residual) == log(Vs - Vi) - t * k
So doing a second fit
[slope2, offset2] = polyfit(t, log(y - slope1 * t - offset1), 1)
will give you slope2 == -k and offset2 == log(Vs - Vi), which should be solvable for Vi since you already know Vs. You might have to limit the second fit to small values of t, otherwise you might be taking the log of negative numbers. Collect all the parameters you obtained with these fits and use them as the starting points for your curve_fit.
Finally, you might want to look into doing some sort of weighted fit. The information about the exponential part of your curve is contained in just the first few points, so maybe you should give those a higher weight. Doing this in a statistically correct way is not trivial.

Categories

Resources