I want to compute entropy for two matrices (inputs). I want the output entropy still using having the matrix shape.
For example::
import numpy as np
def entropy(x, y):
probs = np.mean((x, y), axis=0)
p = probs.astype(np.float32)
return (-p * np.log2(p))
inp1 = np.random.random([5,4])
inp2 = np.random.random([5,4])
inp1_flatt = inp1.reshape([-1])
inp2_flatt = inp2.reshape([-1])
combine_out = entropy(inp1_flatt, inp2_flatt).reshape([5,4])
In the entropy() function, I think I have a problem with computing the posterior probabilities (probs).
How can I compute the posterior probabilities in the correct way ?
EDIT::
These two inputs suppose to be an regression outputs of Neural Network. I want to save their shape. since the input of the entropy (output of the neural network) has shape [5,4], I want the entropy output shape is [5,4]. I want to do something like combine sources using entropy (joint entropy method for continuous values)
Related
I am trying to calculate jacobian matrix from my neural network trained for autoregression.There are 9 input variables to the model and it predicts 3 variables as output.
Input shape=(1,9)
Output shape=(1,3)
And I can calculate the normal jacobian matrix of shape (3,9) from the current code of tensorflow.
I have the representation for the matrix that I can currently calculate attached here
Jacobian Matrix.
My issue is this jacobian calculation is very slow and I don't want to calculate all the jacobians and only those jacobians at the place marked in the above image.
I have the code snippets relevant to this issue.This jacobian for is used for extended kalman filter.
Can some one help me figure out how can I do this in tensorflow.
Code for jacobian calculation
def jacobian_tensorflow(self,verbose=False):
jacobian_matrix = []
it = tqdm(range(self.output_size)) if verbose else range(self.output_size)
for o in it:
grad_func = tf.gradients(self.nn_model.output[:, o], self.nn_model.input)
gradients = sess.run(grad_func, feed_dict={self.nn_model.input: self.pred_x.reshape((1, self.pred_x.size))})
jacobian_matrix.append(gradients[0][0,:])
return np.array(jacobian_matrix)
Code of my neural networks
input_window = Input(shape=(deg_order * 3,))
x = Dense(90, activation='tanh')(input_window)
x = Dense(60, activation='tanh')(x)
x = Dense(30, activation='tanh')(x)
x = Dense(15, activation='tanh')(x)
output = Dense(3, activation='tanh')(x)
autoencoder_model = Model(input_window, output)
autoencoder_model.compile(optimizer='adam', loss=tf.keras.metrics.mean_squared_error)
autoencoder_model.fit(x_train, y_train, epochs=epochs,
shuffle=True,
validation_data=(x_validate, y_validate))
To explain clearly I have added this image of what jacobian i want to calculate.
In image o/p means the output variable and i/p is the variable on the input side. The number are just for position of the variables on the input and output side. The numbers in layers are the neurons in that hidden layer
Jacobian I want to calculate from the neural network
Im fitting some data for a classification task using Gaussian Process Classifiers in sklearn. I know that for the Gaussian Process Regressor one can pass return_std in
y_test, std = gp.predict(x_test, return_std=True)
to output the standard deviation of the test sample (like in this question)
However, I couldn't find such a parameter for the GP Classifier.
Is there such thing as outputting the predictive mean and stdv of test data from a GP Classifiers? And is there a way to output the posterior mean and covariance of the fitted model?
There is not standard deviation for categorical data, hence there is no the parameter return_std in the Classifier.
However, if you want to quantify the uncertainty of the classifier predictions, you could use the .predict_proba(X)method. Once you get the probabilites of each posible class you could compute the entropy of the predicted probabilities.
You could get the variance associated with the logit function by going to the predict_proba function definition in _gpc.py and returning the 'var_f_star' value. I have modified the predict_proba and created a function to return the logit variance below:
def predict_var(self, X):
"""Return probability estimates for the test vector X.
Parameters
----------
X : array-like of shape (n_samples, n_features) or list of object
Query points where the GP is evaluated for classification.
Returns
-------
C : array-like of shape (n_samples, n_classes)
Returns the probability of the samples for each class in
the model. The columns correspond to the classes in sorted
order, as they appear in the attribute ``classes_``.
"""
check_is_fitted(self)
# Based on Algorithm 3.2 of GPML
K_star = self.kernel_(self.X_train_, X) # K_star =k(x_star)
f_star = K_star.T.dot(self.y_train_ - self.pi_) # Line 4
v = solve(self.L_, self.W_sr_[:, np.newaxis] * K_star) # Line 5
# Line 6 (compute np.diag(v.T.dot(v)) via einsum)
var_f_star = self.kernel_.diag(X) - np.einsum("ij,ij->j", v, v)
I'm currently trying to implement a siamese-net in Keras where I have to implement the following loss function:
loss(p ∥ q) = Is · KL(p ∥ q) + Ids · HL(p ∥ q)
detailed description of loss function from paper
Where KL is the Kullback-Leibler divergence and HL is the Hinge-loss.
During training, I label same-speaker pairs as 1, different speakers as 0.
The goal is to use the trained net to extract embeddings from spectrograms.
A spectrogram is a 2-dimensional numpy-array 40x128 (time x frequency)
The problem is I never get over 0.5 accuracy, and when clustering speaker-embeddings the results show there seems to be no correlation between embeddings and speakers
I implemented the kb-divergence as distance measure, and adjusted the hinge-loss accordingly:
def kullback_leibler_divergence(vects):
x, y = vects
x = ks.backend.clip(x, ks.backend.epsilon(), 1)
y = ks.backend.clip(y, ks.backend.epsilon(), 1)
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
def kullback_leibler_shape(shapes):
shape1, shape2 = shapes
return shape1[0], 1
def kb_hinge_loss(y_true, y_pred):
"""
y_true: binary label, 1 = same speaker
y_pred: output of siamese net i.e. kullback-leibler distribution
"""
MARGIN = 1.
hinge = ks.backend.mean(ks.backend.maximum(MARGIN - y_pred, 0.), axis=-1)
return y_true * y_pred + (1 - y_true) * hinge
A single spectrogram would be fed into a branch of the base network, the siamese-net consists of two such branches, so two spectrograms are fed simultaneously, and joined in the distance-layer. The output of the base network is 1 x 128. The distance layer computes the kullback-leibler divergence and its output is fed into the kb_hinge_loss. The architecture of the base-network is as follows:
def create_lstm(units: int, gpu: bool, name: str, is_sequence: bool = True):
if gpu:
return ks.layers.CuDNNLSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
else:
return ks.layers.LSTM(units, return_sequences=is_sequence, input_shape=INPUT_DIMS, name=name)
def build_model(mode: str = 'train') -> ks.Model:
topology = TRAIN_CONF['topology']
is_gpu = tf.test.is_gpu_available(cuda_only=True)
model = ks.Sequential(name='base_network')
model.add(
ks.layers.Bidirectional(create_lstm(topology['blstm1_units'], is_gpu, name='blstm_1'), input_shape=INPUT_DIMS))
model.add(ks.layers.Dropout(topology['dropout1']))
model.add(ks.layers.Bidirectional(create_lstm(topology['blstm2_units'], is_gpu, is_sequence=False, name='blstm_2')))
if mode == 'extraction':
return model
num_units = topology['dense1_units']
model.add(ks.layers.Dense(num_units, name='dense_1'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
model.add(ks.layers.Dropout(topology['dropout2']))
num_units = topology['dense2_units']
model.add(ks.layers.Dense(num_units, name='dense_2'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense3_units']
model.add(ks.layers.Dense(num_units, name='dense_3'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
num_units = topology['dense4_units']
model.add(ks.layers.Dense(num_units, name='dense_4'))
model.add(ks.layers.advanced_activations.PReLU(init='zero', weights=None))
return model
I then build a siamese net as follows:
base_network = build_model()
input_a = ks.Input(shape=INPUT_DIMS, name='input_a')
input_b = ks.Input(shape=INPUT_DIMS, name='input_b')
processed_a = base_network(input_a)
processed_b = base_network(input_b)
distance = ks.layers.Lambda(kullback_leibler_divergence,
output_shape=kullback_leibler_shape,
name='distance')([processed_a, processed_b])
model = ks.Model(inputs=[input_a, input_b], outputs=distance)
adam = build_optimizer()
model.compile(loss=kb_hinge_loss, optimizer=adam, metrics=['accuracy'])
Lastly, I build a net with the same architecture with only one input, and try to extract embeddings, and then build the mean over them, where an embedding should serve as a representation for a speaker, to be used during clustering:
utterance_embedding = np.mean(embedding_extractor.predict_on_batch(spectrogram), axis=0)
We train the net on the voxceleb speaker set.
The full code can be seen here: GitHub repo
I'm trying to figure out if I have made any wrong assumptions and how to improve my accuracy.
Issue with accuracy
Notice that in your model:
y_true = labels
y_pred = kullback-leibler divergence
These two cannot be compared, see this example:
For correct results, when y_true == 1 (same
speaker), Kullback-Leibler is y_pred == 0 (no divergence).
So it's totally expected that metrics will not work properly.
Then, either you create a custom metric, or you count only on the loss for evaluations.
This custom metric should need a few adjustments in order to be feasible, as explained below.
Possible issues with the loss
Clipping
This might be a problem
First, notice that you're using clip in the values for the Kullback-Leibler. This may be bad because clips lose the gradients in the clipped regions. And since your activation is a PRelu, you have values lower than zero and bigger than 1. Then there are certainly zero gradient cases here and there, with the risk of having a frozen model.
So, you might not want to clip these values. And to avoid having negative values with the PRelu, you can try to use a 'softplus' activation, which is kind of a soft relu without negative values. You might also "sum" an epsilon to avoid trouble, but there is no problem in leaving values bigger than one:
#considering you used 'softplus' instead of 'PRelu' in speakers
def kullback_leibler_divergence(speakers):
x, y = speakers
x = x + ks.backend.epsilon()
y = y + ks.backend.epsilon()
return ks.backend.sum(x * ks.backend.log(x / y), axis=-1)
Assimetry in Kullback-Leibler
This IS a problem
Notice also that Kullback-Leibler is not a symetric function, and also doesn't have its minimum at zero!! The perfect match is zero, but bad matches can have lower values, and this is bad for a loss function because it will drive you to divergence.
See this picture showing KB's graph
Your paper states that you should sum two losses: (p||q) and (q||p).
This eliminates the assimetry and also the negative values.
So:
distance1 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance1')([processed_a, processed_b])
distance2 = ks.layers.Lambda(kullback_leibler_divergence,
name='distance2')([processed_b, processed_a])
distance = ks.layers.Add(name='dist_add')([distance1,distance2])
Very low margin and clipped hinge
This might be a problem
Finally, see that the hinge loss also clips values below zero!
Since Kullback-Leibler is not limited to 1, samples with high divergency may not be controled by this loss. Not sure if this really an issue, but you might want to either:
increase the margin
inside the Kullback-Leibler, use mean instead of sum
use a softplus in hinge instead of a max, to avoid losing gradients.
See:
MARGIN = someValue
hinge = ks.backend.mean(ks.backend.softplus(MARGIN - y_pred), axis=-1)
Now we can think of a custom accuracy
This is not very easy, since we don't have clear limits on KB that tells us "correct/not correct"
You might try one at random, but you'd need to tune this threshold parameter until you find a good thing that represents reality. You may for instance use your validation data to find the threshold that brings the best accuracy.
def customMetric(y_true_targets, y_pred_KBL):
isMatch = ks.backend.less(y_pred_KBL, threshold)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
isMatch = ks.backend.equal(y_true_targets, isMatch)
isMatch = ks.backend.cast(isMatch, ks.backend.floatx())
return ks.backend.mean(isMatch)
There are standard ways of predicting proportions such as logistic regression (without thresholding) and beta regression. There have already been discussions about this:
http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
http://scikit-learn-general.narkive.com/lLVQGzyl/beta-regression
I cannot tell if there exists a work-around within the sklearn framework.
There exists a workaround, but it is not intrinsically within the sklearn framework.
If you have a proportional target variable (value range 0-1) you run into two basic difficulties with scikit-learn:
Classifiers (such as logistic regression) deal with class labels as target variables only. As a workaround you could simply threshold your probabilities to 0/1 and interpret them as class labels, but you would lose a lot of information.
Regression models (such as linear regression) do not restrict the target variable. You can train them on proportional data, but there is no guarantee that the output on unseen data will be restricted to the 0/1 range. However, in this situation, there is a powerful work-around (below).
There are different ways to mathematically formulate logistic regression. One of them is the generalized linear model, which basically defines the logistic regression as a normal linear regression on logit-transformed probabilities. Normally, this approach requires sophisticated mathematical optimization because the probabilities are unknown and need to be estimated along with the regression coefficients.
In your case, however, the probabilities are known. This means you can simply transform them with y = log(p / (1 - p)). Now they cover the full range from -oo to oo and can serve as the target variable for a LinearRegression model [*]. Of course, the model output then needs to be transformed again to result in probabilities p = 1 / (exp(-y) + 1).
import numpy as np
from sklearn.linear_model import LinearRegression
class LogitRegression(LinearRegression):
def fit(self, x, p):
p = np.asarray(p)
y = np.log(p / (1 - p))
return super().fit(x, y)
def predict(self, x):
y = super().predict(x)
return 1 / (np.exp(-y) + 1)
if __name__ == '__main__':
# generate example data
np.random.seed(42)
n = 100
x = np.random.randn(n).reshape(-1, 1)
noise = 0.1 * np.random.randn(n).reshape(-1, 1)
p = np.tanh(x + noise) / 2 + 0.5
model = LogitRegression()
model.fit(x, p)
print(model.predict([[-10], [0.0], [1]]))
# [[ 2.06115362e-09]
# [ 5.00000000e-01]
# [ 8.80797078e-01]]
There are also numerous other alternatives. Some non-linear regression models can work naturally in the 0-1 range. For example Random Forest Regressors will never exceed the target variables' range they were trained with. Simply put probabilities in and you will get probabilities out. Neural networks with appropriate output activation functions (tanh, I guess) will also work well with probabilities, but if you want to use those there are more specialized libraries than sklearn.
[*] You could in fact plug in any linear regression model which can make the method more powerful, but then it no longer is exactly equivalent to logistic regression.
I am having a hard time with calculating cross entropy in tensorflow. In particular, I am using the function:
tf.nn.softmax_cross_entropy_with_logits()
Using what is seemingly simple code, I can only get it to return a zero
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
a = tf.placeholder(tf.float32, shape =[None, 1])
b = tf.placeholder(tf.float32, shape = [None, 1])
sess.run(tf.global_variables_initializer())
c = tf.nn.softmax_cross_entropy_with_logits(
logits=b, labels=a
).eval(feed_dict={b:np.array([[0.45]]), a:np.array([[0.2]])})
print c
returns
0
My understanding of cross entropy is as follows:
H(p,q) = p(x)*log(q(x))
Where p(x) is the true probability of event x and q(x) is the predicted probability of event x.
There if input any two numbers for p(x) and q(x) are used such that
0<p(x)<1 AND 0<q(x)<1
there should be a nonzero cross entropy. I am expecting that I am using tensorflow incorrectly. Thanks in advance for any help.
In addition to Don's answer (+1), this answer written by mrry may interest you, as it gives the formula to calculate the cross entropy in TensorFlow:
An alternative way to write:
xent = tf.nn.softmax_cross_entropy_with_logits(logits, labels)
...would be:
softmax = tf.nn.softmax(logits)
xent = -tf.reduce_sum(labels * tf.log(softmax), 1)
However, this alternative would be (i) less numerically stable (since
the softmax may compute much larger values) and (ii) less efficient
(since some redundant computation would happen in the backprop). For
real uses, we recommend that you use
tf.nn.softmax_cross_entropy_with_logits().
Like they say, you can't spell "softmax_cross_entropy_with_logits" without "softmax". Softmax of [0.45] is [1], and log(1) is 0.
Measures the probability error in discrete classification tasks in which the
classes are mutually exclusive (each entry is in exactly one class). For
example, each CIFAR-10 image is labeled with one and only one label: an image
can be a dog or a truck, but not both.
NOTE: While the classes are mutually exclusive, their probabilities
need not be. All that is required is that each row of labels is
a valid probability distribution. If they are not, the computation of the
gradient will be incorrect.
If using exclusive labels (wherein one and only
one class is true at a time), see sparse_softmax_cross_entropy_with_logits.
WARNING: This op expects unscaled logits, since it performs a softmax
on logits internally for efficiency. Do not call this op with the
output of softmax, as it will produce incorrect results.
logits and labels must have the same shape [batch_size, num_classes]
and the same dtype (either float16, float32, or float64).
Here is an implementation in Tensorflow 2.0 in case somebody else (me probably) needs it in the future.
#tf.function
def cross_entropy(x, y, epsilon = 1e-9):
return -2 * tf.reduce_mean(y * tf.math.log(x + epsilon), -1) / tf.math.log(2.)
x = tf.constant([
[1.0,0],
[0.5,0.5],
[.75,.25]
]
,dtype=tf.float32)
with tf.GradientTape() as tape:
tape.watch(x)
y = entropy(x, x)
tf.print(y)
tf.print(tape.gradient(y, x))
Output
[-0 1 0.811278105]
[[-1.44269502 29.8973541]
[-0.442695022 -0.442695022]
[-1.02765751 0.557305]]