Clear Implementation of Softmax and Its Derivative

Clear Implementation of Softmax and Its Derivative - python

I'm currently writing my first multilayer neural net with python 3.7 and numpy, and I'm having trouble implementing softmax (I intend to use my network for classification, so having a working implementation of softmax is pretty crucial). I copied this code off of a different thread:
def softmax(x):
return exp(x) / np.sum(exp(x), axis = 0)
I think I have a basic understanding of the intended function of the softmax function; that is, to take a vector and turn its elements into probabilities so that they sum to 1. Please correct my understanding if I'm wrong. I don't quite understand how this code accomplishes that function, but I found similar code on multiple other threads, so I believe it to be correct. Please confirm.
Unfortunately, in none of these threads could I find a clear implementation of the derivative of the softmax function. I understand it to be more complicated than that of most activation functions, and to require more parameters than just x, but I have no idea how to implement it myself. I'm looking for an explanation of what those other parameters are, as well as for an implementation (or mathematical expression) of the derivative of the softmax function.

Answer for how this code accomplishes that function:
Here, we make use of a concept known as broadcasting.
When you use the function exp(x), then, assuming x is a vector, you actually perform an operation similar to what can be accomplished by the following code:
exps = []
for i in x:
exps.append(exp(i))
return exps
The above code is the longer version of what broadcasting does automatically here.
As for the implementation of the derivative, that's a bit more complicated, as you say.
An untested implementation for computing the vector of derivatives with respect to every parameter:
def softmax_derivative(X):
# input : a vector X
# output : a vector containing derivatives of softmax(X) wrt every element in X
# List of derivatives
derivs = []
# denominator after differentiation
denom = np.sum(exp(X), axis=0)
for x in X:
# Function of current element based on differentiation result
comm = -exp(x)/(denom**2)
factor = 0
# Added exp of every element except current element
for other in X:
if other==x:
continue
factor += (exp(other))
derivs.append(comm*factor)
return derivs
You can also use broadcasting in the above function, but I think its more clear in this manner.

Related

Pytorch Autograd with complex fourier transforms gives wrong results

I am trying to implement a real valued cost function that evaluates a complex input in frequency space with pytorch & autograd since I am interested in the gradients of the cost function w.r.t. the input. When I compare the autograd results with the derivative that I computed by hand (with Wirtinger calculus) I get a different result. I'm not sure where I made the mistake, whether it is in my implementation or in my own derivation of the gradient.
The cost function and its derivative by hand looks like this:
Formula of the cost function
My implementation is here
def f_derivative_by_hand(f):
f = torch.tensor(f, dtype=torch.complex128)
ftilde = torch.fft.fft(f)
absf = torch.abs(ftilde)
f2 = absf**2
C = torch.trapz(f2).numpy()
grads = 2 * torch.fft.ifft((ftilde)).numpy()
return C, grads
def f_derivative_autograd(f):
f = torch.tensor(f, dtype=torch.complex128, requires_grad=True)
ftilde = torch.fft.fft(f)
f2 = torch.abs(ftilde)**2
C = torch.trapz(f2)
C.backward()
grads = f.grad
return C.detach().numpy(), grads.detach().numpy()
When I use some data and evaluate it by both functions, the gradients of the implementation with automatic differentiation is tilted in comparison (note that I normalized the plotted arrays):
Autograd and derivative by hand comparison
I suspect there could also be something wrong with the automatic differentiation of fft though since if I remove the fourier transform from the cost function and integrate the function in real space, both implementations match exactly except at the edges (again normalized):
No FFT autograd and derivative by hand
It would be fantasic if someone could help me figure out what is wrong!

After some more investigation, I found the solution to the problem of the tilted derivatives. Apparently, the trapezoidal integration rule assumes boundary conditions that will show some artifacts at the boundaries as discussed in this pytorch forum post.
In my original problem, the observed tilt results from the integration of the fourier transformed signal which is asymmetric. The boundary artifacts introduce spatial frequencies which tilt the derivative in real space.
For me, the simplest solution is just to use a sum and weight by the frequency differential. Then, everything works out.

Why are embeddings necessary in Pytorch?

I (think) I understand the basic principle behind embeddings: they're a shortcut to quickly perform the operation (one hot encoded vector) * matrix without actually performing that operation (by utilizing the fact that this operation is equivalent to indexing into the matrix) while still maintaining the same gradient as if that operation was actually performed.
I know in the following example:
e = Embedding(3, 2)
n = e(torch.LongTensor([0, 2]))
n will be a tensor of shape 2, 2.
But, we could also do:
p = nn.Parameter(torch.zeros([3, 2]).normal_(0, 0.01))
p[tensor([0, 2])]
and get the same result without an embedding.
This in and of itself wouldn't be confusing since in the first example n has a grad_fn called EmbeddingBackward whereas in the 2nd example p has a grad_fn called IndexBackward, which is what we expect since we know embeddings simulate a different derivative.
The confusing part is in chapter 8 of the fastbook they use embeddings to compute movie recommendations. But then, they do it without embeddings in basically the same manner & the model still works. I would expect the version without embeddings to fail because the derivative would be incorrect.
Version with embeddings:
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
self.movie_factors = Embedding(n_movies, n_factors)
self.movie_bias = Embedding(n_movies, 1)
self.y_range = y_range
def forward(self, x):
users = self.user_factors(x[:,0])
movies = self.movie_factors(x[:,1])
res = (users * movies).sum(dim=1, keepdim=True)
res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
return sigmoid_range(res, *self.y_range)
Version without:
def create_params(size):
return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))
class DotProductBias(Module):
def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
self.user_factors = create_params([n_users, n_factors])
self.user_bias = create_params([n_users])
self.movie_factors = create_params([n_movies, n_factors])
self.movie_bias = create_params([n_movies])
self.y_range = y_range
def forward(self, x):
users = self.user_factors[x[:,0]]
movies = self.movie_factors[x[:,1]]
res = (users*movies).sum(dim=1)
res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
return sigmoid_range(res, *self.y_range)
Does anyone know what is going on?

Besides the points provided in the comments:
I (think) I understand the basic principle behind embeddings: they're
a shortcut to quickly perform the operation (one hot encoded vector) *
matrix without actually performing that operation
This one is only partially correct. Indexing into matrix is fully differentiable operation, it's just taking part of the data and propagate gradient this path only. Multiplication here would be wasteful and unnecessary.
and get the same result without an embedding.
This one is true
called EmbeddingBackward whereas in the 2nd example p has a grad_fn called IndexBackward, which is what we expect since we know embeddings simulate a different derivative. (emphasis mine)
This one isn't (or often is not). Embedding also chooses part of the data, just like you did, grad_fn is different due to "extra functionalities", but in principle it is the same (choosing some vector(s) from the matrix), think of it like IndexBackward on steroids.
The confusing part is in chapter 8 of the fastbook they use embeddings
to compute movie recommendations. But then, they do it without
embeddings in basically the same manner & the model still works. I
would expect the version without embeddings to fail because the
derivative would be incorrect. (emphasis mine)
In this exact case, both approaches are equivalent (give or take different results from random initialization).
Why nn.Embedding at all?
Easier to understand your intent when reading the code
Common utilities and functionality added if needed (as pointed out in the comments)

Implementing a trainable generalized Bump function layer in Keras/Tensorflow

I'm trying to code the following variant of the Bump function, applied component-wise:
,
where σ is trainable; but it's not working (errors reported below).
My attempt:
Here's what I've coded up so far (if it helps). Suppose I have two functions (for example):
def f_True(x):
# Compute Bump Function
bump_value = 1-tf.math.pow(x,2)
bump_value = -tf.math.pow(bump_value,-1)
bump_value = tf.math.exp(bump_value)
return(bump_value)
def f_False(x):
# Compute Bump Function
x_out = 0*x
return(x_out)
class trainable_bump_layer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(trainable_bump_layer, self).__init__(*args, **kwargs)
def build(self, input_shape):
self.threshold_level = self.add_weight(name='threshlevel',
shape=[1],
initializer='GlorotUniform',
trainable=True)
def call(self, input):
# Determine Thresholding Logic
The_Logic = tf.math.less(input,self.threshold_level)
# Apply Logic
output_step_3 = tf.cond(The_Logic,
lambda: f_True(input),
lambda: f_False(input))
return output_step_3
Error Report:
Train on 100 samples
Epoch 1/10
WARNING:tensorflow:Gradients do not exist for variables ['reconfiguration_unit_steps_3_3/threshlevel:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['reconfiguration_unit_steps_3_3/threshlevel:0'] when minimizing the loss.
32/100 [========>.....................] - ETA: 3s
...
tensorflow:Gradients do not exist for variables
Moreover, it does not seem to be applied component-wise (besides the non-trainable problem). What could be the problem?

Unfortunately, no operation to check whether x is within (-σ, σ) will be differentiable and therefore σ cannot be learnt via any gradient descent method. Specifically, it is not possible to compute the gradients with respect to self.threshold_level because tf.math.less is not differentiable with respect to the condition.
Regarding the element-wise conditional, you can instead use tf.where to select elements from f_True(input) or f_False(input) according to the component-wise boolean values of the condition. For example:
output_step_3 = tf.where(The_Logic, f_True(input), f_False(input))
NOTE: I answered based on the provided code, where self.threshold_level is not used in f_True nor f_False. If self.threshold_level is used in those functions as in the provided formula, the function will, of course, be differentiable with respect to self.threshold_level.
Updated 19/04/2020: Thank you #today for the clarification.

I suggest you try a normal distribution instead of a bump.
In my tests here, this bump function is not behaving well (I can't find a bug but don't discard it, but my graph shows two very sharp bumps, which is not good for networks)
With a normal distribution, you would get a regular and differentiable bump whose height, width and center you can control.
So, you may try this function:
y = a * exp ( - b * (x - c)²)
Try it in some graph and see how it behaves.
For this:
class trainable_bump_layer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(trainable_bump_layer, self).__init__(*args, **kwargs)
def build(self, input_shape):
#suggested shape (has a different kernel for each input feature/channel)
shape = tuple(1 for _ in input_shape[:-1]) + input_shape[-1:]
#for your desired shape of only 1:
shape = tuple(1 for _ in input_shape) #all ones
#height
self.kernel_a = self.add_weight(name='kernel_a ',
shape=shape
initializer='ones',
trainable=True)
#inverse width
self.kernel_b = self.add_weight(name='kernel_b',
shape=shape
initializer='ones',
trainable=True)
#center
self.kernel_c = self.add_weight(name='kernel_c',
shape=shape
initializer='zeros',
trainable=True)
def call(self, input):
exp_arg = - self.kernel_b * K.square(input - self.kernel_c)
return self.kernel_a * K.exp(exp_arg)

I am a bit surprised that no one has mentioned the main (and only) reason for the given warning! As it seems, that code is supposed to implement the generalized variant of Bump function; however, just take a look at the functions implemented again:
def f_True(x):
# Compute Bump Function
bump_value = 1-tf.math.pow(x,2)
bump_value = -tf.math.pow(bump_value,-1)
bump_value = tf.math.exp(bump_value)
return(bump_value)
def f_False(x):
# Compute Bump Function
x_out = 0*x
return(x_out)
The error is evident: there is no usage of the trainable weight of the layer in these functions! So there is no surprise that you get the message saying that no gradient exist for that: you are not using it at all, so no gradient to update it! Rather, this is exactly the original Bump function (i.e. with no trainable weight).
But, you might say that: "at least, I used the trainable weight in the condition of tf.cond, so there must be some gradients?!"; however, it's not like that and let me clear up the confusion:
First of all, as you have noticed as well, we are interested in element-wise conditioning. So instead of tf.cond you need to use tf.where.
The other misconception is to claim that since tf.less is used as the condition, and since it is not differentiable i.e. it has no gradient with respect to its inputs (which is true: there is no defined gradient for a function with boolean output w.r.t. its real-valued inputs!), then that results in the given warning!
That's simply wrong! The derivative here would be taken of the output of the layer w.r.t trainable weight, and the selection condition is NOT present in the output. Rather, it's just a boolean tensor which determines the output branch to be selected. That's it! The derivative of condition is not taken and will never be needed. So that's not the reason for the given warning; the reason is only and only what I mentioned above: no contribution of trainable weight in the output of layer. (Note: if the point about condition is a bit surprising to you, then think about a simple example: the ReLU function, which is defined as relu(x) = 0 if x < 0 else x. If the derivative of condition, i.e. x < 0, is considered/needed, which does not exists, then we would not be able to use ReLU in our models and train them using gradient-based optimization methods at all!)
(Note: starting from here, I would refer to and denote the threshold value as sigma, like in the equation).
All right! We found the reason behind the error in implementation. Could we fix this? Of course! Here is the updated working implementation:
import tensorflow as tf
from tensorflow.keras.initializers import RandomUniform
from tensorflow.keras.constraints import NonNeg
class BumpLayer(tf.keras.layers.Layer):
def __init__(self, *args, **kwargs):
super(BumpLayer, self).__init__(*args, **kwargs)
def build(self, input_shape):
self.sigma = self.add_weight(
name='sigma',
shape=[1],
initializer=RandomUniform(minval=0.0, maxval=0.1),
trainable=True,
constraint=tf.keras.constraints.NonNeg()
)
super().build(input_shape)
def bump_function(self, x):
return tf.math.exp(-self.sigma / (self.sigma - tf.math.pow(x, 2)))
def call(self, inputs):
greater = tf.math.greater(inputs, -self.sigma)
less = tf.math.less(inputs, self.sigma)
condition = tf.logical_and(greater, less)
output = tf.where(
condition,
self.bump_function(inputs),
0.0
)
return output
A few points regarding this implementation:
We have replaced tf.cond with tf.where in order to do element-wise conditioning.
Further, as you can see, unlike your implementation which only checked for one side of inequality, we are using tf.math.less, tf.math.greater and also tf.logical_and to find out whether the input values have magnitudes of less than sigma (alternatively, we could do this using just tf.math.abs and tf.math.less; no difference!). And let us repeat it: using boolean-output functions in this way does not cause any problems and have nothing to do with derivatives/gradients.
We are also using a non-negativity constraint on the sigma value learned by layer. Why? Because sigma values less than zero does not make sense (i.e. the range (-sigma, sigma) is ill-defined when sigma is negative).
And considering the previous point, we take care to initialize the sigma value properly (i.e. to a small non-negative value).
And also, please don't do things like 0.0 * inputs! It's redundant (and a bit weird) and it is equivalent to 0.0; and both have a gradient of 0.0 (w.r.t. inputs). Multiplying zero with a tensor does not add anything or solve any existing issue, at least not in this case!
Now, let's test it to see how it works. We write some helper functions to generate training data based on a fixed sigma value, and also to create a model which contains a single BumpLayer with input shape of (1,). Let's see if it could learn the sigma value which is used for generating training data:
import numpy as np
def generate_data(sigma, min_x=-1, max_x=1, shape=(100000,1)):
assert sigma >= 0, 'Sigma should be non-negative!'
x = np.random.uniform(min_x, max_x, size=shape)
xp2 = np.power(x, 2)
condition = np.logical_and(x < sigma, x > -sigma)
y = np.where(condition, np.exp(-sigma / (sigma - xp2)), 0.0)
dy = np.where(condition, xp2 * y / np.power((sigma - xp2), 2), 0)
return x, y, dy
def make_model(input_shape=(1,)):
model = tf.keras.Sequential()
model.add(BumpLayer(input_shape=input_shape))
model.compile(loss='mse', optimizer='adam')
return model
# Generate training data using a fixed sigma value.
sigma = 0.5
x, y, _ = generate_data(sigma=sigma, min_x=-0.1, max_x=0.1)
model = make_model()
# Store initial value of sigma, so that it could be compared after training.
sigma_before = model.layers[0].get_weights()[0][0]
model.fit(x, y, epochs=5)
print('Sigma before training:', sigma_before)
print('Sigma after training:', model.layers[0].get_weights()[0][0])
print('Sigma used for generating data:', sigma)
# Sigma before training: 0.08271004
# Sigma after training: 0.5000002
# Sigma used for generating data: 0.5
Yes, it could learn the value of sigma used for generating data! But, is it guaranteed that it actually works for all different values of training data and initialization of sigma? The answer is: NO! Actually, it is possible that you run the code above and get nan as the value of sigma after training, or inf as the loss value! So what's the problem? Why this nan or inf values might be produced? Let's discuss it below...
Dealing with numerical stability
One of the important things to consider, when building a machine learning model and using gradient-based optimization methods to train them, is the numerical stability of operations and calculations in a model. When extremely large or small values are generated by an operation or its gradient, almost certainly it would disrupt the training process (for example, that's one of the reasons behind normalizing image pixel values in CNNs to prevent this issue).
So, let's take a look at this generalized bump function (and let's discard the thresholdeding for now). It's obvious that this function has singularities (i.e. points where either the function or its gradient is not defined) at x^2 = sigma (i.e. when x = sqrt(sigma) or x=-sqrt(sigma)). The animated diagram below shows the bump function (the solid red line), its derivative w.r.t. sigma (the dotted green line) and x=sigma and x=-sigma lines (two vertical dashed blue lines), when sigma starts from zero and is increased to 5:
As you can see, around the region of singularities the function is not well-behaved for all values of sigma, in the sense that both the function and its derivative take extremely large values at those regions. So given an input value at those regions for a particular value of sigma, exploding output and gradient values would be generated, hence the issue of inf loss value.
Even further, there is a problematic behavior of tf.where which causes the issue of nan values for the sigma variable in the layer: surprisingly, if the produced value in inactive branch of tf.where is extremely large or inf, which with the bump function results in extremely large or inf gradient values, then the gradient of tf.where would be nan, despite the fact that the inf is in inactive branch and is not even selected (see this Github issue which discusses exactly this)!!
So is there any workaround for this behavior of tf.where? Yes, actually there is a trick to somehow resolve this issue which is explained in this answer: basically we can use an additional tf.where in order to prevent the function to be applied on these regions. In other words, instead of applying self.bump_function on any input value, we filter those values which are NOT in the range (-self.sigma, self.sigma) (i.e. the actual range which the function should be applied) and instead feed the function with zero (which is always produce safe values, i.e. is equal to exp(-1)):
output = tf.where(
condition,
self.bump_function(tf.where(condition, inputs, 0.0)),
0.0
)
Applying this fix would entirely resolve the issue of nan values for sigma. Let's evaluate it on training data values generated with different sigma values and see how it would perform:
true_learned_sigma = []
for s in np.arange(0.1, 10.0, 0.1):
model = make_model()
x, y, dy = generate_data(sigma=s, shape=(100000,1))
model.fit(x, y, epochs=3 if s < 1 else (5 if s < 5 else 10), verbose=False)
sigma = model.layers[0].get_weights()[0][0]
true_learned_sigma.append([s, sigma])
print(s, sigma)
# Check if the learned values of sigma
# are actually close to true values of sigma, for all the experiments.
res = np.array(true_learned_sigma)
print(np.allclose(res[:,0], res[:,1], atol=1e-2))
# True
It could learn all the sigma values correctly! That's nice. That workaround worked! Although, there is one caveat: this is guaranteed to work properly and learn any sigma value if the input values to this layer are greater than -1 and less than 1 (i.e. this is the default case of our generate_data function); otherwise, there is still the issue of inf loss value which might happen if the input values have a magnitude of greater than 1 (see point #1 and #2, below).
Here are some foods for thought for the curios and interested mind:
It was just mentioned that if the input values to this layer are greater than 1 or less than -1, then it may cause problems. Can you argue why this is the case? (Hint: use the animated diagram above and consider cases where sigma > 1 and the input value is between sqrt(sigma) and sigma (or between -sigma and -sqrt(sigma).)
Can you provide a fix for the issue in point #1, i.e. such that the layer could work for all input values? (Hint: like the workaround for tf.where, think about how you can further filter-out the unsafe values which the bump function could be applied on and produce exploding output/gradient.)
However, if you are not interested to fix this issue, and would like to use this layer in a model as it is now, then how would you guarantee that the input values to this layer are always between -1 and 1? (Hint: as one solution, there is a commonly-used activation function which produces values exactly in this range and could be potentially used as the activation function of the layer which is before this layer.)
If you take a look at the last code snippet, you will see that we have used epochs=3 if s < 1 else (5 if s < 5 else 10). Why is that? Why large values of sigma need more epochs to be learned? (Hint: again, use the animated diagram and consider the derivative of function for input values between -1 and 1 as sigma value increases. What are their magnitude?)
Do we also need to check the generated training data for any nan, inf or extremely large values of y and filter them out? (Hint: yes, if sigma > 1 and range of values, i.e. min_x and max_x, fall outside of (-1, 1); otherwise, no that's not necessary! Why is that? Left as an exercise!)

Where in tensorflow gradients is the sum over the elements of y made?

I am trying to make a hack of tf.gradient in tensorflow that would give, for a tensor y of rank (M,N) and a tensor x of rank (Q,P) a gradient tensor of rank (M,N,Q,P) as one would naturally expect.
As pointed out in multiple questions on this site*, what one gets is a rank (Q,P) which is the grad of the sum of the elements of y. Now what I can't figure out, looking into the tensorflow code is where is that sum over elements of y is made? Is it as the beginning or at the end? Could someone help me pinpoint the lines of code where that is done?
*
Tensorflow gradients: without automatic implicit sum
TensorFlow: Compute Hessian matrix (and higher order derivatives)
Unaggregated gradients / gradients per example in tensorflow
Separate gradients in tf.gradients

I've answered it here but I'm guessing it's not very useful because you can't use this knowledge to be able to differentiate with respect to non-scalar y. Scalar y assumption is central to design of reverse AD algorithm, and there's not a single place you can modify to support non-scalar ys. Since this confusion keeps coming up, let me go in a bit more detail as to why it's non-trivial:
First of all, how reverse AD works -- suppose we have a function f that's composition of component functions f_i. Each component function takes a vector of length n and produces a vector of length n.
Its derivative can be expressed as a sequence of matrix multiplications. The entire expression can be expressed below.
When differentiating, function composition becomes matrix multiplication of corresponding component function Jacobians.
Note that this involves matrix/matrix products which proves to be too expensive for neural networks. IE, AlexNet contains 8k activations in its convnet->fc transition layer. Doing matrix multiples where each matrix is 8k x 8k would take too long. The trick that makes it efficient is to assume that last function in the chain produces a scalar. Then its Jacobian is a vector, and the whole thing can be rewritten in terms of vector-matrix multiplies, instead of matrix-matrix multiplies.
This product can be computed efficiently by doing multiplication left to right so everything you do is an nxn vector-matrix multiply instead of nxn matrix-matrix multiply.
You can make it even more efficient by never forming those nxn derivative matrices in a first place, and associate each component function with an op that does vector x Jacobian matrix product implicitly. That's what TensorFlow tf.RegisterGradient does. Here's an illustration of the "grad" associated with an a component function.
Now, this is done for vector value functions, what if your functions are matrix valued? This is a typical situation we deal with in neural networks. IE, in a layer that does matrix multiply, matrix that you multiply by is an unknown and it is matrix valued. In that case, the last derivative has rank 2, and remaining derivatives have rank 3.
Now to apply the chain rule you'd have to deal with extra notation because now "x" in chain rule means matrix multiplication generalized to tensors of rank-3.
However, note that we never have to do the multiplication explicitly since we are using a grad operator. So now in practice, this operator now takes values of rank-2 and produces values of rank-2.
So in all of this, there's an assumption that final target is scalar which allows fully connected layers to be differentiated by passing matrices around.
If you want to extend this to support non-scalar vector, you would need to modify the reverse AD algorithm to to propagate more info. IE, for fully connected feed-forward nets you would propagate rank-3 tensors around instead of matrices.

With the jacobian function in Tensorflow 2, this is a straightforward task.
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
y = layer(x)
loss = tf.reduce_mean(y ** 2)
first_order_gradient = tape2.gradient(loss, layer.trainable_weights)
hessian = tape1.jacobian(first_order_gradient, layer.trainable_weights)
https://www.tensorflow.org/guide/advanced_autodiff#hessian

Derivative of a sum in Theano

So I want to calculate the gradient and Hessian of the following sum. Afaik Theano should be able to do that, however I can't figure out how.
X is a Matrix of size M x N; y M sized vector; beta a N sized vector.
One way to compute the sum is using the scan() function, which I did like this:
res,ups = theano.scan(lambda v,w: v*np.log(1/(1+np.exp(-1*w.dot(beta))))
+((1-v)*(np.log(1/(1+np.exp(w.dot(beta)))))), sequences = [y,X])
t7 = theano.function(inputs = [X,y,beta],outputs = res)
and that works fine as far as I can tell. However, I can't use this as an Input for the grad() function with respect to beta.
So what I would like to know is if there is a way to either use the scan function as input of the grad function or a different way to compute the sum.
(I first tried in sympy, but sympy can't lambdify Indexedbase objects, so I can compute the grad but can't use it as a function, maybe that helps? )
The Sum adds up a function of the Dot Product of a line in X and beta while the binary vector y decides which of two functions will be used.
log(1/(1+exp(-X_i*beta)))
Hope that helps?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.