I am exploring PyTorch, and I do not understand the output of the following example:
# Initialize x, y and z to values 4, -3 and 5
x = torch.tensor(4., requires_grad = True)
y = torch.tensor(-3., requires_grad = True)
z = torch.tensor(5., requires_grad = True)
# Set q to sum of x and y, set f to product of q with z
q = x + y
f = q * z
# Compute the derivatives
f.backward()
# Print the gradients
print("Gradient of x is: " + str(x.grad))
print("Gradient of y is: " + str(y.grad))
print("Gradient of z is: " + str(z.grad))
Output
Gradient of x is: tensor(5.)
Gradient of y is: tensor(5.)
Gradient of z is: tensor(1.)
I have little doubt that my confusion originates with a minor misunderstanding. Can someone explain in a stepwise manner?
I hope you understand that When you do f.backward(), what you get in x.grad is .
In your case
.
So, simply (with preliminary calculus)
If you put your values for x, y and z, that explains the outputs.
But, this isn't really "Backpropagation" algorithm. This is just partial derivatives (which is all you asked in the question).
Edit:
If you want to know about the Backpropagation machinery behind it, please see #Ivan's answer.
I can provide some insights on the PyTorch aspect of backpropagation.
When manipulating tensors that require gradient computation (requires_grad=True), PyTorch keeps track of operations for backpropagation and constructs a computation graph ad hoc.
Let's look at your example:
q = x + y
f = q * z
Its corresponding computation graph can be represented as:
x -------\
-> x + y = q ------\
y -------/ -> q * z = f
/
z --------------------------/
Where x, y, and z are called leaf tensors. The backward propagation consists of computing the gradients of x, y, and y, which correspond to: dL/dx, dL/dy, and dL/dz respectively. Where L is a scalar value based on the graph output f. Each operation performed needs to have a backward function implemented (which is the case for all mathematically differentiable PyTorch builtins). For each operation, this function is effectively used to compute the gradient of the output w.r.t. the input(s).
The backward pass would look like this:
dL/dx <------\
x -----\ \
\ dq/dx
\ \ <--- dL/dq-----\
-> x + y = q ----\ \
/ / \ df/dq
/ dq/dy \ \ <--- dL/df ---
y -----/ / -> q * z = f
dL/dy <------/ / /
/ df/dz
z -------------------------/ /
dL/dz <--------------------------/
The "d(outputs)/d(inputs)" terms for the first operator are: dq/dx = 1, and dq/dy = 1. For the second operator they are df/dq = z, and df/dz = q.
Backpropagation comes down to applying the chain rule: dL/dx = dL/dq * dq/dx = dL/df * df/dq * dq/dx. Intuitively we decompose dL/dx in the opposite way than what backpropagation actually does, which to navigate bottom up.
Without shape considerations, we start from dL/df = 1. In reality dL/df has the shape of f (see my other answer linked below). This results in dL/dx = 1 * z * 1 = z. Similarly for y and z, we have dL/dy = z and dL/dz = q = x + y. Which are the results you observed.
Some answers I gave to related topics:
Understand PyTorch's graph generation
Meaning of grad_outputs in PyTorch's torch.autograd.grad
Backward function of the normalize operator
Difference between autograd.grad and autograd.backward
Understanding Jacobian tensors in PyTorch
you just got to understand what are the operations and what are the partial derivatives you should use to come at each, for example:
x = torch.tensor(1., requires_grad = True)
q = x*x
q.backward()
print("Gradient of x is: " + str(x.grad))
will give you 2, because the derivative of x*x is 2*x.
if we take your exemple for x, we have:
q = x + y
f = q * z
which can be modified as:
f = (x+y)*z = x*z+y*z
if we take the partial derivative of f in function of x, we endup with just z.
To come at this result you have to consider all other variables a constant and apply the derivative rules you already know.
But keep in mind, the process that pytorch executes to get these results are not symbolic or numeric differentiation, is Automatic differentiation, which is a computational method to efficiently get the gradients.
Take a closer look at:
https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/slides/lec10.pdf
Related
I want to calculate the divergent of a given vector with sympy. Is there any function in python responsible for this? I looked for something in the functions of einsteinpy, but I still haven't found any that help.
Basically I want to calculate \nabla_\mu (n v^\mu)=0 from a given vector v; n being a constant number.
\nabla_\mu (nv^\mu)=0 represents a divergence where \mu will take the derivative with respect to x, y or z of the vector element corresponding to the component. For example:
\nabla_\mu (n v^\mu) = \partial_x (u^x) + \partial_y(u^y) + \partial_z(u^z)
u can be something like (2x,4y,6z)
I appreciate any help.
As shown by #mikuszefski, you can use the module sympy.vector such that you have the implementation of the divergence in a space.
Another way to do what you want is to use the function derive_by_array to get a tensor and do einsten contraction.
import sympy as sp
x, y, z = sp.symbols("x y z") # dim = 3
# Now the functions that you want:
u, v, w = 2*x, 4*y, 6*z
# In a more general way, you can do:
u = sp.Function("u")(x, y, z)
v = sp.Function("v")(x, y, z)
w = sp.Function("w")(x, y, z)
U = sp.Array([u, v, w]) # U is a vector of dim = 3 (or sympy.Array)
X = sp.Array([x, y, z]) # X is a vector of dim = 3 (or sympy.Array)
dUdX = sp.derive_by_array(U, X) # dUdX is a tensor of dim = 3 and order = 2
# Frist way:
divU = sp.trace(sp.Matrix(sp.derive_by_array(U, X))) # Limited
# Second way:
divU = sp.tensorcontraction(sp.derive_by_array(U, X), (0, 1)) # More general
This solution works fine when dim = 2 for example, but you must have that len(X) == len(U)
I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
I have the following equation: x/0,2 * (0,2+1)+y/0,1*(0,1+1) = 26.34
The initial values of X and Y are set as 4.085 and 0.17 respectively.
I need to find the values of X and Y which satisfy the equation and have the lowest common deviation from initially set values. In other words, sum of |4.085 - x| and |0.17 - y| is minimized.
With Excel Solver Valueof Function this easy to find:
we insert x and y as variables to be changed to reach 26 in the formula result
Here is my python code (I am trying to use sympy for that)
x,y = symbols('x y')
eqn = solve([Eq(x/0.2*(0.2+1)+y/0.1*(0.1+1),26)],x,y)
print(eqn)
I am getting however strange result {x: 4.33333333333333 - 1.83333333333333*y}
Can anyone help me solve this equation?
The answer you are obtaining is not strange, it is just the answer to what you ask. You have an equation on two variables x and y, the solution to this problem is in general not unique (sometimes infinite). Now, you can either add an extra condition (inequality for example) or change the numeric Domain in which solutions are possible (like in Diophantine equations). You can do either of them in Sympy, in the following example I find the solution on x to your problem in the Real domain, using solveset:
from sympy import symbols, Eq, solveset
x,y = symbols('x y')
eqn = solveset(Eq(1.2 * x / 0.2 + 1.1 * y / 0.1, 26), x, Reals)
print(eqn)
Output:
Intersection(FiniteSet(4.33333333333333 - 1.83333333333333*y), Reals)
As you can see the solution on x is a finite set, that is the intersection between a straight line on y and the Reals. Any particular solution can be found by direct evaluation of y.
This is equivalent to say x = 4.33333333333333 - 1.83333333333333 * y if you evaluate this equation in the guess value y = 0.17, you obtain x = 4.0216 (close to your x = 4.085 guess value).
Edit:
After analyzing the new information added to your question, I think I have finally understood it: your problem is a constrained optimization. Now, I don't use Excel frequently, but it would be my bet that under the hood this optimization is carried out there using Lagrange multipliers. In your particular case, the target function represents the deviation of the solution (x, y) from the point (4.085, 0.17). For convenience, I have chosen this function to be the Euclidean distance between them (absolute values as you suggested can be problematic due to discontinuity of the derivatives). The constraint function is simply the equation you provided. To solve this problem with Sympy, one could use something like this:
import sympy as sp
# Define symbols and functions
x, y, lamb = sp.symbols('x, y, lamb', real=True)
func = sp.sqrt((x - 4.085) ** 2 + (y - 0.17) ** 2) # Target function
const = 1.2 * x / 0.2 + 1.1 * y / 0.1 - 26 # Constraint function
# Define Lagrangian
lagrang = func - lamb * const
# Compute gradient of Lagrangian
grad_lagrang = [sp.diff(lagrang, var) for var in [x, y, lamb]]
# Solve the resulting system of equations
spoints = sp.solve(grad_lagrang, [x, y, lamb], dict=True)
# Print stationary points
print(spoints)
Output:
[{x: 4.07047770700637, lamb: -0.0798086884467563, y: 0.143375796178345}]
Since in our case only one stationary point was found, this is the optimal solution (although this is only a necessary condition). The value of the lamb multiplier can be ditched, so x, y = 4.070, 0.1434. Hope this helps.
I want to calculate the derivative of points, a few internet posts suggested using np.diff function. However, I tried using np.diff against manually calculated results (chose a random polynomial equation and differentiated it) to see if I end up with the same results. I used the following eq : Y = (X^3) + (X^2) + 7 and the results i ended up with were different. Any ideas why?. Is there any other method to calculate the differential.
In the problem, I am trying to solve, I have recieved data points of fitted spline function ( not the original data that need to be fitted by spline but the points of the already fitted spline). The x-values are at equal intervals. I only have the points and no equation, what I require is to calculate, the first, second and third derivatives. i.e dy/dx, d2y/dx2, d3y/dx3. Any ideas on how to do this?. Thanks in advance.
xval = [1,2,3,4,5]
yval = []
yval_dashList = []
#selected a polynomial equation
def calc_Y(X):
Y = (X**3) + (X**2) + 7
return(Y)
#calculate y values using equatuion
for i in xval:
yval.append(calc_Y(i))
#output: yval = [9,19,43,87,157]
#manually differentiated the equation or use sympy library (sym.diff(x**3 + x**2 + 7))
def calc_diffY(X):
yval_dash = 3*(X**2) + 2**X
#store differentiated y-values in a list
for i in xval:
yval_dashList.append(yval_dash(i))
#output: yval_dashList = [5,16,35,64,107]
#use numpy diff method on the y values(yval)
numpyDiff = np.diff(yval)
#output: [10,24,44,60]
The values of numpy diff method [10,24,44,60] is different from yval_dashList = [5,16,35,64,107]
The idea behind what you are trying to do is correct, but there are a couple of points to make it work as intended:
There is a typo in calc_diffY(X), the derivative of X**2 is 2*X, not 2**X:
def calc_diffY(X):
yval_dash = 3*(X**2) + 2*X
By doing this you don't obtain much better results:
yval_dash = [5, 16, 33, 56, 85]
numpyDiff = [10. 24. 44. 70.]
To calculate the numerical derivative you should do a "Difference quotient" which is an approximation of a derivative
numpyDiff = np.diff(yval)/np.diff(xval)
The approximation becomes better and better if the values of the points are more dense.
The difference between your points on the x axis is 1, so you end up in this situation (in blue the analytical derivative, in red the numerical):
If you reduce the difference in your x points to 0.1, you get this, which is much better:
Just to add something to this, have a look at this image showing the effect of reducing the distance of the points on which the derivative is numerically calculated, taken from Wikipedia:
I like #lgsp's answer. I will add that you can directly estimate the derivative without having to worry about how much space there is between the values. This just uses the symmetric formula for calculating finite differences, described at this wikipedia page.
Take note, though, of the way delta is specified. I found that when it is too small, higher-order estimates fail. There's probably not a 100% generic value that will always work well!
Also, I simplified your code by taking advantage of numpy broadcasting over arrays to eliminate for loops.
import numpy as np
# selecte a polynomial equation
def f(x):
y = x**3 + x**2 + 7
return y
# manually differentiate the equation
def f_prime(x):
return 3*x**2 + 2*x
# numerically estimate the first three derivatives
def d1(f, x, delta=1e-10):
return (f(x + delta) - f(x - delta)) / (2 * delta)
def d2(f, x, delta=1e-5):
return (d1(f, x + delta, delta) - d1(f, x - delta, delta)) / (2 * delta)
def d3(f, x, delta=1e-2):
return (d2(f, x + delta, delta) - d2(f, x - delta, delta)) / (2 * delta)
# demo output
# note that functions operate in parallel on numpy arrays -- no for loops!
xval = np.array([1,2,3,4,5])
print('y = ', f(xval))
print('y\' = ', f_prime(xval))
print('d1 = ', d1(f, xval))
print('d2 = ', d2(f, xval))
print('d3 = ', d3(f, xval))
And the outputs:
y = [ 9 19 43 87 157]
y' = [ 5 16 33 56 85]
d1 = [ 5.00000041 16.00000132 33.00002049 56.00000463 84.99995374]
d2 = [ 8.0000051 14.00000116 20.00000165 25.99996662 32.00000265]
d3 = [6. 6. 6. 6. 5.99999999]
Complex numbers in theano are not fully implemented yet, as discussed for example in this groups.google post and this question.
However, given that some support does seem to exist, I am trying to understand what can actually be done at the moment.
Consider for example this very simple code:
x = T.dscalar('x')
y = 2j * x
gy = T.grad(y, x)
f = theano.function([x], gy)
f(1.234)
that is, derivative with respect to the (real) scalar x of a * x with a some complex number.
The above code does not produce a result, complaining that
Casting from complex to real is ambiguous: consider real(), imag(), angle() or abs()
How can this code be made to work?
Here is the simple implementation I managed to get to work:
x = T.dscalar('x')
y_R = T.real(2j) * x
y_I = T.imag(2j) * x
gy_R = T.grad(y_R, x)
gy_I = T.grad(y_I, x)
gy = gy_R + 1j * gy_I
f = theano.function([x], gy)
f(1.234)
# array(2j)
basically, separate the complex constant into real and imaginary part, compute the gradient separately and only at the end sum them to get the complex result.
The problem with this method is that it doesn't work if we try a more complex example, like computing the gradient with respect to x of expm(1j * x * H) for some matrix H:
x = T.dscalar('x')
expH = T.slinalg.expm(1j * x * H)
expH_flat = T.flatten(expH)
expH_flat_R = T.real(T.flatten(expH))
expH_flat_I = T.imag(T.flatten(expH))
def fn(i, mat, x):
return T.grad(mat[i], x)
J_R, updates = theano.scan(fn, sequences=T.arange(expH_flat_R.shape[0]), non_sequences=[expH_flat_R, x])
J_I, updates = theano.scan(fn, sequences=T.arange(expH_flat_I.shape[0]), non_sequences=[expH_flat_I, x])
expH_J_R = J_R.reshape(expH.shape)
expH_J_I = J_I.reshape(expH.shape)
expH_J = expH_J_R + 1j * expH_J_I
f = theano.function([x], expH_J)
f(2)
which returns
TypeError: Elemwise{real,no_inplace}.grad illegally returned an integer-valued variable. (Input index 0, dtype complex128)
If this cannot be achieved at all with theano (like for example this question seems to suggest), is it for some particular reason, like fundamental difficulties of sort?
It seems that this is simply not possible natively (and likely won't be in the future, considering the development of Theano stopped in 2017).
One way around it is to remap every complex matrix into the larger real matrix and perform all operations on the bigger matrix (some care will be needed to redefine the various operations).
A possible mapping is A -> [[A_R, -A_I], [A_I, A_R]].