I am trying to build a linear regression model and find optimal values using fmin_cg optimizer.
I have two functions for this job. First linear_reg_cost which is cost function and second linear_reg_grad which is gradient of cost function. This functions both have same argument.
def hypothesis(x,theta):
return np.dot(x,theta)
Cost function:
def linear_reg_cost(x_flatten, y, theta_flatten, lambda_, num_of_features,num_of_samples):
x = x_flatten.reshape(num_of_samples, num_of_features)
theta = theta_flatten.reshape(n,1)
loss = hypothesis(x,theta)-y
regularizer = lambda_*np.sum(theta[1:,:]**2)/(2*m)
j = np.sum(loss ** 2)/(2*m)
return j
Gradient function:
def linear_reg_grad(x_flatten, y, theta_flatten, lambda_, num_of_features,num_of_samples):
x = x_flatten.reshape(num_of_samples, num_of_features)
m,n = x.shape
theta = theta_flatten.reshape(n,1)
new_theta = np.zeros(shape=(theta.shape))
loss = hypothesis(x,theta)-y
gradient = np.dot(x.T,loss)
new_theta[0:,:] = gradient/m
new_theta[1:,:] = gradient[1:,:]/m + lambda_*(theta[1:,]/m)
return new_theta
and fmin_cg:
theta = np.ones(n)
from scipy.optimize import fmin_cg
new_theta = fmin_cg(f=linear_reg_cost, x0=theta, fprime=linear_reg_grad,args=(x.flatten(), y, lambda_, m,n))
Note: I flatten x as input and retrieve in the cost and gradient function as matrix.
the output error:
<ipython-input-98-b29c1b8f6e58> in linear_reg_grad(x_flatten, y, theta_flatten, lambda_, num_of_features, num_of_samples)
1 def linear_reg_grad(x_flatten, y, theta_flatten, lambda_,num_of_features, num_of_samples):
----> 2 x = x_flatten.reshape(num_of_samples, num_of_features)
3 m,n = x.shape
4 theta = theta_flatten.reshape(n,1)
5 new_theta = np.zeros(shape=(theta.shape))
ValueError: cannot reshape array of size 2 into shape (2,12)
Note: x.shape = (12,2), y.shape = (12,1) ,theta.shape = (2,). So num_of_features =2 and num_of_samples=12. But error shows that my input x is parsing instead of theta. Why this happening even when I explicitly assigned args in fmin_cg? And how I should solve this problem?
Thanks for any advice
All of your implementations are correct but you have a little mistake.
Be inform to pass arguments in order for both of your functions.
Your problem is the order of num_of_feature and num_of_samples. You can replace their position with each other in linear_reg_grad or linear_reg_cost. Of course you should change this order in scipy.optimize.fmin_cg, args argument.
Second important thing is, x as first argument in fmin_cg is the variable you want to update each time and find the optimal one. So in your solution, x in fmin_cg must be theta not your x which is your input.
Related
I need to write a simple neural network that consists of 1 output node, one hidden layer of 3 nodes, and 1 input layer (variable size). For now I am just trying to train on the xor data so lets presume that there are 3 input nodes (one node represents the bias and is always 1). The data is labeled 0,1.
I did out the equations for backpropogation and found that despite being so simple, my code does not converge to the xor data being correct.
Let W be the 3x3 matrix of weights connecting the input and hidden layer, and w be the 1x3 matrix that connects the hidden to output layer. Here are some helper functions for my method
def feed_forward_predict(x, W, w):
sigmoid = lambda x: 1/(1+np.exp(-x))
z = np.array(list(map(sigmoid, np.matmul(W, x))))
L = sigmoid(np.matmul(w, z))
return [L, z, x]
this just takes in a value and makes a prediction using the formula sig(w*sig(W*x)). We also have
def calculate_objective(data, labels, W, w):
obj = 0
for point, label in zip(data, labels):
L, z, x = feed_forward_predict(point, W, w)
obj += (label - L)**2
return obj
which calculates the Mean Squared Error for a bunch of given data points. Both of these functions should work as I checked them by hand. Now the problem comes in for the back propogation algorithm
def back_prop(traindata, trainlabels):
sigmoid = lambda x: 1/(1+np.exp(-x))
sigmoid_prime = lambda x: np.exp(-x)/((1+np.exp(-x))**2)
W = np.random.rand(3, len(traindata[0]))
w = np.random.rand(1, 3)
obj = calculate_objective(traindata, trainlabels, W, w)
print(obj)
epochs = 10_000
eta = .01
prevobj = np.inf
i=0
while(i < epochs):
prevobj = obj
dellw = np.zeros((1,3))
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
dellw += 2*(y - label) * sigmoid_prime(np.dot(w, z)) * z
w -= eta * dellw
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells = temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu = temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv = temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
dellW = np.array([dells, dellu, dellv])
W -= eta*dellW
obj = calculate_objective(traindata, trainlabels, W, w)
i = i + 1
print("i=", i, " Objective=",obj)
return [W, w]
However this code, despite seemingly being correct in terms of the matrix multiplications and derivatives I took, does not converge to anything. In fact the error consistantly bounces: it will fall, then rise, then fall back to the same spot, then rise again. I believe that the problem lies with the W matrix gradient but I do not know what exactly it is.
If you'd like to see for yourself what is happening, the input data I used is
0: 0 0 1
0: 1 1 1
1: 1 0 1
1: 0 1 1
where the first number represents the label. I also set the random seed to np.random.seed(0) just so that I could be consistant with my matrices I'm dealing with.
It appears you are attempting to setup a manual version of stochastic gradient decent with a fixed learning rate (a classic NN problem).
Some notes on your code. It is very difficult to follow all the steps you are doing with so much loops and inconsistencies. In general, it defeats the purpose of using np.array() if you are using loops. Likewise you should know that np.matmul() is * and np.dot() is #. It is unclear how you are using the derivative. You have it explicitly stated at the start for the activation function and then partially derived in the middle of your loop for the MSE. Ugh.
Some other pointers. Explicitly state all your functions and your data, those should be globals. Those should also be derived all at once based on your fixed data as np.array(). In particular, note that while traditional statistics (like finding the line of best fit) means that we are solving for a fixed set of weights given a random variable; in stochastic gradient decent, we are doing the opposite. We are instead fixing the random variable to our data and optimizing our weights. Hence, your functions should only have your weights as "free variables", everything else is fixed. It is important to follow what is being fixed and what is free to update. Your code does not reflect that you know what is being update and what is fixed.
SGD algorithm outline:
Random params.
Update params by moving params a small percentage in the direction of lowest decent.
Run step (2) for a specified amount of time.
Print your params.
Example of SGD code (here is an example of performing SGD to find the line of best fit for some data).
import numpy as np
#Data
X = np.random.random((100,)) #Random points
Y = (2.3*X + 8) + 0.1*np.random.random((100,)) #Linear model + Noise
#Functions (only free variable is the params) (we want the F of best fit under MSE)
F = lambda p : p[0]*X+p[1]
dF = lambda p : np.array([X,np.ones(X.shape)])
MSE = lambda p : (1/Y.shape[0])*((Y-F(p))**2).sum(0)
dMSE = lambda p : (1/Y.shape[0])*(-2*(Y-F(p))*dF(p)).sum(1)
#SGD loop
lr = 0.05
epochs = 1000
params = np.array([0.0,0.0])
for i in range(epochs):
params -= lr*dMSE(params)
print(params)
Hopefully, written this way it is super clear exactly where the subtraction of the gradient is occurring and exactly how it is calculated. Note also, in case it wasn't clear, the derivative in both dF and dMSE is with respect to the params. Obviously this is a toy problem that can be solved explicitly with the scipy module. Hence, SGD is a clearly useless way to optimize two variables.
from scipy.stats import linregress
params = linregress(X,Y)
print(params)
I think I figured it out, in my code I was not summing the hidden node weight derivatives and instead was assigning at every loop iteration. The correct version would be as follow
for point, label in zip(traindata, trainlabels):
y, z, x = feed_forward_predict(point, W, w)
temp = 2 * (y - label) * sigmoid_prime(np.dot(w, z))
# Note that s,u,v represent the hidden node weights. My professor required it this way
dells += temp * w[0][0] * sigmoid_prime(np.matmul(W[0,:], x)) * x
dellu += temp * w[0][1] * sigmoid_prime(np.matmul(W[1,:], x)) * x
dellv += temp * w[0][2] * sigmoid_prime(np.matmul(W[2,:], x)) * x
I am trying to implement a basic way of the stochastic gradient desecent with multi linear regression and the L2 Norm as loss function.
The result can be seen in this picture:
Its pretty far of the ideal regression line, but I dont really understand why thats the case. I double checked all array dimensions and they all seem to fit.
Below is my source code. If anyone can see my error or give me a hint I would appreciate that.
def SGD(x,y,learning_rate):
theta = np.array([[0],[0]])
for i in range(N):
xi = x[i].reshape(1,-1)
y_pre = xi#theta
theta = theta + learning_rate*(y[i]-y_pre[0][0])*xi.T
print(theta)
return theta
N = 100
x = np.array(np.linspace(-2,2,N))
y = 4*x + 5 + np.random.uniform(-1,1,N)
X = np.array([x**0,x**1]).T
plt.scatter(x,y,s=6)
th = SGD(X,y,0.1)
y_reg = np.matmul(X,th)
print(y_reg)
print(x)
plt.plot(x,y_reg)
plt.show()
Edit: Another solution was to shuffle the measurements with x = np.random.permutation(x)
to illustrate my comment,
def SGD(x,y,n,learning_rate):
theta = np.array([[0],[0]])
# currently it does exactly one iteration. do more
for _ in range(n):
for i in range(len(x)):
xi = x[i].reshape(1,-1)
y_pre = xi#theta
theta = theta + learning_rate*(y[i]-y_pre[0][0])*xi.T
print(theta)
return theta
SGD(X,y,10,0.01) yields the correct result
While implement logistic regression with only numpy library, I wrote the following code for cost function:
#sigmoid function
def sigmoid(z):
sigma = 1/(1+np.exp(-z))
return sigma
#cost function
def cost(X,y,theta):
m = y.shape[0]
z = X#theta
h = sigmoid(z)
J = np.sum((y*np.log(h))+((1-y)*np.log(1-h)))
J = -J/m
return J
Theta is a (3,1) array and X is the training data of shape (m,3). First column of X is ones.
For theta = [0,0,0], cost function outputs 0.693 which is the correct cost, but for theta = [1,-1,1], it outputs:
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: RuntimeWarning: divide by zero encountered in log
"""
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:5: RuntimeWarning: invalid value encountered in multiply
"""
nan
My code for gradient descent is:
#gradientdesc function
#alpha is the learning rate, iter is the number of iterations
def gradientDesc(X,y,theta,alpha,iter):
m = y.shape[0]
#d represents the derivative term
d = np.zeros((3,1))
for iter in range(iter):
h = sigmoid(X#theta) - y
temp = h.T.dot(X)
d = temp.T
d/=m
theta = theta - alpha*d
return theta
But this does not give the correct value of theta. What should I do?
Are the values in X large? This might lead to the sigmoid returning values close to zero that lead to the warnings you are seeing. Have a look at this thread:
Divide-by-zero-in-log
Your gradient descent won't work properly unless you solve this issue of values exploding. I would also consider adding regularization in your cost function.
J += C * np.sum(theta**2)
I have been trying to use fmin_cg to minimize cost function for Logistic Regression.
xopt = fmin_cg(costFn, fprime=grad, x0= initial_theta,
args = (X, y, m), maxiter = 400, disp = True, full_output = True )
This is how I call my fmin_cg
Here is my CostFn:
def costFn(theta, X, y, m):
h = sigmoid(X.dot(theta))
J = 0
J = 1 / m * np.sum((-(y * np.log(h))) - ((1-y) * np.log(1-h)))
return J.flatten()
Here is my grad:
def grad(theta, X, y, m):
h = sigmoid(X.dot(theta))
J = 1 / m * np.sum((-(y * np.log(h))) - ((1-y) * np.log(1-h)))
gg = 1 / m * (X.T.dot(h-y))
return gg.flatten()
It seems to be throwing this error:
/Users/sugethakch/miniconda2/lib/python2.7/site-packages/scipy/optimize/linesearch.pyc in phi(s)
85 def phi(s):
86 fc[0] += 1
---> 87 return f(xk + s*pk, *args)
88
89 def derphi(s):
ValueError: operands could not be broadcast together with shapes (3,) (300,)
I know it's something to do with my dimensions. But I can't seem to figure it out.
I am noob, so I might be making an obvious mistake.
I have read this link:
fmin_cg: Desired error not necessarily achieved due to precision loss
But, it somehow doesn't seem to work for me.
Any help?
Updated size for X,y,m,theta
(100, 3) ----> X
(100, 1) -----> y
100 ----> m
(3, 1) ----> theta
This is how I initialize X,y,m:
data = pd.read_csv('ex2data1.txt', sep=",", header=None)
data.columns = ['x1', 'x2', 'y']
x1 = data.iloc[:, 0].values[:, None]
x2 = data.iloc[:, 1].values[:, None]
y = data.iloc[:, 2].values[:, None]
# join x1 and x2 to make one array of X
X = np.concatenate((x1, x2), axis=1)
m, n = X.shape
ex2data1.txt:
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
.....
If it helps, I am trying to re-code one of the homework assignments for the Coursera's ML course by Andrew Ng in python
Finally, I figured out what the problem in my initial program was.
My 'y' was (100, 1) and the fmin_cg expects (100, ). Once I flattened my 'y' it no longer threw the initial error. But, the optimization wasn't working still.
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 0.693147
Iterations: 0
Function evaluations: 43
Gradient evaluations: 41
This was the same as what I achieved without optimization.
I figured out the way to optimize this was to use the 'Nelder-Mead' method. I followed this answer: scipy is not optimizing and returns "Desired error not necessarily achieved due to precision loss"
Result = op.minimize(fun = costFn,
x0 = initial_theta,
args = (X, y, m),
method = 'Nelder-Mead',
options={'disp': True})#,
#jac = grad)
This method doesn't need a 'jacobian'.
I got the results I was looking for,
Optimization terminated successfully.
Current function value: 0.203498
Iterations: 157
Function evaluations: 287
Well, since I don't know exactly how your initializing m, X, y, and theta I had to make some assumptions. Hopefully my answer is relevant:
import numpy as np
from scipy.optimize import fmin_cg
from scipy.special import expit
def costFn(theta, X, y, m):
# expit is the same as sigmoid, but faster
h = expit(X.dot(theta))
# instead of 1/m, I take the mean
J = np.mean((-(y * np.log(h))) - ((1-y) * np.log(1-h)))
return J #should be a scalar
def grad(theta, X, y, m):
h = expit(X.dot(theta))
J = np.mean((-(y * np.log(h))) - ((1-y) * np.log(1-h)))
gg = (X.T.dot(h-y))
return gg.flatten()
# initialize matrices
X = np.random.randn(100,3)
y = np.random.randn(100,) #this apparently needs to be a 1-d vector
m = np.ones((3,)) # not using m, used np.mean for a weighted sum (see ali_m's comment)
theta = np.ones((3,1))
xopt = fmin_cg(costFn, fprime=grad, x0=theta, args=(X, y, m), maxiter=400, disp=True, full_output=True )
While the code runs, I don't know enough about your problem to know if this is what you're looking for. But hopefully this can help you understand the problem better. One way to check your answer is to call fmin_cg with fprime=None and see how the answers compare.
I am trying to port some of my code from MatLab into Python and am running into problems with scipy.optimize.fmin_cg function - this is the code I have at the moment:
My cost function:
def nn_costfunction2(nn_params,*args):
Theta1, Theta2 = reshapeTheta(nn_params)
input_layer_size, hidden_layer_size, num_labels, X, y, lam = args[0], args[1], args[2], args[3], args[4], args[5]
m = X.shape[0] #Length of vector
X = np.hstack((np.ones([m,1]),X)) #Add in the bias unit
layer1 = sigmoid(Theta1.dot(np.transpose(X))) #Calculate first layer
layer1 = np.vstack((np.ones([1,layer1.shape[1]]),layer1)) #Add in bias unit
layer2 = sigmoid(Theta2.dot(layer1))
y_matrix = np.zeros([y.shape[0],layer2.shape[0]]) #Create a matrix where vector position of one corresponds to label
for i in range(y.shape[0]):
y_matrix[i,y[i]-1] = 1
#Cost function
J = (1/m)*np.sum(np.sum(-y_matrix.T.conj()*np.log(layer2),axis=0)-np.sum((1-y_matrix.T.conj())*np.log(1-layer2),axis=0))
#Add in regularization
J = J+(lam/(2*m))*np.sum(np.sum(Theta1[:,1:].conj()*Theta1[:,1:])+np.sum(Theta2[:,1:].conj()*Theta2[:,1:]))
#Backpropagation with vectorization and regularization
delta_3 = layer2 - y_matrix.T
r2 = delta_3.T.dot(Theta2[:,1:])
z_2 = Theta1.dot(X.T)
delta_2 = r2*sigmoidGradient(z_2).T
t1 = (lam/m)*Theta1[:,1:]
t1 = np.hstack((np.zeros([t1.shape[0],1]),t1))
t2 = (lam/m)*Theta2[:,1:]
t2 = np.hstack((np.zeros([t2.shape[0],1]),t2))
Theta1_grad = (1/m)*(delta_2.T.dot(X))+t1
Theta2_grad = (1/m)*(delta_3.dot(layer1.T))+t2
nn_params = np.hstack([Theta1_grad.flatten(),Theta2_grad.flatten()]) #Unroll parameters
return nn_params
My call of the function:
args = (input_layer_size, hidden_layer_size, num_labels, X, y, lam)
fmin_cg(nn_costfunction2,nn_params, args=args,maxiter=50)
Gives the following error:
File "C:\WinPython3\python-3.3.2.amd64\lib\site-packages\scipy\optimize\optimize.py", line 588, in approx_fprime
grad[k] = (f(*((xk+d,)+args)) - f0) / d[k]
ValueError: setting an array element with a sequence.
I tried various permutations in passing arguments to fmin_cg but this is the farthest I got. Running the cost function on its own does not throw any errors in this form.
The input variable in cost function should be an 1D array. So your Theta1 and Theta2 in J have to be derived from nn_params. And you need to return J as well.
Try to add epsilon argument in function call:
fmin_cg(nn_costfunction2,nn_params, args=args,epsilon,maxiter=50)
I see this issue is due to the fact you let nnCostFunction2 return cost and grad.
But the scipy.optimize.fmin_cg function will only take single cost output of nnCostFunction2.
So retain single J or cost output from nnCostFunction2 function.
this is my function which is working:
scipy.optimize.fmin_cg(nnCostFunction, initial_rand_theta, backpropagate, \
args=(hidden_s, input_s, num_labels, X, y, lamb), maxiter=1000, \
disp=True, full_output=True)