gradient descent implementation python - python

I am trying to implement my own gradient descent function in python but my MSE loss function is suspiciously high. How does my implementation look? I based my function on the formula below.
def gradient_descent(w, X):
for i in range(len(X)):
L = 0.001
total = 0
row_vec = X.iloc[i].to_numpy()
y = Y.iloc[i]
y_hat =, row_vec)
inner = y_hat - y
total += inner * row_vec
return w - L * ( ( 1 / len(X) ) * total)
The function above represents one iteration of gradient descent. The w parameter is a weights vector that I initialize to np.array([[1,1,1,...]]) and X is a DataFrame where each column represents a feature with an added column of all 1s for bias.
def optimize(w, X):
loss = 999999
iter = 0
loss_arr = []
while True:
vec = gradient_descent(w, X) # new weights vector
tmp_loss = loss_function(vec, X) # test new weights
if tmp_loss < loss:
loss = tmp_loss
w = vec
iter += 1
return (loss_arr, w, iter)
In the function above, I call the gradient_descent function and check if my loss function is better than the previous one. If it's not then I stop and I have my final weights.


TypeError: 'numpy.float64' object is not iterable for ML Model

How do I avoid getting the error on the highlighted line, it keeps giving me that error on how the float64 type is not iterable. Essentially, I'm trying to calculate the cost for SGD in the first function and then calculate SGD in the second function.
def calculate_cost_gradient(W, X_batch, Y_batch):
# if only one example is passed (eg. in case of SGD)
if type(Y_batch) == np.float64:
Y_batch = np.array([Y_batch])
X_batch = np.array([X_batch]) # gives multidimensional array
distance = 1 - (Y_batch *, W))
dw = np.zeros(len(W))
######## Error is here ########
for ind, d in enumerate(distance):
if max(0, d) == 0:
di = W
di = W - (regularization_strength * Y_batch[ind] * X_batch[ind])
dw += di
dw = dw/len(Y_batch) # average
return dw
def sgd(features, outputs):
max_epochs = 5000
weights = np.zeros(features.shape[1])
nth = 0
prev_cost = float("inf")
cost_threshold = 0.01 # in percent
# stochastic gradient descent
for epoch in range(1, max_epochs):
# shuffle to prevent repeating update cycles
X, Y = shuffle(features, outputs)
for ind, x in enumerate(X):
ascent = calculate_cost_gradient(weights, x, Y[ind])
weights = weights - (learning_rate * ascent)
# convergence check on 2^nth epoch
if epoch == 2 ** nth or epoch == max_epochs - 1:
cost = compute_cost(weights, features, outputs)
print("Epoch is:{} and Cost is: {}".format(epoch, cost))
# stoppage criterion
if abs(prev_cost - cost) < cost_threshold * prev_cost:
return weights
prev_cost = cost
nth += 1
return weights
It's very likely that distance is a numpy integer and the code should be:
for ind, d in enumerate(range(distance)):
I also faced the same issue while implementing SVM from a blog post:
The solution is just to remove the if type(Y_batch) == np.float64
It is because Y_batch is of type int64 and it is not converted into NumPy array resulting in the distance being a type of float64. (we can use enumerate only to the collections of items)

My predicted value only ever decreases for gradient descent

I am currently writing an implementation for gradient descent and I'm running into an issue where my predicted value (y_hat) only ever decreases. It never increases even though it should increase in cases where the training label is 1 and not 0. my train function code is below:
def sigma(self, a):
ans = 1/(1+np.exp(-a))
return ans
def get_loss(self, y_i, y_hat):
loss = -(y_i * np.log(y_hat) + (1 - y_i) * np.log(1 - y_hat))
return loss
def train(self, X, y, step_size, num_iterations):
b_0 = 0
rows = X.shape[0]
columns = X.shape[1]
weights = np.zeros(columns)
losses = []
for iteration in range(num_iterations):
#Step 1: calculate y hat for row
summation = 0
summation_k = np.zeros(columns)
total_loss = 0
for i in range(rows):
row_total = np.sum(np.multiply(X[i], weights))
y_hat = self.sigma(b_0 + row_total)
y_i = y[i]
# print('y_i: ', y_i)
# print('y_hat: ', y_hat)
# print()
total_loss += self.get_loss(y_i, y_hat)
diff = y_i - y_hat
summation += diff
# summation_k_i = summation_k_i + X[i] * diff
summation_k = np.add(summation_k, np.multiply(diff, X[i]))
# Compute change for each weight based on errors, then update the weights
# Update b_0
b_0 = b_0 + step_size * ((1/rows) * (-summation))
# Update b_k
# for j in range(columns):
# weights[j] = weights[j] + step_size * ((1/rows) * (-summation_k[j]))
weights = np.add(weights, np.multiply(summation_k, (-step_size/rows)))
# Keeping track of average loss for each iteration.
self.weights = np.insert(weights, 0, b_0)
return np.array(losses)
When I run this the y_hat values always decrease for every row and in every iteration. I can't find the bug that's causing this.

TypeError: 'NoneType' object is not callable (Pytorch)

I am trying to complete an implementation of a neural network class that uses pytorch.
But the upgrade step is causing the error to pop up related to None Type.
I am using Pytorch Pkg with Python 3.73 using Jupyter Notebook.
The problem is in the step where I have to ake weight update step, then zero the gradient values.
class NNet(torch.nn.Module):
def __init__(self, n_inputs, n_hiddens_per_layer, n_outputs, act_func='tanh'):
super().__init__() # call parent class (torch.nn.Module) constructor
# Set self.n_hiddens_per_layer to [] if argument is 0, [], or [0]
if n_hiddens_per_layer == 0 or n_hiddens_per_layer == [] or n_hiddens_per_layer == [0]:
self.n_hiddens_per_layer = []
self.n_hiddens_per_layer = n_hiddens_per_layer
self.hidden_layers = torch.nn.ModuleList() # necessary for'cuda')
for nh in self.n_hiddens_per_layer:
self.hidden_layers.append( torch.nn.Sequential(
torch.nn.Linear(n_inputs, nh),
torch.nn.Tanh() if act_func == 'tanh' else torch.nn.ReLU()))
n_inputs = nh
self.output_layer = torch.nn.Linear(n_inputs, n_outputs)
self.Xmeans = None
self.Xstds = None
self.Tmeans = None
self.Tstds = None
self.error_trace = []
def forward(self, X):
Y = X
for hidden_layer in self.hidden_layers:
Y = hidden_layer(Y)
Y = self.output_layer(Y)
return Y
def train(self, X, T, n_epochs, learning_rate, verbose=True):
# Set data matrices to torch.tensors if not already.
if not isinstance(X, torch.Tensor):
X = torch.from_numpy(X).float()
if not isinstance(T, torch.Tensor):
T = torch.from_numpy(T).float()
W = torch.zeros((2, 1), requires_grad=True)
# Calculate standardization parameters if not already calculated
if self.Xmeans is None:
self.Xmeans = X.mean(0)
self.Xstds = X.std(0)
self.Xstds[self.Xstds == 0] = 1
self.Tmeans = T.mean(0)
self.Tstds = T.std(0)
self.Tstds[self.Tstds == 0] = 1
# Standardize inputs and targets
X = (X - self.Xmeans) / self.Xstds
T = (T - self.Tmeans) / self.Tstds
# Set optimizer to Adam and loss functions to MSELoss
optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)
mse_func = torch.nn.MSELoss()
# For each epoch:
# Do forward pass to calculate output Y.
# Calculate mean squared error loss, mse.
# Calculate gradient of mse with respect to all weights by calling mse.backward().
# Take weight update step, then zero the gradient values.
# Unstandardize the mse error and save in self.error_trace
# Print epoch+1 and unstandardized error if verbose is True and
# (epoch+1 is n_epochs or epoch+1 % (n_epochs // 10) == 0)
for epoch in range(n_epochs):
# Do forward pass to calculate output Y.
Y = self.forward(X)
print("Y = \n",Y)
# Calculate mean squared error loss, mse.
mse = ((T - Y)**2).mean()
#mse = torch.mean((T - Y[-1]) ** 2)
print("Y shape = \n",Y.shape)
print("Tshape = \n",T.shape)
print("MSE = \n",mse)
# Calculate gradient of mse with respect to all weights by calling mse.backward().
# Take weight update step, then zero the gradient values.
#print("W.grad = ",W.grad())
with torch.no_grad():
W = learning_rate*W.grad()
W -= learning_rate * W.grad()
# Unstandardize the mse error and save in self.error_trace
self.error_trace = mse * self.Tstds
#. . .
def use(self, X):
# Set input matrix to torch.tensors if not already.
if not isinstance(X, torch.Tensor):
X = torch.from_numpy(X).float()
# Standardize X
X = (X - torch.mean(X)) / self.Xstds
# Do forward pass and unstandardize resulting output. Assign to variable Y.
# Return output Y after detaching from computation graph and converting to numpy
return Y.detach().numpy()
*<ipython-input-20-6e1e577f866d> in train(self, X, T, n_epochs, learning_rate, verbose)
86 # Take weight update step, then zero the gradient values.
87 with torch.no_grad():
---> 88 W = learning_rate*W.grad()
89 print("w",W.requires_grad)
90 W -= learning_rate * W.grad()*
TypeError: 'NoneType' object is not callable

Vectorize for loop in python

I have a following loop where I am calculating softmax transform for batches of different sizes as below
import numpy as np
def softmax(Z,arr):
:param Z: numpy array of any shape (output from hidden layer)
:param arr: numpy array of any shape (start, end)
:return A: output of multinum_logit(Z,arr), same shape as Z
:return cache: returns Z as well, useful during back propagation
A = np.zeros(Z.shape)
for i in prange(len(arr)):
shiftx = Z[:,arr[i,1]:arr[i,2]+1] - np.max(Z[:,int(arr[i,1]):int(arr[i,2])+1])
A[:,arr[i,1]:arr[i,2]+1] = np.exp(shiftx)/np.exp(shiftx).sum()
cache = Z
return A,cache
Since this for loop is not vectorized it is the bottleneck in my code. What is a possible solution to make it faster. I have tried using #jit of numba which makes it little faster but not enough. I was wondering if there is another way to make it faster or vectorize/parallelize it.
Sample input data for the function
Z = np.random.random([1,10000])
arr = np.zeros([100,3])
arr[:,0] = 1
temp = int(Z.shape[1]/arr.shape[0])
for i in range(arr.shape[0]):
arr[i,1] = i*temp
arr[i,2] = (i+1)*temp-1
arr = arr.astype(int)
I forgot to stress here that my number of class is varying. For example batch 1 has say 10 classes, batch 2 may have 15 classes. Therefore I am passing an array arr which keeps track of the which rows belong to batch1 and so on. These batches are different than the batches in traditional neural network framework
In the above example arr keeps track of starting index and end index of rows. So the denominator in the softmax function will be sum of only those observations whose index lie between the starting and ending index.
Here's a vectorized softmax function. It's the implementation of an assignment from Stanford's cs231n course on conv nets.
The function takes in optimizable parameters, input data, targets, and a regularizer. (You can ignore the regularizer as that references another class exclusive to some cs231n assignments).
It returns a loss and gradients of the parameters.
def softmax_loss_vectorized(W, X, y, reg):
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
scores =
shift_scores = scores - np.amax(scores,axis=1).reshape(-1,1)
softmax = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis=1).reshape(-1,1)
loss = -np.sum(np.log(softmax[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dSoftmax = softmax.copy()
dSoftmax[range(num_train), list(y)] += -1
dW = (X.T).dot(dSoftmax)
dW = dW/num_train + reg * W
return loss, dW
For comparison's sake, here is a naive (non-vectorized) implementation of the same method.
def softmax_loss_naive(W, X, y, reg):
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
for i in xrange(num_train):
scores = X[i].dot(W)
shift_scores = scores - max(scores)
loss_i = -shift_scores[y[i]] + np.log(sum(np.exp(shift_scores)))
loss += loss_i
for j in xrange(num_classes):
softmax = np.exp(shift_scores[j])/sum(np.exp(shift_scores))
if j==y[i]:
dW[:,j] += (-1 + softmax) * X[i]
dW[:,j] += softmax *X[i]
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train + reg * W
return loss, dW

How to use mini-batch instead of SGD

Here is a quick implementation of a one-layer neural network in python:
import numpy as np
# simulate data
X = np.random.random((200, 3)) # 100 3d vectors
# first col is set to 1
X[:, 0] = 1
def simu_out(x):
return np.sum(np.power(x, 2))
y = np.apply_along_axis(simu_out, 1, X)
# code 1 if above average
y = (y > np.mean(y)).astype("float64")*2 - 1
# split into training and testing sets
Xtr = X[:100]
Xte = X[100:]
ytr = y[:100]
yte = y[100:]
w = np.random.random(3)
# 1 layer network. Final layer has one node
# initial weights,
def epoch():
err_sum = 0
global w
for i in range(len(ytr)):
learn_rate = .1
s_l1 = Xtr[i] # signal at layer 1, pre-activation
x_l1 = np.tanh(s_l1) # output at layer 1, activation
err = x_l1 - ytr[i]
err_sum += err
# see here:
delta_l1 = 2 * err * (1 - x_l1**2)
dw = Xtr[i] * delta_l1
w -= learn_rate * dw
print("Mean error: %f" % (err_sum / len(ytr)))
for i in range(1000):
def predict(X):
global w
return np.sign(np.tanh(
# > 80% accuracy!!
np.mean(predict(Xte) == yte)
It is using stochastic gradient descent for optimization. I am thinking how do I apply mini-batch gradient descent here?
The difference from "classical" SGD to a mini-batch gradient descent is that you use multiple samples (a so-called mini-batch) to calculate the update for w. This has the advantage, that the steps you take in direction of the solution are less noisy, as you follow a smoothed gradient.
To do that, you need an inner loop to calculate the update dw, where you iterate over the mini batch. For example (quick-n-dirty code):
def epoch():
err_sum = 0
learn_rate = 0.1
global w
for i in range(int(ceil(len(ytr) / batch_size))):
batch = Xtr[i:i+batch_size]
target = ytr[i:i+batch_size]
dw = np.zeros_like(w)
for j in range(batch_size):
s_l1 = batch[j]
x_l1 = np.tanh(s_l1)
err = x_l1 - target[j]
err_sum += err
delta_l1 = 2 * err * (1 - x_l1**2)
dw += batch[j] * delta_l1
w -= learn_rate * (dw / batch_size)
print("Mean error: %f" % (err_sum / len(ytr)))
gave an accuracy of 87 percent in a test.
Now, one more thing: you always go through the training set from start to end. You should definitely shuffle the data in each iteration. Always going through in the same order can really affect your performance, especially if you e.g. first have all samples of class A, and then all of class B. This can also make your training go in cycles. So just go through the set in a random order, e.g. with
order = np.random.permutation(len(ytr))
and replace all occurrences of i by order[i] in the epoch() function.
And a more general remark: Global variables are often considered bad design, as you don't have any control over which snippet modifies your variables. Rather pass w as a parameter. The same goes for the learning rate and the batch size.

