Need help in debugging Shallow Neural network using numpy - python

I'm doing a hands-on for learning and have created a model in python using numpy that's being trained on breast cancer dataSet from sklearn library. Model is running without any error and giving me Train and Test accuracy as 92.48826291079813% and 90.9090909090909% respectively. However somehow I'm not able to complete the hands-on since (probably) my result is different than expected. I don't know where the problem is because I don't know the right answer, also don't see any error.
Would request someone to help me with this. Code is given below.
#Import numpy as np and pandas as pd
"""
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
**Define method initialiseNetwork() initilise weights with zeros of shape(num_features, 1) and also bias b to zero
parameters: num_features(number of input features)
returns : dictionary of weight vector and bias**
def initialiseNetwork(num_features):
W = np.zeros((num_features,1))
b = 0
parameters = {"W": W, "b": b}
return parameters
** define function sigmoid for the input z.
parameters: z
returns: $1/(1+e^{(-z)})$ **
def sigmoid(z):
a = 1/(1 + np.exp(-z))
return a
** Define method forwardPropagation() which implements forward propagtion defined as Z = (W.T dot_product X) + b, A = sigmoid(Z)
parameters: X, parameters
returns: A **
def forwardPropagation(X, parameters):
W = parameters["W"]
b = parameters["b"]
Z = np.dot(W.T,X) + b
A = sigmoid(Z)
return A
** Define function cost() which calculate the cost given by −(sum(Y\*log(A)+(1−Y)\*log(1−A)))/num_samples, here * is elementwise product
parameters: A,Y,num_samples(number of samples)
returns: cost **
def cost(A, Y, num_samples):
cost = -1/num_samples * np.sum(Y*np.log(A) + (1-Y)*(np.log(1-A)))
#cost = Y*np.log(A) + (1-Y)*(np.log(1-A))
return cost
** Define method backPropgation() to get the derivatives of weigths and bias
parameters: X,Y,A,num_samples
returns: dW,db **
def backPropagration(X, Y, A, num_samples):
dZ = A - Y
dW = (np.dot(X,dZ.T))/num_samples #(X dot_product dZ.T)/num_samples
db = np.sum(dZ)/num_samples #sum(dZ)/num_samples
return dW, db
** Define function updateParameters() to update current parameters with its derivatives
w = w - learning_rate \* dw
b = b - learning_rate \* db
parameters: parameters,dW,db, learning_rate
returns: dictionary of updated parameters **
def updateParameters(parameters, dW, db, learning_rate):
W = parameters["W"] - (learning_rate * dW)
b = parameters["b"] - (learning_rate * db)
return {"W": W, "b": b}
** Define the model for forward propagation
parameters: X,Y, num_iter(number of iterations), learning_rate
returns: parameters(dictionary of updated weights and bias) **
def model(X, Y, num_iter, learning_rate):
num_features = X.shape[0]
num_samples = X.shape[1]
parameters = initialiseNetwork(num_features) #call initialiseNetwork()
for i in range(num_iter):
#A = forwardPropagation(X, Y, parameters) # calculate final output A from forwardPropagation()
A = forwardPropagation(X, parameters)
if(i%100 == 0):
print("cost after {} iteration: {}".format(i, cost(A, Y, num_samples)))
dW, db = backPropagration(X, Y, A, num_samples) # calculate derivatives from backpropagation
parameters = updateParameters(parameters, dW, db, learning_rate) # update parameters
return parameters
** Run the below cell to define the function to predict the output.It takes updated parameters and input data as function parameters and returns the predicted output **
def predict(X, parameters):
W = parameters["W"]
b = parameters["b"]
b = b.reshape(b.shape[0],1)
Z = np.dot(W.T,X) + b
Y = np.array([1 if y > 0.5 else 0 for y in sigmoid(Z[0])]).reshape(1,len(Z[0]))
return Y
** The code in the below cell loads the breast cancer data set from sklearn.
The input variable(X_cancer) is about the dimensions of tumor cell and targrt variable(y_cancer) classifies tumor as malignant(0) or benign(1) **
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
** Split the data into train and test set using train_test_split(). Set the random state to 25. Refer the code snippet in topic 4 **
X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer,
random_state = 25)
** Since the dimensions of tumor is not uniform you need to normalize the data before feeding to the network
The below function is used to normalize the input data. **
def normalize(data):
col_max = np.max(data, axis = 0)
col_min = np.min(data, axis = 0)
return np.divide(data - col_min, col_max - col_min)
** Normalize X_train and X_test and assign it to X_train_n and X_test_n respectively **
X_train_n = normalize(X_train)
X_test_n = normalize(X_test)
** Transpose X_train_n and X_test_n so that rows represents features and column represents the samples
Reshape Y_train and y_test into row vector whose length is equal to number of samples.Use np.reshape() **
X_trainT = X_train_n.T
#print(X_trainT.shape)
X_testT = X_test_n.T
#print(X_testT.shape)
y_trainT = y_train.reshape(1,X_trainT.shape[1])
y_testT = y_test.reshape(1,X_testT.shape[1])
** Train the network using X_trainT,y_trainT with number of iterations 4000 and learning rate 0.75 **
parameters = model(X_trainT, y_trainT, 4000, 0.75) #call the model() function with parametrs mentioned in the above cell
** Predict the output of test and train data using X_trainT and X_testT using predict() method> Use the parametes returned from the trained model **
yPredTrain = predict(X_trainT, parameters) # pass weigths and bias from parameters dictionary and X_trainT as input to the function
yPredTest = predict(X_testT, parameters) # pass the same parameters but X_testT as input data
** Run the below cell print the accuracy of model on train and test data. ***
accuracy_train = 100 - np.mean(np.abs(yPredTrain - y_trainT)) * 100
accuracy_test = 100 - np.mean(np.abs(yPredTest - y_testT)) * 100
print("train accuracy: {} %".format(accuracy_train))
print("test accuracy: {} %".format(accuracy_test))
My Output:
train accuracy: 92.48826291079813 %
test accuracy: 90.9090909090909 %

I figured out where the problem was. It was the third line in predict function where I was reshaping bias which was not at all necessary.
def predict(X, parameters):
W = parameters["W"]
b = parameters["b"]
**b = b.reshape(b.shape[0],1)**
Z = np.dot(W.T,X) + b
Y = np.array([1 if y > 0.5 else 0 for y in sigmoid(Z[0])]).reshape(1,len(Z[0]))
return Y
and third line in back-propagation function needed to be corrected as np.sum(dZ)/num_samples.
def backPropagration(X, Y, A, num_samples):
dZ = A - Y
dW = (np.dot(X,dZ.T))/num_samples
** db = sum(dZ)/num_samples **
return dW, db
After I corrected both functions, the model gave me train accuracy as 98.59154929577464% and test accuracy as 93.00699300699301%.

Related

Gradient descent for linear regression with numpy

I want to implement gradient descent with numpy for linear regression but I have some error in this code:
import numpy as np
# Code Example
rng = np.random.RandomState(10)
X = 10*rng.rand(1000, 5) # feature matrix
y = 0.9 + np.dot(X, [2.2, 4, -4, 1, 2]) # target vector
# GD implementation for linear regression
def GD(X, y, eta=0.1, n_iter=20):
theta = np.zeros((X.shape[0], X.shape[1]))
for i in range(n_iter):
grad = 2 * np.mean((np.dot(theta.T, X) - y) * X)
theta = theta - eta * grad
return theta
# SGD implementation for linear regression
def SGD(X, y, eta=0.1, n_iter=20):
theta = np.zeros(1, X.shape[1])
for i in range(n_iter):
for j in range(X.shape[0]):
grad = 2 * np.mean((np.dot(theta.T, X[j,:]) - y[j]) * X[j,:])
theta = theta - eta * grad
return theta
# MSE loss for linear regression with numpy
def MSE(X, y, theta):
return np.mean((X.dot(theta.T) - y)**2)
# linear regression with GD and MSE with numpy
theta_gd = GD(X, y)
theta_sgd = SGD(X, y)
print('MSE with GD: ', MSE(X, y, theta_gd))
print('MSE with SGD: ', MSE(X, y, theta_sgd))
The error is
grad = 2 * np.mean((np.dot(theta.T, X) - y) * X)
ValueError: operands could not be broadcast together with shapes (5,5) (1000,)
and I can't solve it.
Minor changes in your code that resolve dimensionality issues during matrix multiplication make the code run successfully. In particular, note that a linear regression on a design matrix X of dimension Nxk has a parameter vector theta of size k.
In addition, I'd suggest some changes in SGD() that make it a proper stochastic gradient descent. Namely, evaluating the gradient over random subsets of the data realized as realized by randomly partitioning the index set of the train data with np.random.shuffle() and looping through it. The batch_size determines the size of each subset after which the parameter estimate is updated. The argument seed ensures reproducibility.
# GD implementation for linear regression
def GD(X, y, eta=0.001, n_iter=100):
theta = np.zeros(X.shape[1])
for i in range(n_iter):
for j in range(X.shape[0]):
grad = (2 * np.mean(X[j,:] # theta - y[j]) * X[j,:]) # changed line
theta -= eta * grad
return theta
# SGD implementation for linear regression
def SGD(X, y, eta=0.001, n_iter=1000, batch_size=25, seed=7678):
theta = np.zeros(X.shape[1])
indexSet = list(range(len(X)))
np.random.seed(seed)
for i in range(n_iter):
np.random.shuffle(indexSet) # random shuffle of index set
for j in range(round(len(X) / batch_size)+1):
X_sub = X[indexSet[j*batch_size:(j+1)*batch_size],:]
y_sub = y[indexSet[j*batch_size:(j+1)*batch_size]]
if(len(X_sub) > 0):
grad = (2 * np.mean(X_sub # theta - y_sub) * X_sub) # changed line
theta -= eta * np.mean(grad, axis=0)
return theta
Running the code, I get
print('MSE with GD : ', MSE(X, y, theta_gd))
print('MSE with SGD: ', MSE(X, y, theta_sgd))
> MSE with GD : 0.07602
MSE with SGD: 0.05762
Each observation has 5 features, and X contains 1000 observations:
X = rng.rand(1000, 5) * 10 # X.shape == (1000, 5)
Create y which is perfectly linearly correlated with X (with no distortions):
real_weights = np.array([2.2, 4, -4, 1, 2]).reshape(-1, 1)
real_bias = 0.9
y = X # real_weights + real_bias # y.shape == (1000, 1)
G.D. implementation for linear regression:
Note:
w (weights) is your theta variable.
I have also added the calculation of b (bias).
def GD(X, y, eta=0.1, n_iter=20):
# Initialize weights and a bias (all zeros):
w = np.zeros((X.shape[1], 1)) # w.shape == (5, 1)
b = 0
# Gradient descent
for i in range(n_iter):
errors = X # w + b - y # errors.shape == (1000, 1)
dw = 2 * np.mean(errors * X, axis=0).reshape(5, 1)
db = 2 * np.mean(errors)
w -= eta * dw
b -= eta * db
return w, b
Testing:
w, b = GD(X, y, eta=0.003, n_iter=5000)
print(w, b)
[[ 2.20464905]
[ 4.00510139]
[-3.99569374]
[ 1.00444026]
[ 2.00407476]] 0.7805448262466914
Notes:
Your function SGD also contains some error..
I'm using the # operator because it's just my preference over np.dot.

Linear Regression loss value increasing after each iteration of gradient descent

I am trying to implement multivariate linear regression(gradient descent and mse cost function) but the loss value keeps exponentially increasing for every iteration of gradient descent and I'm unable to figure out why?
from sklearn.datasets import load_boston
class LinearRegression:
def __init__(self):
self.X = None # The feature vectors [shape = (m, n)]
self.y = None # The regression outputs [shape = (m, 1)]
self.W = None # The parameter vector `W` [shape = (n, 1)]
self.bias = None # The bias value `b`
self.lr = None # Learning Rate `alpha`
self.m = None
self.n = None
self.epochs = None
def fit(self, X: np.ndarray, y: np.ndarray, epochs: int = 100, lr: float = 0.001):
self.X = X # shape (m, n)
self.m, self.n = X.shape
assert y.size == self.m and y.shape[0] == self.m
self.y = np.reshape(y, (-1, 1)) # shape (m, ) or (m, 1)
assert self.y.shape == (self.m, 1)
self.W = np.random.random((self.n, 1)) * 1e-3 # shape (n, 1)
self.bias = 0.0
self.epochs = epochs
self.lr = lr
self.minimize()
def minimize(self, verbose: bool = True):
for num_epoch in range(self.epochs):
predictions = np.dot(self.X, self.W)
assert predictions.shape == (self.m, 1)
grad_w = (1/self.m) * np.sum((predictions-self.y) * self.X, axis=0)[:, np.newaxis]
self.W = self.W - self.lr * grad_w
assert self.W.shape == grad_w.shape
loss = (1 / 2 * self.m) * np.sum(np.square(predictions - self.y))
if verbose:
print(f'Epoch : {num_epoch+1}/{self.epochs} \t Loss : {loss.item()}')
linear_regression = LinearRegression()
x_train, y_train = load_boston(return_X_y=True)
linear_regression.fit(x_train, y_train, 10)
I'm using the boston housing dataset from sklearn.
PS. I'd like to know what's causing this issue and how to fix it and whether or not my implementation is correct.
Thanks
The error is in the gradient. A divergence like that for an iterative shrinkage-thresholding algorithms (ISTA) solver is not something you should see.
For your gradient computation: X is of shape (m,n) and W of shape(n,1) so (prediction - y) is of shape (m,1) then you multiply by X on the left? (m,1) by (m,n)? Not sure what numpy is computing but it is not what you want to compute:
grad_w = (1/self.m) * np.sum((predictions-self.y) * self.X, axis=0)[:, np.newaxis]
here the code should be a bit different to have a (n,m) multiply by a (m,1) in order to get a (n,1), same shape as W.
(1/self.m) * np.sum(self.X.T*(predictions-self.y) , axis=0)[:, np.newaxis]
For the derivation to be correct.
I am also not sure of why you use the dot (which is a good idea) for the prediction but not for the gradient.
You Also do not need so many reshapes:
from sklearn.datasets import load_boston
A,b = load_boston(return_X_y=True)
n_samples = A.shape[0]
n_features = A.shape[1]
def grad_linreg(x):
"""Least-squares gradient"""
grad = (1. / n_samples) * np.dot(A.T, np.dot(A, x) - b)
return grad
def loss_linreg(x):
"""Least-squares loss"""
f = (1. / (2. * n_samples)) * sum((b - np.dot(A, x)) ** 2)
return f
And then you check that your gradient is good:
from scipy.optimize import check_grad
from numpy.random import randn
check_grad(loss_linreg,grad_linreg,randn(n_features))
check_grad(loss_linreg,grad_linreg,randn(n_features))
check_grad(loss_linreg,grad_linreg,randn(n_features))
check_grad(loss_linreg,grad_linreg,randn(n_features))
You can then build the Model on that.
If you want to test that with ISTA/FISTA and Logistic/Linear Regression and LASSO/RIDGE, here is a jupyter notebook with the theory and a working example

Neural Network (operands could not be broadcast together with shapes (1,713) (713,18) )

I am currently taking the Deep Learning specialization by Deeplearning.ai on Coursera and am on the first assignment that requires implementing Neural Network with Logistic Regression mindset. The problem is that the assignment is implementation of Neural Network as Logistic Regression function for UNSTRUCTURED DATA (IMAGES). I have successfully completed the assignment, getting all the expected outputs. However, I am now trying to use the coded Neural Network for STRUCTURE DATA but come across broadcast error. Part of the code is as below :
The dataset code
path_train = r'C:\Users\Ahmed Ismail Khalid\Desktop\Research Paper\Research Paper Feature Sets\Balanced Feature Sets\Balanced Train combined scores.csv'
path_test = r'C:\Users\Ahmed Ismail Khalid\Desktop\Research Paper\Research Paper Feature Sets\Balanced Feature Sets\Balanced Test combined scores.csv'
df_train = pd.read_csv(path_train)
#df_train = df_train.to_numpy()
df_test = pd.read_csv(path_test)
#df_test = df_test.to_numpy()
x_train = df_train.iloc[:,1:19]
x_train = x_train.to_numpy()
x_train = x_train.T
y_train = df_train.iloc[:,19]
y_train = y_train.to_numpy()
y_train = y_train.reshape(y_train.shape[0],1)
y_train = y_train.T
x_test = df_test.iloc[:,1:19]
x_test = x_test.to_numpy()
x_test = x_test.T
y_test = df_test.iloc[:,19]
y_test = y_test.to_numpy()
y_test = y_test.reshape(y_test.shape[0],1)
y_test = y_test.T
print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
print ("train_set_x shape: " + str(x_train.shape))
print ("train_set_y shape: " + str(y_train.shape))
print ("test_set_x shape: " + str(x_test.shape))
print ("test_set_y shape: " + str(y_test.shape))
Output of Dataset Code
Number of training examples: df_train = 713
Number of testing examples: df_test = 237
x_train shape: (18, 713)
y_train shape: (1, 713)
x_test shape: (18, 237)
y_test shape: (1, 237)
The propagate function code
def propagate(w,b,X,Y) :
m = X.shape[1]
A = sigmoid((w.T * X) + b)
cost = (- 1 / m) * np.sum(np.dot(Y,np.log(A)) + np.dot((1 - Y), np.log(1 - A)))
dw = (1 / m) * np.dot((X,(A - Y)).T)
db = (1 / m) * np.sum(A - Y)
assert(dw.shape == w.shape)
assert(db.dtype == float)
cost = np.squeeze(cost)
assert(cost.shape == ())
grads = {"dw": dw,
"db": db}
return grads, cost
The optimize and model functions
**def optimize**(w,b,X,Y,num_iterations,learning_rate,print_cost) :
costs = []
for i in range(num_iterations) :
# Cost and gradient calculation
grads, cost = propagate(w,b,X,Y)
# Retrieve derivatives from gradients
dw = grads['dw']
db = grads['db']
# Update w and b
w = w - learning_rate * dw
b = b - learning_rate * db
if i % 100 == 0:
costs.append(cost)
# Print the cost every 100 training iterations
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" %(i, cost))
params = {"w": w,
"b": b}
grads = {"dw": dw,
"db": db}
return params, grads, costs
**def model**(X_train, Y_train, X_test, Y_test, num_iterations = 2000, learning_rate = 0.5, print_cost = False) :
# initialize parameters with zero
w, b = initialize_with_zeros(X_train.shape[0])
# Gradient descent (≈ 1 line of code)
parameters, grads, costs = optimize(w,b,X_train,Y_train,num_iterations,learning_rate,print_cost)
# Retrieve parameters w and b from dictionary "parameters"
w = parameters["w"]
b = parameters["b"]
# Predict train/test set examples (≈ 2 lines of code)
Y_prediction_train = predict(w,b,X_train)
Y_prediction_test = predict(w,b,X_test)
# Print train/test Errors
print("train accuracy: {} %".format(100 - np.mean(abs(Y_prediction_train - Y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(abs(Y_prediction_test - Y_test)) * 100))
d = {"costs": costs,
"Y_prediction_test": Y_prediction_test,
"Y_prediction_train" : Y_prediction_train,
"w" : w,
"b" : b,
"learning_rate" : learning_rate,
"num_iterations": num_iterations}
return d
Model Function output
Cost after iteration 0: 0.693147
train accuracy: -0.1402524544179613 %
test accuracy: 0.4219409282700326 %
When I run the code, I get ValueError: operands could not be broadcast together with shapes (1,713) (713,18) at A = sigmoid((w.T * X) + b). I am pretty new to neural networks and usage of numpy, so I can't figure out the problem. Any and all help would be really appreciated. The entire .ipynb file containing the entire code can be downloaded from here
Thanks
The * operator is elementwise multiplication, and your arrays have incompatible shapes. You want matrix multiplication, which you can do with np.matmul() or with the # operator:
A = sigmoid(w.T # X + b)
A lot of ML, especially neural nets, is about keeping the shapes of things straight. Check the shapes of your w, X, and Y — they should be: (features, 1), (features, m), (1, m) respectively, where features is 18 for you, and m is 713.
You should also then be able to make sure that the shape of A matches Y.

numpy squeeze side effects

I've trained a simple machine learning model, a polynomial regression. The pseudocode of prediction function is as follows:
def f(x):
"""
x is a np.ndarray of shape (m, )
"""
# X is stacked of x ** 0, x ** 1, x ** 2, ..., x ** (n - 1) by rows
# X is of shape of (m, n)
# m is the number of training examples
X = generate(x)
Y = np.dot(X, W)
return Y
W is trained parameters. Here the shape of Y is (m, 1), but if I return Y.squeeze(), say of shape (m,), I get a very different standard deviation on the test set, say 70 for the former and 8 for the latter.
I use random initialisation, but I've trained and tested many times, the std of the squeezed version is much smaller. So I just wonder why.
I just show the complete codes below, and you can test by yourself. My questions are in line 90 and line 91
# python: 3.5.2
# encoding: utf-8
# numpy: 1.14.1
import numpy as np
import matplotlib.pyplot as plt
def load_data(filename):
xys = []
with open(filename, 'r') as f:
for line in f:
xys.append(map(float, line.strip().split()))
xs, ys = zip(*xys)
return np.asarray(xs), np.asarray(ys)
def evaluate(ys, ys_pred):
std = np.sqrt(np.mean(np.abs(ys - ys_pred) ** 2))
return std
def linear_regression(x_train, y_train, n=2, learning_rate=0.0005, epochs=1000, l2=0, Print=False):
"""
This target function is: y = b + w1 * x^1 + w2 * x^2 + ...
also y = b + np.dot(w.T, x)
:param x_train: np.ndarray
:param y_train: np.ndarray
:return: a trained model (as a function), trained by x_train and y_train
"""
# get the number of train e.g.
m = x_train.shape[0]
# set and initialize parameters here
# intercept
b = np.float64(-10)
# weights
w = np.float64(np.random.randn(n, 1))
# convert the x_train matrix to a design matrix
X = np.zeros((n, m), dtype=np.float64)
for i in range(n):
X[i, :] = x_train ** (i + 1)
X = np.float64(X)
Y = np.float64(np.reshape(y_train, newshape=(1, m)))
# if plot of the training process is needed
costs = []
# train on the dataset
for epoch in range(epochs):
# compute the gradient of cost on w
Z = b + np.dot(w.T, X)
dZ = Z - Y
dw = 1./m * np.dot(X, dZ.T)
db = 1./m * np.squeeze(np.sum(dZ))
# update the parameters, for w, I also set "weight decay"
w -= learning_rate * dw + l2 * w
b -= learning_rate * db
cost = np.squeeze(0.5/m * np.dot(dZ, dZ.T))
costs.append(cost)
if Print == True and epoch % 25 == 0:
print("Cost after " + str(epoch) + " iterations " + ": " + str(cost))
# plot the costs
if Print == True:
plt.plot(costs)
plt.show()
def pred(x):
assert type(x) is np.ndarray
m = x.shape[0]
# convert the x_train matrix to a design matrix
X = np.zeros((n, m))
for i in range(n):
X[i, :] = x ** (i + 1)
# to predict
Y = b + np.dot(w.T, X)
return Y.T
# return Y.squeeze()
return pred
if __name__ == '__main__':
train_file = 'train.txt'
test_file = 'test.txt'
# load data
x_train, y_train = load_data(train_file)
x_test, y_test = load_data(test_file)
print(x_train.shape)
print(x_test.shape)
# use a trained linear-regression model
f = linear_regression(x_train, y_train, n=2, epochs=10000, Print=False, learning_rate=1e-8, l2=5e-2)
# compute the predictions
y_test_pred = f(x_test)
# use the test set to evaluate the model
std = evaluate(y_test, y_test_pred)
print('the standard deviation:{:.1f}'.format(std))
# show the result
plt.plot(x_train, y_train, 'ro', markersize=3)
plt.plot(x_test, y_test, 'k')
plt.plot(x_test, y_test_pred)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Linear Regression')
plt.legend(['train', 'test', 'pred'])
plt.show()

Modify neural net to classify single example

This is my custom extension of one of Andrew NG's neural network from deep learning course where instead of producing 0 or 1 for binary classification I'm attempting
to classify multiple examples.
Both the inputs and outputs are one hot encoded.
With not much training I receive an accuracy of 'train accuracy: 67.51658067499625 %'
How can I classify a single training example instead of classifying all training examples?
I think a bug exists in my implementation as an issue with this network is training examples (train_set_x) and output values (train_set_y) both need to have same dimensions or an error related to the dimensionality of matrices is received.
For example using :
train_set_x = np.array([
[1,1,1,1],[0,1,1,1],[0,0,1,1]
])
train_set_y = np.array([
[1,1,1],[1,1,0],[1,1,1]
])
returns error :
ValueError Traceback (most recent call last)
<ipython-input-11-0d356e8d66f3> in <module>()
27 print(A)
28
---> 29 np.multiply(train_set_y,A)
30
31 def initialize_with_zeros(numberOfTrainingExamples):
ValueError: operands could not be broadcast together with shapes (3,3) (1,4)
network code :
import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from scipy import ndimage
import pandas as pd
%matplotlib inline
train_set_x = np.array([
[1,1,1,1],[0,1,1,1],[0,0,1,1]
])
train_set_y = np.array([
[1,1,1,0],[1,1,0,0],[1,1,1,1]
])
numberOfFeatures = 4
numberOfTrainingExamples = 3
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
w = np.zeros((numberOfTrainingExamples , 1))
b = 0
A = sigmoid(np.dot(w.T , train_set_x))
print(A)
np.multiply(train_set_y,A)
def initialize_with_zeros(numberOfTrainingExamples):
w = np.zeros((numberOfTrainingExamples , 1))
b = 0
return w, b
def propagate(w, b, X, Y):
m = X.shape[1]
A = sigmoid(np.dot(w.T , X) + b)
cost = -(1/m)*np.sum(np.multiply(Y,np.log(A)) + np.multiply((1-Y),np.log(1-A)), axis=1)
dw = ( 1 / m ) * np.dot( X, ( A - Y ).T ) # consumes ( A - Y )
db = ( 1 / m ) * np.sum( A - Y ) # consumes ( A - Y ) again
# cost = np.squeeze(cost)
grads = {"dw": dw,
"db": db}
return grads, cost
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = True):
costs = []
for i in range(num_iterations):
grads, cost = propagate(w, b, X, Y)
dw = grads["dw"]
db = grads["db"]
w = w - (learning_rate * dw)
b = b - (learning_rate * db)
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 10000 == 0:
print(cost)
params = {"w": w,
"b": b}
grads = {"dw": dw,
"db": db}
return params, grads, costs
def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False):
w, b = initialize_with_zeros(numberOfTrainingExamples)
parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = True)
w = parameters["w"]
b = parameters["b"]
Y_prediction_train = sigmoid(np.dot(w.T , X_train) + b)
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.0001, print_cost = True)
Update: A bug exists in this implementation in that the training example pairs (train_set_x , train_set_y) must contain the same dimensions. Can point in direction of how linear algebra should be modified?
Update 2 :
I modified #Paul Panzer answer so that learning rate is 0.001 and train_set_x , train_set_y pairs are unique :
train_set_x = np.array([
[1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1]
])
train_set_y = np.array([
[1,0,0],[0,0,1],[0,1,0],[1,0,1]
])
grads = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True)
# To classify single training example :
print(sigmoid(dw # [0,0,1,1,0] + db))
This update produces following output :
-2.09657359028
-3.94918577439
[[ 0.74043089 0.32851512 0.14776077 0.77970162]
[ 0.04810012 0.08033521 0.72846174 0.1063849 ]
[ 0.25956911 0.67148488 0.22029838 0.85223923]]
[[1 0 0 1]
[0 0 1 0]
[0 1 0 1]]
train accuracy: 79.84462279013312 %
[[ 0.51309252 0.48853845 0.50945862]
[ 0.5110232 0.48646923 0.50738869]
[ 0.51354109 0.48898712 0.50990734]]
Should print(sigmoid(dw # [0,0,1,1,0] + db)) produce a vector that once rounded matches train_set_y corresponding value : [0,1,0] ?
Modifying to produce a vector with (adding [0,0,1,1,0] to numpy array and taking transpose):
print(sigmoid(dw # np.array([[0,0,1,1,0]]).T + db))
returns :
array([[ 0.51309252],
[ 0.48646923],
[ 0.50990734]])
Again, rounding these values to nearest whole number produces vector [1,0,1] when [0,1,0] is expected.
These are incorrect operations to produce a prediction for single training example ?
Your difficulties come from mismatched dimensions, so let's walk through the problem and try and get them straight.
Your network has a number of inputs, the features, let's call their number N_in (numberOfFeatures in your code). And it has a number of outputs which correspond to different classes let's call their number N_out. Inputs and outputs are connected by the weights w.
Now here is the problem. Connections are all-to-all, so we need a weight for each of the N_out x N_in pairs of outputs and inputs. Therefore in your code the shape of w must be changed to (N_out, N_in). You probably also want an offset b for each output, so b should be a vector of size (N_out,) or rather (N_out, 1) so it plays well with the 2d terms.
I've fixed that in the modified code below and I tried to make it very explicit. I've also thrown a mock data creator into the bargain.
Re the one-hot encoded categorical output, I'm not an expert on neural networks but I think, most people understand it so that classes are mutually exclusive, so each sample in your mock output should have one one and the rest zeros.
Side note:
At one point a competing answer advised you to get rid of the 1-... terms in the cost function. While that looks like an interesting idea to me my gut feeling (Edit Now confirmed using gradient-free minimizer; use activation="hybrid" in code below. Solver will simply maximize all outputs which are active in at least one training example.) is it won't work just like that because the cost will then fail to penalise false positives (see below for detailed explanation). To make it work you'd have to add some kind of regularization. One method that appears to work is using the softmax instead of the sigmoid. The softmax is to one-hot what the sigmoid is to binary. It makes sure the output is "fuzzy one-hot".
Therefore my recommendation is:
If you want to stick with sigmoid and not explicitly enforce one-hot predictions. Keep the 1-... term.
If you want to use the shorter cost function. Enforce one-hot predictions. For example by using softmax instead of sigmoid.
I've added an activation="sigmoid"|"softmax"|"hybrid" parameter to the code that switches between models. I've also made the scipy general purpose minimizer available, which may be useful when the gradient of the cost is not at hand.
Recap on how the cost function works:
The cost is a sum over all classes and all training samples of the term
-y log (y') - (1-y) log (1-y')
where y is the expected response, i.e. the one given by the "y" training sample for the input (the "x" training sample). y' is the prediction, the response the network with its current weights and biases generates. Now, because the expected response is either 0 or 1 the cost for a single category and a single training sample can be written
-log (y') if y = 1
-log(1-y') if y = 0
because in the first case (1-y) is zero, so the second term vanishes and in the secondo case y is zero, so the first term vanishes.
One can now convince oneself that the cost is high if
the expected response y is 1 and the network prediction y' is close to zero
the expected response y is 0 and the network prediction y' is close to one
In other words the cost does its job in punishing wrong predictions. Now, if we drop the second term (1-y) log (1-y') half of this mechanism is gone. If the expected response is 1, a low prediction will still incur a cost, but if the expected response is 0, the cost will be zero, regardless of the prediction, in particular, a high prediction (or false positive) will go unpunished.
Now, because the total cost is a sum over all training samples, there are three possibilities.
all training samples prescribe that the class be zero:
then the cost will be completely independent of the predictions for this class and no learning can take place
some training samples put the class at zero, some at one:
then because "false negatives" or "misses" are still punished but false positives aren't the net will find the easiest way to minimize the cost which is to indiscriminately increase the prediction of the class for all samples
all training samples prescribe that the class be one:
essentially the same as in the second scenario will happen, only here it's no problem, because that is the correct behavior
And finally, why does it work if we use softmax instead of sigmoid? False positives will still be invisible. Now it is easy to see that the sum over all classes of the softmax is one. So I can only increase the prediction for one class if at least one other class is reduced to compensate. In particular, there can be no false positives without a false negative, and the false negative the cost will detect.
On how to get a binary prediction:
For binary expected responses rounding is indeed the appropriate procedure. For one-hot I'd rather find the largest value, set that to one and all others to zero. I've added a convenience function, predict, implementing that.
import numpy as np
from scipy import optimize as opt
from collections import namedtuple
# First, a few structures to keep ourselves organized
Problem_Size = namedtuple('Problem_Size', 'Out In Samples')
Data = namedtuple('Data', 'Out In')
Network = namedtuple('Network', 'w b activation cost gradient most_likely')
def get_dims(Out, In, transpose=False):
"""extract dimensions and ensure everything is 2d
return Data, Dims"""
# gracefully acccept lists etc.
Out, In = np.asanyarray(Out), np.asanyarray(In)
if transpose:
Out, In = Out.T, In.T
# if it's a single sample make sure it's n x 1
Out = Out[:, None] if len(Out.shape) == 1 else Out
In = In[:, None] if len(In.shape) == 1 else In
Dims = Problem_Size(Out.shape[0], *In.shape)
if Dims.Samples != Out.shape[1]:
raise ValueError("number of samples must be the same for Out and In")
return Data(Out, In), Dims
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
def sig_cost(Net, data):
A = process(data.In, Net)
logA = np.log(A)
return -(data.Out * logA + (1-data.Out) * (1-logA)).sum(axis=0).mean()
def sig_grad (Net, Dims, data):
A = process(data.In, Net)
return dict(dw = (A - data.Out) # data.In.T / Dims.Samples,
db = (A - data.Out).mean(axis=1, keepdims=True))
def sig_ml(z):
return np.round(z).astype(int)
def sof_ml(z):
hot = np.argmax(z, axis=0)
z = np.zeros(z.shape, dtype=int)
z[hot, np.arange(len(hot))] = 1
return z
def softmax(z):
z = z - z.max(axis=0, keepdims=True)
z = np.exp(z)
return z / z.sum(axis=0, keepdims=True)
def sof_cost(Net, data):
A = process(data.In, Net)
logA = np.log(A)
return -(data.Out * logA).sum(axis=0).mean()
sof_grad = sig_grad
def get_net(Dims, activation='softmax'):
activation, cost, gradient, ml = {
'sigmoid': (sigmoid, sig_cost, sig_grad, sig_ml),
'softmax': (softmax, sof_cost, sof_grad, sof_ml),
'hybrid': (sigmoid, sof_cost, None, sig_ml)}[activation]
return Network(w=np.zeros((Dims.Out, Dims.In)),
b=np.zeros((Dims.Out, 1)),
activation=activation, cost=cost, gradient=gradient,
most_likely=ml)
def process(In, Net):
return Net.activation(Net.w # In + Net.b)
def propagate(data, Dims, Net):
return Net.gradient(Net, Dims, data), Net.cost(Net, data)
def optimize_no_grad(Net, Dims, data):
def f(x):
Net.w[...] = x[:Net.w.size].reshape(Net.w.shape)
Net.b[...] = x[Net.w.size:].reshape(Net.b.shape)
return Net.cost(Net, data)
x = np.r_[Net.w.ravel(), Net.b.ravel()]
res = opt.minimize(f, x, options=dict(maxiter=10000)).x
Net.w[...] = res[:Net.w.size].reshape(Net.w.shape)
Net.b[...] = res[Net.w.size:].reshape(Net.b.shape)
def optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True):
w, b = Net.w, Net.b
costs = []
for i in range(num_iterations):
grads, cost = propagate(data, Dims, Net)
dw = grads["dw"]
db = grads["db"]
w -= learning_rate * dw
b -= learning_rate * db
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 10000 == 0:
print(cost)
return grads, costs
def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False, activation='sigmoid'):
data, Dims = get_dims(Y_train, X_train, transpose=True)
Net = get_net(Dims, activation)
if Net.gradient is None:
optimize_no_grad(Net, Dims, data)
else:
grads, costs = optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True)
Y_prediction_train = process(data.In, Net)
print(Y_prediction_train)
print(data.Out)
print(Y_prediction_train.sum(axis=0))
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - data.Out)) * 100))
return Net
def predict(In, Net, probability=False):
In = np.asanyarray(In)
is1d = In.ndim == 1
if is1d:
In = In.reshape(-1, 1)
Out = process(In, Net)
if not probability:
Out = Net.most_likely(Out)
if is1d:
Out = Out.reshape(-1)
return Out
def create_data(Dims):
Out = np.zeros((Dims.Out, Dims.Samples), dtype=int)
Out[np.random.randint(0, Dims.Out, (Dims.Samples,)), np.arange(Dims.Samples)] = 1
In = np.random.randint(0, 2, (Dims.In, Dims.Samples))
return Data(Out, In)
train_set_x = np.array([
[1,1,1,1,1],[0,1,1,1,1],[0,0,1,1,0],[0,0,1,0,1]
])
train_set_y = np.array([
[1,0,0],[1,0,0],[0,0,1],[0,0,1]
])
Net1 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='sigmoid')
Net2 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='softmax')
Net3 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='hybrid')
Dims = Problem_Size(8, 100, 50)
data = create_data(Dims)
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='softmax')
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='sigmoid')
Both the idea of how to fix the bug and how you can extend the implementation to classify between more classes can be solved with some dimensionality analysis.
I am assuming that you by classifying multiple examples mean multiple classes and not multiple samples, as we need multiple samples to train even for 2 classes.
Where N = number of samples, D = number of features, K = number of categories(with K=2 being a special case where one can reduce this down to one dimension,ie K=1 with y=0 signifying one class and y=1 the other). The data should have the following dimensions:
X: N * D #input
y: N * K #output
W: D * K #weights, also dW has same dimensions
b: 1 * K #bias, also db has same dimensions
#A should have same dimensions as y
The order of the dimensions can be switched around, as long as the dot products are done correctly.
First dealing with your bug: You are initializing W as N * K instead of D * K ie. in the binary case:
w = np.zeros((numberOfTrainingExamples , 1))
#instead of
w = np.zeros((numberOfFeatures , 1))
This means that the only time you are initializing W to correct dimensions is when y and X (coincidentally) have same dimensions.
This will mess with your dot products as well:
np.dot(X, w) # or np.dot(w.T,X.T) if you define y as [K * N] dimensions
#instead of
np.dot(w.T , X)
and
np.dot( X.T, ( A - Y ) ) #np.dot( X.T, ( A - Y ).T ) if y:[K * N]
#instead of
np.dot( X, ( A - Y ).T )
Also make sure that the cost function returns one number (ie. not an array).
Secondly going on to K>2 you need to make some changes. b is no longer a single number, but a vector (1D-array). y and W go from being 1D-array to 2D array. To avoid confusion and hard-to-find bugs it could be good to set K, N and D to different values

Categories

Resources