How to calculate logistic regression accuracy - python

I am a complete beginner in machine learning and coding in python, and I have been tasked with coding logistic regression from scratch to understand what happens under the hood. So far I have coded for the hypothesis function, cost function and gradient descent, and then coded for the logistic regression. However on coding for printing the accuracy I get a low output (0.69) which doesnt change with increasing iterations or changing the learning rate. My question is, is there a problem with my accuracy code below? Any help pointing to the right direction would be appreciated
X = data[['radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst']]
X = np.array(X)
X = min_max_scaler.fit_transform(X)
Y = data["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)
X = data["diagnosis"].map(lambda x: float(x))
def Sigmoid(z):
if z < 0:
return 1 - 1/(1 + math.exp(z))
else:
return 1/(1 + math.exp(-z))
def Hypothesis(theta, x):
z = 0
for i in range(len(theta)):
z += x[i]*theta[i]
return Sigmoid(z)
def Cost_Function(X,Y,theta,m):
sumOfErrors = 0
for i in range(m):
xi = X[i]
hi = Hypothesis(theta,xi)
error = Y[i] * math.log(hi if hi >0 else 1)
if Y[i] == 1:
error = Y[i] * math.log(hi if hi >0 else 1)
elif Y[i] == 0:
error = (1-Y[i]) * math.log(1-hi if 1-hi >0 else 1)
sumOfErrors += error
constant = -1/m
J = constant * sumOfErrors
#print ('cost is: ', J )
return J
def Cost_Function_Derivative(X,Y,theta,j,m,alpha):
sumErrors = 0
for i in range(m):
xi = X[i]
xij = xi[j]
hi = Hypothesis(theta,X[i])
error = (hi - Y[i])*xij
sumErrors += error
m = len(Y)
constant = float(alpha)/float(m)
J = constant * sumErrors
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = []
constant = alpha/m
for j in range(len(theta)):
CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)
new_theta_value = theta[j] - CFDerivative
new_theta.append(new_theta_value)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test, Hypothesis(X,theta))
for i in range(length):
prediction = round(Hypothesis(X[i],theta))
answer = Y[i]
if prediction == answer.all():
correct += 1
my_accuracy = (correct / length)*100
print ('LR Accuracy %: ', my_accuracy)
def Logistic_Regression(X,Y,alpha,theta,num_iters):
theta = np.zeros(X.shape[1])
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
Cost_Function(X,Y,theta,m)
print ('theta: ', theta)
print ('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
alpha = 0.0001
iterations = 1000
Logistic_Regression(X,Y,alpha,initial_theta,iterations)
This is using data from the wisconsin breast cancer dataset (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) where I am weighing in 30 features - although changing the features to ones which are known to correlate also doesn't change my accuracy.

Python gives us this scikit-learn library that makes our work easier,
this worked for me:
from sklearn.metrics import accuracy_score
y_pred = log.predict(x_test)
score =accuracy_score(y_test,y_pred)

Accuracy is one of the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. Higher accuracy means model is preforming better.
Accuracy = TP+TN/TP+FP+FN+TN
TP = True positives
TN = True negatives
FN = False negatives
TN = True negatives
While you are using accuracy measure your false positives and false negatives should be of similar cost. A better metric is the F1-score which is given by
F1-score = 2*(Recall*Precision)/Recall+Precision where,
Precision = TP/TP+FP
Recall = TP/TP+FN
Read more here
https://en.wikipedia.org/wiki/Precision_and_recall
The beauty about machine learning in python is that important modules like scikit-learn is open source so you can always look at the actual code.
Please use the below link to scikit learn metrics source code which will give you an idea how scikit-learn calculates the accuracy score when you do
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/metrics

I'm not sure how you arrived at a value of 0.0001 for alpha, but I think it's too low. Using your code with the cancer data shows that cost is decreasing with each iteration -- it's just going glacially.
When I raise this to 0.5, I still get a decreasing costs, but at a more reasonable level. After 1000 iterations it reports:
cost: 0.23668000993020666
And after fixing the Accuracy function I'm getting 92% on the test segment of the data.
You have Numpy installed, as shown by X = np.array(X). You should really consider using it for your operations. It will be orders of magnitude faster for jobs like this. Here is a vectorized version that gives results instantly rather than waiting:
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv("cancerdata.csv")
X = df.values[:,2:-1].astype('float64')
X = (X - np.mean(X, axis =0)) / np.std(X, axis = 0)
## Add a bias column to the data
X = np.hstack([np.ones((X.shape[0], 1)),X])
X = MinMaxScaler().fit_transform(X)
Y = df["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, x):
return Sigmoid(x # theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T # (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy %: ', my_accuracy)
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
#print ('theta: ', theta)
print ('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 2000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)
I think I might have a different versions of scikit, because I had change the MinMaxScaler line to make it work. The result is that I can 10K iterations in the blink of an eye and the results of the applying the model to the test set is about 97% accuracy.

This also works using Vectorization to calculate the accuracy
But Accuracy is not recommended metric as the above Answer noted (if the data is not well_blanced you should not use accuracy instead you use F1-score)
clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X.T, Y.T);
LR_predictions = clf.predict(X.T)
print ('Accuracy of logistic regression: %d ' % float((np.dot(Y,LR_predictions) + np.dot(1-Y,1-LR_predictions))/float(Y.size)*100) +
'% ' + "(percentage of correctly labelled datapoints)")

Related

Why is my neural network only getting to a certain accuracy?

Sorry for not being very specific in the question, but I've been trying to create a neural network all on my own for a couple months now, and could use some help. This is a basic one made to recognize numbers from the MNIST data set, and it's mostly based on code from here and here. When I run it now, after some experimenting with the amount of iterations and the learning rate, I can get it to ~30% accuracy, which is a lot better than my failed experiments from before but definitely hardly as good as it can be (even if I do 40,000 iterations, it seems to end up almost always guessing 1 for some reason). Here's the code, it's got a couple quirks and could be optimized a lot but I just wanted to be completely able to see what's happening and fully understand it.
#Importing some random libraries idk
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from mnist import MNIST
#Sigmoid is used as the activation function
def sigmoid(x):
return 1/(1 + np.exp(-x))
#Derivative of the sigmoid function
def dsigmoid(x):
return sigmoid(x)*(1.0 - sigmoid(x))
def convToBinary(ogArray, newArray):
for i in range(len(ogArray)):
newArray[i][ogArray[i]] = 1
print("Very beginning")
#size is 784-16-16-10
# 0-1 - 2-3 < layer names
# 0 1 2 < interlayer names
#this means that something like outputs[2] is actually the activition for layer 3
class NeuralNetwork(object):
def __init__ (self):
#weights
#dimensions are opposite usual notation, (in, out) or (layer size, next layer size)
self.weights = ([np.random.randn(784, 16), np.random.randn(16, 16), np.random.randn(16, 10)])
self.wDerivs = ([np.zeros((784, 16)), np.zeros((16,16)), np.zeros((16, 10))])
#biases
self.biases = ([np.ones((16, 1)), np.ones((16, 1)), np.ones((10, 1))])
self.bDerivs = ([np.zeros((16, 1)), np.zeros((16, 1)), np.zeros((10, 1))])
#outputs
self.zoutputs = ([np.ones((16, 1)), np.ones((16, 1)), np.ones((10, 1))])
self.outputs = ([np.ones((16, 1)), np.ones((16, 1)), np.ones((10, 1))])
self.aDerivs = ([np.ones((16, 1)), np.ones((16, 1)), np.ones((10, 1))])
self.cost = np.ones((10, 1))
def forwardPropagate(self, input):
last = input
for i in range(3):
self.zoutputs[i] = np.add(np.dot(np.transpose(self.weights[i]), last), self.biases[i])
self.outputs[i] = sigmoid(self.zoutputs[i])
last = self.outputs[i]
def backPropagate(self, input, y, lr):
#deltas, dC/da
self.aDerivs[2] = (self.outputs[2] - y) * dsigmoid(self.zoutputs[2])
self.aDerivs[1] = np.dot(self.weights[2], self.aDerivs[2]) * dsigmoid(self.zoutputs[1])
self.aDerivs[0] = np.dot(self.weights[1], self.aDerivs[1]) * dsigmoid(self.zoutputs[0])
#biases, dC/db
self.bDerivs[2] = self.aDerivs[2]
self.bDerivs[1] = self.aDerivs[1]
self.bDerivs[0] = self.aDerivs[0]
#weights, dC/dw
self.wDerivs[2] = np.dot(self.outputs[1], np.transpose(self.aDerivs[2]))
self.wDerivs[1] = np.dot(self.outputs[0], np.transpose(self.aDerivs[1]))
self.wDerivs[0] = np.dot(input, np.transpose(self.aDerivs[0]))
#doing the adjusting
for i in range(len(self.biases)):
self.biases[i] = self.biases[i] - (self.bDerivs[i] * lr)
for i in range(len(self.weights)):
self.weights[i] = self.weights[i] - (self.wDerivs[i] * lr)
def findCost(self, y):
for i in range(10):
self.cost[i] = -(y[i]*np.log(self.outputs[2][i]) + (1-y[i])*np.log(1 - self.outputs[2][i]))
aM = 0
for i in self.cost:
aM += i
#find average cost for each one, idk if I'm using the right cost function here but it gets the job done
return aM / 10
def findAnswer(self):
#finding highest value in output layer to guess the number
bestNum = 0
bestVal = 0
for i in range(10):
if (self.outputs[2][i] > bestVal):
bestNum = i
bestVal = self.outputs[2][i]
return bestNum
def doTheThing(self, X, oldY, Y, iter, lr):
#this function does all of the other functions
n_c = 0
for i in range(iter):
c = 0
x = X[i].reshape(784, 1)
oldy = oldY[i]
y = Y[i].reshape(10, 1)
self.forwardPropagate(x)
c = self.findCost(y)
self.backPropagate(x, y, lr)
print("It is " + str(oldy))
print("It predicted " + str(self.findAnswer()))
if (oldy == self.findAnswer()):
if (i > (iter * 4) / 5):
n_c += 1
print("Iteration: " + str(i))
print("Cost: " + str(c))
#I didn't really separate training and testing, so I just found percent right based on last 1/5
if (((i - (iter*0.8) + 1)) != 0):
print("Right: " + str(n_c / ((i - (iter*0.8) + 1))))
#import
mndata = MNIST('Number_Samples')
iTest, lTest = mndata.load_training()
newITest = np.array(iTest)
newLTest = np.zeros((len(lTest), 10))
#putting the expected result in a form that can be compared to the output layer
convToBinary(lTest, newLTest)
nn = NeuralNetwork()
nn.doTheThing(newITest, lTest, newLTest, 40000, 0.1)
I've tried debugging but I've had no luck, there's probably some major flaw that I'm just not seeing. I would greatly appreciate it if someone with a lot more experience than me were to at least point in the right direction, because right now I have no clue what I'm doing wrong.

Polynomial Regression without scikitlearn

Tried doing a polynomial regression. However, for any values of n other than 3, the error increases significantly, the x vs y_hat plot actually starts going downwards.
The logs have been taken to get rid of the outliers
import random
import numpy as np
import matplotlib.pyplot as plt
import math
x = np.array([math.log10(1), math.log10(9), math.log10(22), math.log10(24), math.log10(25), math.log10(26), math.log10(27), math.log10(28), math.log10(29), math.log10(30), math.log10(31), math.log10(32), math.log10(33), math.log10(34), math.log10(35)])
y = np.array([math.log10(8), math.log10(9), math.log10(51), math.log10(115), math.log10(164), math.log10(209),math.log10(278), math.log10(321), math.log10(382),math.log10(456), math.log10(596), math.log10(798),math.log10(1140), math.log10(1174), math.log10(1543)])
c = random.random()
plt.scatter(x, y)
n = 3
m=[]
x_real = []
alpha = 0.0001
y_hat = []
for i in range(1, n+1):
x_real.append(x**i)
m.append(random.random())
x_real = np.array(x_real)
m = np.array(m)
x_real = np.transpose(x_real)
y_hat = np.matmul(x_real, m)+c
error = 0.5*(np.sum((y-y_hat)**2))
print(error)
sum = np.sum(y_hat-y)
for epochs in range(101):
for items in range(n):
m[items] = m[items] - (alpha*(sum*x[items]))
c = c - (alpha*sum)
y_hat = (np.matmul(x_real, m))+c
error = 0.5*(np.sum((y-y_hat)**2))
print(error)
plt.plot(x, y_hat)
You need to update the value of sum for each epoch :
prev = 0
for epochs in range(101):
sum = np.sum(y_hat-y)
for items in range(n):
m[items] = m[items] - (alpha*(sum*x[items]))
c = c - (alpha*sum)
y_hat = (np.matmul(x_real, m))+c
error = 0.5*(np.sum((y-y_hat)**2))
if error == prev:
break
print(error)
plt.plot(x, y_hat)
Just a small error, I assume !
Also you can break the epoch loop once the errors are too close, or in your case when they are equal for successive epochs.

Something wrong with Sigmoid curve for Logistic Regression

I'm trying to use logistic regression on the popularity of hits songs on Spotify from 2010-2019 based on their durations and durability, whose data are collected from a .csv file. Basically, since the popularity values of each song are numerical, I have converted each of them to binary numbers "0" to "1". If the popularity value of a hit song is less than 70, I will replace its current value to 0, and vice versa if its value is more than 70. For some reason, as the rest of my code is pretty standard in creating a sigmoid function, the end result is a straight line instead of a sigmoid curve.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('top10s [SubtitleTools.com] (2).csv')
BPM = df.bpm
BPM = np.array(BPM)
Energy = df.nrgy
Energy = np.array(Energy)
Dance = df.dnce
Dance = np.array(Dance)
dB = df.dB
dB = np.array(dB)
Live = df.live
Live = np.array(Live)
Valence = df.val
Valence = np.array(Valence)
Acous = df.acous
Acous = np.array(Acous)
Speech = df.spch
Speech = np.array(Speech)
df.loc[df['popu'] <= 70, 'popu'] = 0
df.loc[df['popu'] > 70, 'popu'] = 1
def Logistic_Regression(X, y, iterations, alpha):
ones = np.ones((X.shape[0], ))
X = np.vstack((ones, X))
X = X.T
b = np.zeros(X.shape[1])
for i in range(iterations):
z = np.dot(X, b)
p_hat = sigmoid(z)
gradient = np.dot(X.T, (y - p_hat))
b = b + alpha * gradient
if (i % 1000 == 0):
print('LL, i ', log_likelihood(X, y, b), i)
return b
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def log_likelihood(X, y, b):
z = np.dot(X, b)
LL = np.sum(y*z - np.log(1 + np.exp(z)))
return LL
def LR1():
Dur = df.dur
Dur = np.array(Dur)
Pop = df.popu
Pop = [int(i) for i in Pop]; Pop = np.array(Pop)
plt.figure(figsize=(10,8))
colormap = np.array(['r', 'b'])
plt.scatter(Dur, Pop, c = colormap[Pop], alpha = .4)
b = Logistic_Regression(Dur, Pop, iterations = 8000, alpha = 0.00005)
print('Done')
p_hat = sigmoid(np.dot(Dur, b[1]) + b[0])
idxDur = np.argsort(Dur)
plt.plot(Dur[idxDur], p_hat[idxDur])
plt.show()
LR1()
df
Your logreg params arent coming out correctly, thus something is wrong in your gradient descent.
If I do
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame({'popu':[0,1,0,1,1,0,0,1,0,0],'dur'[217,283,200,295,221,176,206,260,217,213]})
logreg = LogisticRegression()
logreg.fit(Dur.reshape([10,1]),Pop.reshape([10,1]))
print(logreg.coef_)
print(logreg.intercept_)
I get [0.86473507, -189.79655798]
whereas your params (b) come out [0.012136874150412973 -0.2430389407767768] for this data.
Plot of your vs scikit logregs here

Implementing simple probabilistic model with negative log likelihood loss

First a quick disclaimer would be that I posted this question on Reddit, in the Deep Learning and Learning Machine Learning first, but I thought I might also request your expertise here too. Without further ado:
I am currently challenging myself on this year Deep Unsupervised Learning Course of Berkeley University and although I just started the warmup exercise of week 1, I am already having 'technical' difficulties.
The exercise in question is the "1. Warmup" in the following document: Week 1 Exercises. (My apologies as I am not familiar enough with Reddit formating to seemlessly include images.
In my understanding, we have a variable x which can take values from 1..100 which a specific probability of being sampled ( defined in sample_data() function).
The task is therefore to fit a vector of parameters theta which is passed to a softmax function, and is supposed to give the likelihood of a specific element x_i to be sampled. Namely, theta_1 should the parameter which "bumps up" the soft-max value corresponding to the variable x = 1 and so on.
Using Tensorflow, I think I was able to create such a model, but when it comes to training, I believe I am missing a crucial point as the program cannot compute gradients with respect to the theta parameters.
I would like to know if am not misunderstanding the task, and if there is any better method to achieve the result of the exercise.
Here is the code, where the failing par is located from the # Computing gradients.
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
if __name__ == "__main__":
# Sampling function of the x variable provided in the exercise
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters theta
w_init = tf.zeros_initializer()
params = tf.Variable(
initial_value=w_init(shape=(1, 100),
dtype='float32'), trainable=True, name='params')
softmax = tf.squeeze( tf.nn.softmax( params, axis=1))
#Should materialize the loss of the model
def get_neg_log_likelihood( inputs):
return - tf.math.log( softmax)
neg_log_likelihoods = get_neg_log_likelihood( softmax)
dist = tfp.distributions.Categorical( probs=softmax, dtype=tf.int32)
optimizer = tf.keras.optimizers.Adam()
for epoch in range( 100):
minibatch_size = 200
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
with tf.GradientTape() as tape:
tape.watch( params)
loss = tf.reduce_mean( - dist.log_prob( x))
# Computing gradients
grads = tape.gradient( loss, params)
print( grads) # Result: None
# input()
optimizer.apply_gradients( zip( grads, params))
Thank you in advance for your time.
PS: I mainly have a background in Deep Reinforcement Learning, therefore I can understand the various models used there ( policy, value functions ...), but I am trying to refine my grasp over the internals of the models themselves, namely in generative probabilistic models (GAN, VAE) and other unsupervised learning models in general ( RealNVP, Norm Flows, ...)
Pretty sure nobody is gonna see this, but I thought I might as well bring some closure to this.
First of all, I calculated the gradients by directly deriving its expression from the negative log likelihood of the soft-max value, thus dropping the Tensorflow framework by the same occasion.
Although the results are a little bit under my expectations, the program was able to fit the model to a distribution somewhat similar to the empirical distribution of the sampled data. I guess this is due to the fact that just a 1 dimensional theta parameter vector is not enough to fully model the real data distribution, as well as the finite amount of sampled data.
An updated version of the code:
import numpy as np
from matplotlib import pyplot as plt
np.random.seed( 42)
def softmax(X, theta = 1.0, axis = None):
# Shamefull copy paste from SO
y = np.atleast_2d(X)
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
y = y * float(theta)
y = y - np.expand_dims(np.max(y, axis = axis), axis)
y = np.exp(y)
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
p = y / ax_sum
if len(X.shape) == 1: p = p.flatten()
return p
if __name__ == "__main__":
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters
params = np.zeros(100)
# Use for loss computation
def get_neg_log_likelihood( softmax):
return - np.log( softmax)
def get_loss( params, x):
return np.mean( [get_neg_log_likelihood( softmax( params))[i-1] for i in x])
lr = .0005
for epoch in range( 1000):
# Shuffling training data
np.random.shuffle( train_ds)
minibatch_size = 100
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
smax = softmax( params)
# Jacobian of neg log likelishood
jacobian = [[ smax[j] - 1 if i == j else
smax[j] for j in range(100)] for i in range(100)]
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
# Compute the gradient matrix for each sample data and mean over it
grad_matrix = np.vstack( [jacobian[i] for i in x])
grads = np.sum( grad_matrix, axis=0)
params -= lr * grads
print( "Epoch %d -- Train loss: %.4f , Val loss: %.4f" %(epoch, get_loss( params, train_ds), get_loss( params, val_ds)))
# Plotting each ~100 epochs
if epoch % 100 == 0:
counters = { i+1: 0 for i in range(100)}
for x in full_data:
counters[x]+= 1
histogram = np.array( [ counters[i+1] / len( full_data) for i in range( 100)])
fsmax = softmax( params)
fig, ax = plt.subplots()
ax.set_title('Dist. Comp. after %d epochs of training (from scratch)' % epoch)
x = np.arange( 1,101)
width = 0.35
rects1 = ax.bar(x - width/2, fsmax, width, label='Model')
rects2 = ax.bar(x + width/2, histogram, width, label='Empirical')
ax.set_ylabel('Likelihood')
ax.set_xlabel('Variable x\s values')
ax.legend()
def autolabel(rects):
for rect in rects:
height = rect.get_height()
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.savefig( 'plots/results_after_%d_epochs.png' % epoch)
Picture of the final model distribution included for completeness. Modeled vs Empirical Distribution

Why doesn't my custom made linear regression model match sklearn?

I'm attempting to create a simple linear model with Python using no libraries (other than numpy). Here's what I have
import numpy as np
import pandas
np.random.seed(1)
alpha = 0.1
def h(x, w):
return np.dot(w.T, x)
def cost(X, W, Y):
totalCost = 0
for i in range(47):
diff = h(X[i], W) - Y[i]
squared = diff * diff
totalCost += squared
return totalCost / 2
housing_data = np.loadtxt('Housing.csv', delimiter=',')
x1 = housing_data[:,0]
x2 = housing_data[:,1]
y = housing_data[:,2]
avgX1 = np.mean(x1)
stdX1 = np.std(x1)
normX1 = (x1 - avgX1) / stdX1
print('avgX1', avgX1)
print('stdX1', stdX1)
avgX2 = np.mean(x2)
stdX2 = np.std(x2)
normX2 = (x2 - avgX2) / stdX2
print('avgX2', avgX2)
print('stdX2', stdX2)
normalizedX = np.ones((47, 3))
normalizedX[:,1] = normX1
normalizedX[:,2] = normX2
np.savetxt('normalizedX.csv', normalizedX)
weights = np.ones((3,))
for boom in range(100):
currentCost = cost(normalizedX, weights, y)
if boom % 1 == 0:
print(boom, 'iteration', weights[0], weights[1], weights[2])
print('Cost', currentCost)
for i in range(47):
errorDiff = h(normalizedX[i], weights) - y[i]
weights[0] = weights[0] - alpha * (errorDiff) * normalizedX[i][0]
weights[1] = weights[1] - alpha * (errorDiff) * normalizedX[i][1]
weights[2] = weights[2] - alpha * (errorDiff) * normalizedX[i][2]
print(weights)
predictedX = [1, (2100 - avgX1) / stdX1, (3 - avgX2) / stdX2]
firstPrediction = np.array(predictedX)
print('firstPrediction', firstPrediction)
firstPrediction = h(firstPrediction, weights)
print(firstPrediction)
First, it converges VERY quickly. After only 14 iterations. Second, it gives me a different result than a linear regression with sklearn. For reference, my sklearn code is:
import numpy
import matplotlib.pyplot as plot
import pandas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
dataset = pandas.read_csv('Housing.csv', header=None)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 2].values
linearRegressor = LinearRegression()
xnorm = sklearn.preprocessing.scale(x)
scaleCoef = sklearn.preprocessing.StandardScaler().fit(x)
mean = scaleCoef.mean_
std = numpy.sqrt(scaleCoef.var_)
print('stf')
print(std)
stuff = linearRegressor.fit(xnorm, y)
predictedX = [[(2100 - mean[0]) / std[0], (3 - mean[1]) / std[1]]]
yPrediction = linearRegressor.predict(predictedX)
print('predictedX', predictedX)
print('predict', yPrediction)
print(stuff.coef_, stuff.intercept_)
My custom model predicts 337,000 for the value of y and sklearn predicts 355,000. My data is 47 rows that look like
2104,3,3.999e+05
1600,3,3.299e+05
2400,3,3.69e+05
1416,2,2.32e+05
3000,4,5.399e+05
1985,4,2.999e+05
1534,3,3.149e+05
Complete data available at https://github.com/shamoons/linear-logistic-regression/blob/master/Housing.csv
I assume either (a) my regression with gradient descent is somehow wrong or (b) I'm not using sklearn properly.
Any other reasons why the 2 wouldn't predict the same output for a given input?
I think you are missing the 1/m term (where m is the size of y) in the gradient descent. After including the 1/m term, I seem to get a predicted value similar to your sklearn code.
see below
....
weights = np.ones((3,))
m = y.size
for boom in range(100):
currentCost = cost(normalizedX, weights, y)
if boom % 1 == 0:
print(boom, 'iteration', weights[0], weights[1], weights[2])
print('Cost', currentCost)
for i in range(47):
errorDiff = h(normalizedX[i], weights) - y[i]
weights[0] = weights[0] - alpha *(1/m)* (errorDiff) * normalizedX[i][0]
weights[1] = weights[1] - alpha *(1/m)* (errorDiff) * normalizedX[i][1]
weights[2] = weights[2] - alpha *(1/m)* (errorDiff) * normalizedX[i][2]
...
this gives the firstprediction to be 355242.
This agrees well with the linear regression model even though it does not do gradient descent.
I also tried sgdregressor (uses stochastic gradient descent) in sklearn and it too seem to get a value close to linear regressor model and your model. see the code below
import numpy
import matplotlib.pyplot as plot
import pandas
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, SGDRegressor
dataset = pandas.read_csv('Housing.csv', header=None)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 2].values
sgdRegressor = SGDRegressor(penalty='none', learning_rate='constant', eta0=0.1, max_iter=1000, tol = 1E-6)
xnorm = sklearn.preprocessing.scale(x)
scaleCoef = sklearn.preprocessing.StandardScaler().fit(x)
mean = scaleCoef.mean_
std = numpy.sqrt(scaleCoef.var_)
print('stf')
print(std)
yPrediction = []
predictedX = [[(2100 - mean[0]) / std[0], (3 - mean[1]) / std[1]]]
print('predictedX', predictedX)
for trials in range(10):
stuff = sgdRegressor.fit(xnorm, y)
yPrediction.extend(sgdRegressor.predict(predictedX))
print('predict', np.mean(yPrediction))
results in
predict 355533.10119985335

Categories

Resources