Implementing simple probabilistic model with negative log likelihood loss

Implementing simple probabilistic model with negative log likelihood loss - python

First a quick disclaimer would be that I posted this question on Reddit, in the Deep Learning and Learning Machine Learning first, but I thought I might also request your expertise here too. Without further ado:
I am currently challenging myself on this year Deep Unsupervised Learning Course of Berkeley University and although I just started the warmup exercise of week 1, I am already having 'technical' difficulties.
The exercise in question is the "1. Warmup" in the following document: Week 1 Exercises. (My apologies as I am not familiar enough with Reddit formating to seemlessly include images.
In my understanding, we have a variable x which can take values from 1..100 which a specific probability of being sampled ( defined in sample_data() function).
The task is therefore to fit a vector of parameters theta which is passed to a softmax function, and is supposed to give the likelihood of a specific element x_i to be sampled. Namely, theta_1 should the parameter which "bumps up" the soft-max value corresponding to the variable x = 1 and so on.
Using Tensorflow, I think I was able to create such a model, but when it comes to training, I believe I am missing a crucial point as the program cannot compute gradients with respect to the theta parameters.
I would like to know if am not misunderstanding the task, and if there is any better method to achieve the result of the exercise.
Here is the code, where the failing par is located from the # Computing gradients.
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
if __name__ == "__main__":
# Sampling function of the x variable provided in the exercise
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters theta
w_init = tf.zeros_initializer()
params = tf.Variable(
initial_value=w_init(shape=(1, 100),
dtype='float32'), trainable=True, name='params')
softmax = tf.squeeze( tf.nn.softmax( params, axis=1))
#Should materialize the loss of the model
def get_neg_log_likelihood( inputs):
return - tf.math.log( softmax)
neg_log_likelihoods = get_neg_log_likelihood( softmax)
dist = tfp.distributions.Categorical( probs=softmax, dtype=tf.int32)
optimizer = tf.keras.optimizers.Adam()
for epoch in range( 100):
minibatch_size = 200
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
with tf.GradientTape() as tape:
tape.watch( params)
loss = tf.reduce_mean( - dist.log_prob( x))
# Computing gradients
grads = tape.gradient( loss, params)
print( grads) # Result: None
# input()
optimizer.apply_gradients( zip( grads, params))
Thank you in advance for your time.
PS: I mainly have a background in Deep Reinforcement Learning, therefore I can understand the various models used there ( policy, value functions ...), but I am trying to refine my grasp over the internals of the models themselves, namely in generative probabilistic models (GAN, VAE) and other unsupervised learning models in general ( RealNVP, Norm Flows, ...)

Pretty sure nobody is gonna see this, but I thought I might as well bring some closure to this.
First of all, I calculated the gradients by directly deriving its expression from the negative log likelihood of the soft-max value, thus dropping the Tensorflow framework by the same occasion.
Although the results are a little bit under my expectations, the program was able to fit the model to a distribution somewhat similar to the empirical distribution of the sampled data. I guess this is due to the fact that just a 1 dimensional theta parameter vector is not enough to fully model the real data distribution, as well as the finite amount of sampled data.
An updated version of the code:
import numpy as np
from matplotlib import pyplot as plt
np.random.seed( 42)
def softmax(X, theta = 1.0, axis = None):
# Shamefull copy paste from SO
y = np.atleast_2d(X)
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
y = y * float(theta)
y = y - np.expand_dims(np.max(y, axis = axis), axis)
y = np.exp(y)
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
p = y / ax_sum
if len(X.shape) == 1: p = p.flatten()
return p
if __name__ == "__main__":
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters
params = np.zeros(100)
# Use for loss computation
def get_neg_log_likelihood( softmax):
return - np.log( softmax)
def get_loss( params, x):
return np.mean( [get_neg_log_likelihood( softmax( params))[i-1] for i in x])
lr = .0005
for epoch in range( 1000):
# Shuffling training data
np.random.shuffle( train_ds)
minibatch_size = 100
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
smax = softmax( params)
# Jacobian of neg log likelishood
jacobian = [[ smax[j] - 1 if i == j else
smax[j] for j in range(100)] for i in range(100)]
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
# Compute the gradient matrix for each sample data and mean over it
grad_matrix = np.vstack( [jacobian[i] for i in x])
grads = np.sum( grad_matrix, axis=0)
params -= lr * grads
print( "Epoch %d -- Train loss: %.4f , Val loss: %.4f" %(epoch, get_loss( params, train_ds), get_loss( params, val_ds)))
# Plotting each ~100 epochs
if epoch % 100 == 0:
counters = { i+1: 0 for i in range(100)}
for x in full_data:
counters[x]+= 1
histogram = np.array( [ counters[i+1] / len( full_data) for i in range( 100)])
fsmax = softmax( params)
fig, ax = plt.subplots()
ax.set_title('Dist. Comp. after %d epochs of training (from scratch)' % epoch)
x = np.arange( 1,101)
width = 0.35
rects1 = ax.bar(x - width/2, fsmax, width, label='Model')
rects2 = ax.bar(x + width/2, histogram, width, label='Empirical')
ax.set_ylabel('Likelihood')
ax.set_xlabel('Variable x\s values')
ax.legend()
def autolabel(rects):
for rect in rects:
height = rect.get_height()
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.savefig( 'plots/results_after_%d_epochs.png' % epoch)
Picture of the final model distribution included for completeness. Modeled vs Empirical Distribution

Related

Pytorch: multiplication between parameters is inplace for LBFGS optimizer?

I am trying to solve a kind of inverse problem by backward propagation with pytorch. I am trying to recover the parameters (r, theta) that generate a vector field U(r,theta).
As I intended to use the LBFGS optimizer from pytorch, I realize that the operation
r*theta
is detected as inplace and thus not supported for the backward computation of the gradient, whereas
r+theta is not.
How can I overcome this ? I actually need to recover fields that use transformations of the form r*theta.
Here is an example of a code that reproduces the error: it is running fine if you change
field = Wrong_U_param(r, theta, positions)
by
field = U_param(r, theta, positions)
in the loop. Is also works if you replace the r*theta operation by r.item()*theta (but is does not optimize over r since there is no more gradient depending on r.
I tried to use torch.mul() to run the product but it also fails.
The error message is the following
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
and the automatic detection points towards this very product.
Thank you for your help !
import numpy as np
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
import torch.optim as optim
from geomloss import SamplesLoss
torch.autograd.set_detect_anomaly(True)
def model(field):
return field
def U_param(r, theta, pos):
result = r + theta + 0. * pos
return result
def Wrong_U_param(r, theta, pos):
result = r * theta + 0. * pos
return result
def learn_U_param(Zobs, ngrad, params, r_guess=0., theta_guess=0., lambd=1.):
Npts = params[0]
positions = torch.tensor(np.arange(0, 1, 1 / Npts) + 1 / 2 / Npts).reshape((Npts, 1))
lab = torch.tensor(np.arange(0, Npts))
r = torch.tensor(float(r_guess)).to(device)
r.requires_grad = True
theta = torch.tensor(float(theta_guess)).to(device)
theta.requires_grad = True
r_hist = [r.item()]
theta_hist = [theta.item()]
loss_hist = []
optimizer = optim.LBFGS([r, theta])
for i in range(ngrad):
field = Wrong_U_param(r, theta, positions)
Z = model(field)
Loss = SamplesLoss(loss="sinkhorn", p=2, blur=.05)
Wass = Loss(lab, Z, positions, lab, Zobs, positions)
def closure():
optimizer.zero_grad()
Wass.backward(retain_graph=True)
return Wass
optimizer.step(closure)
optimizer.zero_grad()
r_hist.append(r.item())
theta_hist.append(theta.item())
loss_hist.append(Wass.item())
return r_hist, theta_hist, loss_hist
N=100
r = 2
theta = 2
params = [N]
positions = torch.tensor(np.arange(0, 1, 1 / N) + 1 / 2 / N).reshape((N, 1))
Zobs = U_param(r, theta, positions)
ngrad = 10
print(learn_U_param(Zobs, ngrad, params, r_guess=0.1, theta_guess=0.1, lambd=1.))

Incremental Bayesian updates with multi-dimensional parameters

I am trying to use PYMC3 for a Bayesian model where I would like to repeatedly train my model on new unseen data. I am thinking I would need to update the priors with the posterior of the previously trained model every time I see the data, similar to how is achieved here https://docs.pymc.io/notebooks/updating_priors.html. They use the following function that finds the KDE from the samples and replacing each of the original definitions of the parameters in the model with a call to from_posterior.
def from_posterior(param, samples):
smin, smax = np.min(samples), np.max(samples)
width = smax - smin
x = np.linspace(smin, smax, 100)
y = stats.gaussian_kde(samples)(x)
# what was never sampled should have a small probability but not 0,
# so we'll extend the domain and use linear approximation of density on it
x = np.concatenate([[x[0] - 3 * width], x, [x[-1] + 3 * width]])
y = np.concatenate([[0], y, [0]])
return Interpolated(param, x, y)
And here is my original model.
def create_model(batsmen, bowlers, id1, id2, X):
testval = [[-5,0,1,2,3.5,5] for i in range(0, 9)]
l = [i for i in range(9)]
model = pm.Model()
with model:
delta_1 = pm.Uniform("delta_1", lower=0, upper=1)
delta_2 = pm.Uniform("delta_2", lower=0, upper=1)
inv_sigma_sqr = pm.Gamma("sigma^-2", alpha=1.0, beta=1.0)
inv_tau_sqr = pm.Gamma("tau^-2", alpha=1.0, beta=1.0)
mu_1 = pm.Normal("mu_1", mu=0, sigma=1/pm.math.sqrt(inv_tau_sqr), shape=len(batsmen))
mu_2 = pm.Normal("mu_2", mu=0, sigma=1/pm.math.sqrt(inv_tau_sqr), shape=len(bowlers))
delta = pm.math.ge(l, 3) * delta_1 + pm.math.ge(l, 6) * delta_2
eta = [pm.Deterministic("eta_" + str(i), delta[i] + mu_1[id1[i]] - mu_2[id2[i]]) for i in range(9)]
cutpoints = pm.Normal("cutpoints", mu=0, sigma=1/pm.math.sqrt(inv_sigma_sqr), transform=pm.distributions.transforms.ordered, shape=(9,6), testval=testval)
X_ = [pm.OrderedLogistic("X_" + str(i), cutpoints=cutpoints[i], eta=eta[i], observed=X[i]-1) for i in range(9)]
return model
Here, the problem is that some of my parameters such as mu_1, are multidimensional. This is why I get the following error:
ValueError: points have dimension 1, dataset has dimension 1500
because of the line y = stats.gaussian_kde(samples)(x).
Can someone please help me make this work for multi-dimensional parameters? I don't properly understand what KDE is and how the code computes it.
Thank you in advance!!

Weights explode in polynomial regression with gradient descent

I'm just starting out learning machine learning and have been trying to fit a polynomial to data generated with a sine curve. I know how to do this in closed form, but I'm trying to get it to work with gradient descent too.
However, my weights explode to crazy heights, even with a very large penalty term. What am I doing wrong?
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from math import pi
N = 10
D = 5
X = np.linspace(0,100, N)
Y = np.sin(0.1*X)*50
X = X.reshape(N, 1)
Xb = np.array([[1]*N]).T
for i in range(1, D):
Xb = np.concatenate((Xb, X**i), axis=1)
#Randomly initializie the weights
w = np.random.randn(D)/np.sqrt(D)
#Solving in closed form works
#w = np.linalg.solve((Xb.T.dot(Xb)),Xb.T.dot(Y))
#Yhat = Xb.dot(w)
#Gradient descent
learning_rate = 0.0001
for i in range(500):
Yhat = Xb.dot(w)
delta = Yhat - Y
w = w - learning_rate*(Xb.T.dot(delta) + 100*w)
print('Final w: ', w)
plt.scatter(X, Y)
plt.plot(X,Yhat)
plt.show()
Thanks!

When updating theta, you have to take theta and subtract it with the learning weight times the derivative of theta divided by the training set size. You also have to divide your penality term by the training size set. But the main problem is that your learning rate is too large. For future debugging, it is helpful to print the cost to see if gradient descent is working and if the learning rate is too small or just right.
Below here is the code for 2nd degree polynomial which the found the optimum thetas (as you can see the learning rate is really small). I've also added the cost function.
N = 2
D = 2
#Gradient descent
learning_rate = 0.000000000001
for i in range(200):
Yhat = Xb.dot(w)
delta = Yhat - Y
print((1/N) * np.sum(np.dot(delta, np.transpose(delta))))
w = w - learning_rate*(np.dot(delta, Xb)) * (1/N)

How to calculate logistic regression accuracy

I am a complete beginner in machine learning and coding in python, and I have been tasked with coding logistic regression from scratch to understand what happens under the hood. So far I have coded for the hypothesis function, cost function and gradient descent, and then coded for the logistic regression. However on coding for printing the accuracy I get a low output (0.69) which doesnt change with increasing iterations or changing the learning rate. My question is, is there a problem with my accuracy code below? Any help pointing to the right direction would be appreciated
X = data[['radius_mean', 'texture_mean', 'perimeter_mean',
'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst']]
X = np.array(X)
X = min_max_scaler.fit_transform(X)
Y = data["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)
X = data["diagnosis"].map(lambda x: float(x))
def Sigmoid(z):
if z < 0:
return 1 - 1/(1 + math.exp(z))
else:
return 1/(1 + math.exp(-z))
def Hypothesis(theta, x):
z = 0
for i in range(len(theta)):
z += x[i]*theta[i]
return Sigmoid(z)
def Cost_Function(X,Y,theta,m):
sumOfErrors = 0
for i in range(m):
xi = X[i]
hi = Hypothesis(theta,xi)
error = Y[i] * math.log(hi if hi >0 else 1)
if Y[i] == 1:
error = Y[i] * math.log(hi if hi >0 else 1)
elif Y[i] == 0:
error = (1-Y[i]) * math.log(1-hi if 1-hi >0 else 1)
sumOfErrors += error
constant = -1/m
J = constant * sumOfErrors
#print ('cost is: ', J )
return J
def Cost_Function_Derivative(X,Y,theta,j,m,alpha):
sumErrors = 0
for i in range(m):
xi = X[i]
xij = xi[j]
hi = Hypothesis(theta,X[i])
error = (hi - Y[i])*xij
sumErrors += error
m = len(Y)
constant = float(alpha)/float(m)
J = constant * sumErrors
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = []
constant = alpha/m
for j in range(len(theta)):
CFDerivative = Cost_Function_Derivative(X,Y,theta,j,m,alpha)
new_theta_value = theta[j] - CFDerivative
new_theta.append(new_theta_value)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test, Hypothesis(X,theta))
for i in range(length):
prediction = round(Hypothesis(X[i],theta))
answer = Y[i]
if prediction == answer.all():
correct += 1
my_accuracy = (correct / length)*100
print ('LR Accuracy %: ', my_accuracy)
def Logistic_Regression(X,Y,alpha,theta,num_iters):
theta = np.zeros(X.shape[1])
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
Cost_Function(X,Y,theta,m)
print ('theta: ', theta)
print ('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
initial_theta = [0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
alpha = 0.0001
iterations = 1000
Logistic_Regression(X,Y,alpha,initial_theta,iterations)
This is using data from the wisconsin breast cancer dataset (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data) where I am weighing in 30 features - although changing the features to ones which are known to correlate also doesn't change my accuracy.

Python gives us this scikit-learn library that makes our work easier,
this worked for me:
from sklearn.metrics import accuracy_score
y_pred = log.predict(x_test)
score =accuracy_score(y_test,y_pred)

Accuracy is one of the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. Higher accuracy means model is preforming better.
Accuracy = TP+TN/TP+FP+FN+TN
TP = True positives
TN = True negatives
FN = False negatives
TN = True negatives
While you are using accuracy measure your false positives and false negatives should be of similar cost. A better metric is the F1-score which is given by
F1-score = 2*(Recall*Precision)/Recall+Precision where,
Precision = TP/TP+FP
Recall = TP/TP+FN
Read more here
https://en.wikipedia.org/wiki/Precision_and_recall
The beauty about machine learning in python is that important modules like scikit-learn is open source so you can always look at the actual code.
Please use the below link to scikit learn metrics source code which will give you an idea how scikit-learn calculates the accuracy score when you do
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/metrics

I'm not sure how you arrived at a value of 0.0001 for alpha, but I think it's too low. Using your code with the cancer data shows that cost is decreasing with each iteration -- it's just going glacially.
When I raise this to 0.5, I still get a decreasing costs, but at a more reasonable level. After 1000 iterations it reports:
cost: 0.23668000993020666
And after fixing the Accuracy function I'm getting 92% on the test segment of the data.
You have Numpy installed, as shown by X = np.array(X). You should really consider using it for your operations. It will be orders of magnitude faster for jobs like this. Here is a vectorized version that gives results instantly rather than waiting:
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv("cancerdata.csv")
X = df.values[:,2:-1].astype('float64')
X = (X - np.mean(X, axis =0)) / np.std(X, axis = 0)
## Add a bias column to the data
X = np.hstack([np.ones((X.shape[0], 1)),X])
X = MinMaxScaler().fit_transform(X)
Y = df["diagnosis"].map({'M':1,'B':0})
Y = np.array(Y)
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.25)
def Sigmoid(z):
return 1/(1 + np.exp(-z))
def Hypothesis(theta, x):
return Sigmoid(x # theta)
def Cost_Function(X,Y,theta,m):
hi = Hypothesis(theta, X)
_y = Y.reshape(-1, 1)
J = 1/float(m) * np.sum(-_y * np.log(hi) - (1-_y) * np.log(1-hi))
return J
def Cost_Function_Derivative(X,Y,theta,m,alpha):
hi = Hypothesis(theta,X)
_y = Y.reshape(-1, 1)
J = alpha/float(m) * X.T # (hi - _y)
return J
def Gradient_Descent(X,Y,theta,m,alpha):
new_theta = theta - Cost_Function_Derivative(X,Y,theta,m,alpha)
return new_theta
def Accuracy(theta):
correct = 0
length = len(X_test)
prediction = (Hypothesis(theta, X_test) > 0.5)
_y = Y_test.reshape(-1, 1)
correct = prediction == _y
my_accuracy = (np.sum(correct) / length)*100
print ('LR Accuracy %: ', my_accuracy)
def Logistic_Regression(X,Y,alpha,theta,num_iters):
m = len(Y)
for x in range(num_iters):
new_theta = Gradient_Descent(X,Y,theta,m,alpha)
theta = new_theta
if x % 100 == 0:
#print ('theta: ', theta)
print ('cost: ', Cost_Function(X,Y,theta,m))
Accuracy(theta)
ep = .012
initial_theta = np.random.rand(X_train.shape[1],1) * 2 * ep - ep
alpha = 0.5
iterations = 2000
Logistic_Regression(X_train,Y_train,alpha,initial_theta,iterations)
I think I might have a different versions of scikit, because I had change the MinMaxScaler line to make it work. The result is that I can 10K iterations in the blink of an eye and the results of the applying the model to the test set is about 97% accuracy.

This also works using Vectorization to calculate the accuracy
But Accuracy is not recommended metric as the above Answer noted (if the data is not well_blanced you should not use accuracy instead you use F1-score)
clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X.T, Y.T);
LR_predictions = clf.predict(X.T)
print ('Accuracy of logistic regression: %d ' % float((np.dot(Y,LR_predictions) + np.dot(1-Y,1-LR_predictions))/float(Y.size)*100) +
'% ' + "(percentage of correctly labelled datapoints)")

LMS batch gradient descent with NumPy

I'm trying to write some very simple LMS batch gradient descent but I believe I'm doing something wrong with the gradient. The ratio between the order of magnitude and the initial values for theta is very different for the elements of theta so either theta[2] doesn't move (e.g. if alpha = 1e-8) or theta[1] shoots off (e.g. if alpha = .01).
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x = np.concatenate((np.ones((5,1), dtype=np.int), x), axis=1)
theta = np.array([[0.], [.1], [50.]])
alpha = .01
for i in range(1,1000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
print ((h - y)**2).sum(), theta.squeeze().tolist()

The algorithm as written is completely correct, but without feature scaling, convergence will be extremely slow as one feature will govern the gradient calculation.
You can perform the scaling in various ways; for now, let us just scale the features by their L^1 norms because it's simple
import numpy as np
y = np.array([[400], [330], [369], [232], [540]])
x_orig = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]])
x_orig = np.concatenate((np.ones((5,1), dtype=np.int), x_orig), axis=1)
x_norm = np.sum(x_orig, axis=0)
x = x_orig / x_norm
That is, the sum of every column in x is 1. If you want to retain your good guess at the correct parameters, those have to be scaled accordingly.
theta = (x_norm*[0., .1, 50.]).reshape(3, 1)
With this, we may proceed as you did in your original post, where again you will have to play around with the learning rate until you find a sweet spot.
alpha = .1
for i in range(1, 100000):
h = np.dot(x, theta)
gradient = np.sum((h - y) * x, axis=0, keepdims=True).transpose()
theta -= alpha * gradient
Let's see what we get now that we've found something that seems to converge. Again, your parameters will have to be scaled to relate to the original unscaled features.
print (((h - y)**2).sum(), theta.squeeze()/x_norm)
# Prints 1444.14443271 [ -7.04344646e+01 6.38435468e-02 1.03435881e+02]
At this point, let's cheat and check our results
theta, error, _, _ = np.linalg.lstsq(x_orig, y)
print(error, theta)
# Prints [ 1444.1444327] [[ -7.04346018e+01]
# [ 6.38433756e-02]
# [ 1.03436047e+02]]
A general introductory reference on feature scaling is this Stanford lecture.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementing simple probabilistic model with negative log likelihood loss - python

Related

Pytorch: multiplication between parameters is inplace for LBFGS optimizer?

Incremental Bayesian updates with multi-dimensional parameters

Weights explode in polynomial regression with gradient descent

How to calculate logistic regression accuracy

LMS batch gradient descent with NumPy

Categories

Resources