How to calculate perplexity of RNN in tensorflow

How to calculate perplexity of RNN in tensorflow - python

I'm running the word RNN implmentation of tensor flow of Word RNN
How to calculate the perplexity of RNN.
Following is the code in training that shows training loss and other things in each epoch:
for e in range(model.epoch_pointer.eval(), args.num_epochs):
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** e)))
data_loader.reset_batch_pointer()
state = sess.run(model.initial_state)
speed = 0
if args.init_from is None:
assign_op = model.batch_pointer.assign(0)
sess.run(assign_op)
assign_op = model.epoch_pointer.assign(e)
sess.run(assign_op)
if args.init_from is not None:
data_loader.pointer = model.batch_pointer.eval()
args.init_from = None
for b in range(data_loader.pointer, data_loader.num_batches):
start = time.time()
x, y = data_loader.next_batch()
feed = {model.input_data: x, model.targets: y, model.initial_state: state,
model.batch_time: speed}
summary, train_loss, state, _, _ = sess.run([merged, model.cost, model.final_state,
model.train_op, model.inc_batch_pointer_op], feed)
train_writer.add_summary(summary, e * data_loader.num_batches + b)
speed = time.time() - start
if (e * data_loader.num_batches + b) % args.batch_size == 0:
print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" \
.format(e * data_loader.num_batches + b,
args.num_epochs * data_loader.num_batches,
e, train_loss, speed))
if (e * data_loader.num_batches + b) % args.save_every == 0 \
or (e==args.num_epochs-1 and b == data_loader.num_batches-1): # save for the last result
checkpoint_path = os.path.join(args.save_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step = e * data_loader.num_batches + b)
print("model saved to {}".format(checkpoint_path))
train_writer.close()

The project you are referencing uses sequence_to_sequence_loss_by_example, which returns the cross-entropy loss. So for calculating the training perplexity, you just need to exponentiate the loss like explained here.
train_perplexity = tf.exp(train_loss)
We have to use e instead of 2 as a base, because TensorFlow measures the cross-entropy loss with the natural logarithm (TF Documentation). Thank you, #Matthias Arro and #Colin Skow for the hint.
Detailed Explanation
The cross-entropy of two probability distributions P and Q tells us the minimum average number of bits we need to encode events of P, when we develop a coding scheme based on Q. So, P is the true distribution, which we usually don't know. We want to find a Q as close to P as possible, so that we can develop a nice coding scheme with as few bits per event as possible.
I shouldn't say bits, because we can only use bits as a measure if we use base 2 in the calculation of the cross-entropy. But TensorFlow uses the natural logarithm, so instead let's measure the cross-entropy in nats.
So let's say we have a bad language model that says every token (character / word) in the vocabulary is equally probable to be the next one. For a vocabulary of 1000 tokens, this model will have a cross-entropy of log(1000) = 6.9 nats. When predicting the next token, it has to choose uniformly between 1000 tokens at each step.
A better language model will determine a probability distribution Q that is closer to P. Thus, the cross-entropy is lower - we might get a cross-entropy of 3.9 nats. If we now want to measure the perplexity, we simply exponentiate the cross-entropy:
exp(3.9) = 49.4
So, on the samples, for which we calculated the loss, the good model was as perplex as if it had to choose uniformly and independently among roughly 50 tokens.

It depends whether your loss function gives you a log likelihood of the data in base 2 or base e. This model is using legacy_seq2seq.sequence_loss_by_example, which uses TensorFlow's binary crossentropy, which appears to use logs of base e. Therefore, even though we're dealing with a discrete probability distribution (text), we should exponentiate with e, i.e. use tf.exp(train_loss) as Colin Skow suggested.

Related

Improving a simple 1 layer Neural Network

I've created my own very simple 1 layer neural network, specialised in binary classification problems. Where the input data-points are multiplied by the weights and a bias is added. The whole thing is summed (weighted-sum) and fed through an activation function (such as relu or sigmoid). That would be the prediction output. There are no other layers (i.e. hidden layers) involved.
Just for my own understanding of the mathematical side, I didn't want to use an existing library/package (e.g. Keras, PyTorch, Scikit-learn ..etc), but simply wanted to create a neural network using plain python code. The model is created inside a method (simple_1_layer_classification_NN) that takes the necessary parameters to make a prediction. However, I encountered some problems, and as such listed the questions below along with my code.
P.s. I really apologise for including such a large portion of code, but I didn't know how else to ask the questions without referencing the relevant code.
The questions:
1 - When I passed some training dataset to train the network, I found that the final average accuracy completely differed with different number of Epochs with absolutely no clear pattern to some sort of optimal number of Epochs. I kept the other parameters the same: learning rate = 0.5, activation = sigmoid(since it's 1 layer - being both the input and output layer. No hidden layers involved. I've read sigmoid is suited for output layer more than relu), cost function = squared error. Here are the results for different Epochs:
Epoch = 100,000.
Average Accuracy: 50.10541638874056
Epoch = 500,000.
Average Accuracy: 50.08965597645948
Epoch = 1,000,000.
Average Accuracy: 97.56879179064482
Epoch = 7,500,000.
Average Accuracy: 49.994692515332524
Epoch 750,000.
Average Accuracy: 77.0028368954157
Epoch = 100.
Average Accuracy: 48.96967591507596
Epoch = 500.
Average Accuracy: 48.20721972881673
Epoch = 10,000.
Average Accuracy: 71.58066454336122
Epoch = 50,000.
Average Accuracy: 62.52998222597177
Epoch = 100,000.
Average Accuracy: 49.813675726563424
Epoch = 1,000,000.
Average Accuracy: 49.993141329926374
As you can see there doesn't seem to be any clear pattern. I tried 1 million epochs and got 97.6% accuracy. Then I tried 7.5 million epochs got 50% accuracy. Half a million epochs also got 50% accuracy. 100 epochs resulted in 49% accuracy. Then the really odd one, tried 1 millions epochs again and got 50%.
So I'm sharing my code below, because I don't believe the network is doing any learning. Just seems like random guesses. I applied the concept of Back-propagation and partial derivative to optimise the weights and bias. So I'm not sure where I'm going wrong with my code.
2- One of the parameters I included in the parameter list of the simple_1_layer_classification_NN method, is the input_dimension parameter. At first I thought it would be needed to workout the number of weights required for the input layer. Then I realised, as long as the dataset_input_matrix (matrix of features) argument is passed to the method, I can access a random index of the matrix to access a random observation vector from the matrix (input_observation_vector = dataset_input_matrix[ri]). Then looping through the observation to access each feature. The number of loops (or length) of the observation vector will tell me exactly how many weights are required (because each feature will require one weight (as its coefficient). So (len(input_observation_vector)) will tell me the number of weights required in the input layer, and therefore I don't need to ask the user to pass input_dimension argument to the method.
So my question is simply, is there any need/reason to include a input_dimension parameter, when this can be worked out simply by evaluating the length of the observation vector from the input matrix?
3 - When I try to plot the array of costs values, nothing shows up - plt.plot(y_costs). A cost value (produced from every Epoch), is appended to the costs array only every 50 epochs. This is to avoid having so many cost elements added in the array if the number of epochs is really high. At line:
if i % 50 == 0:
costs.append(cost)
When I did some debugging, I found that the costs array is empty, after the method returns. I'm not sure why that is, when it should be appending a cost value every 50th epoch. Probably I've overlooked something really silly that I can't see it.
Many thanks in advance, and apologies again for the long piece of code.
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import sys
# import os
class NN_classification:
def __init__(self):
self.bias = float()
self.weights = []
self.chosen_activation_func = None
self.chosen_cost_func = None
self.train_average_accuracy = int()
self.test_average_accuracy = int()
# -- Activation functions --:
def sigmoid(x):
return 1/(1 + np.exp(-x))
def relu(x):
return np.maximum(0.0, x)
# -- Derivative of activation functions --:
def sigmoid_derivation(x):
return NN_classification.sigmoid(x) * (1-NN_classification.sigmoid(x))
def relu_derivation(x):
if x <= 0:
return 0
else:
return 1
# -- Squared-error cost function --:
def squared_error(pred, target):
return np.square(pred - target)
# -- Derivative of squared-error cost function --:
def squared_error_derivation(pred, target):
return 2 * (pred - target)
# --- neural network structure diagram ---
# O output prediction
# / \ w1, w2, b
# O O datapoint 1, datapoint 2
def simple_1_layer_classification_NN(self, dataset_input_matrix, output_data_labels, input_dimension, epochs, activation_func='sigmoid', learning_rate=0.2, cost_func='squared_error'):
weights = []
bias = int()
cost = float()
costs = []
dCost_dWeights = []
chosen_activation_func_derivation = None
chosen_cost_func = None
chosen_cost_func_derivation = None
correct_pred = int()
incorrect_pred = int()
# store the chosen activation function to use to it later on in the activation calculation section and in the 'predict' method
# Also the same goes for the derivation section.
if activation_func == 'sigmoid':
self.chosen_activation_func = NN_classification.sigmoid
chosen_activation_func_derivation = NN_classification.sigmoid_derivation
elif activation_func == 'relu':
self.chosen_activation_func = NN_classification.relu
chosen_activation_func_derivation = NN_classification.relu_derivation
else:
print("Exception error - no activation function utilised, in training method", file=sys.stderr)
return
# store the chosen cost function to use to it later on in the cost calculation section.
# Also the same goes for the cost derivation section.
if cost_func == 'squared_error':
chosen_cost_func = NN_classification.squared_error
chosen_cost_func_derivation = NN_classification.squared_error_derivation
else:
print("Exception error - no cost function utilised, in training method", file=sys.stderr)
return
# Set initial network parameters (weights & bias):
# Will initialise the weights to a uniform distribution and ensure the numbers are small close to 0.
# We need to loop through all the weights to set them to a random value initially.
for i in range(input_dimension):
# create random numbers for our initial weights (connections) to begin with. 'rand' method creates small random numbers.
w = np.random.rand()
weights.append(w)
# create a random number for our initial bias to begin with.
bias = np.random.rand()
# We perform the training based on the number of epochs specified
for i in range(epochs):
# create random index
ri = np.random.randint(len(dataset_input_matrix))
# Pick random observation vector: pick a random observation vector of independent variables (x) from the dataset matrix
input_observation_vector = dataset_input_matrix[ri]
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
# Loop through all the independent variables (x) in the observation
for i in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[i] * weights[i]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for i in range(len(weights)):
weights[i] = weights[i] - learning_rate * dCost_dWeights[i]
bias = bias - learning_rate * dCost_dB
# for each 50th loop we're going to get a summary of the
# prediction compared to the actual ouput
# to see if the prediction is as expected.
# Anything in prediction above 0.5 should match value
# 1 of the actual ouptut. Any prediction below 0.5 should
# match value of 0 for actual output
if i % 50 == 0:
costs.append(cost)
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += accuracy
# Evaluate whether guessed correctly or not based on classification binary problem 0 or 1 outcome. So if prediction is above 0.5 it guessed 1 and below 0.5 it guessed incorrectly. If it's dead on 0.5 it is incorrect for either guesses. Because it's no exactly a good guess for either 0 or 1. We need to set a good standard for the neural net model.
if (error_margin < 0.5) and (error_margin >= 0):
correct_pred += 1
elif (error_margin >= 0.5) and (error_margin <= 1):
incorrect_pred += 1
else:
print("Exception error - 'margin error' for 'predict' method is out of range. Must be between 0 and 1, in training method", file=sys.stderr)
return
# store the final optimised weights to the weights instance variable so it can be used in the predict method.
self.weights = weights
# store the final optimised bias to the weights instance variable so it can be used in the predict method.
self.bias = bias
# Calculate average accuracy from the predictions of all obervations in the training dataset
self.train_average_accuracy /= epochs
# Print out results
print('Average Accuracy: {}'.format(self.train_average_accuracy))
print('Correct predictions: {}, Incorrect Predictions: {}'.format(correct_pred, incorrect_pred))
print('costs = {}'.format(costs))
y_costs = np.array(costs)
plt.plot(y_costs)
plt.show()
from numpy import array
#define array of dataset
# each observation vector has 3 datapoints or 3 columns: length, width, and outcome label (0, 1 to represent blue flower and red flower respectively).
data = array([[3, 1.5, 1],
[2, 1, 0],
[4, 1.5, 1],
[3, 1, 0],
[3.5, 0.5, 1],
[2, 0.5, 0],
[5.5, 1, 1],
[1, 1, 0]])
# separate data: split input, output, train and test data.
X_train, y_train, X_test, y_test = data[:6, :-1], data[:6, -1], data[6:, :-1], data[6:, -1]
nn_model = NN_classification()
nn_model.simple_1_layer_classification_NN(X_train, y_train, 2, 1000000, learning_rate=0.5)

Have you tried a smaller learning rate? Your network may be skipping over local minima because it is too high.
Here's an article that goes more in-depth on learning rates: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
The reason that the cost is never getting appended is because you are using the same variable, 'i', within nested for loops.
# We perform the training based on the number of epochs specified
for i in range(epochs):
# create random index
ri = np.random.randint(len(dataset_input_matrix))
# Pick random observation vector: pick a random observation vector of independent variables (x) from the dataset matrix
input_observation_vector = dataset_input_matrix[ri]
# reset weighted sum value at the beginning of every epoch to avoid incrementing the previous observations weighted-sums on top.
weighted_sum = 0
# Loop through all the independent variables (x) in the observation
for i in range(len(input_observation_vector)):
# Weighted_sum: we take each independent variable in the entire observation, add weight to it then add it to the subtotal of weighted sum
weighted_sum += input_observation_vector[i] * weights[i]
# Add Bias: add bias to weighted sum
weighted_sum += bias
# Activation: process weighted_sum through activation function
activation_func_output = self.chosen_activation_func(weighted_sum)
# Prediction: Because this is a single layer neural network, so the activation output will be the same as the prediction
pred = activation_func_output
# Cost: the cost function to calculate the prediction error margin
cost = chosen_cost_func(pred, output_data_labels[ri])
# Also calculate the derivative of the cost function with respect to prediction
dCost_dPred = chosen_cost_func_derivation(pred, output_data_labels[ri])
# Derivative: bringing derivative from prediction output with respect to the activation function used for the weighted sum.
dPred_dWeightSum = chosen_activation_func_derivation(weighted_sum)
# Bias is just a number on its own added to the weighted sum, so its derivative is just 1
dWeightSum_dB = 1
# The derivative of the Weighted Sum with respect to each weight is the input data point / independant variable it's multiplied by.
# Therefore I simply assigned the input data array to another variable I called 'dWeightedSum_dWeights'
# to represent the array of the derivative of all the weights involved. I could've used the 'input_sample'
# array variable itself, but for the sake of readibility, I created a separate variable to represent the derivative of each of the weights.
dWeightedSum_dWeights = input_observation_vector
# Derivative chaining rule: chaining all the derivative functions together (chaining rule)
# Loop through all the weights to workout the derivative of the cost with respect to each weight:
for dWeightedSum_dWeight in dWeightedSum_dWeights:
dCost_dWeight = dCost_dPred * dPred_dWeightSum * dWeightedSum_dWeight
dCost_dWeights.append(dCost_dWeight)
dCost_dB = dCost_dPred * dPred_dWeightSum * dWeightSum_dB
# Backpropagation: update the weights and bias according to the derivatives calculated above.
# In other word we update the parameters of the neural network to correct parameters and therefore
# optimise the neural network prediction to be as accurate to the real output as possible
# We loop through each weight and update it with its derivative with respect to the cost error function value.
for i in range(len(weights)):
weights[i] = weights[i] - learning_rate * dCost_dWeights[i]
bias = bias - learning_rate * dCost_dB
# for each 50th loop we're going to get a summary of the
# prediction compared to the actual ouput
# to see if the prediction is as expected.
# Anything in prediction above 0.5 should match value
# 1 of the actual ouptut. Any prediction below 0.5 should
# match value of 0 for actual output
This was causing 'i' to always be 1 when it got to the if statement
if i % 50 == 0:
costs.append(cost)
# Compare prediction to target
error_margin = np.sqrt(np.square(pred - output_data_labels[ri]))
accuracy = (1 - error_margin) * 100
self.train_average_accuracy += accuracy
Edit
So I tried training the model 1000 times with random learning rates between 0 and 1, and the initial learning rate doesn't seem to make any difference. 0.3% of these achieved accuracies above 0.60, and none of them were above 70%.
Then I ran the same test with an adaptive learning rate:
# Modify the learning rate based on the cost
# Placed just before the bias is calculated
learning_rate = 0.999 * learning_rate + 0.1 * cost
This is resulting in about 10-12% of the models having an accuracy above 60%, and about 2.5% of them are above 70%

Gradually decay the weight of loss function

I am not sure is the right place to ask this question, feel free to tell me if I need to remove the post.
I am quite new in pyTorch and currently working with CycleGAN (pyTorch implementation) as a part of my project and I understand most of the implementation of cycleGAN.
I read the paper with the name ‘CycleGAN with better Cycles’ and I am trying to apply the modification which mentioned in the paper. One of modification is Cycle consistency weight decay which I don’t know how to apply.
optimizer_G.zero_grad()
# Identity loss
loss_id_A = criterion_identity(G_BA(real_A), real_A)
loss_id_B = criterion_identity(G_AB(real_B), real_B)
loss_identity = (loss_id_A + loss_id_B) / 2
# GAN loss
fake_B = G_AB(real_A)
loss_GAN_AB = criterion_GAN(D_B(fake_B), valid)
fake_A = G_BA(real_B)
loss_GAN_BA = criterion_GAN(D_A(fake_A), valid)
loss_GAN = (loss_GAN_AB + loss_GAN_BA) / 2
# Cycle consistency loss
recov_A = G_BA(fake_B)
loss_cycle_A = criterion_cycle(recov_A, real_A)
recov_B = G_AB(fake_A)
loss_cycle_B = criterion_cycle(recov_B, real_B)
loss_cycle = (loss_cycle_A + loss_cycle_B) / 2
# Total loss
loss_G = loss_GAN +
lambda_cyc * loss_cycle + #lambda_cyc is 10
lambda_id * loss_identity #lambda_id is 0.5 * lambda_cyc
loss_G.backward()
optimizer_G.step()
My question is how can I gradually decay the weight of cycle consistency loss?
Any help in implementing this modification would be appreciated.
This is from the paper:
Cycle consistency loss helps to stabilize training a lot in early stages but becomes an obstacle towards realistic images in later stages. We propose to gradually decay the weight of cycle consistency loss λ as training progress. However, we should still make sure that λ is
not decayed to 0 so that generators won’t become unconstrained and go completely wild.
Thanks in advance.

Below is a prototype function you can use!
def loss (other params, decay params, initial_lambda, steps):
# compute loss
# compute cyclic loss
# function that computes lambda given the steps
cur_lambda = compute_lambda(step, decay_params, initial_lamdba)
final_loss = loss + cur_lambda*cyclic_loss
return final_loss
compute_lambda function for linearly decaying from 10 to 1e-5 in 50 steps
def compute_lambda(step, decay_params):
final_lambda = decay_params["final"]
initial_lambda = decay_params["initial"]
total_step = decay_params["total_step"]
start_step = decay_params["start_step"]
if (step < start_step+total_step and step>start_step):
return initial_lambda + (step-start_step)*(final_lambda-initial_lambda)/total_step
elif (step < start_step):
return initial_lambda
else:
return final_lambda
# Usage:
compute_lambda(i, {"final": 1e-5, "initial":10, "total_step":50, "start_step" : 50})

Gensim equivalent of training steps

Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps?
The TensorFlow script includes this section.
with tf.Session(graph=graph) as session:
# We must initialize all variables before we use them.
init.run()
print('Initialized')
average_loss = 0
for step in xrange(num_steps):
batch_inputs, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
if step % 2000 == 0:
if step > 0:
average_loss /= 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step ', step, ': ', average_loss)
average_loss = 0
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in xrange(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1]
log_str = 'Nearest to %s:' % valid_word
for k in xrange(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = '%s %s,' % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()
In the TensorFlow example, if I perform T-SNE on the embeddings and plot with matplotlib, the plot looks more reasonable to me when the number of steps is high.
I am using a small corpus of 1,200 emails. One way it looks more reasonable is that numbers are clustered together. I would like to attain the same apparent level of quality using gensim.

Yes, Word2Vec class constructor has iter argument:
iter = number of iterations (epochs) over the corpus. Default is 5.
Also, if you call Word2Vec.train() method directly, you can pass in epochs argument that has the same meaning.
The number of actual training steps is deduced from epochs, but depends on
other parameters like text size, window size and batch size. If you're just looking to improve the quality of embedding vectors, increasing iter is the right way.

Linear regression implementation always performs worse than sklearn

I implemented linear regression with gradient descent in python. To see how well it is doing I compared it with scikit-learn's LinearRegression() class. For some reason, sklearn always outperforms my program by a MSE of 3 on average (I am using the Boston Housing dataset for testing). I understand that I am currently not doing gradient checking to check for convergence, but I am allowing for many iterations and have set the learning rate low enough such that it SHOULD converge. Is there any clear bug in my learning algorithm implementation? Here is my code:
import numpy as np
from sklearn.linear_model import LinearRegression
def getWeights(x):
lenWeights = len(x[1,:]);
weights = np.random.rand(lenWeights)
bias = np.random.random();
return weights,bias
def train(x,y,weights,bias,maxIter):
converged = False;
iterations = 1;
m = len(x);
alpha = 0.001;
while not converged:
for i in range(len(x)):
# Dot product of weights and training sample
hypothesis = np.dot(x[i,:], weights) + bias;
# Calculate gradient
error = hypothesis - y[i];
grad = (alpha * 1/m) * ( error * x[i,:] );
# Update weights and bias
weights = weights - grad;
bias = bias - alpha * error;
iterations = iterations + 1;
if iterations > maxIter:
converged = True;
break
return weights, bias
def predict(x, weights, bias):
return np.dot(x,weights) + bias
if __name__ == '__main__':
data = np.loadtxt('housing.txt');
x = data[:,:-1];
y = data[:,-1];
for i in range(len(x[1,:])):
x[:,i] = ( (x[:,i] - np.min(x[:,i])) / (np.max(x[:,i]) - np.min(x[:,i])) );
initialWeights,initialBias = getWeights(x);
weights,bias = train(x,y,initialWeights,initialBias,55000);
pred = predict(x, weights,bias);
MSE = np.mean(abs(pred - y));
print "This Program MSE: " + str(MSE)
sklearnModel = LinearRegression();
sklearnModel = sklearnModel.fit(x,y);
sklearnModel = sklearnModel.predict(x);
skMSE = np.mean(abs(sklearnModel - y));
print "Sklearn MSE: " + str(skMSE)

First, make sure that you are computing the correct objective function value. The linear regression objective should be .5*np.mean((pred-y)**2), rather than np.mean(abs(pred - y)).
You are actually running a stochastic gradient descent (SGD) algorithm (running a gradient iteration on individual examples), which should be distinguished from "gradient descent".
SGD is a good learning method, but a bad optimization method - it can take many iterations to converge to a minimum of the empirical error (http://leon.bottou.org/publications/pdf/nips-2007.pdf).
For SGD to converge, the learning rate must be restricted. Typically, the learning rate is set to the base learning rate divided by the number of iterations, something like alpha/(iterations+1), using the variables in your code.
You also include a multiple of 1/m in your gradient, which is typically not used in SGD updates.
To test your SGD implementation, rather than evaluating the error on the dataset that you trained with, split the dataset into a training set and a test set, and evaluate the error on this test set after training with both methods. The training/test set split will allow you to estimate the performance of your algorithm as a learning algorithm (estimate the expected error) rather than as an optimization algorithm (minimize the empirical error).

Try increasing your iteration value. This should allow your algorithm to, hopefully, converge on a value that is closer to the global minimum. Keep in mind you are not using l-bfgs which can come closer to converging much faster than plain gradient descent or even SGD.
Also try using the normal equation as another way to do Linear Regression.
http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/.

Neural network - output is converging to 0, python

I am trying to classify 2D data in to 3 classes in multy-layer neural network using simple back-propagation and one-hot encoding. After I changed incremental learning to batch learning my output is converging to 0 ([0,0,0]), mostly if I use more data or higher learning speed. I don't know if I have to derivate something else, or if I made some bugs in code.
for each epoch: #pseudocode
for each input:
caluclate hiden neurons activations (logsig)
calculate output neurons activations (logsig)
#error propagation
for i in range(3):
error = (desired_out[i] - aktivations_out[i])
error_out[i] = error * deriv_logsig(aktivations_out[i])
t_weights_out = zip(*weights_out)
for i in range(hiden_neurons):
sum_error = sum(e*w for e, w in zip(error_out, t_weights_out[i]))
error_h[i] = sum_error * deriv_logsig(input_out[i])
#cumulate deltas
for i in range(len(weights_out)):
delta_out[i] = [d + x * coef * error_out[i] for d, x in zip(delta_out[i], input_out)]
for i in range(len(weights_h)):
delta_h[i] = [d + x * coef * error_h[i] for d, x in zip(delta_h[i], input)]
#batch learning after epoch
for i in range(len(weights_out)):
weights_out[i] = [w + delta for w, delta in zip(weights_out[i], delta_out[i])]
for i in range(len(weights_h)):
weights_h[i] = [w + delta for w, delta in zip(weights_h[i], delta_h[i])]

I'd try some toy-example where I'm sure how NN would behave and debug my code. If I'm sure that my code is valid NN and I'm still not getting good results I'd try to change parameters of NN. But it can by quite time consuming, therefore I'd go for some easier ML technique e.g. decision trees which are not blackbox as NN. With decision trees you should be easier and faster to find solution. Question is whether you can implement it in other than NN ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.