Related
I am trying to train a word embedding to a list of repeated sentences where only the subject changes. I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding. However, the angle between the vectors of subjects is not always larger than the angle between subjects and a random word.
Man is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy is going to write a very long novel that no one can read.
The code is based on pytorch tutorial:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
class EmbedTrainer(nn.Module):
def __init__(self, d_vocab, d_embed, d_context):
super(EmbedTrainer, self).__init__()
self.embed = nn.Embedding(d_vocab, d_embed)
self.fc_1 = nn.Linear(d_embed * d_context, 128)
self.fc_2 = nn.Linear(128, d_vocab)
def forward(self, x):
x = self.embed(x).view((1, -1)) # flatten after embedding
x = self.fc_2(F.relu(self.fc_1(x)))
x = F.log_softmax(x, dim=1)
return x
text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2
""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
losses = []
epochs = 10
for epoch in range(epochs):
total_loss = 0
for input, target in trigrams:
tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
model.zero_grad()
log_prob = model(tok_ids)
#if total_loss == 0: print("train ", log_prob, target_id)
loss = loss_func(log_prob, target_id)
total_loss += loss.item()
loss.backward()
optimizer.step()
print(total_loss)
losses.append(total_loss)
embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
embed_map[word] = model.embed.weight[tok_to_ids[word]]
print(word, embed_map[word])
def angle(a, b):
from numpy.linalg import norm
a, b = a.detach().numpy(), b.detach().numpy()
return np.dot(a, b) / norm(a) / norm(b)
print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))
I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding
I don't really think you'll achieve that kind of result with only 3 sentences and like 40 iterations in 10 epochs (plus most of the data in your 40 iterations is repeated).
maybe try downloading a couple of free datasets out there, or try your own data with a proven model like a genism model.
I'll give you the code for training a gensim model, so you can test your dataset on another model and see if the problem comes from your data or from your model.
I've tested similar gensim models on datasets with millions of sentences and it worked like a charm, for smaller datasets you might want to change the parameters.
from gensim.models import Word2Vec
from multiprocessing import cpu_count
corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'
model = Word2Vec(corpus_file=corpus_path,
size=vecSize,
window=winSize,
min_count=minCount,
workers=numWorkers,
iter=epochs,
sg=skipGram)
model.save(modelName)
P.S. I don't think it's a good idea to use the keyword input as a variable in your code.
It's most probably the training size. Training a 128d embedding is definitely overkill. Rule of thumb from the the google developers blog:
Why is the embedding vector size 3 in our example? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25
!!!!!!!!!TL;DR at the bottom!!!!!!!!
In an attempt to learn the in's and out's of ML, I have been implementing a neural network optimizer in c++ and wrapped it with swig as a python module. Of course, the first problem I tackled was XOR via the following snip of code: 2 input layers, 2 hidden layers, 1 output layer.
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import time
#=======================================================
# Training Set
#=======================================================
X = [[0,1],[1,0],[1,1],[0,0]]
Y = [[1],[1],[0],[0]]
nIn = len(X[0])
nOut = len(Y[0])
#=======================================================
# Model
#=======================================================
verbosity = 0
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,2,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize the classification optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X,Y)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10000,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#Make a prediction
print(Opt.predict(X))
This snippet of code yields the following output (Correct answer would be [1,1,0,0])
--- 0.10273098945617676 seconds ---
((0.9398755431175232,), (0.9397522211074829,), (0.0612373948097229,), (0.04882470518350601,))
>>>
Looks great!
Now for the issue. The following snippet of code tries to learn from the mnist dataset, but suffers very obviously from overfitting. ~750 input (28X28 pixels), 50 hidden, 10 output
from MikeLearn import NeuralNetwork
from MikeLearn import ClassificationOptimizer
import matplotlib.pyplot as plt
import numpy as np
import pickle
import time
#=======================================================
# Data Set
#=======================================================
#load the data dictionary
modeldata = pickle.load( open( "mnist_data.p", "rb" ) )
X = modeldata['X']
Y = modeldata['Y']
#normalize data
X = np.array(X)
X = X/255
X = X.tolist()
#training set
X1 = X[0:49999]
Y1 = Y[0:49999]
#validation set
X2 = X[50000:59999]
Y2 = Y[50000:59999]
#number of inputs/outputs
nIn = len(X[0]) #~750
nOut = len(Y[0]) #=10
#=======================================================
# Model
#=======================================================
verbosity = 1
#Initualize neural network
# NeuralNetwork([nInputs, nHidden1, nHidden2,..,nOutputs],['Activation1','Activation2'...]
N = NeuralNetwork([nIn,50,nOut],['sigmoid','sigmoid'])
N.setLoggerVerbosity(verbosity)
#Initialize optimizer
#ClassificationOptimizer(NeuralNetwork,Xtrain,Ytrain)
Opt = ClassificationOptimizer(N,X1,Y1)
Opt.setLoggerVerbosity(verbosity)
start_time = time.time();
#fit data
#fit(nEpoch,LearningRate)
E = Opt.fit(10,0.1)
print("--- %s seconds ---" % (time.time() - start_time))
#================================
#Final Accuracy on training set
#================================
XL = Opt.predict(X1)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y1[i])):
correct = correct + 1
print("Training set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy) + '%')
#================================
#Final Accuracy on validation set
#================================
XL = Opt.predict(X2)
correct = 0
for i,x in enumerate(XL):
if XL[i].index(max(XL[i])) == Y[i].index(max(Y2[i])):
correct = correct + 1
print("Testing set Correct = " + str(correct))
Accuracy = correct/len(XL)*100;
print("Accuracy = " + str(Accuracy)+'%')
That snippet of code yields the following output which shows the training accuracy and validation accuracy.
-------------------------
Epoch
9
-------------------------
E=
0.00696964
E=
0.350509
E=
3.49568e-05
E=
4.09073e-06
E=
1.38491e-06
E=
0.229873
E=
3.60186e-05
E=
0.000115187
E=
2.29978e-06
E=
2.69165e-06
--- 27.400235176086426 seconds ---
Training set Correct = 48435
Accuracy = 96.87193743874877%
Testing set Correct = 982
Accuracy = 9.820982098209821%
The training set accuracy is great, but then the testing set is no better than a random guess. Any idea what could be causing this?
TL;DR
Solved XOR with a model 2 inputs, 2 hidden, 1 output and sigmoid activation functions. Good results.
Tried to solve the Mnist data set with a model of 750 inputs (28X28 pixels), 50 hidden, 10 output and sigmoid activation functions. Severe overfitting issue. 95% accuracy on the training set, 10% accuracy on validation set.
Any Idea what is causing this?
The cause of overfitting is a combination of the data and model (network in this case). During the training is was 'lazy' and found aspects of the data that worked well in training data but not generalize well.
It is difficult/impossible to point out exactly where in the trained network the nodes/weights are located that are responsible for overfitting.
But we can avoid overfitting with several tricks:
Regularisation
Drop-out (easier to implement)
Change Network Architecture (less layers/less nodes/more dimension-reduction)
https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
To get an idea of regularization, try the playground from tensorflow:
https://playground.tensorflow.org/
A visualisation of dropout
https://yusugomori.com/projects/deep-learning/dropout-relu
Besides try out regularisation techniques, also experiments with different NN architectures.
I am training a keras model to recognise images of cats, dogs and horses.
So far, I have one-hot-encoded my data (since this is a multi-class classification problem), trained my model and called the predictions.
def read_and_process_images(list_of_images):
X = [] #images
y = [] #labels
for image in list_of_images:
try:
X.append(cv2.resize(cv2.imread(image, cv2.IMREAD_COLOR),(nrows, ncolumns), interpolation = cv2.INTER_CUBIC))
if 'dog' in image:
y.append(0)
elif 'cat' in image:
y.append(1)
elif 'horse' in image:
y.append(2)
except Exception as e:
print(str(e))
return X, y
...
X_test, y_test = read_and_process_images(test_imgs)
x = np.array(X_test)
test_datagen = ImageDataGenerator(rescale = 1./255)
i = 0
text_labels = []
plt.figure(figsize = (30,20))
for batch in test_datagen.flow(x, batch_size = 1):
pred = model.predict(batch)
print(np.argmax(pred))
if np.argmax(pred) == 0 :
text_labels.append('dog')
elif np.argmax(pred) == 1:
text_labels.append('cat')
else:
text_labels.append('horse')
plt.subplot(5 / columns + 1, columns, i+1)
plt.title('I think this is a ' + text_labels[i])
imgplot = plt.imshow(batch[0])
i += 1
if i % 10 == 0:
break
plt.show()
The model seems to be working very well. I usually get between 7-10 correct predictions, depending on the batch size. However, I do not understand how model.predict chooses the batches, and I am therefore unable to compare the actual value to the predicted value. When I try to do the following:
y_pred = model.predict(x, batch_size=1)
matrix = confusion_matrix(y_test, y_pred.argmax(axis=1))
the confusion matrix that I get is completely nonsensical (for example it tells me that it only got one cat correct, but I can clearly see with some batches that it got many more correct). Could someone explain to me, how the .predict function goes about choosing its batches, and how I can successfully compare the predicted values to the actual test values? Thank you in advance.
I am implementing my own perceptron algorithm in python wihtout using numpy or scikit yet. I wanted to get the basics right before proceeding to machine learning specific modules.
I wrote the code as given below.
used iris data set to classify based on sepal length and petal size.
updating the weights at the end of each training set
learning rate, number of iterations for training provided to the algorithm from client
Issues:
My training algorithm degrades instead of improving over time. Can someone please explain what i am doing incorrectly.
This is my error set across iteration number, as you can see the error is actually increasing.
{
0: 0.01646885885483229,
1: 0.017375368112097056,
2: 0.018105024923841584,
3: 0.01869233173693685,
4: 0.019165059856726563,
5: 0.01954556263697238,
6: 0.019851832477317588,
7: 0.02009835160930562,
8: 0.02029677690109266,
9: 0.020456491062436744
}
import pandas as panda
import matplotlib.pyplot as plot
import random
remote_location = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
class Perceptron(object):
def __init__(self, epochs, learning_rate, weight_range = None):
self.epochs = epochs
self.learning_rate = learning_rate
self.weight_range = weight_range if weight_range else [-1, 1]
self.weights = []
self._x_training_set = None
self._y_training_set = None
self.number_of_training_set = 0
def setup(self):
self.number_of_training_set = self.setup_training_set()
self.initialize_weights(len(self._x_training_set[0]) + 1)
def setup_training_set(self):
"""
Downloading training set data from UCI ML Repository - Iris DataSet
"""
data = panda.read_csv(remote_location)
self._x_training_set = list(data.iloc[0:, [0,2]].values)
self._y_training_set = [0 if i.lower()!='iris-setosa' else 1
for i in data.iloc[0:, 4].values]
return len(self._x_training_set)
def initialize_weights(self, number_of_weights):
random_weights = [random.uniform(self.weight_range[0], self.weight_range[1])
for i in range(number_of_weights)]
self.weights.append(-1) # setting up bias unit
self.weights.extend(random_weights)
def draw_initial_plot(self, _x_data, _y_data, _x_label, _y_label):
plot.xlabel(_x_label)
plot.ylabel(_y_label)
plot.scatter(_x_data,_y_data)
plot.show()
def learn(self):
self.setup()
epoch_data = {}
error = 0
for epoch in range(self.epochs):
for i in range(self.number_of_training_set):
_x = self._x_training_set[i]
_desired = self._y_training_set[i]
_weight = self.weights
guess = _weight[0] ## setting up the bias unit
for j in range(len(_x)):
guess += _weight[j+1] * _x[j]
error = _desired - guess
## i am going to reset all the weights
if error!= 0 :
## resetting the bias unit
self.weights[0] = error * self.learning_rate
for j in range(len(_x)):
self.weights[j+1] = self.weights[j+1] + error * self.learning_rate * _x[j]
#saving error at the end of the training set
epoch_data[epoch] = error
# print(epoch_data)
self.draw_initial_plot(list(epoch_data.keys()), list(epoch_data.values()),'Epochs', 'Error')
def runMyCode():
learning_rate = 0.01
epochs = 15
random_generator_start = -1
random_generator_end = 1
perceptron = Perceptron(epochs, learning_rate, [random_generator_start, random_generator_end])
perceptron.learn()
runMyCode()
plot with epoch 6
plot with epoch > 6
Really is weird, since I just ran your code (as is) and got the following plot:
Can you elaborate on where you see this error? maybe you are reading it in reverse.
i figured out the issue. i was plotting the error function incorrectly.
the error in gradient descent is error squared/ so the code should be:
epoch_data[epoch] = error
once i did that, the epoch vs cost function plotting was always coming as converging towards zero, irrespective of the number of epochs.
I am writing a KNN classifier taken from here for character recognition from accelerometer and gyroscopic data.But, the below functions are not working correctly and prediction is not happening.Are there any mistakes in below code? kindly guide me.
trainingset-> training data with 20 samples(10=A,10=B).
testset-> live reading taken for recognition.
#-- KNN Classifier Functions ----------
def loaddataset():
global trainingset
with open('imudata.csv','rb') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)):
trainingset.append(dataset[x])
def euclideandistance(instance1,instance2,length):
distance = 0
for x in range(length-1):
instance1[x] = float(instance1[x])
instance2[x] = float(instance2[x])
for x in range(length-1):
distance += pow((instance1[x]-instance2[x]),2)
return math.sqrt(distance)
def getneighbours(trainingset,testinstance,k):
distances = []
length = len(testinstance)-1
for x in range(len(trainingset)):
dist = euclideandistance(testinstance, trainingset[x],length)
#print(trainingset[x][-1],dist)
distances.append((trainingset[x],dist))
#print(distances)
distances.sort(key=operator.itemgetter(1))
#print(distances)
neighbours = []
print('k='+repr(k)+'length of distances='+repr(len(distances)))
for x in range(k):
neighbours.append(distances[x][0])
return neighbours
def getresponse(neighbours):
classvotes = {}
for x in range(len(neighbours)):
response = neighbours[x][-1]
if response in classvotes:
classvotes[response] += 1
else:
classvotes[response] = 1
sortedvotes = sorted(classvotes.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedvotes[0][0]
def getaccuracy(testset, predictions):
correct = 0
for x in range(len(testset)):
if testset[x][-1] is predictions[x]:
correct +=1
return ((correct/float(len(testset))) * 100.0)
#------- END of KNN Classifier Functions -------------
My main compare function is
def compare():
loaddataset()
testset.append(testdata)
print 'Train set: '+ repr(len(trainingset))
print 'Test set: '+ repr(len(testset))
predictions=[]
k = len(trainingset)
for x in range(len(testset)):
neighbours = getneighbours(trainingset,testset[x],k)
result = getresponse(neighbours)
predictions.append(result)
print('>Predicted=' +repr(result)+', actual=' + repr(testset[x][-1]))
accuracy = getaccuracy(testset, predictions)
print('Accuracy: '+repr(accuracy)+'%')
My output is
Train set: 20
Test set: 1
k=20 length of distance=20
>Predicted='A', actual='B'
Accuracy: 0.0%
My sample data packet:
-1.1945864763443935e-16,1.0000000000000031,0.81335962823925234,1.2678119727931405,4.6396523259663871,3,1.0000000000000013,108240.99999999988,328.99999999999966,4.3008487686466931e-16,1.000000000000002,0.73006871826334618,0.88693535629714804,4.3903300136708818,15,1.0000000000000011,108240.99999999977,328.99999999999932,1.990977460573989e-16,1.0000000000000009,0.8120281400849243,1.3556881217171162,4.2839744646260876,9,1.0000000000000004,108240.99999999994,328.99999999999983,-3.4217816017322454e-16,1.0000000000000009,0.7842111273340705,1.0882622268942712,4.4762484049613418,4,1.0000000000000004,108241.00000000038,329.00000000000114,2.6996304550155782e-18,1.000000000000004,0.76504908035654873,1.1890598964371606,4.2138613873737967,7,1.000000000000002,108241.0000000001,329.00000000000028,7.154020705791282e-17,1.0,0.83945423805187047,1.4309844267934049,3.7008217934312198,6,1.0,108240.99999999983,328.99999999999949,-0.66014932688009009,0.48967404184734276,0.083592048161537938,A
I am from hardware and dont know much about KNN, thatswhy I am asking for corrections in my code if any.I added my dataset here.
I could see from your data that number of samples is very less that no. of features, that may affect the accuracy of prediction and the number of samples need to be very high. You can't expect to predict everything correctly, algorithms have their own accuracies. Try to check the correctness of this code by using any other well-known datasets like iris Or try to use built-in knn classifier from scikit learn python.