I am trying to train a word embedding to a list of repeated sentences where only the subject changes. I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding. However, the angle between the vectors of subjects is not always larger than the angle between subjects and a random word.
Man is going to write a very long novel that no one can read.
Woman is going to write a very long novel that no one can read.
Boy is going to write a very long novel that no one can read.
The code is based on pytorch tutorial:
import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
class EmbedTrainer(nn.Module):
def __init__(self, d_vocab, d_embed, d_context):
super(EmbedTrainer, self).__init__()
self.embed = nn.Embedding(d_vocab, d_embed)
self.fc_1 = nn.Linear(d_embed * d_context, 128)
self.fc_2 = nn.Linear(128, d_vocab)
def forward(self, x):
x = self.embed(x).view((1, -1)) # flatten after embedding
x = self.fc_2(F.relu(self.fc_1(x)))
x = F.log_softmax(x, dim=1)
return x
text = " ".join(["{} is going to write a very long novel that no one can read.".format(x) for x in ["Man", "Woman", "Boy"]])
text_split = text.split()
trigrams = [([text_split[i], text_split[i+1]], text_split[i+2]) for i in range(len(text_split)-2)]
dic = list(set(text.split()))
tok_to_ids = {w:i for i, w in enumerate(dic)}
tokens_text = text.split(" ")
d_vocab, d_embed, d_context = len(dic), 10, 2
""" Train """
loss_func = nn.NLLLoss()
model = EmbedTrainer(d_vocab, d_embed, d_context)
print(model)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
losses = []
epochs = 10
for epoch in range(epochs):
total_loss = 0
for input, target in trigrams:
tok_ids = torch.tensor([tok_to_ids[tok] for tok in input], dtype=torch.long)
target_id = torch.tensor([tok_to_ids[target]], dtype=torch.long)
model.zero_grad()
log_prob = model(tok_ids)
#if total_loss == 0: print("train ", log_prob, target_id)
loss = loss_func(log_prob, target_id)
total_loss += loss.item()
loss.backward()
optimizer.step()
print(total_loss)
losses.append(total_loss)
embed_map = {}
for word in ["Man", "Woman", "Boy", "novel"]:
embed_map[word] = model.embed.weight[tok_to_ids[word]]
print(word, embed_map[word])
def angle(a, b):
from numpy.linalg import norm
a, b = a.detach().numpy(), b.detach().numpy()
return np.dot(a, b) / norm(a) / norm(b)
print("man.woman", angle(embed_map["Man"], embed_map["Woman"]))
print("man.novel", angle(embed_map["Man"], embed_map["novel"]))
I expected that the generated vectors corresponding the subjects provide a strong correlation after training as it is expected from a word embedding
I don't really think you'll achieve that kind of result with only 3 sentences and like 40 iterations in 10 epochs (plus most of the data in your 40 iterations is repeated).
maybe try downloading a couple of free datasets out there, or try your own data with a proven model like a genism model.
I'll give you the code for training a gensim model, so you can test your dataset on another model and see if the problem comes from your data or from your model.
I've tested similar gensim models on datasets with millions of sentences and it worked like a charm, for smaller datasets you might want to change the parameters.
from gensim.models import Word2Vec
from multiprocessing import cpu_count
corpus_path = 'eachLineASentence.txt'
vecSize = 300
winSize = 5
numWorkers = cpu_count()-1
epochs = 20
minCount = 5
skipGram = False
modelName = f'mymodel.model'
model = Word2Vec(corpus_file=corpus_path,
size=vecSize,
window=winSize,
min_count=minCount,
workers=numWorkers,
iter=epochs,
sg=skipGram)
model.save(modelName)
P.S. I don't think it's a good idea to use the keyword input as a variable in your code.
It's most probably the training size. Training a 128d embedding is definitely overkill. Rule of thumb from the the google developers blog:
Why is the embedding vector size 3 in our example? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25
Related
I am working with word embeddings, and each phrase has different length.
My dataset contains a list of vectors, one vector of size 300 for each word in the phrase.
Maybe one phrase has 20 words, maybe it has 2, etc.
For instance:
X.iloc[0:10, ]
0 [[-0.51389, -0.55286, -0.28648, -0.18608, -0.0...
1 [[0.33837, -0.13933, -0.096114, 0.40103, 0.041...
2 [[-0.078564, 0.18702, -0.35168, 0.067557, 0.11...
3 [[0.047356, -0.10216, -0.15738, -0.04521, 0.26...
4 [[0.16781, -0.31793, -0.21877, 0.28025, 0.3364...
5 [[-0.4509, 0.077681, -0.058347, 0.2859, -0.369...
6 [[0.018223, -0.012323, 0.035569, 0.24232, -0.1...
7 [[-0.19265, 0.45863, -0.33841, -0.16293, -0.26...
8 [[0.10751, 0.15958, 0.13332, 0.16642, -0.03273...
9 [[0.35259, 0.60833, 0.051335, -0.079285, -0.35...
Name: embedding, dtype: object
len(X.iloc[0])
313
len(X.iloc[1])
2
The targets are just a numeric integer, from 0 to 5.
How can I pad this sequence using pytorch to feed a neural network of fixed size?
I saw something with collate_fn, however I think it only works for batches, and not for the whole dataset.
It seems correct that collate_fn can help in this case. Indeed, it is applied on batch level, but iterating through all batches will process entire dataset.
For padding you can use some of the following functions:
torch.nn.functional.pad
torch.nn.utils.rnn.pad_sequence
Doing it manually in a custom way is also an option.
Here is toy example with torch.nn.functional.pad and random dataset:
import torch
import random
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
MAX_PHRASE_LENGTH = 20
class PhraseDataset(Dataset):
def __len__(self):
return 5
def __getitem__(self, item):
phrase = torch.rand((random.randint(2, MAX_PHRASE_LENGTH), 300), dtype=torch.float)
target = torch.tensor(random.randint(0, 5), dtype=torch.float)
return phrase, target
def collate_fn(batch):
targets = []
phrases = []
for (phrase, target) in batch:
targets.append(target)
padding = MAX_PHRASE_LENGTH - phrase.size()[0]
if padding > 0:
phrases.append(F.pad(phrase, (0, 0, padding, 0), "constant", 0))
else:
phrases.append(phrase)
return torch.stack(phrases), torch.tensor(targets, dtype=torch.float)
dataset = PhraseDataset()
data_loader = DataLoader(dataset, batch_size=5, collate_fn=collate_fn)
for phrases, targets in data_loader:
print('Phrases size:', phrases.size())
Output:
Phrases size: torch.Size([5, 20, 300])
Although padding could actually be applied in a dataset, before collate_fn is called, so collate_fn won't be needed in this case at all.
I did a quick experiment to see if I could understand what the hidden state in an LSTM does...
I tried to make an LSTM predict a sequence of [1,0,1,0,1...] based off an input sequence of X with X[0] = 1 and the remainder as random noise.
X = [1, randFloat, randFloat, randFloat...]
label = [1, 0, 1, 0...]
In my head, the model would understand:
The inputs X mean nothing, or at least very little (as it's noise) - so it'd discard these values for the most part
Solely the hidden state from the previous sequence/timestep n would be used to predict the next timestep n+1... [1, 0, 1, 0...]
I also set X[0] = 1 so the first initial in an attempt to guide the net to predicting 1 on the first item (which it does)
So, this didn't work. In theory, should it not? Can you someone explain?
It essentially never converges, and is on the cusp of guessing between 0 or 1
## Code
import os
import numpy as np
import torch
from torchvision import transforms
from torch import nn
from sklearn import preprocessing
from util import create_sequences
import torch.optim as optim
Create some fake data
sequence_1 = torch.tensor(np.random.uniform(size=50)).float().detach()
sequence_1[0] = 1
sequence_2 = torch.tensor(np.random.uniform(size=50)).float().detach()
sequence_2[0] = 1
labels_1 = np.zeros(50)
labels_1[::2] = 1
labels_1 = torch.tensor(labels_1, dtype=torch.long)
labels_2 = labels_1.clone()
training_data = [sequence_1, sequence_2]
label_data = [labels_1, labels_2]
Create simple LSTM Model
class LSTM(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(LSTM, self).__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, seq):
lstm_out, _ = self.lstm(seq.view(len(seq), 1, -1))
out = self.fc(lstm_out.view(len(seq), -1))
out = F.log_softmax(out, dim=1)
return out
We try to overfit on the dataset
INPUT_DIM = 1
HIDDEN_DIM = 6
model = LSTM(INPUT_DIM, HIDDEN_DIM, 2)
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(500):
for i, seq in enumerate(training_data):
labels = label_data[i]
model.zero_grad()
scores = model(seq)
loss = loss_function(scores, labels)
loss.backward()
print(loss)
optimizer.step()
with torch.no_grad():
seq_d = training_data[0]
tag_scores = model(seq_d)
for score in tag_scores:
print(np.argmax(score))
I would say it's not meant to work.
The model would always try to make sense and find patterns in the data it's trained on i.e sequence_1 and to "verify" that it has "found" them, it uses labels_1. Since the data is random the model fails to find the pattern.
The pattern the model tries to find is not in the label but in the data, so it doesn't matter how the label is arranged. The label actually never passes through the model, so NO.
If perhaps, you trained it on a single example then definitely. The model will become overfit and give you your ones and zeros and fail miserably on other examples, otherwise it just won't be able to make sense of the random data no matter the size.
Hidden State
Solely the hidden state from the previous sequence/timestep n would be used to predict the next timestep n+1... [1, 0, 1, 0...]
Concerning Hidden state, NOTE that it is not a trainable parameter, it is the result of performing some operations on the data and parameters, meaning that the input data determines the Hidden state.
What the Hidden state does is to hold the information the model has extracted from the previous timesteps and passes it to the next timestep or as output. In the case of LSTM, it does some forgetting and updating before passing it.
I am trying to create a Convolutional Neural Network to classify what language a certain "word" is from. There are two files ("english_words.txt" and "spanish_words.txt") which each contain about 60,000 words each. I have converted each word into a 29-dimensional vector where each element is a number between 0 and 1. I am training the model for 500 epochs with the optimizer "adam". However, when I train the model, the loss tends to hover around 0.7 and the accuracy around 0.5, and no matter how long I train it for, these metrics will not improve. Here is the code:
import keras
import numpy as np
from keras.layers import Dense
from keras.models import Sequential
import re
train_labels = []
train_data = []
with open("english_words.txt") as words:
full_words = words.read()
full_words = full_words.split("\n")
# all of the labels are just 1.
# we now need to encode them into 29 dimensional vectors.
vector = []
i = 0
for word in full_words:
train_labels.append([1,0])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
with open("spanish_words.txt") as words:
full_words = words.read()
full_words = full_words.replace(' ', '')
full_words = full_words.replace('\n', ',')
full_words = full_words.split(",")
vector = []
for word in full_words:
train_labels.append([0,1])
for letter in word:
vector.append((ord(letter) - 96) * (1.0 / 26.0))
i += 1
if (i < 29):
for x in range(0, 29 - i):
vector.append(0)
train_data.append(vector)
vector = []
i = 0
def shuffle_in_unison(a, b):
assert len(a) == len(b)
shuffled_a = np.empty(a.shape, dtype=a.dtype)
shuffled_b = np.empty(b.shape, dtype=b.dtype)
permutation = np.random.permutation(len(a))
for old_index, new_index in enumerate(permutation):
shuffled_a[new_index] = a[old_index]
shuffled_b[new_index] = b[old_index]
return shuffled_a, shuffled_b
train_data = np.asarray(train_data, dtype=np.float32)
train_labels = np.asarray(train_labels, dtype=np.float32)
train_data, train_labels = shuffle_in_unison(train_data, train_labels)
print(train_data.shape, train_labels.shape)
model = Sequential()
model.add(Dense(29, input_shape=(29,)))
model.add(Dense(60))
model.add(Dense(40))
model.add(Dense(25))
model.add(Dense(2))
model.compile(optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"])
model.summary()
model.fit(train_data, train_labels, epochs=500, batch_size=128)
model.save("language_predictor.model")
For some extra info, I am running python 3.x with tensorflow 1.15 and keras 1.15 on windows x64.
I can see several potential problems with your code.
You added several Dense layers one after another, but you really need to also include a non-linear activation function with the parameter activation= .... In the absence of any non-linear activation functions, all those fully-connected Dense layers will mathematically collapse into one single linear Dense layer incapable of learning a non-linear decision boundary.
In general, if you see your loss and accuracy not making any improvement or even getting worse, then the first thing to try is to reduce your learning rate.
You don't need to necessarily implement your own shuffling function. The Keras fit() function can do it if you use the shuffle=True parameter.
In addition to the points mentioned by stackoverflowuser2010:
I find this a very good read and highly suggest checking the mentioned points: 37 Reasons why your Neural Network is not working
Center your input data: Compute a component-wise mean vector and subtract it from every input.
I've been getting some practice with Bert Embedding. Specifically, I'm using BERT-Base Uncased model with the PyTorch libraries
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
I noticed that for pretty much every word I've looked at the element #308 in the bert embedding 768-long vector is a negative outlier with the values below -2, perhaps half of the time below -4.
It is really weird. I tried to google something about 'bert embedding 308' but couldn't find anything.
I wonder if there is any explanation for this 'phenomenon'.
Here is my routine to extract the embeddings:
def bert_embedding(text, bert_model, bert_tokenizer, layer_number = 0):
marked_text = "[CLS] " + text + " [SEP]"
#the default is the last layer: 0-1 = -1
layer_number -= 1
tokenized_text = bert_tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
encoded_layers, _ = bert_model(tokens_tensor, segments_tensors)
return encoded_layers[layer_number][0][:]
I'm having trouble with my first neural network. I simply cannot find the source of the error.
Problem
Reading the book "Make your own neural network" by Tariq Rashid I tried to implement Handwriting recognition using NN which would classify images and determine which digit from 0 to 9 is written down.
After training the NN the tests show that each of the letters have ~99% match, which is obviously wrong.
Suspicions
In the book the author approaches NN matrices a bit deferent then I have. For example he multiplies input-hidden layer weights with input, which I do other way around by multiplying input with input-hidden weights.
Here is illustration of the way I do matrix multiplication while querying NN (feedforward):
I'm aware that matrices do not posses commutative property for dot product but I it don't notice that I have made an error there.
Should I take different approach i.e. transpose all matrices and multiply them in different order?
Is there de facto standard for dimensions of an input and output matrix i.e. should they be shaped as 1×n or n×1?
If this is wrong approach then it certainly has manifested itself in backpropagation with gradient descent used for training.
Source code
import numpy as np
import matplotlib.pyplot
from matplotlib.pyplot import imshow
import scipy.special as scipy
from PIL import Image
class NeuralNetwork(object):
def __init__(self):
self.input_neuron_count = 28*28 # One for each pixel, 28*28 = 784 in total.
self.hidden_neuron_count = 100 # Arbitraty.
self.output_neuron_count = 10 # One for each digit from 0 to 9.
self.learning_rate = 0.1 # Arbitraty.
# Sampling the weights from a normal probability distribution
# centered around zero and with standard deviation
# that is related to the number of incoming links into a node,
# 1/√(number of incoming links).
generate_random_weight_matrix = lambda input_neuron_count, output_neuron_count: (
np.random.normal(0.0, pow(input_neuron_count, -0.5), (input_neuron_count, output_neuron_count))
)
self.input_x_hidden_weights = generate_random_weight_matrix(self.input_neuron_count, self.hidden_neuron_count)
self.hidden_x_output_weights = generate_random_weight_matrix(self.hidden_neuron_count, self.output_neuron_count)
self.activation_function = lambda value: scipy.expit(value) # Sigmoid function
def train(self, input_array, target_array):
inputs = np.array(input_array, ndmin=2)
targets = np.array(target_array, ndmin=2)
hidden_layer_input = np.dot(inputs, self.input_x_hidden_weights)
hidden_layer_output = self.activation_function(hidden_layer_input)
output_layer_input = np.dot(hidden_layer_output, self.hidden_x_output_weights)
output_layer_output = self.activation_function(output_layer_input)
output_errors = targets - output_layer_output
self.hidden_x_output_weights += self.learning_rate * np.dot(hidden_layer_output.T, (output_errors * output_layer_output * (1 - output_layer_output)))
hidden_errors = np.dot(output_errors, self.hidden_x_output_weights.T)
self.input_x_hidden_weights += self.learning_rate * np.dot(inputs.T, (hidden_errors * hidden_layer_output * (1 - hidden_layer_output)))
def query(self, input_array):
inputs = np.array(input_array, ndmin=2)
hidden_layer_input = np.dot(inputs, self.input_x_hidden_weights)
hidden_layer_output = self.activation_function(hidden_layer_input)
output_layer_input = np.dot(hidden_layer_output, self.hidden_x_output_weights)
output_layer_output = self.activation_function(output_layer_input)
return output_layer_output
Replication (Training and testing)
The original source of training and testing data is from The MNIST Database. I have used CSV version which I downloaded from the book authors web page The MNIST Dataset of Handwitten Digits.
Here is the code I have used for training and testing so far:
def prepare_data(handwritten_digit_array):
return ((handwritten_digit_array / 255.0 * 0.99) + 0.0001).flatten()
def create_target(digit_target):
target = np.zeros(10) + 0.01
target[digit_target] = target[digit_target] + 0.98
return target
# Training
neural_network = NeuralNetwork()
training_data_file = open('mnist_train.csv', 'r')
training_data = training_data_file.readlines()
training_data_file.close()
for data in training_data:
handwritten_digit_raw = data.split(',')
handwritten_digit_array = np.asfarray(handwritten_digit_raw[1:]).reshape((28, 28))
handwritten_digit_target = int(handwritten_digit_raw[0])
neural_network.train(prepare_data(handwritten_digit_array), create_target(handwritten_digit_target))
# Testing
test_data_file = open('mnist_test_10.csv', 'r')
test_data = test_data_file.readlines()
test_data_file.close()
for data in test_data:
handwritten_digit_raw = data.split(',')
handwritten_digit_array = np.asfarray(handwritten_digit_raw[1:]).reshape((28, 28))
handwritten_digit_target = int(handwritten_digit_raw[0])
output = neural_network.query(handwritten_digit_array.flatten())
print('target', handwritten_digit_target)
print('output', output)
This is one of those facepalm moments. Neural network has been working as expected all along. The truth is that I have now noticed I've overlooked the test results and read numbers written in scientific notation incorrectly.
Measured on 10000 test data from The MNIST Database this NN has accuracy of 94.01%.