Gensim equivalent of training steps

Gensim equivalent of training steps - python

Does gensim Word2Vec have an option that is the equivalent of "training steps" in the TensorFlow word2vec example here: Word2Vec Basic? If not, what default value does gensim use? Is the gensim parameter iter related to training steps?
The TensorFlow script includes this section.
with tf.Session(graph=graph) as session:
# We must initialize all variables before we use them.
init.run()
print('Initialized')
average_loss = 0
for step in xrange(num_steps):
batch_inputs, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
# We perform one update step by evaluating the optimizer op (including it
# in the list of returned values for session.run()
_, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += loss_val
if step % 2000 == 0:
if step > 0:
average_loss /= 2000
# The average loss is an estimate of the loss over the last 2000 batches.
print('Average loss at step ', step, ': ', average_loss)
average_loss = 0
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in xrange(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k + 1]
log_str = 'Nearest to %s:' % valid_word
for k in xrange(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = '%s %s,' % (log_str, close_word)
print(log_str)
final_embeddings = normalized_embeddings.eval()
In the TensorFlow example, if I perform T-SNE on the embeddings and plot with matplotlib, the plot looks more reasonable to me when the number of steps is high.
I am using a small corpus of 1,200 emails. One way it looks more reasonable is that numbers are clustered together. I would like to attain the same apparent level of quality using gensim.

Yes, Word2Vec class constructor has iter argument:
iter = number of iterations (epochs) over the corpus. Default is 5.
Also, if you call Word2Vec.train() method directly, you can pass in epochs argument that has the same meaning.
The number of actual training steps is deduced from epochs, but depends on
other parameters like text size, window size and batch size. If you're just looking to improve the quality of embedding vectors, increasing iter is the right way.

Related

Final step of PyTorch Gradient Accumulation for small datasets

I am training a BERT model on a relatively small dataset and cannot afford to lose any labelled sample as they must all be used for training. Due to GPU memory constraints, I am using gradient accumulation to train on larger batches (e.g. 32). According to PyTorch documentation, gradient accumulation is implemented as follows:
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0:
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
However, if you are using e.g. 110 training samples, with batch size 8 and accumulation step 4 (i.e. effective batch size 32), this method would only train the first 96 samples (i.e. 32 x 3), i.e. wasting 14 samples. In order to avoid this, I'd like to modify the code as follows (notice change to the final if statement):
scaler = GradScaler()
for epoch in epochs:
for i, (input, target) in enumerate(data):
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / iters_to_accumulate
# Accumulates scaled gradients.
scaler.scale(loss).backward()
if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data):
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Is this correct and really that simple, or will this have any side effects? It seems very simple to me, but I've never seen it done before. Any help appreciated!

As Lucas Ramos already mentioned, when using DataLoader where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:
drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
Your plan is basically implementing gradient accumulation combined with drop_last=False - that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.
However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate:
loss = loss / iters_to_accumulate
In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate to reflect this smaller minibatch size!
I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter over the DataLoader helps breaking the training loop into two:
scaler = GradScaler()
for epoch in epochs:
bi = 0 # index batches
# outer loop over minibatches
data_iter = iter(data)
while bi < len(data):
# determine the range for this batch
nbi = min(len(data), bi + iters_to_accumulate)
# inner loop over the items of the mini batch - accumulating gradients
for i in range(bi, nbi):
input, target = data_iter.next()
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / (nbi - bi) # divide by the true batch size
# Accumulates scaled gradients.
scaler.scale(loss).backward()
# done mini batch loop - gradients were accumulated, we can make an optimizatino step.
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
bi = nbi

I was pretty sure I've seen this done before. Check out this code from Pytorch Lightning (functions _accumulated_batches_reached, _num_training_batches_reached and should_accumulate).

Cannot improve model accuracy

I am building a general-purpose NN that would classify images (Dog/No Dog) and movie reviews(Good/Bad). I have to stick to a very specific architecture and loss function so changing these two seems out of the equation. My architecture is a two-layer network with relu followed by a sigmoid and a cross-entropy loss function. With 1000 epochs and a learning rate of around .001 I am getting 100 percent training accuracy and .72 testing accuracy.I was looking for suggestions to improve my testing accuracy.This is the layout of what I have:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
n_x,n_h,n_y=model_size
model = Net(n_x, n_h, n_y)
optim = torch.optim.Adam(model.parameters(),lr=0.005)
loss_function = nn.BCELoss()
train_losses = []
accuracy = []
for epoch in range(epochs):
count=0
model.train()
train_loss = []
batch_accuracy = []
for idx in range(0, train_x.shape[0], batch_size):
batch_x = torch.from_numpy(train_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(train_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
batch_accuracy=[]
loss = loss_function(model_output, batch_y)
train_loss.append(loss.item())
preds = model_output > 0.5
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
optim.zero_grad()
loss.backward()
# Scheduler made it worse
# scheduler.step(loss.item())
optim.step()
if epoch % 100 == 1:
train_losses.append(train_loss)
print("Iteration : {}, Training loss: {} ,Accuracy %: {}".format(epoch,np.mean(train_loss),(count/train_x.shape[0])*100))
plt.plot(np.squeeze(train_losses))
plt.ylabel('loss')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(lr))
plt.show()
return model
My model parameters:
batch_size = 32
lr = 0.0001
epochs = 1500
n_x = 12288 # num_px * num_px * 3
n_h = 7
n_y = 1
model_size=n_x,n_h,n_y
model=train_net(epochs,batch_size,train_x,train_y,model_size,or)
and this is the testing phase.
model.eval() #Setting the model to eval mode, hence making it deterministic.
test_loss = []
count=0;
loss_function = nn.BCELoss()
for idx in range(0, test_x.shape[0], batch_size):
with torch.no_grad():
batch_x = torch.from_numpy(test_x[idx : idx + batch_size]).float()
batch_y = torch.from_numpy(test_y[:,idx : idx + batch_size]).float()
model_output = model(batch_x)
preds = model_output > 0.5
loss = loss_function(model_output, batch_y)
test_loss.append(loss.item())
nb_correct = (preds == batch_y).sum()
count+=nb_correct.item()
print("test loss: {},test accuracy: {}".format(np.mean(test_loss),count/test_x.shape[0]))
Things I have tried:
Messing around with the learning rate, having momentum, using schedulers and changing batch sizes.Of course these were mainly guesses and not based on any valid assumptions.

The issue you're facing is overfitting. With 100% accuracy on the training set, your model is effectively memorizing the training set, then failing to generalize to unseen samples. The good news is this is a very common major challenge!
You need regularization. One method is dropout, whereby on different training epochs a random set of the NN connections are dropped, forcing the network to "learn" alternate pathways and weights, and softening sharp peaks in parameter space. Since you need to keep your architecture and loss function the same, you won't be able to add such an option in (though for completeness, read this article for a description and implementation of dropout in PyTorch).
Given your constraints, you'll want to use something like L2 or L1 weight regularization. This typically shows up in the way of adding an additional term to the cost/loss function, which penalizes large weights. In PyTorch, L2 regularization is implemented via the torch.optim construct, with the option weight_decay. (See documentation: torch.optim, search for 'L2')
For your code, try something like:
def train_net(epochs,batch_size,train_x,train_y,model_size,lr):
...
optim = torch.optim.Adam(model.parameters(),...,weight_decay=0.01)
...

Based on your statement that your training accuracy is 100%, while your testing accuracy is significantly lower at 72%, it seems that you are significantly overfitting your dataset.
In short, this means that your model is training itself too specifically to the training data that you've given it, picking up on quirks that may exist in the training data but which are not inherent to the classification. For example, if the dogs in your training data were all white, the model would eventually learn to associate the color white with dogs, and be hard-pressed to recognize dogs of other colors given to it in the test data set.
There are many avenues to address this issue: a well sourced overview of the subject written in simple terms can be found here.
Without more information on the specific constraints you have around changing the architecture of the neural network, it's tough to say for sure what you will and will not be able to change. However, weight regularization and dropout are often used to great effect (and are described in the above article.) You should also be free to implement early stopping and a weight constraint to the model.
I'll leave it you to find resources on how to implement these specific strategies in pytorch, but this should provide a good jumping off point.

Average error and standard deviation of error within epoch not correctly updating - PyTorch

I am attempting to use Stochastic Gradient Descent but I am unsure as to why my error/loss is not decreasing. The information I am using from the train dataframe is the index (each sequence) and the binding affinity, and the goal is to predict the binding affinity. Here is what the head of the dataframe looks like:
For the training, I make a one-hot of a sequence and calculate a score with another matrix, and the goal is to get this score to be as close to the binding affinity as possible (for any given peptide). How I calculate the score and my training loop is shown in my code below but I don't think an explanation is necessary to solve why my error fails to decrease.
#ONE-HOT ENCODING
AA=['A','R','N','D','C','Q','E','G','H','I','L','K','M','F','P','S','T','W','Y','V']
loc=['N','2','3','4','5','6','7','8','9','10','11','C']
aa = "ARNDCQEGHILKMFPSTWYV"
def p_one_hot(seq):
c2i = dict((c,i) for i,c in enumerate(aa))
int_encoded = [c2i[char] for char in seq]
onehot_encoded = list()
for value in int_encoded:
letter = [0 for _ in range(len(aa))]
letter[value] = 1
onehot_encoded.append(letter)
return(torch.Tensor(np.transpose(onehot_encoded)))
#INITALIZE TENSORS
a=Var(torch.randn(20,1),requires_grad=True) #initalize similarity matrix - random array of 20 numbers
freq_m=Var(torch.randn(12,20),requires_grad=True)
freq_m.data=(freq_m.data-freq_m.min().data)/(freq_m.max().data-freq_m.min().data)#0 to 1 scaling
optimizer = optim.SGD([torch.nn.Parameter(a), torch.nn.Parameter(freq_m)], lr=1e-6)
loss = nn.MSELoss()
#TRAINING LOOP
epochs = 100
for i in range(epochs):
#RANDOMLY SAMPLE DATA
train = all_seq.sample(frac=.03)
names = train.index.values.tolist()
affinities = train['binding_affinity']
print('Epoch: ' + str(i))
#forward pass
iteration_loss=[]
for j, seq in enumerate(names):
sm=torch.mm(a,a.t()) #make simalirity matrix square symmetric
freq_m.data=freq_m.data/freq_m.data.sum(1,keepdim=True) #sum of each row must be 1 (sum of probabilities of each amino acid at each position)
affin_score = affinities[j]
new_m = torch.mm(p_one_hot(seq), freq_m)
tss_m = new_m * sm
tss_score = tss_m.sum()
sms = sm
fms = freq_m
error = loss(tss_score, torch.FloatTensor(torch.Tensor([affin_score])))
iteration_loss.append(error.item())
optimizer.zero_grad()
error.backward()
optimizer.step()
mean = statistics.mean(iteration_loss)
stdev = statistics.stdev(iteration_loss)
print('Epoch Average Error: ' + str(mean) + '. Epoch Standard Deviation: ' + str(stdev))
iteration_loss.clear()
After each epoch, I print out the average of all errors for that epoch as well as the standard deviation. Each epoch runs through about 45,000 sequences. However, after 10 epochs I'm still not seeing any improvement with my error and I'm unsure as to why. Here is the output I am seeing:
Are there any ideas as to what I'm doing wrong? I'm new to PyTorch so any help is appreciated! Thank you!

It turns out that casting the optimizer parameters into torch.nn.Parameter() makes the tensors fail to hold on to updates, and removing this now shows a decreasing error.

Dataset enumeration (epoch and batchSize) when implementing Ray.Tune PBT hyperparameter optimization

This is my first time trying to use Ray.Tune for hyperparameter optimization. I am confused as to where in the Ray code I should initialize the dataset as well as where to put the for-loops for defining the epoch and enumerating the dataset batches.
Background
In my normal training script, I follow several steps:
1. Parse the model options,
2. Initialize the dataset,
3. Create and initialize the model,
4. For-loop for progressing through the epochs,
5. Nested for-loop for enumerating the dataset batches
The Ray.Tune documentation says that when defining the Trainable class object, that I really only need _setup, _train, _save, and _restore. As I understand it, the _train() is for a single iteration and increments the training_iteration automatically. Given that the dataset_size may not be cleanly divisible by the batchSize, I calculate the total_steps as the training progresses. If I understand it right, my total_steps will not be equal to training_iteration. This is important because the number of steps is supposed to be used to determine when to evaluate the worker.
I also do not want to instantiate the dataset for each worker individually. Ray should instantiate the dataset once, and then the workers can access the data via shared memory.
Original train.py code
self.opt = TrainOptions().parse()
data_loader = CreateDataLoader(self.opt)
self.dataset = data_loader.load_data()
self.dataset_size = len(data_loader)
total_steps = 0
counter = 0
for epoch in range(self.opt.starting_epoch, self.opt.niter + self.opt.niter_decay + 1):
for i, data in enumerate(self.dataset):
total_steps += self.opt.batchSize if i<len(self.dataset) else (self.dataset_size * (epoch + 1)) - total_steps
counter += 1
self.model.set_input(data, self.opt)
self.model.optimizeD()
if counter % self.opt.critic_iters == 0:
self.model.optimizeG()

The training_iteration is just a logical unit of training. It would not be a problem to use one epoch per "training_iteration".

How to calculate perplexity of RNN in tensorflow

I'm running the word RNN implmentation of tensor flow of Word RNN
How to calculate the perplexity of RNN.
Following is the code in training that shows training loss and other things in each epoch:
for e in range(model.epoch_pointer.eval(), args.num_epochs):
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** e)))
data_loader.reset_batch_pointer()
state = sess.run(model.initial_state)
speed = 0
if args.init_from is None:
assign_op = model.batch_pointer.assign(0)
sess.run(assign_op)
assign_op = model.epoch_pointer.assign(e)
sess.run(assign_op)
if args.init_from is not None:
data_loader.pointer = model.batch_pointer.eval()
args.init_from = None
for b in range(data_loader.pointer, data_loader.num_batches):
start = time.time()
x, y = data_loader.next_batch()
feed = {model.input_data: x, model.targets: y, model.initial_state: state,
model.batch_time: speed}
summary, train_loss, state, _, _ = sess.run([merged, model.cost, model.final_state,
model.train_op, model.inc_batch_pointer_op], feed)
train_writer.add_summary(summary, e * data_loader.num_batches + b)
speed = time.time() - start
if (e * data_loader.num_batches + b) % args.batch_size == 0:
print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}" \
.format(e * data_loader.num_batches + b,
args.num_epochs * data_loader.num_batches,
e, train_loss, speed))
if (e * data_loader.num_batches + b) % args.save_every == 0 \
or (e==args.num_epochs-1 and b == data_loader.num_batches-1): # save for the last result
checkpoint_path = os.path.join(args.save_dir, 'model.ckpt')
saver.save(sess, checkpoint_path, global_step = e * data_loader.num_batches + b)
print("model saved to {}".format(checkpoint_path))
train_writer.close()

The project you are referencing uses sequence_to_sequence_loss_by_example, which returns the cross-entropy loss. So for calculating the training perplexity, you just need to exponentiate the loss like explained here.
train_perplexity = tf.exp(train_loss)
We have to use e instead of 2 as a base, because TensorFlow measures the cross-entropy loss with the natural logarithm (TF Documentation). Thank you, #Matthias Arro and #Colin Skow for the hint.
Detailed Explanation
The cross-entropy of two probability distributions P and Q tells us the minimum average number of bits we need to encode events of P, when we develop a coding scheme based on Q. So, P is the true distribution, which we usually don't know. We want to find a Q as close to P as possible, so that we can develop a nice coding scheme with as few bits per event as possible.
I shouldn't say bits, because we can only use bits as a measure if we use base 2 in the calculation of the cross-entropy. But TensorFlow uses the natural logarithm, so instead let's measure the cross-entropy in nats.
So let's say we have a bad language model that says every token (character / word) in the vocabulary is equally probable to be the next one. For a vocabulary of 1000 tokens, this model will have a cross-entropy of log(1000) = 6.9 nats. When predicting the next token, it has to choose uniformly between 1000 tokens at each step.
A better language model will determine a probability distribution Q that is closer to P. Thus, the cross-entropy is lower - we might get a cross-entropy of 3.9 nats. If we now want to measure the perplexity, we simply exponentiate the cross-entropy:
exp(3.9) = 49.4
So, on the samples, for which we calculated the loss, the good model was as perplex as if it had to choose uniformly and independently among roughly 50 tokens.

It depends whether your loss function gives you a log likelihood of the data in base 2 or base e. This model is using legacy_seq2seq.sequence_loss_by_example, which uses TensorFlow's binary crossentropy, which appears to use logs of base e. Therefore, even though we're dealing with a discrete probability distribution (text), we should exponentiate with e, i.e. use tf.exp(train_loss) as Colin Skow suggested.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.