I have created the following Python program, which, as far as I understand CTC, should be a valid CTC-based model, as well as training data. The best documentation I can find is CNTK_208_Speech_CTC Tutorial, which is what I've based this on. The program is as simple as I could make it, and it relies only on numpy and CNTK, and generates data itself.
When I run this, I get the following error:
Validating --> ForwardBackward2850 = ForwardBackward (LabelsToGraph2847, StableSigmoid2703) : [5 x labelAxis1], [5 x inputAxis1] -> []
RuntimeError: The Matrix dimension in the ForwardBackwardNode operation does not match.
This seems to be the same issue from this ticket: https://github.com/Microsoft/CNTK/issues/2156
Here is the Python program:
# cntk_ctc_hello_world.py
#
# This is a "hello world" example of using CTC (Connectionist Temporal Classification) with CNTK.
#
# The input is a sequence of vectors of size 17. We use 17 because it's easy to spot that number in
# error messages. The output is a string of codes, each code being one of 4 possible characters from
# our alphabet that we'll refer to here as "ABCD", although they're actually just represented
# by the numbers 0..3, which is typical for classification systems. To make the setup of training data
# trivial, we assign the first four elements of our 17-dimension input vector to the four characters
# of our alphabet, so that the matching is:
# 10000000000000000 A
# 01000000000000000 B
# 00100000000000000 C
# 00010000000000000 D
# In our input sequences, we repeat each code three to five times, followed by three to five codes
# containing random noise. Whether it's repeated 3,4, or 5 times, is random for each code and each
# spacer. When we emit one of our codes, we fill the first 4 values with the code, and the remaining
# 13 values with random noise.
# For example:
# Input: AAA-----CCCC---DDDDD
# Output: ACD
import cntk as C
import numpy as np
import random
import sys
InputDim = 17
NumClasses = 4 # A,B,C,D
MinibatchSize = 100
MinibatchPerEpoch = 50
NumEpochs = 10
MaxOutputSeqLen = 10 # ABCDABCDAB
inputAxis = C.Axis.new_unique_dynamic_axis('inputAxis')
labelAxis = C.Axis.new_unique_dynamic_axis('labelAxis')
inputVar = C.sequence.input_variable((InputDim), sequence_axis=inputAxis, name="input")
labelVar = C.sequence.input_variable((NumClasses+1), sequence_axis=labelAxis, name="labels")
# Construct an LSTM-based model that will perform the classification
with C.default_options(activation=C.sigmoid):
classifier = C.layers.Sequential([
C.layers.For(range(3), lambda: C.layers.Recurrence(C.layers.LSTM(128))),
C.layers.Dense(NumClasses + 1)
])(inputVar)
criteria = C.forward_backward(C.labels_to_graph(labelVar), classifier, blankTokenId=NumClasses, delayConstraint=3)
err = C.edit_distance_error(classifier, labelVar, squashInputs=True, tokensToIgnore=[NumClasses])
lr = C.learning_rate_schedule([(3, .01), (1,.001)], C.UnitType.sample)
mm = C.momentum_schedule([(1000, 0.9), (0, 0.99)], MinibatchSize)
learner = C.momentum_sgd(classifier.parameters, lr, mm)
trainer = C.Trainer(classifier, (criteria, err), learner)
# Return a numpy array of 17 elements, for this code
def make_code(code):
a = np.zeros(NumClasses) # 0,0,0,0
v = np.random.rand(InputDim - NumClasses) # 13x random
a = np.concatenate((a, v))
a[code] = 1
return a
def make_noise_code():
return np.random.rand(InputDim)
def make_onehot(code):
v = np.zeros(NumClasses+1)
v[code] = 1
return v
def gen_batch():
x_batch = []
y_batch = []
for mb in range(MinibatchSize):
yLen = random.randint(1, MaxOutputSeqLen)
x = []
y = []
for i in range(yLen):
code = random.randint(0,3)
y.append(make_onehot(code))
xLen = random.randint(3,5) # Input is 3 to 5 repetitions of the code
for j in range(xLen):
x.append(make_code(code))
spacerLen = random.randint(3,5) # Spacer is 3 to 5 repetitions of noise
for j in range(spacerLen):
x.append(make_noise_code())
x_batch.append(np.array(x, dtype='float32'))
y_batch.append(np.array(y, dtype='float32'))
return x_batch, y_batch
#######################################################################################
# Dump first X/Y training pair from minibatch
#x, y = gen_batch()
#print("\nx sequence of first sample of minibatch:\n", x[0])
#print("\ny sequence of first sample of minibatch:\n", y[0])
#######################################################################################
progress_printer = C.logging.progress_print.ProgressPrinter(tag='Training', num_epochs=NumEpochs)
for epoch in range(NumEpochs):
for mb in range(MinibatchPerEpoch):
x_batch, y_batch = gen_batch()
trainer.train_minibatch({inputVar: x_batch, labelVar: y_batch})
progress_printer.epoch_summary(with_metric=True)
For those who are facing this error, there are two points to take note of:
(1) The data provided to labels sequence tensor to labels_to_graph must have the same sequence length as the data coming out from network output at runtime.
(2) If during the model building you change the dynamic sequence axis of input sequence tensor (e.g. stride in sequential axis), then you must call reconcile_dynamic_axes on your labels sequence tensor with the network_output as the second argument to the function. This tells CNTK that labels have the same dynamic axis as the network.
Adhering to these 2 points will allow forward_backward to run.
Related
I’m trying to apply multiclass logistic regression from scratch. The dataset is the MNIST.
I built some functions such as hypothesis, sigmoid, cost function, cost function derivate, and gradient descendent. My code is below.
I’m struggling with:
As all images are labeled with the respective digit that they represent. There are a total of 10 classes.
Inside the function gradient descendent, I need to loop through each class, but I do not know how to apply it using the One vs All method.
In other words, what I need to do are:
How to filter each class inside the gradient descendent.
After that, how to build a function to predict the test set.
Here is my code.
import numpy as np
import pandas as pd
# Only training data set
# the test data will be load later.
url='https://drive.google.com/file/d/1-MO8oCfq4KU361QeeL4DdafVBhZePUNT/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url,header = None)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
def hypothesis(X, thetas):
return sigmoid( X.dot(thetas)) #- 0.0000001
def sigmoid(z):
return 1/(1+np.exp(-z))
def losscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return -(1/m) * ( y.dot(np.log(h)) + (1-y).dot(np.log(1-h)) )
def derivativelosscost(X, y, m, thetas):
h = hypothesis(X, thetas)
return (h-y).dot(X)/m
def descendinggradient(X, y, m, epoch, alpha, thetas):
n = np.size(X, 1)
J_historico = []
for i in range(epoch):
for j in range(0,10): # 10 classes
# How to filter each class inside here (inside this def descendinggradient)?
# 2 lines below are wrong.
#thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
#J_historico = J_historico + [losscost(X, y, m, thetas)]
return [thetas, J_historico]
alpha = 0.01
epoch = 50
(thetas, J_historico) = descendinggradient(X, y, m, epoch, alpha)
# After that, how to build a function to predict the test set.
Let me explain this problem step-by-step:
First since you code doesn't provides the actual data or a link to it I've created a random dataset followed by the same commands you used to create X and Y:
batch_size = 20
num_classes = 10
rng = np.random.default_rng(seed=42)
df = pd.DataFrame(
4* rng.random((batch_size, num_classes + 1)) - 2, # Create Random Array Between -2, 2
columns=['X0','X1','X2','X3','X4','X5','X6','X7','X8', 'X9','Y']
)
X = df.values[:, 0:-1]
y = df.values[:, -1]
m = np.size(X, 0)
y = np.array(y).reshape(m, 1)
X = np.c_[ np.ones(m), X ] # Bias
Next lets take a look at your hypothesis function. If we would just run hypothesis and take a look at the first sample, we will get a vector with the size (10,1). I also needed to provide the initial thetas for this case:
thetas = rng.random((X.shape[1],num_classes))
h = hypothesis(X, thetas)
print(h[0])
>>>[0.89701729 0.90050806 0.98358408 0.81786334 0.96636732 0.97819512
0.89118488 0.87238045 0.70612173 0.30256924]
Basically the function calculates a "propabilties"[1] for each class.
At this point we got to the first issue in your code. The result of the sigmoid function returns "propabilities" which are not "connected" to each other. So to set those "propabilties" in relation we need a another function: SOFTMAX. You will find plenty implementations about this functions. In short: It will calculate the "propabilites" based on the "sigmoid", so that the sum overall class-"propabilites" results to 1.
So for your second question "How to implement a predict after training", we only need to find the argmax value to determine the class:
h = hypothesis(X, thetas)
p = softmax(h) # needs to be implemented
prediction = np.argmax(p, axis=1)
print(prediction)
>>>[2 5 5 8 3 5 2 1 3 5 2 3 8 3 3 9 5 1 1 8]
Now that we know how to predict a class, we also need to know where to setup the training. We want to do this directly after the softmax function. But instead of using the argmax to determine the winning class, we use the costfunction and its derivative. Your problem in your code: You used the crossentropy loss for a binary problem. The binary problem also don't need to use the softmax function, because the sigmoid function already provides the connection of the two binary classes. So since we are not interested in the result at all of the cross-entropy-loss for multiple classes and only into its derivative, we also want to calculate this directly.
The conversion from binary crossentropy to multiclass is kind of unintuitive in the first view. I recommend to read a bit about it before implementing. After this you basicly use your line:
thetas = thetas - alpha * derivativelosscost(X, y, m, thetas)
for updating the thetas.
[1]These are not actuall propabilities, but this is a complete different topic.
I'm trying to vectorize the following for-loop in Pytorch. I'd be happy with just vectorizing the inner for-loop, but doing the whole batch would also be awesome.
# B: the batch size
# N: the number of training examples
# dim: the dimension of each feature vector
# K: the number of discrete labels. each vector has a single label
# delta: margin for hinge loss
batch_data = torch.tensor(...) # Tensor of shape [B x N x d]
batch_labels = torch.tensor(...) # Tensor of shape [B x N x 1], each element is one of K labels (ints)
batch_losses = [] # Ultimately should be [B x 1]
batch_centroids = [] # Ultimately should be [B x K_i x dim]
for i in range(B):
centroids = [] # Keep track of the means for each class.
classes = torch.unique(labels) # Get the unique labels for the classes.
# NOTE: The number of classes K for each item in the batch might actually
# be different. This may complicate batch-level operations.
total_loss = 0
# For each class independently. This is the part I want to vectorize.
for cl in classes:
# Take the subset of training examples with that label.
subset = data[torch.where(labels == cl)]
# Find the centroid of that subset.
centroid = subset.mean(dim=0)
centroids.append(centroid)
# Get the distance between each point in the subset and the centroid.
dists = subset - centroid
norm = torch.linalg.norm(dists, dim=1)
# The loss is the mean of the hinge loss across the subset.
margin = norm - delta
hinge = torch.clamp(margin, min=0.0) ** 2
total_loss += hinge.mean()
# Keep track of everything. If it's too hard to keep track of centroids, that's also OK.
loss = total_loss.mean()
batch_losses.append(loss)
batch_centroids.append(centroids)
I've been scratching my head on how to deal with the irregularly sized tensors. The number of classes in each batch K_i is different, and the size of each subset is different.
It turns out it actually is possible to vectorize across ragged arrays. I'll use numpy, but code should be directly translatable to torch. The key technique is to:
Sort by ragged array membership
Perform an accumulation
Find boundary indices, compute adjacent differences
For a single (non-batch) input of an n x d matrix X and an n-length array label, the following returns the k x d centroids and n-length distances to respective centroids:
def vcentroids(X, label):
"""
Vectorized version of centroids.
"""
# order points by cluster label
ix = np.argsort(label)
label = label[ix]
Xz = X[ix]
# compute pos where pos[i]:pos[i+1] is span of cluster i
d = np.diff(label, prepend=0) # binary mask where labels change
pos = np.flatnonzero(d) # indices where labels change
pos = np.repeat(pos, d[pos]) # repeat for 0-length clusters
pos = np.append(np.insert(pos, 0, 0), len(X))
Xz = np.concatenate((np.zeros_like(Xz[0:1]), Xz), axis=0)
Xsums = np.cumsum(Xz, axis=0)
Xsums = np.diff(Xsums[pos], axis=0)
counts = np.diff(pos)
c = Xsums / np.maximum(counts, 1)[:, np.newaxis]
repeated_centroids = np.repeat(c, counts, axis=0)
aligned_centroids = repeated_centroids[inverse_permutation(ix)]
dist = np.sum((X - aligned_centroids) ** 2, axis=1)
return c, dist
Batching requires little special handling. For an input B x n x d array batch_X, with B x n batch labels batch_labels, create unique labels for each batch:
batch_k = batch_labels.max(axis=1) + 1
batch_k[1:] = batch_k[:-1]
batch_k[0] = 0
base = np.cumsum(batch_k)
batch_labels += base.expand_dims(1)
So now each batch element has a unique contiguous range of labels. I.e., the first batch element will have n labels in some range [0, k0) where k0 = batch_k[0], the second will have range [k0, k0 + k1) where k1 = batch_k[1], etc.
Then just flatten the n x B x d input to n*B x d and call the same vectorized method. Your loss function is derivable using the final distances and same position-array based reduction technique.
For a detailed explanation of how the vectorization works, see my blog post.
You can vectorize the whole thing if you use a one-hot encoding for your classes and a pairwise distance trick for your norms:
import torch
B = 32
N = 1000
dim = 50
K = 25
batch_data = torch.randn((B, N, dim))
batch_labels = torch.randint(0, K, size=(B, N))
batch_one_hot = torch.nn.functional.one_hot(batch_labels)
centroids = torch.matmul(
batch_one_hot.transpose(-1, 1).type(batch_data.dtype),
batch_data
) / batch_one_hot.sum(1)[..., None]
norms = torch.linalg.norm(batch_data[:, :, None] - centroids[:, None], axis=-1)
# Compute the rest of your loss
# ...
A couple things to watch out for:
You'll get a divide by zero for any batches that have a missing class. You can handle this by first computing the class sums (with matmul) and counts (summing the one-hot tensor along axis 1) separately. Then, mask the sums with count == 0 and divide the rest of them by their class counts.
If you have a large number of classes, this will cause memory problems because the one-hot tensor will be too big. In that case, the answer from #VF1 probably makes more sense.
How do I set up a multi-variate regression problem using Trax?
I get AssertionError: Invalid shape (16, 2); expected (16,). from the code below, coming from the L2Loss object.
The following is my attempt to adapt the sentiment analysis example into a regression problem:
import os
import trax
from trax import layers as tl
from trax.supervised import training
import numpy
import random
#train_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=True)()
#eval_stream = trax.data.TFDS('imdb_reviews', keys=('text', 'label'), train=False)()
def generate_samples():
# (text, lat/lon)
data= [
("Aberdeen MS", numpy.array((33.824742, -88.554591)) ),
("Aberdeen SD", numpy.array((45.463186, -98.471033))),
("Aberdeen WA", numpy.array((46.976432, -123.795781))),
("Amite City LA", numpy.array((30.733723, -90.5208))),
("Amory MS", numpy.array((33.984789, -88.48001))),
("Amouli AS", numpy.array((-14.26556, -170.589772))),
("Amsterdam NY", numpy.array((42.953149, -74.19505)))
]
for i in range(1024*8):
yield random.choice(data)
train_stream = generate_samples()
eval_stream = generate_samples()
model = tl.Serial(
tl.Embedding(vocab_size=8192, d_feature=256),
tl.Mean(axis=1), # Average on axis 1 (length of sentence).
tl.Dense(2), # Regress to lat/lon
# tl.LogSoftmax() # Produce log-probabilities.
)
# You can print model structure.
print(model)
print(next(train_stream)) # See one example.
data_pipeline = trax.data.Serial(
trax.data.Tokenize(vocab_file='en_8k.subword', keys=[0]),
trax.data.Shuffle(),
# trax.data.FilterByLength(max_length=2048, length_keys=[0]),
trax.data.BucketByLength(boundaries=[ 8, 128,],
batch_sizes=[256, 64, 4],
length_keys=[0]),
trax.data.AddLossWeights()
)
train_batches_stream = data_pipeline(train_stream)
eval_batches_stream = data_pipeline(eval_stream)
example_batch = next(train_batches_stream)
print(f'shapes = {[x.shape for x in example_batch]}') # Check the shapes.:wq
# Training task.
train_task = training.TrainTask(
labeled_data=train_batches_stream,
# loss_layer=tl.CrossEntropyLoss(),
loss_layer=tl.L2Loss(),
optimizer=trax.optimizers.Adam(0.01),
n_steps_per_checkpoint=500,
)
# Evaluaton task.
eval_task = training.EvalTask(
labeled_data=eval_batches_stream,
metrics=[tl.L2Loss(),],
n_eval_batches=20 # For less variance in eval numbers.
)
# Training loop saves checkpoints to output_dir.
output_dir = os.path.expanduser('~/output_dir/')
training_loop = training.Loop(model,
train_task,
eval_tasks=[eval_task],
output_dir=output_dir)
# Run 2000 steps (batches).
training_loop.run(2000)
The problem might be in the generate_samples() generator: This yields only 1024*8 (=8192) samples. If I replace the line
for i in range(1024*8):
by
while True:
so that an infinite amount of samples is generated, your example works on my machine.
Since generate_samples() only yields 8192 samples, train_batches_stream only yields 32 batches of 256 samples each, so that you can only train for at most 32 steps. However, you ask for 2000 steps.
While attempting to replicate the section 3.1 in Incorporating Discrete Translation Lexicons into Neural MT in paddle-paddle
I tried to have a static matrix that I'll need to load into the seqToseq training pipeline, e.g.:
>>> import numpy as np
>>> x = np.random.rand(3,2)
>>> x
array([[ 0.64077103, 0.03278357],
[ 0.47133411, 0.16309775],
[ 0.63986919, 0.07130613]])
# where there is 3 target words and 2 source words,
# and each cell in the matrix represents some co-occurrence probabilities.
With the seqToseq_net demo, this matrix would need to be multiplied to the attention layer output in the gru_decoder_with_attention. The original demo:
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(name='gru_decoder',
size=decoder_size,
boot_layer=decoder_boot)
# This attention context layer would have been
# a vector of size |src_vocab| x 1
context = simple_attention(encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem, )
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
decoder_inputs += full_matrix_projection(input=context)
decoder_inputs += full_matrix_projection(input=current_word)
gru_step = gru_step_layer(name='gru_decoder',
input=decoder_inputs,
output_mem=decoder_mem,
size=decoder_size)
with mixed_layer(size=target_dict_dim,
bias_attr=True,
act=SoftmaxActivation()) as out:
out += full_matrix_projection(input=gru_step)
return out
The goal is to affect the attention layer by multiplying it with the static matrix:
def gru_decoder_with_attention(enc_vec, enc_proj, current_word):
decoder_mem = memory(name='gru_decoder',
size=decoder_size,
boot_layer=decoder_boot)
# This attention context layer would have been
# of size |src_vocab| x 1
context = simple_attention(encoded_sequence=enc_vec,
encoded_proj=enc_proj,
decoder_state=decoder_mem, )
# This static matrix layer, x, would have been
# of size |trg_vocab| x |src_vocab|
static_matrix = some_sort_of_layer(x)
# This should yield a vector of size
# |trg_vocab| x 1
static_matrix_multiply_context = some_sort_of_operation_layer( static_matrix, context)
with mixed_layer(size=decoder_size * 3) as decoder_inputs:
#
decoder_inputs += full_matrix_projection(input= static_matrix_multiply_context)
decoder_inputs += full_matrix_projection(input=current_word)
I've tried looking through the code in Paddle/python/trainer_config_helps and walked-through all the demo code and I've also asked on PaddlePaddle's gitter. But I can't find how can I load a customized static matrix that doesn't need to be updated in the training process and interact with one of Paddle's layer.
How to load a matrix to change the attention layer in seqToseq demo?
What should some_sort_of_layer and some_sort_of_operation_layer be in the above example?
I'm puzzled as to why the in the code below (the section where I labeled "HERE"), would work because j+1 would make the list of list (which is the X_train_folds) go out of range when j reaches the end of the range. Why would this even work? Is it because vstack can automatically detect this change? I couldn't find any documentation for it though.
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]
X_train_folds = []
y_train_folds = []
################################################################################
# Split up the training data into folds. After splitting, X_train_folds and #
# y_train_folds should each be lists of length num_folds, where #
# y_train_folds[i] is the label vector for the points in X_train_folds[i]. #
# Hint: Look up the numpy array_split function. #
################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
# print y_train_folds
# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}
################################################################################
# Perform k-fold cross validation to find the best value of k. For each #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times, #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all #
# values of k in the k_to_accuracies dictionary. #
################################################################################
for k in k_choices:
k_to_accuracies[k] = []
for k in k_choices:
print 'evaluating k=%d' % k
for j in range(num_folds):
X_train_cv = np.vstack(X_train_folds[0:j]+X_train_folds[j+1:])#<--------------HERE
X_test_cv = X_train_folds[j]
#print len(y_train_folds), y_train_folds[0].shape
y_train_cv = np.hstack(y_train_folds[0:j]+y_train_folds[j+1:]) #<----------------HERE
y_test_cv = y_train_folds[j]
#print 'Training data shape: ', X_train_cv.shape
#print 'Training labels shape: ', y_train_cv.shape
#print 'Test data shape: ', X_test_cv.shape
#print 'Test labels shape: ', y_test_cv.shape
classifier.train(X_train_cv, y_train_cv)
dists_cv = classifier.compute_distances_no_loops(X_test_cv)
#print 'predicting now'
y_test_pred = classifier.predict_labels(dists_cv, k)
num_correct = np.sum(y_test_pred == y_test_cv)
accuracy = float(num_correct) / num_test
k_to_accuracies[k].append(accuracy)
################################################################################
# END OF YOUR CODE #
################################################################################
# Print out the computed accuracies
for k in sorted(k_to_accuracies):
for accuracy in k_to_accuracies[k]:
print 'k = %d, accuracy = %f' % (k, accuracy)
No. vstack is not causing that, but the very powerful indexation of numpy is. The internals of numpy are complex and sometimes it returns a copy, other times a view. In both cases, however, you are launching methods. And this method in particular returns an empty array when indexation is, itself, empty (as outside the space of the array).
See the following example and the consequential outputs (in print):
import numpy as np
a = np.array([1, 2, 3])
print(a[10:]) # This will return empty
print(a[10]) # This is an error
, the result is:
[]
Traceback (most recent call last): File "C:/Users/imactuallyavegetable/temp.py", line 333, in
print(a[10]) IndexError: index 10 is out of bounds for axis 0 with size 3
First an empty array, second the exception.