Shaping Numpy Arrays for pytorch GAN - python

(pre-processing for qiskit QGAN but the use case is somewhat irrelevant)
I'm a bit lost trying to figure out how to preprocess an image dataset before passing it through a GAN. Below is all the relevant code up to my error. This code is derived from https://github.com/Qiskit/qiskit-tutorials/blob/master/legacy_tutorials/aqua/machine_learning/qgans_for_loading_random_distributions.ipynb and has been (attempted to be) altered to accommodate for a different input dataset. (The original has generated dummy data of much simpler dimensions)
# Root directory for dataset
dataroot = "./data/land"
# Number of workers for dataloader
workers = 2
# Batch size during training
batch_size = 128
#img size
image_size = 64
dataset = dset.ImageFolder(root=dataroot,
transform=transforms.Compose([
transforms.Resize(image_size),
transforms.CenterCrop(image_size),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]))
# Create the dataloader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
shuffle=True, num_workers=workers)
real_batch = next(iter(dataloader))
real_batch_arr= [t.numpy() for t in real_batch]
# Set number of qubits per data dimension as list of k qubit values[#q_0,...,#q_k-1]
num_qubits = [4]
k = len(num_qubits)
num_epochs = 100
# Initialize qGAN
qgan = QGAN(real_batch_arr,bounds=bounds, num_qubits = num_qubits,batch_size = 128, num_epochs=num_epochs, snapshot_dir=None)
This gives me is the following error.
ValueError Traceback (most recent call last)
<ipython-input-42-8cba9a74f024> in <module>
5
6 # Initialize qGAN
----> 7 qgan = QGAN(real_batch_arr,bounds=bounds, num_qubits = num_qubits,batch_size = 128, num_epochs=num_epochs, snapshot_dir=None)
8 qgan.seed = 1
9 # Set quantum instance to run the quantum generator
~\Anaconda3\lib\site-packages\qiskit\aqua\algorithms\distribution_learners\qgan.py in __init__(self, data, bounds, num_qubits, batch_size, num_epochs, seed, discriminator, generator, tol_rel_ent, snapshot_dir, quantum_instance)
99 if data is None:
100 raise AquaError('Training data not given.')
--> 101 self._data = np.array(data)
102 if bounds is None:
103 bounds_min = np.percentile(self._data, 5, axis=0)
ValueError: could not broadcast input array from shape (128,3,64,64) into shape (128)
I understand that the qiskit function (QGAN) at some point is attempting to turn real_batch_arr to an array (which is defined as a list when passed to QGAN). This array is expected to be just (128) however, on top of that, an array needs to be passed to QGAN, not a list (based from the original code linked above).
My question is how would I be able to transform my list into the array that I need. There also could be something I am simply fundamentally missing. I truly appreciate any advice or comments.

The current implementation of the qGAN algorithm does not support data sets which are given as a tensor. The data is required to be either given as a flat array or an array of k-dimensional data points, i.e., the shape of the data should be num_data_samples x dim_data_samples.

Related

How to get a single index from a DataSet in PyTorch?

I want to randomly draw a sample from my test DataSet object to perform a prediction using my trained model.
To achieve this I use this code block which causes the following error:
rng = np.random.default_rng()
ind = rng.integers(0,len(test_ds),(1,))[-1]
I = test_ds[ind] # Note I is a list of tensors of equal size
I = [Ik.to(device) for Ik in I]
with torch.no_grad():
_, y_f_hat, _, y_f = model.forward_F(I)
y_f_hat = y_f_hat.cpu().numpy().flatten()
y_f = y_f.cpu().numpy().flatten()
ERROR: /usr/local/lib/python3.8/dist-packages/torch/nn/modules/flatten.py in forward(self, input)
44
45 def forward(self, input: Tensor) -> Tensor:
---> 46 return input.flatten(self.start_dim, self.end_dim)
47
48 def extra_repr(self) -> str:
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
There is no problem when using the dataloader:
for I in test_dataloader:
with torch.no_grad():
_, y_f_hat, _, y_f = model.forward_F(I)
y_f_hat = y_f_hat.cpu().numpy().flatten()
y_f = y_f.cpu().numpy().flatten()
break
test_ds is the dataset used in test_dataloader.
Notes: on google Colab GPU, Python 3.9
When using DataLoader, it brings the data as a batch of samples. So the shape of the data coming out of DataLoader is like (B, ...) where B is the batch size and ... are the other dimensions (I do not know how your samples look like, in terms of images, for example, it is like (B, C, H, W) where C, H, W are the number of channels, height and width, respectively). This is what pytorch layers expect. In other words, you need a preceding dimension for batch size.
As a solution, you can call .unsqueeze(0) on input tensor before feeding into the model.
_, y_f_hat, _, y_f = model.forward_F(I.unsqueeze(0))

Keras won't broadcast-multiply the model output with a mask designed for the entire mini batch

I have a data generator that produces batches of input data (X) and targets (Y), and also a mask (batch_mask) to be applied to the model output (the same mask applies to all the datapoint in the batch; there are different masks for different batches and the data generator takes care of doing this).
As a result, the first dimension of batch_mask could have shape 1 or batch_size (by repeating the same mask along the first dimension batch_size times). I was expecting Keras to let me use either, and I wanted to simply create masks having a shape of 1 on the first dimension.
However, when I tried this, I got the error:
ValueError: Data cardinality is ambiguous:
x sizes: 128, 1
y sizes: 128
Make sure all arrays contain the same number of samples.
Why won't Keras broadcast along the first dimension? It seems like this should not be complicated.
Here's some minimal example code to observe this behavior
import tensorflow.keras as tfk
import numpy as np
#######################
# 1. model definition #
#######################
# model parameters
nfeatures_in = 6
target_size = 8
# model inputs
input = tfk.layers.Input(nfeatures_in)
input_mask = tfk.layers.Input(target_size)
# model graph
out = tfk.layers.Dense(target_size)(input)
out_masked = tfk.layers.Multiply()((out,input_mask)) # multiply all model outputs in the batch by the same mask
model = tfk.Model(inputs=(input, input_mask), outputs=out_masked)
##########################
# 2. dummy data creation #
##########################
batch_size = 32
# create masks the batch
zeros_vector = np.zeros((1,target_size)) # "batch_size"==1
zeros_vector[0,:6] = 1
batch_mask = zeros_vector
# dummy data creation
X = np.random.randn(batch_size, 6)
Y = np.random.randn(batch_size, target_size)*batch_mask # the target is masked by design in each batch
############################
# 3. compile model and fit #
############################
model.compile(optimizer="Adam", loss="mse")
model.fit((X, batch_mask),Y, batch_size=batch_size)
I know I could make this work by either:
repeating the mask to make the first dimension of batch_mask be the size of the first dimension of X (instead of 1).
using pure tensorflow (but I feel like broadcasting along the batch dimension should not be a problem for Keras).
How can I make this work with Keras?
Thank you!
You can create an IdentityLayer which receives as an external input parameter the batch_mask and returns it as a tensor.
class IdentityLayer(tfk.layers.Layer):
def __init__(self, my_mask, **kwargs):
super(IdentityLayer, self).__init__()
self.my_mask = my_mask
def call(self, _):
my_mask = tf.convert_to_tensor(self.my_mask, dtype=tf.float32)
return my_mask
def get_config(self):
config = super().get_config()
config.update({
"my_mask": self.my_mask,
})
return config
The usage of IdentityLayer in a model is straightforward:
# model inputs
input = tfk.layers.Input(nfeatures_in)
input_mask = IdentityLayer(batch_mask)(input)
# model graph
out = tfk.layers.Dense(target_size)(input)
out_masked = tfk.layers.Multiply()((out,input_mask))
model = tfk.Model(inputs=input, outputs=out_masked)
Where batch_mask is a numpy array created as you reported:
zeros_vector = np.zeros((1,target_size)) # "batch_size"==1
zeros_vector[0,:6] = 1
batch_mask = zeros_vector
The solution is to (properly) use a DataGenerator.
See the gist with the working code: https://gist.github.com/iranroman/2aaecf5b5621051df6b1b6b5394e5ef3
Thank you #Marco Cerliani for the discussion that led to figuring out the solution.

ValueError: Error when checking input: expected input_1 to have shape (168, 5) but got array with shape (5808, 5)

I'm trying to implement a hybrid LSTM-DNN forecaster with multiple inputs using the code from Hvass-Labs Time Series tutorial #23. Basically I want to forecast day-ahead prices (just a 24 time step into the future for now) of electricity using sequential and non-sequential data. The model I'm using is two sets of inputs feeding an LSTM (for the sequential data) and Dense for the non-sequential data, with their outputs concatenated. It looks like this:
!https://imgur.com/a/x15FfIy
Basically whenever I try to fit the model after one epoch it shows this error:
UPDATE:
ValueError: Error when checking input: expected input_1 to have shape (168, 5) but got array with shape (5808, 5)
The changes I have implemented:
# Chop off x_test_scaled into two parts:
x_test1_scaled = x_test_scaled[:,0:5] # shape is (5808, 5)
x_test2_scaled = x_test_scaled[:,5:12] # shape is (5808, 7)
validation_data = [np.expand_dims(x_test1_scaled, axis=0), np.expand_dims(x_test2_scaled, axis=0)], np.expand_dims(y_test_scaled, axis=0)
I'm confused because I have indeed assigned the generator to the generator in the model.fit_generator, and I'm not passing the x_test1_scaled which does have the shape of (5808, 5). edit:(not validation_data)
%%time
model.fit_generator(generator=generator,
epochs=10,
steps_per_epoch=30,
validation_data=validation_data,
callbacks=callbacks)
If this helps, this is my model:
# first input model
input_1 = Input(shape=((168,5)))
dense_1 = Dense(50)(input_1)
# second input model
input_2 = Input(shape=((168,7)))
lstm_1 = LSTM(units=64, return_sequences=True, input_shape=(None, 7,))(input_2)
# merge input models
merge = concatenate([dense_1, lstm_1])
output = Dense(num_y_signals, activation='sigmoid')(merge)
model = Model(inputs=[input_1, input_2], outputs=output)
# summarize layers
print(model.summary())
EDIT: Cleared this problem, replaced with error on top.
Thus far I've managed everything up to actually fitting the model.
Whenever an epoch finishes however it goes into the error:
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[[0.4 , 0.44444442, 0. , ..., 0.1734707 ,
0.07272629, 0.07110982],
[0.3904762 , 0.43434343, 0.04347826, ..., 0.1740398 ,
0.07282589, 0.06936309],
...
I have tried the solutions from other stackexchange posts of the same error message. They haven't been successful, but I was able to eventually isolate the problem array to that of the validation_data. I just don't know how to "reshape" it into the required 2 array.
The batch generator: I have included the two sets of inputs already. the x_batch_1 and x_batch_2
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
x_batch_1 = x_batch[ :, :, 0:5]
x_batch_2 = x_batch[ :, :, 5:12]
yield ([x_batch_1, x_batch_2], y_batch)
batch_size = 32
sequence_length = 24 * 7
generator = batch_generator(batch_size=batch_size,
sequence_length=sequence_length)
Validation set:
validation_data = np.expand_dims(x_test_scaled, axis=0), np.expand_dims(y_test_scaled, axis=0)
And lastly the model fit:
%%time
model.fit_generator(generator=generator,
epochs=10,
steps_per_epoch=30,
validation_data=validation_data,
callbacks=callbacks)
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[[0.4 , 0.44444442, 0. , ..., 0.1734707 ,
0.07272629, 0.07110982],
[0.3904762 , 0.43434343, 0.04347826, ..., 0.1740398 ,
0.07282589, 0.06936309],
...
The array is the same one as the validation_data. Another thing is that the error creeps up whenever the first epoch finishes which strengthens the case for the problem being the validation_data.
It's because your model need 2 sets of input, x_batch_1, x_batch_2 in your batch_generator. While your validation_data has only one array np.expand_dims(x_test_scaled, axis=0)
You need to make validation_data looks like the batch_generator, probably [np.expand_dims(x_test1_scaled, axis=0), np.expand_dims(x_test2_scaled, axis=0)], np.expand_dims(y_test_scaled, axis=0).
In case of you still don't understand, please provide information about x_test1_scaled, like it's shape or how you load it.

Program hangs on Estimator.evaluate in Tensorflow 1.6

As a learning tool, I am trying to do something simple.
I have two training CSV files:
One file with 36 columns (3500 records) with 0s and 1s. I am envisioning this file as a flattened 6x6 matrix.
I have another CSV file with 1 columnn of ground truth 0 or 1 (3500 records) which indicates if at least 4 of the 6 of elements in the 6x6 matrix's diagonal are 1's.
I also have two test CSV files which are the same structure as the training files except there are 500 records in each.
When I step through the program using the debugger, it appears that the...
estimator.train(
input_fn=lambda: get_inputs(x_paths=[x_train_file], y_paths=[y_train_file], batch_size=32), steps=100)
... runs OK. I see files in the checkpoint directory and see a loss function graph in Tensorboard.
But when the program gets to...
eval_result = estimator.evaluate(
input_fn=lambda: get_inputs(x_paths=[x_test_file], y_paths=[y_test_file], batch_size=32))
... it just hangs.
I have checked the test files and I also tried running the estimator.evaluate using the training files. Still hangs
I am using TensorFlow 1.6, Python 3.6
The following is all of the code:
import tensorflow as tf
import os
import numpy as np
x_train_file = os.path.join('D:', 'Diag', '6x6_train.csv')
y_train_file = os.path.join('D:', 'Diag', 'HasDiag_train.csv')
x_test_file = os.path.join('D:', 'Diag', '6x6_test.csv')
y_test_file = os.path.join('D:', 'Diag', 'HasDiag_test.csv')
model_chkpt = os.path.join('D:', 'Diag', "checkpoints")
def get_inputs(
count=None, shuffle=True, buffer_size=1000, batch_size=32,
num_parallel_calls=8, x_paths=[x_train_file], y_paths=[y_train_file]):
"""
Get x, y inputs.
Args:
count: number of epochs. None indicates infinite epochs.
shuffle: whether or not to shuffle the dataset
buffer_size: used in shuffle
batch_size: size of batch. See outputs below
num_parallel_calls: used in map. Note if > 1, intra-batch ordering
will be shuffled
x_paths: list of paths to x-value files.
y_paths: list of paths to y-value files.
Returns:
x: (batch_size, 6, 6) tensor
y: (batch_size, 2) tensor of 1-hot labels
"""
def x_map(line):
n_dims = 6
columns = [str(i1) for i1 in range(n_dims**2)]
# Decode the line into its fields
fields = tf.decode_csv(line, record_defaults=[[0]] * (n_dims ** 2))
# Pack the result into a dictionary
features = dict(zip(columns, fields))
return features
def y_map(line):
y_row = tf.string_to_number(line, out_type=tf.int32)
return y_row
def xy_map(x, y):
return x_map(x), y_map(y)
x_ds = tf.data.TextLineDataset(x_train_file)
y_ds = tf.data.TextLineDataset(y_train_file)
combined = tf.data.Dataset.zip((x_ds, y_ds))
combined = combined.repeat(count=count)
if shuffle:
combined = combined.shuffle(buffer_size)
combined = combined.map(xy_map, num_parallel_calls=num_parallel_calls)
combined = combined.batch(batch_size)
x, y = combined.make_one_shot_iterator().get_next()
return x, y
columns = [str(i1) for i1 in range(6 ** 2)]
feature_columns = [
tf.feature_column.numeric_column(name)
for name in columns]
estimator = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[18, 9],
activation_fn=tf.nn.relu,
n_classes=2,
model_dir=model_chkpt)
estimator.train(
input_fn=lambda: get_inputs(x_paths=[x_train_file], y_paths=[y_train_file], batch_size=32), steps=100)
eval_result = estimator.evaluate(
input_fn=lambda: get_inputs(x_paths=[x_test_file], y_paths=[y_test_file], batch_size=32))
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
There are two parameters that are causing this:
tf.data.Dataset.repeat has a count parameter:
count: (Optional.) A tf.int64 scalar tf.Tensor, representing the
number of times the dataset should be repeated. The default behavior
(if count is None or -1) is for the dataset be repeated indefinitely.
In your case, count is always None, so the dataset is repeated indefinitely.
tf.estimator.Estimator.evaluate has the steps parameter:
steps: Number of steps for which to evaluate model. If None, evaluates until input_fn raises an end-of-input exception.
Steps are set for the training, but not for the evaluation, as a result the estimator is running until input_fn raises an end-of-input exception, which, as described above, never happens.
You should set either of those, I think count=1 is the most reasonable for evaluation.

reshape Error/ValueError: total size of new array must be unchanged

I have a code for image classification using CNN, so there are training data set and testing dataset. When I perform the system I have this error:
ValueError Traceback (most recent call last)
<ipython-input-44-cb7ec1a13881> in <module>()
1 optimize(num_iterations=1)
2
----> 3 print_validation_accuracy()
<ipython-input-43-7f1a17e48e41> in print_validation_accuracy(show_example_errors, show_confusion_matrix)
21
22 # Get the images from the test-set between index i and j.
---> 23 images = data.valid.images[i:j, :].reshape(batch_size, img_size_flat)
24 #images = data.valid.images[i:j, :].reshape(1, 128)
25
ValueError: total size of new array must be unchanged
and the steps of the code that precessed this error are:
def print_validation_accuracy(show_example_errors=False,
show_confusion_matrix=False):
# Number of images in the test-set.
num_test = len(data.valid.images)
# Allocate an array for the predicted classes which
# will be calculated in batches and filled into this array.
cls_pred = np.zeros(shape=num_test, dtype=np.int)
# Now calculate the predicted classes for the batches.
# We will just iterate through all the batches.
# There might be a more clever and Pythonic way of doing this.
# The starting index for the next batch is denoted i.
i = 0
while i < num_test:
# The ending index for the next batch is denoted j.
j = min(i + batch_size, num_test)
# Get the images from the test-set between index i and j.
images = data.valid.images[i:j, :].reshape(batch_size, img_size_flat)
# Get the associated labels.
labels = data.valid.labels[i:j, :]
# Create a feed-dict with these images and labels.
feed_dict = {x: images,
y_true: labels}
# Calculate the predicted class using TensorFlow.
cls_pred[i:j] = session.run(y_pred_cls, feed_dict=feed_dict)
# Set the start-index for the next batch to the
# end-index of the current batch.
i = j
cls_true = np.array(data.valid.cls)
cls_pred = np.array([classes[x] for x in cls_pred])
# Create a boolean array whether each image is correctly classified.
correct = (cls_true == cls_pred)
# Calculate the number of correctly classified images.
# When summing a boolean array, False means 0 and True means 1.
correct_sum = correct.sum()
# Classification accuracy is the number of correctly classified
# images divided by the total number of images in the test-set.
acc = float(correct_sum) / num_test
# Print the accuracy.
msg = "Accuracy on Test-Set: {0:.1%} ({1} / {2})"
print(msg.format(acc, correct_sum, num_test))
# Plot some examples of mis-classifications, if desired.
if show_example_errors:
print("Example errors:")
plot_example_errors(cls_pred=cls_pred, correct=correct)
# Plot the confusion matrix, if desired.
if show_confusion_matrix:
print("Confusion Matrix:")
plot_confusion_matrix(cls_pred=cls_pred)
Can anyone help me please?
As the error message shows, there is a mismatch in your reshape in this statement.
images = data.valid.images[i:j, :].reshape(batch_size, img_size_flat)
What is happening is this equation is not equal i.e
(j - i) * (column_size of data.valid.images) is not equal to batch_size * img_size_flat.
Make it equal and the problem will be solved.

Categories

Resources