Tensorflow.data.Dataset.rejection_resample modifies my dataset's element_spec - python

I am trying to use tf.data.Dataset.rejection_resample to balance my dataset, but I am running into an issue in which the method modifies the element_spec of my dataset, making it incompatible with my models.
The original element spec of my dataset is:
({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None))
This is the element spec after batching.
However, if I run rejection_resample (before batching), the element spec at the end becomes:
(TensorSpec(shape=(None,), dtype=tf.int64, name=None),
({'input_A': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None),
'input_B': TensorSpec(shape=(None, 900, 1), dtype=tf.float64, name=None)},
TensorSpec(shape=(None, 1, 1), dtype=tf.int64, name=None)))
So rejection_resample is adding another tf.int64 tensor in the beginning of my data, which I can't find out what is it for. My problem is that this breaks compatibility between the input data and my model, since it depends on the original input tuple.
Furthermore, it also causes an inconsistency between the training and validation data. I was expecting to apply rejection_resample only on training data, but if I do that, the training dataset will have the added tensor, while the validation one won't.
So my question is what is this added tensor to the element spec, and if there is any way to drop an element from the dataset after building it. Thank you.

Let suppose I have created the same dataset as yours,
x = tf.random.normal((7000, 900,1))
y = tf.random.normal((7000, 900,1))
z = tf.random.uniform((7000, 1,1), 1, 2, dtype=tf.int32)
#Now converting it to Tf.Dataset object
dataset = tf.data.Dataset.from_tensor_slices(((x,y),z))
func = lambda x , y : (({'input_A' : x[0], 'input_B' : x[1]}), y)
dataset = dataset.map(func)
After mapping my dataset will look exactly like yours
<MapDataset element_spec=({'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}, TensorSpec(shape=(1, 1), dtype=tf.int32, name=None))>
Now, I have to remove this last Tensor
disjoint_func = lambda x , y :(x)
dataset = dataset.map(disjoint_func)
Now, the extra Tensor has been removed
<MapDataset element_spec={'input_A': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None), 'input_B': TensorSpec(shape=(900, 1), dtype=tf.float32, name=None)}>

I can't tell you where the added tensor comes from but here would be an example how to remove/drop it from your dataset:
import tensorflow as tf
import numpy as np
# creating a sample dataset that's similar to your 'wrong' output
ds = tf.data.Dataset.from_tensor_slices((np.arange(-10, 0),(tf.constant(np.arange(10)), tf.constant(np.arange(10,20)))))
# remove the new 'wrong' tensor
dds = ds.map(lambda x, y: y)
# check new dataset
for i in dds.take(2):
print(i)
Keep in mind that this is a workaround and doesn't remove the source that causes the additional tensor

Related

How to tf.cast a field within a tensorflow Dataset

I have a tf.data.Dataset that looks like this:
<BatchDataset shapes: ((None, 256, 256, 3), (None,)), types: (tf.float32, tf.int32)>
The 2nd element (1st if zero indexing) corresponds with a label. I want to cast the 2nd term (labels) to tf.uint8.
How can one use tf.cast when dealing with td.data.Dataset?
Similar Questions
How to convert tf.int64 to tf.float32? is very similar, but is not for a tf.data.Dataset.
Repro
From Image classification from scratch:
curl -O https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_5340.zip
unzip kagglecatsanddogs_5340.zip
Then in Python with tensorflow~=2.4:
import tensorflow as tf
ds = tf.keras.preprocessing.image_dataset_from_directory(
"PetImages", batch_size=32
)
print(ds)
A map function may help
a = tf.data.Dataset.from_tensor_slices(np.empty((2,5,3)))
b = tf.data.Dataset.range(5, 8)
c = tf.data.Dataset.zip((a,b))
d = c.batch(1)
d
<BatchDataset shapes: ((None, 5, 3), (None,)), types: (tf.float64, tf.int64)>
# change the dtype of the 2nd arg in the batch from int64 to int8
e = d.map(lambda x,y:(x,tf.cast(y, tf.int8)))
<MapDataset shapes: ((None, 5, 3), (None,)), types: (tf.float64, tf.int8)>

Feeding multi-input .tfrecord-file to .fit()

I try to train my model using a tfrecord dataset (800 GB).
The simplified data pipeline looks like this:
files = tf.io.matching_files(tfr_dir + '*_' + single_pattern + '_*')
shards = tf.data.Dataset.from_tensor_slices(files)
# Read the tfrecords
dataset = tf.data.TFRecordDataset(filenames=shards, num_parallel_reads=tf.data.experimental.AUTOTUNE)
# Parse the tfrecords
dataset = dataset.map(parse_tfr_element, num_parallel_calls=tf.data.experimental.AUTOTUNE)
# Apply image augmentation and parameter optimization using tf.py_function with defined Tout
dataset = self.dataset.map(imgaug)
# Sort out erroneous
dataset = dataset.filter(lambda f1, f2, f3, f4, f5, state: state == False)
# Batch and prefetch data (not using shuffle atm)
dataset = dataset.batch(self.config.batch_size, num_parallel_calls=self.AUTOTUNE, drop_remainder=True)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
This gives me as output following (batch_size=8):
<RepeatDataset element_spec=
(TensorSpec(shape=(8, 4, 256, 256, 3), dtype=tf.float32, name=None),
TensorSpec(shape=(8, 4, 19, 2), dtype=tf.float32, name=None),
TensorSpec(shape=(8, 4, 19, 3), dtype=tf.float32, name=None),
TensorSpec(shape=(8, 4, 3, 4), dtype=tf.float32, name=None),
TensorSpec(shape=(8, 4, 4, 3), dtype=tf.float32, name=None),
TensorSpec(shape=(8,), dtype=tf.bool, name=None))>
dataset[0], dataset[3] and dataset[4] are the inputs (x) and dataset[1] and dataset[2] is the ground truth (y) (depending on the model).
This works well using a custom training loop iterating over the batches of the dataset using for step, data in enumerate(dataset) and defining the inputs to the model by simple subscribing e.g. data[0]. However, I can't get it running using .fit(). I tried different approaches to force .fit() to iterate over the dataset (next(iter(dataset), .from_generator()) but had no luck so far.
So how could I get a multi-input dataset into the fit function? I consider atm to not use tfrecords, as they were so far just hard to use.
Thanks for your help and all the best

Strange padding layer output

I am trying to construct a model that looks like this.
Notice that the output shape of the padding layer is 1 * 48 * 48 * 32. The input shape to padding layer is 1 * 48 * 48 * 16. Which type of padding operation does that?
My code:
prelu3 = tf.keras.layers.PReLU(shared_axes = [1, 2])(add2)
deptconv3 = tf.keras.layers.DepthwiseConv2D(3, strides=(2, 2), padding='same')(prelu3)
conv4 = tf.keras.layers.Conv2D(32, 1, strides=(1, 1), padding='same')(deptconv3)
maxpool1 = tf.keras.layers.MaxPool2D()(prelu3)
pad1 = tf.keras.layers.ZeroPadding2D(padding=(1, 1))(maxpool1) # This is the padding layer where problem lies.
This is the part of code that is trying to replicate that block. However, I get model that looks like this.
Am I missing something here or am I using the wrong layer?
By default, keras maxpool2d takes in:
Input shape : 4D tensor with shape (batch_size, rows, cols, channels).
Output shape : (batch_size, padded_rows, padded_cols, chamels)
PLease have a look here zero_padding2d layer docs in keras.
In that respect you are trying to double what is getting treated as a channel here.
Your input looks more like (batch, x, y, z) and you want to have a (batch, x, y, 2*z)
Why do you want to have a zeropadding to double your z? I would rather suggest you to use a dense layer like
tf.keras.layers.Dense(32)(maxpool1)
That would increase z shape from 16 to 32.
Edited:
I got something which can help you.
tf.keras.layers.ZeroPadding2D(
padding=(0, 8), data_format="channels_first"
)(maxpool1)
What this does is treats your y, z as (x, y) and x as channel and pads (0, 8) around (y, z) to give (y, 32)
Demo:
import tensorflow as tf
input_shape = (4, 28, 28, 3)
x = tf.keras.layers.Input(shape=input_shape[1:])
y = tf.keras.layers.Conv2D(16, 3, activation='relu', dilation_rate=2, input_shape=input_shape[1:])(x)
x=tf.keras.layers.ZeroPadding2D(
padding=(0, 8), data_format="channels_first"
)(y)
print(y.shape, x.shape)
(None, 24, 24, 16) (None, 24, 24, 32)

NeuPy: Input shapes issues

I want to build a neural network using neupy.
Therefore I consturcted the following architecture:
network = layers.join(
layers.Input(10),
layers.Linear(500),
layers.Relu(),
layers.Linear(300),
layers.Relu(),
layers.Linear(10),
layers.Softmax(),
)
My data is shaped as follwoing:
x_train.shape = (32589,10)
y_train.shape = (32589,1)
When I try to train this network using:
model.train(x_train, y_trian)
I get the follwoing error:
ValueError: Input dimension mis-match. (input[0].shape[1] = 10, input[1].shape[1] = 1)
Apply node that caused the error: Elemwise{sub,no_inplace}(SoftmaxWithBias.0, algo:network/var:network-output)
Toposort index: 26
Inputs types: [TensorType(float64, matrix), TensorType(float64, matrix)]
Inputs shapes: [(32589, 10), (32589, 1)]
Inputs strides: [(80, 8), (8, 8)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of 2.0}, Elemwise{sub,no_inplace}.0, Elemwise{mul,no_inplace}.0), Elemwise{Sqr}[(0, 0)](Elemwise{sub,no_inplace}.0)]]
How do I have to edit my network to map this kind of data?
Thank you a lot!
Your architecture has 10 outputs instead of 1. I assume that your y_train function is a 0-1 class identifier. If so, than you need to change your structure to this:
network = layers.join(
layers.Input(10),
layers.Linear(500),
layers.Relu(),
layers.Linear(300),
layers.Relu(),
layers.Linear(1), # Single output
layers.Sigmoid(), # Sigmoid works better for 2-class classification
)
You can make it even simpler
network = layers.join(
layers.Input(10),
layers.Relu(500),
layers.Relu(300),
layers.Sigmoid(1),
)
The reason why it works is because layers.Liner(10) > layers.Relu() is the same as layers.Relu(10). You can learn more in official documentation: http://neupy.com/docs/layers/basics.html#mutlilayer-perceptron-mlp

(Lasagne) ValueError: Input dimension mis-match

When I run my code, I get a value error with the following message:
ValueError: Input dimension mis-match. (input[0].shape[1] = 1, input[2].shape[1] = 20)
Apply node that caused the error: Elemwise{Composite{((i0 + i1) - i2)}}[(0, 0)](Dot22.0, InplaceDimShuffle{x,0}.0, InplaceDimShuffle{x,0}.0)
Toposort index: 18
Inputs types: [TensorType(float64, matrix), TensorType(float64, row), TensorType(float64, row)]
Inputs shapes: [(20, 1), (1, 1), (1, 20)]
Inputs strides: [(8, 8), (8, 8), (160, 8)]
Inputs values: ['not shown', array([[ 0.]]), 'not shown']
Outputs clients: [[Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of 2.0}, Elemwise{Composite{((i0 + i1) - i2)}}[(0, 0)].0, Elemwise{mul,no_inplace}.0), Elemwise{Sqr}[(0, 0)](Elemwise{Composite{((i0 + i1) - i2)}}[(0, 0)].0)]]
My training data is a matrix of entries such as ..
[ 815.257786 320.447 310.841]
And the batches I'm inputting to my training function have a shape of (BATCH_SIZE, 3) and type TensorType(float64, matrix)
My neural net is very simple:
self.inpt = T.dmatrix('inpt')
self.out = T.dvector('out')
self.network_in = nnet.layers.InputLayer(shape=(BATCH_SIZE, 3), input_var=self.inpt)
self.l0 = nnet.layers.DenseLayer(self.network_in, num_units=40,
nonlinearity=nnet.nonlinearities.rectify,
)
self.network = nnet.layers.DenseLayer(self.l0, num_units=1,
nonlinearity=nnet.nonlinearities.linear
)
My loss function is:
pred = nnet.layers.get_output(self.network)
loss = nnet.objectives.squared_error(pred, self.out)
loss = loss.mean()
I'm a bit confused as to why I'm getting a dimension mismatch. I'm passing in the correct input and label types (as per my symbolic variables), and the shape of my input data corresponds to the expected 'shape' parameter that I'm giving my InputLayer. I believe it's a problem with how I'm specifying the batch size, as when I use a batch size of 1 then my network can train without any problem, and the input[2].shape[1] value from the error message is my batch size. I'm quite new to machine learning, and any help would be greatly appreciated!
Turns out the problem was that my labels had the wrong dimensionality.
My data had shapes:
x_train.shape == (batch_size, 3)
y_train.shape == (batch_size,)
And the symbolic inputs to my net were:
self.inpt = T.dmatrix('inpt')
self.out = T.dvector('out')
I was able to solve my problem by reshaping y_train. I then changed the symbolic output variable to a matrix to account for these changes.
y_train = np.reshape(y_train, y_train.shape + (1,))
# y_train.shape == (batch_size, 1)
self.out = T.dmatrix('out')

Categories

Resources