Here's my current call to model.fit in Keras
history_callback = model.fit(x_train/255.,
validation_train_data,
validation_split=validation_split,
batch_size=batch_size,
callbacks=callbacks)
in this example x_train is a list of numpy arrays that contains all of my image data. The way validation_train_data is structured though is its a list of numpy arrays of totally different sizes that is equal in length to the list of numpy arrays that contains my image. The data for each image though is contained in validation_train_data such that x_train[i] would correspond to a set containing validation_train_data[0][i], validation_train_data[1][i], validation_train_data[2][i], etc. Is there any way I can reformat my validation_train_data such that it can properly be used as a y_true in a custom keras loss function.
I managed to solve my problem by writing a generator function which generated a batch of x and y data as lists and put them together as a tuple. I then called fit_generator with the argument where generator = my_generator and it worked just fine. If you have odd input data then you should consider writing a generator to take care of it.
This is the tutorial I used to do so:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
Related
This is my first attempt at branching out from ready-made datasets and models to something pieced together on my own. Using tensorflow, I'm trying to load a dataset of images where each image is assigned a normalized, numeric value so that I can try to build a regression CNN over it.
Unfortunately for me, tf.keras.preprocessing.image_dataset_from_directory expects the dataset to be discretely classified.
Is there a straightforward way to convert the BatchDataset object to a numeric labeling?
For further clarification, if I were to do a dir(my_dataset) or my_dataset.__dict__, I would like to know where the labels are.
I can answer this after poking around a little more at the BatchDataSet object returned from tf.keras.preprocessing.image_dataset_from_directory
Because these objects are very specialized iterators, it's not straightforward to access a single "row."
The following is how I got direct access to where this information is located from a BatchDataSet.
training_ds = tf.keras.preprocessing.image_dataset_from_directory(
"assets/",
validation_split=0.2,
subset="training",
seed=123,
color_mode="rgb",
image_size=(IMG_SIZE, IMG_SIZE),
batch_size=32)
# pull one batch from the BatchDataSet
one_batch = training_ds.take(1)
# transform the TakeDataset into a python iterator and then pull one batch using next()
training_data, labels = next(iter(batch))
# labels is a numpy array of int32 where each integer is the index of the class inside training_ds.class_names
So the logistic regression from the sklearn library from Python has the .fit() function which takes x_train(features) and y_train(labels) as arguments to train the classifier.
It seems that x_train.shape = (number_of_samples, number_of_features)
For x_train I should use the extracted xvector.scp file, which I am reading like so:
b = kaldiio.load_scp('xvector.scp')
And I can print the content like so:
for file_id in b:
xvector = b[file_id]
print(xvector)
Right now the b variable is like a dictionary and you can get the x-vector value of the corresponding id. I want to use sklearn Logistic Regression to classify the x-vectors and in order to use the .fit() method I should pass an array as an argument.
My question is how can I make an array that contains only the xvector variables?
PS: the file_ids are like 1 million and each xvector has length of 512, which is too big for an array
It seems you are trying to store the dictionary into a numpy array. If the dictionary is small, you can directly store the values as:
import numpy as np
x = np.array(list(b.values()))
However, this will run into OOM issues if the dictionary is large. In this case, you would need to use np.memmap as explained here: https://ipython-books.github.io/48-processing-large-numpy-arrays-with-memory-mapping/
Essentially, you have to add rows to the array one at a time, and flush it when you have run out of memory. The array is stored directly on the disk, so it avoids OOM issues.
I'm trying to implement a model for which the input should be a list of lists:
inputs = [ [np.array([...]), ..., np.array([...])], [np.array([...]), ..., np.array([...])] ]
I can not convert the inner lists in two np array since the shapes of them don't allow that.
When I pass the inputs to the model I receive the following error:
Please provide as model inputs either a single array or a list of arrays.
How can I feed my inputs to the model?
Thanks
You must have compatible shapes, that's unavoidable.
The only case that accepts list of lists if when you have model with "more than one input tensor".
The solutions for you are:
Padding the data: add a padding so every array has the same shape
Train separate arrays, one at a time, using train_on_batch instead of fit in a manual training loop. Each of the separate arrays must have a well defined shape.
With the new dataset object, is there a way to divide a dataset into training and test dataset, according to a certain ratio, to get an hold out? and a k-fold cross validation?
In my case i wrote all data in only one TFRecord file and then i imported it with tf.data.TFRecordDataset.
Now, for hold out i'd like a way to split this given dataset in two datasets with a ratio. I solved this with data.take() and data.skip() but for ratio i need dataset's lenght, it's not graceful.
def split_dataset(dataset, ratio, n):
count_train = (n*ratio)//100
train = dataset.take(count_train)
test = dataset.skip(count_train)
return train,test
filenames = ["dataset_breast.tfrecords"]
dataset = tf.data.TFRecordDataset(filenames)
train_dataset, test_dataset = split_dataset(dataset, 80, 3360)
While for k-fold, i find only solution with scikit workaround on the dataset, before tf.data.TFRecordDataset importing.
I do not know of any feature like what you're describing. There are, of course, ways to achieve the functionality you're after. Here's two:
"Source" Placeholder
This one comes straight from the API docs. Though it is originally intended for TFRecordDataset, I imagine it could be adapted to other types. I'll copy/paste from the link:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.
# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
A little discussion: This works with just one training tfrecord and one validation tfrecord, too. So, you could split your data using sklearn.model_selection.train_test_split before writing the TFRecords. Write one TFRecord dedicated to training data, one dedicated to validation data. With sklearn, you can specify a ratio (or an absolute number).
Two Datasets
Exactly like the name says, forget the filenames = tf.placeholder. Just create two iterators, one for testing and one for training. I usually use TFRecords, but you're free to try another type of dataset. Typically, I put the get_next calls into a tf.cond on a boolean tf.placeholder. If you're especially interested in this method, I could provide a MWE. But, the source placeholder seems to be the preferred method (seeing as it's in the docs...).
With CNTK I have created a network with 2 input neurons and 1 output neuron.
A line in the training file looks like
|features 1.567518 2.609619 |labels 1.000000
Then the network was trained with brain script. Now I want to use the network for predicting values. For example: Input data is [1.82, 3.57]. What ist the output from the net?
I have tried Python with the following code, but here I am new. Code does not work. So my question is: How to pass the input data [1.82, 3.57] to the eval function?
On stackoverflow there are some hints, here and here, but this is too abstract for me.
Thank you.
import cntk as ct
import numpy as np
z = ct.load_model("LR_reg.dnn", ct.device.cpu())
input_data= np.array([1.82, 3.57], dtype=np.float32)
pred = z.eval({ z.arguments[0] : input_data })
print(pred)
Here's the most defensive way of doing it. CNTK can be forgiving if you omit some of this when the network is specified with V2 constructs. Not sure about a network that was created with V1 code.
Basically you need a pair of braces for each axis. Which axes exist in Brainscript? There's a batch axis, a sequence axis and then the static axes of your network. You have one dimensional data so that means the following should work:
input_data= np.array([[[1.82, 3.57]]], dtype=np.float32)
This specifies a batch of one sequence, of length one, containing one 1d vector of two elements. You can also try omitting the outermost braces and see if you are getting the same result.
Update based on more information from the comment below, we should not forget that the V1 code also saved the part of the network that computes things like loss and accuracy. If we provide only the features, CNTK will complain that the labels have not been provided. There are two ways to deal with this issue. One possibility is to provide some fake labels, so that the network can evaluate these auxiliary operations. Another possibility is to identify the prediction and use that. If the prediction was called 'p' in V1, this python code
p = z.find_by_name('p')
should create a CNTK function that only needs the features in order to compute the prediction.