Hold out tensorflow 1.4 new dataset API - python

With the new dataset object, is there a way to divide a dataset into training and test dataset, according to a certain ratio, to get an hold out? and a k-fold cross validation?
In my case i wrote all data in only one TFRecord file and then i imported it with tf.data.TFRecordDataset.
Now, for hold out i'd like a way to split this given dataset in two datasets with a ratio. I solved this with data.take() and data.skip() but for ratio i need dataset's lenght, it's not graceful.
def split_dataset(dataset, ratio, n):
count_train = (n*ratio)//100
train = dataset.take(count_train)
test = dataset.skip(count_train)
return train,test
filenames = ["dataset_breast.tfrecords"]
dataset = tf.data.TFRecordDataset(filenames)
train_dataset, test_dataset = split_dataset(dataset, 80, 3360)
While for k-fold, i find only solution with scikit workaround on the dataset, before tf.data.TFRecordDataset importing.

I do not know of any feature like what you're describing. There are, of course, ways to achieve the functionality you're after. Here's two:
"Source" Placeholder
This one comes straight from the API docs. Though it is originally intended for TFRecordDataset, I imagine it could be adapted to other types. I'll copy/paste from the link:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.
# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
A little discussion: This works with just one training tfrecord and one validation tfrecord, too. So, you could split your data using sklearn.model_selection.train_test_split before writing the TFRecords. Write one TFRecord dedicated to training data, one dedicated to validation data. With sklearn, you can specify a ratio (or an absolute number).
Two Datasets
Exactly like the name says, forget the filenames = tf.placeholder. Just create two iterators, one for testing and one for training. I usually use TFRecords, but you're free to try another type of dataset. Typically, I put the get_next calls into a tf.cond on a boolean tf.placeholder. If you're especially interested in this method, I could provide a MWE. But, the source placeholder seems to be the preferred method (seeing as it's in the docs...).

Related

Changing a TF Dataset from classified to numeric/regression data

This is my first attempt at branching out from ready-made datasets and models to something pieced together on my own. Using tensorflow, I'm trying to load a dataset of images where each image is assigned a normalized, numeric value so that I can try to build a regression CNN over it.
Unfortunately for me, tf.keras.preprocessing.image_dataset_from_directory expects the dataset to be discretely classified.
Is there a straightforward way to convert the BatchDataset object to a numeric labeling?
For further clarification, if I were to do a dir(my_dataset) or my_dataset.__dict__, I would like to know where the labels are.
I can answer this after poking around a little more at the BatchDataSet object returned from tf.keras.preprocessing.image_dataset_from_directory
Because these objects are very specialized iterators, it's not straightforward to access a single "row."
The following is how I got direct access to where this information is located from a BatchDataSet.
training_ds = tf.keras.preprocessing.image_dataset_from_directory(
"assets/",
validation_split=0.2,
subset="training",
seed=123,
color_mode="rgb",
image_size=(IMG_SIZE, IMG_SIZE),
batch_size=32)
# pull one batch from the BatchDataSet
one_batch = training_ds.take(1)
# transform the TakeDataset into a python iterator and then pull one batch using next()
training_data, labels = next(iter(batch))
# labels is a numpy array of int32 where each integer is the index of the class inside training_ds.class_names

Creating a Y_true Dataset in Keras

Here's my current call to model.fit in Keras
history_callback = model.fit(x_train/255.,
validation_train_data,
validation_split=validation_split,
batch_size=batch_size,
callbacks=callbacks)
in this example x_train is a list of numpy arrays that contains all of my image data. The way validation_train_data is structured though is its a list of numpy arrays of totally different sizes that is equal in length to the list of numpy arrays that contains my image. The data for each image though is contained in validation_train_data such that x_train[i] would correspond to a set containing validation_train_data[0][i], validation_train_data[1][i], validation_train_data[2][i], etc. Is there any way I can reformat my validation_train_data such that it can properly be used as a y_true in a custom keras loss function.
I managed to solve my problem by writing a generator function which generated a batch of x and y data as lists and put them together as a tuple. I then called fit_generator with the argument where generator = my_generator and it worked just fine. If you have odd input data then you should consider writing a generator to take care of it.
This is the tutorial I used to do so:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

Imputation on the test set with fancyimpute

The python package Fancyimpute provides several methods for the imputation of missing values in Python. The documentation provides examples such as:
# X is the complete data matrix
# X_incomplete has the same values as X except a subset have been replace with NaN
# Model each feature with missing values as a function of other features, and
# use that estimate for imputation.
X_filled_ii = IterativeImputer().fit_transform(X_incomplete)
This works fine when applying the imputation method to a dataset X. But what if a training/test split is necessary? Once
X_train_filled = IterativeImputer().fit_transform(X_train_incomplete)
is called, how do I impute the test set and create X_test_filled? The test set needs to be imputed using the information from the training set. I guess that IterativeImputer() should returns and object that can fit X_test_incomplete. Is that possible?
Please note that imputing on the whole dataset and then split into training and test set is not correct.
The package looks like it mimic's scikit-learn's API. And after looking in the source code, it looks like it does have a transform method.
my_imputer = IterativeImputer()
X_trained_filled = my_imputer.fit_transform(X_train_incomplete)
# now transform test
X_test_filled = my_imputer.transform(X_test)
The imputer will apply the same imputations that it learned from the training set.

Runtime switching between datasets in Tensorflow from_generator?

I have a huge dataset (about 50 Gigabytes) and I'm loading it using Python generators like this:
def data_generator(self, images_path):
with open(self.temp_csv, 'r') as f:
for image in f.readlines():
# Something going on...
yield (X, y)
The important thing is that I'm using a single generator for both training and validation data and I'm trying to change the self.temp_csv during the runtime. However, things are not going on as expected and by updating the variable self.temp_csv which is supposed to switch between train and validation sets, with open is not called and I end up iterating over the same dataset over and over again. I wonder if there is any possibility to use Dataset.from_generator and during the runtime, I switch to another dataset to do the validation phase. Here is how I am specifying the generator. Thank you!
def get_data(self):
with tf.name_scope('data'):
data_generator = lambda: self.data_generator(images_path=self.data_path)
my_data = tf.data.Dataset.from_generator(
generator=data_generator,
output_types=(tf.float32, tf.float32),
output_shapes=(tf.TensorShape([None]), tf.TensorShape([None]))
).batch(self.batch_size).prefetch(2)
img, self.label = my_data.make_one_shot_iterator().get_next()
self.img = tf.reshape(img, [-1, CNN_INPUT_HEIGHT, CNN_INPUT_WIDTH, CNN_INPUT_CHANNELS])
You could use a reinitialize iterator or a feedable iterator to switch between 2 datasets as shown in the official docs.
However, if you want to read all the data you have using the generator, then create a train and validation split, then its not that straightforward.
If you have a separate validation file, you can simply create a new validation dataset and use the iterator like shown above.
If thats not the case, methods such as skip() and take() can help you split the data, but the shuffling for a good split is something you need to think about.

How to get feature vector column length in Spark Pipeline

I have an interesting question.
I am using Pipeline object to run a ML task.
This is how my Pipeline object looks like.
jpsa_mlp.pipeline.getStages()
Out[244]:
[StringIndexer_479d82259c10308d0587,
Tokenizer_4c5ca5ea35544bb835cb,
StopWordsRemover_4641b68e77f00c8fbb91,
CountVectorizer_468c96c6c714b1000eef,
IDF_465eb809477c6c986ef9,
MultilayerPerceptronClassifier_4a67befe93b015d5bd07]
All the estimators and transformers inside this pipeline object have been coded as part of class methods with JPSA being class object.
Now I want to put a method for hyper parameter tuning. So I use below:
self.paramGrid = ParamGridBuilder()\
.addGrid(self.pipeline.getStages()[5].layers, [len(self.pipeline.getStages()[3].vocab),10,3])\
.addGrid(self.pipeline.getStages()[5].maxIter, [100,300])\
.build()
The problem is for a Neural Network classifier one of the hyper parameter is basically the hidden layer size. The layers attribute of MLP classifier requires the size of input layer, hidden and output layer. Input and Output is fixed (based on data we have). So I wanted to put input layer size as the size of my feature vector. However I don't know the size of my feature vector because the estimator inside the pipeline object to create feature vectors (Count Vectorizer, IDF) have not been fit yet to the data.
The pipeline object will fit the data during cross validation by using a cross validator object of Spark. Then only I would be able to have CountVectorizerModel to know the feature vector size.
If I had Countvectorizer materialized then I can use either the countvectorizerModel.vocab to get the length of the feature vector and use that as a parameter for input layer value in layers attribute of mlp.
SO then how do I add hyper parameters for Layers for mlp (both the hidden and input layer size)?
You can find out that information from your dataframe schema metadata.
Scala code:
val length = datasetAfterPipe.schema(datasetAfterPipe.schema.fieldIndex("columnName"))
.metadata.getMetadata("ml_attr").getLong("num_attrs")
Since is requested PySpark code:
u can se them "navigating" metadata: datasetAfterPipe.schema["features"].metadata["ml_attr"]
here is sample output (xxx is all features made into features columns and the end results is the size):
Out:
{'attrs': {'numeric': [{'idx': xxxxxxx }]}, 'num_attrs': 337}
so u slice metadata:
lenFeatureVect = datasetAfterPipe.schema["features"].metadata["ml_attr"]["num_attrs"]
print('Len feature vector:', lenFeatureVect)
Out:
337
Note: if u have "scaled features" then u need to use "pre-Scaled" column
"features" in order to get attributes info (assuming u scale after vectorizing otherwise is not getting applied limitations if u feed original columns) since u feed feature
vectors to that step into Pipeline.

Categories

Resources