I have a huge dataset (about 50 Gigabytes) and I'm loading it using Python generators like this:
def data_generator(self, images_path):
with open(self.temp_csv, 'r') as f:
for image in f.readlines():
# Something going on...
yield (X, y)
The important thing is that I'm using a single generator for both training and validation data and I'm trying to change the self.temp_csv during the runtime. However, things are not going on as expected and by updating the variable self.temp_csv which is supposed to switch between train and validation sets, with open is not called and I end up iterating over the same dataset over and over again. I wonder if there is any possibility to use Dataset.from_generator and during the runtime, I switch to another dataset to do the validation phase. Here is how I am specifying the generator. Thank you!
def get_data(self):
with tf.name_scope('data'):
data_generator = lambda: self.data_generator(images_path=self.data_path)
my_data = tf.data.Dataset.from_generator(
generator=data_generator,
output_types=(tf.float32, tf.float32),
output_shapes=(tf.TensorShape([None]), tf.TensorShape([None]))
).batch(self.batch_size).prefetch(2)
img, self.label = my_data.make_one_shot_iterator().get_next()
self.img = tf.reshape(img, [-1, CNN_INPUT_HEIGHT, CNN_INPUT_WIDTH, CNN_INPUT_CHANNELS])
You could use a reinitialize iterator or a feedable iterator to switch between 2 datasets as shown in the official docs.
However, if you want to read all the data you have using the generator, then create a train and validation split, then its not that straightforward.
If you have a separate validation file, you can simply create a new validation dataset and use the iterator like shown above.
If thats not the case, methods such as skip() and take() can help you split the data, but the shuffling for a good split is something you need to think about.
Related
This is my first attempt at branching out from ready-made datasets and models to something pieced together on my own. Using tensorflow, I'm trying to load a dataset of images where each image is assigned a normalized, numeric value so that I can try to build a regression CNN over it.
Unfortunately for me, tf.keras.preprocessing.image_dataset_from_directory expects the dataset to be discretely classified.
Is there a straightforward way to convert the BatchDataset object to a numeric labeling?
For further clarification, if I were to do a dir(my_dataset) or my_dataset.__dict__, I would like to know where the labels are.
I can answer this after poking around a little more at the BatchDataSet object returned from tf.keras.preprocessing.image_dataset_from_directory
Because these objects are very specialized iterators, it's not straightforward to access a single "row."
The following is how I got direct access to where this information is located from a BatchDataSet.
training_ds = tf.keras.preprocessing.image_dataset_from_directory(
"assets/",
validation_split=0.2,
subset="training",
seed=123,
color_mode="rgb",
image_size=(IMG_SIZE, IMG_SIZE),
batch_size=32)
# pull one batch from the BatchDataSet
one_batch = training_ds.take(1)
# transform the TakeDataset into a python iterator and then pull one batch using next()
training_data, labels = next(iter(batch))
# labels is a numpy array of int32 where each integer is the index of the class inside training_ds.class_names
I'm trying to write Tensorflow 2.0 code which is good enough to share with other people. I have run into a problem with tf.data.Dataset. I have solved it, but I dislike my solutions.
Here is working Python code which generates padded batches from irregular data, two different ways. In one case, I re-use a global variable to supply the shape information. I dislike the global variable, especially because I know that the Dataset knows its own output shapes, and in the future I may have Dataset objects with several different output shapes.
In the other case, I extract the shape information from the Dataset object itself. But I have to jump through hoops to do it.
import numpy as np
import tensorflow as tf
print("""
Create a data set with the desired shape: 1 input per sub-element,
3 targets per sub-element, 8 elements of varying lengths.
""")
def gen():
lengths = np.tile(np.arange(4,8), 2)
np.random.shuffle(lengths)
for length in lengths:
inp = np.random.randint(1, 51, length)
tgt = np.random.random((length, 3))
yield inp, tgt
output_types = (tf.int64, tf.float64)
output_shapes = ([None], [None, 3])
dataset = tf.data.Dataset.from_generator(gen, output_types, output_shapes)
print("""
Using the global variable, output_shapes, allows the retrieval
of padded batches.
""")
for inp, tgt in dataset.padded_batch(3, output_shapes):
print(inp)
print(tgt)
print()
print("""
Obtaining the shapes supplied to Dataset.from_generator()
is possible, but hard.
""")
default_shapes = tuple([[y.value for y in x.shape.dims] for x in dataset.element_spec]) # Crazy!
for inp, tgt in dataset.padded_batch(3, default_shapes):
print(inp)
print(tgt)
I don't quite understand why one might want to pad the data in a batch of unevenly-sized elements to any shapes other than the output shapes which were used to define the Dataset elements in the first place. Does anyone know of a use case?
Also, there is no default value for the padded_shapes argument. I show how to retrieve what I think is the sensible default value for padded_shapes. That one-liner works... but why is it so difficult?
I'm currently trying to subclass Dataset to provide the Dataset default shapes as a Python property. Tensorflow is fighting me, probably because the underlying Dataset is a C++ object while I'm working in Python.
All this trouble makes me wonder whether there is a cleaner approach than what I have tried.
Thanks for your suggestions.
Answering my own question. I asked this same question on Reddit. A Tensorflow contributor replied that TF 2.2 will provide a default value for the padded_shapes argument. I am glad to see that the development team has recognized the same need that I identified.
Here's my current call to model.fit in Keras
history_callback = model.fit(x_train/255.,
validation_train_data,
validation_split=validation_split,
batch_size=batch_size,
callbacks=callbacks)
in this example x_train is a list of numpy arrays that contains all of my image data. The way validation_train_data is structured though is its a list of numpy arrays of totally different sizes that is equal in length to the list of numpy arrays that contains my image. The data for each image though is contained in validation_train_data such that x_train[i] would correspond to a set containing validation_train_data[0][i], validation_train_data[1][i], validation_train_data[2][i], etc. Is there any way I can reformat my validation_train_data such that it can properly be used as a y_true in a custom keras loss function.
I managed to solve my problem by writing a generator function which generated a batch of x and y data as lists and put them together as a tuple. I then called fit_generator with the argument where generator = my_generator and it worked just fine. If you have odd input data then you should consider writing a generator to take care of it.
This is the tutorial I used to do so:
https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
With the new dataset object, is there a way to divide a dataset into training and test dataset, according to a certain ratio, to get an hold out? and a k-fold cross validation?
In my case i wrote all data in only one TFRecord file and then i imported it with tf.data.TFRecordDataset.
Now, for hold out i'd like a way to split this given dataset in two datasets with a ratio. I solved this with data.take() and data.skip() but for ratio i need dataset's lenght, it's not graceful.
def split_dataset(dataset, ratio, n):
count_train = (n*ratio)//100
train = dataset.take(count_train)
test = dataset.skip(count_train)
return train,test
filenames = ["dataset_breast.tfrecords"]
dataset = tf.data.TFRecordDataset(filenames)
train_dataset, test_dataset = split_dataset(dataset, 80, 3360)
While for k-fold, i find only solution with scikit workaround on the dataset, before tf.data.TFRecordDataset importing.
I do not know of any feature like what you're describing. There are, of course, ways to achieve the functionality you're after. Here's two:
"Source" Placeholder
This one comes straight from the API docs. Though it is originally intended for TFRecordDataset, I imagine it could be adapted to other types. I'll copy/paste from the link:
filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...) # Parse the record into tensors.
dataset = dataset.repeat() # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()
# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.
# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})
# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})
A little discussion: This works with just one training tfrecord and one validation tfrecord, too. So, you could split your data using sklearn.model_selection.train_test_split before writing the TFRecords. Write one TFRecord dedicated to training data, one dedicated to validation data. With sklearn, you can specify a ratio (or an absolute number).
Two Datasets
Exactly like the name says, forget the filenames = tf.placeholder. Just create two iterators, one for testing and one for training. I usually use TFRecords, but you're free to try another type of dataset. Typically, I put the get_next calls into a tf.cond on a boolean tf.placeholder. If you're especially interested in this method, I could provide a MWE. But, the source placeholder seems to be the preferred method (seeing as it's in the docs...).
I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.
You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.
You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)
try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.
I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()
I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.