How to normalize data when using Keras fit_generator - python

I have a very large data set and am using Keras' fit_generator to train a Keras model (tensorflow backend). My data needs to be normalized across the entire data set however when using fit_generator, I have access to relatively small batches of data and normalization of the data in this small batch is not representative of normalizing the data across the entire data set. The impact is quite large (I tested it and the model accuracy is significantly degraded).
My question is this: What is the correct practice of normalizing data across entire data set when using Keras' fit_generator? One last point: my data is a mix of text and numeric data and not images, and hence I am not able to use some of the capabilities in Keras' provided image generator which may address some of the issues for image data.
I have looked at normalizing the full data set prior to training ("brute-force" approach, I suppose) but I am wondering if there is a more elegant way of doing this.

The generator does allow you to do on-the-fly processing of data but pre-processing the data prior to training is the preferred approach:
Pre-process and saving avoids processing the data for every epoch, you should really just do small operations that can be applied to batches. One-hot encoding for example is a common one while tokenising sentences etc can be done offline.
You probably will tweak, fine-tune your model. You don't want to have the overhead of normalising the data and ensure every model trains on the same normalised data.
So, pre-process once offline prior to training and save it as your training data. When predicting you can process on-the-fly.

You would do this via pre-processing your data to a matrix. One hot encode your text data:
from keras.preprocessing.text import Tokenizer
# X is a list of text elements
t = Tokenizer()
t.fit_on_texts(X)
X_one_hot = t.texts_to_matrix(X)
and normalize your numeric data via:
for i in range(len(matrix)):
refactored_array = (matrix[i]- np.min(matrix[i], 0)) / (np.max(matrix[i], 0) + 0.0001)
If you concatenate your two matrices you should have properly preprocessed your data. I just could imagine that the text will always be influencing the outcome of your model too much. So it would make sence to train seperate models for text and numeric data.

Related

How to fit Word2Vec on test data?

I am working on a Sentiment Analysis problem. I am using Gensim's Word2Vec to vectorize my data in the following way:
# PREPROCESSING THE DATA
# SPLITTING THE DATA
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y = train_test_split(x, y, test_size = 0.2, random_state = 69, stratify = y)
train_x2 = train_x['review'].to_list()
test_x2 = test_x['review'].to_list()
# CONVERT TRIAN DATA INTO NESTED LIST AS WORD2VEC EXPECTS A LIST OF LIST TOKENS
train_x3 = [nltk.word_tokenize(k) for k in train_x2]
test_x3 = [nltk.word_tokenize(k) for k in test_x2]
# TRAIN THE MODEL ON TRAIN SET
from gensim.models import Word2Vec
model = Word2Vec(train_x3, min_count = 1)
key_index = model.wv.key_to_index
# MAKE A DICT
we_dict = {word:model.wv[word] for word in key_index}
# CONVERT TO DATAFRAME
import pandas as pd
new = pd.DataFrame.from_dict(we_dict)
The new dataframe is the vectorized form of the train data. Now how do I do the same process for the test data? I can't pass the whole corpus (train+test) to the Word2Vec instance as it might lead to data leakage. Should I simply pass the test list to another instance of the model as:
model = Word2Vec(test_x3, min_count = 1)
I dont think so this would be the correct way. Any help is appreciated!
PS: I am not using the pretrained word2vec in an LSTM model. What I am doing is training the Word2Vec on the data that I have and then feeding it to a ML algorithm like RF or LGBM. Hence I need to vectorize the test data separately.
Note that because word2vec is an unsupervised algorithm, it can sometimes be defensible to use all available texts to train it. That includes texts with known labels that you're witthiolding from other supervised-classification steps as test/validation records.
You just make sure the labels themselves aren't in the training data, but still use the bulk unlabeled text for further unsupervised improvement of the raw word-vectors. Those vectors, influenced by all the input text (but none of the known-answer labels) are then used for enhanced feature-modeling of the texts, as input to later supervised label-aware steps.
(Whether this is Ok for your project may depend on what future performance you want your various accuracy/etc evaluation measures to be reasonably estimate. Is it new situations where everything always must be trained from scratch, and where relevant raw text and labels as training data are both scarce? Or situations where the corpus always grows, & text is always plentiful even if labels are expensive to acquite, or where any actual deployed classifiers will be able to leverage other unlabeled texts before committing to a prediction?)
But note also: word-vectors are only comparison-compatible with each other when trained together, into a shared space. (Or, made compatible via other less-common post-training alginment steps.) There's no single right place for any word's vector, just a good relative position, with regard to everything trained in the same session – which used randomization in both initialization, & training, so even repeated runs on the same training data can yield end models of approximately-equivalent usefulness with wildly-different word-coordinates.
So, when withholding your test-set texts from initial word2vec training, you might alternatives never train a separate word2vec model on just the test texts, but rather use the frozen word2vec model from training data.
Separately: min_count=1 is almost always a bad idea for word2vec models, & if you're tempted to do so, you may have far too little data for such a data-hungry algorithm to show its true value. (If using it on the datasets where it really shines, you should be more often raising that threshold above its default – discarding more rare words – than lowering it to save every rare, hard-to-model-well word.)

CountVectorizer test data loss?

I am working on a small project that requires using different classification models on BoW data.
I understand that the train and test data must be different to get the model's true accuracy.
For model.score() to work correctly, I need to give it test data and labels in the same dimensions as the initial data. But the test data is in different dimensions, so I do it like this:
vectorizer = CountVectorizer()
traindata_bow = vectorizer.fit_transform(traindata)
testdata_bow = vectorizer.transform(testdata)
Now, the test data has the same dimensions as the initial train data.
Now on to my question:
Test data has its own set of dimensions/"characteristics".
So by transforming it using the vectorizer are we not losing any of the test data's characteristics?
I am asking because my model's accuracy ends up being in the 99.9% range and I worry something gets calculated incorrectly (Though my dataset is quite easy)
For example, after the code above:
traindata_bow.shape is (35918, 34319) and
testdata_bow.shape is (8980, 34319)
But if I run:
testdata_bow = vectorizer.fit_transform(testdata) i get
testdata_bow.shape is (8980, 20806)
So is there any data loss (or even partial merging with the train data) in the transform stage?

How to retain Scikit-learn OneHotEncoding from model generation to use on new data?

I'm using OneHotEncoding to generate dummies for a classification problem. When used on the training data, I get ~300 dummy columns, which is fine. However, when I input new data (which is fewer rows), the OneHotEncoding only generates ~250 dummies, which isn't surprising considering the smaller dataset, but then I can't use the new data with the model because the features don't align.
Is there a way to retain the OneHotEncoding schema to use on new incoming data?
I think you are using fit_transform on both training and test dataset, which is not the right approach because the encoding schema has to be consistent on both the dataset for the model to understand the information from the features.
The correct way is do
fit_transform on training data
transform on test data
By doing this way, you will get consistent number of columns.

How to efficiently join in data from TFRecords in Tensorflow

I need to efficiently join in a small amount of data when training a TensorFlow model on TFRecords. How can I do this lookup using information from the parsed TFRecord?
More details:
I am training a convolutional network on a large dataset using TFRecords. Each TFRecord contains the raw image along with the target label, and some metadata about the image. Part of the training is that I need to standardize the image using a mean and std that are specific to a grouping of images. To do this in the past I have hardcoded the mean and std into the TFRecord. It is then used like so in my parse_example which is used to map over the Dataset in my input_fn, like so:
def parse_example(..):
# ...
parsed = tf.parse_single_example(value, keys_to_features)
image_raw = tf.decode_raw(parsed['image/raw'], tf.uint16)
image = tf.reshape(image_raw, image_shape)
image.set_shape(image_shape)
# pull hardcoded pixels mean and std from the parsed TFExample
mean = parsed['mean']
std = parsed['std']
image = (tf.cast(image, tf.float32) - mean) / std
# ...
return image, label
While the above works and makes for fast training times it is limiting in that I often want to change what mean and std I use. Rather than writing the mean and std into the TFRecords I would prefer to lookup the appropriate summary stats at training time. What this means is that when I train I have a small python dictionary that I can lookup the appropriate summary stats using information about the image that is parsed from the TFRecord. The problem I am running into is that I can't seem to use this python dictionary in my tensorflow graph. If I try to do the lookup directly it doesn't work because I have tensor objects instead of the actual primitives. This makes sense sine the input_fn is doing symbolic manipulation constructing the computation graph for TensorFlow (right?). How do I get around this?
One thing I have tried is to create a lookup table from a dictionary like so:
def create_channel_hashtable(keys, values, default_val=-1):
initializer = tf.contrib.lookup.KeyValueTensorInitializer(keys, values)
return tf.contrib.lookup.HashTable(initializer, default_val)
The hashtables can be created and used in the parse_example function to do a lookup. This all "works" but it prohibitively slows training down. It may be worth noting that this training is being done on TPUs. With the original approach of using values from the TFRecords the training is very fast and isn't bottlenecked by IO, however this changes when the hash lookup is used. What is the suggested way to handle these cases? While repackaging TFRecords is doable it seems silly when the data to be looked up is small and could be made efficient.
This question kind of covers this topic:
How can I merge multiple tfrecords file into one file?
Seems like you would save the TFRecords to file and then use TFRecordDataset to pull them all out into one dataset. The code given in the answer to that above question i linked is this:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)

How to use model.fit_generator in keras

When and how should I use fit_generator?
What is the difference between fit and fit_generator?
If you have prepared your data and labels in all necessary aspects and simply can assign these to an array x and y, than use model.fit(x, y).
If you need to preprocess and/or augment your data while training, than you can take advantage of the generators that Keras provides.
You could for example augment images by applying random transforms (very helpful if you only have little data to train with), pad sequences, tokenize text, let Keras automagically read your data from a folder and assign appropiate classes (flow_from_directory) and much more.
See here for examples and boilerplate code for image preprocessing: https://keras.io/preprocessing/image/
or here for text preprocessing:
https://keras.io/preprocessing/text/
fit_generator also will help you to train in a more memory efficient way since you load data only when needed. The generator function yields (aka "delivers") data to your model batch by batch on demand, so to say.
They are useful for on-the-fly augmentations, which the previous poster mentioned. This however is not neccessarily restricted to generators, because you can fit for one epoch and then augment your data and fit again.
What does not work with fit is using too much data per epoch though. This means that if you have a dataset of 1 TB and only 8 GB of RAM you can use the generator to load the data on the fly and only hold a couple of batches in memory. This helps tremendously on scaling to huge datasets.

Categories

Resources