TF2.0 Data API get n_i samples from each class label

TF2.0 Data API get n_i samples from each class label - python

I have to classify inputs of shape 32x32 into 3 classes using a TF2 Keras model. My training set has 7000 examples
>>> X_train.shape # (7000, 32, 32)
>>> Y_train.shape # (7000, 3)
The number of examples for each class varies (e.g. class_0 has ~2500 examples while class_1 has ~800, etc.)
I want to use the tf.data API to create a dataset object that returns batches of training data with no. of examples from each class specified by [n_0, n_1, n_2].
I would like to have these n_i samples from each class randomly drawn with replacement from X_train, Y_train
For example, if I call get_batch([100, 150, 125]) it should return 100 random samples from X_batch from class_0, 150 from class_1, and 125 from class_2.
How can I achieve this using the TF2.0 Data API so I could use it for training a Keras model?

One possible approach is to proceed as follows:
Load the data from X_train & Y_train into a single tf.data Dataset so that we ensure we keep each X matched with the correct Y
.shuffle() then split the dataset into each n_i using a filter()
Write our get_batch function to return the correct number of samples from each dataset, shuffle() the sample then split it back into X & Y
Something like this:
# 1: Load the data into a Dataset
raw_data = tf.data.Dataset.zip(
(
tf.data.Dataset.from_tensor_slices(X_train),
tf.data.Dataset.from_tensor_slices(Y_train)
)
).shuffle(7000)
# 2: Split for each category
def get_filter_fn(n):
def filter_fn(x, y):
return tf.equal(1.0, y[n])
return filter_fn
n_0s = raw_data.filter(get_filter_fn(0))
n_1s = raw_data.filter(get_filter_fn(1))
n_2s = raw_data.filter(get_filter_fn(2))
# 3:
def get_batch(n_0,n_1,n_2):
sample = n_0s.take(n_0).concatenate(n_1s.take(n_1)).concatenate(n_2s.take(n_2))
shuffled = sample.shuffle(n_0 + n_1 + n_2)
return shuffled.map(lambda x,y: x),shuffled.map(lambda x,y: y)
So now we can do:
x_batch, y_batch = get_batch(100, 150, 125)
Note that I've used some potentially wasteful operations here pursuing an approach I find intuitive and straightforward (specifically reading the raw_data dataset 3 times for the filter operations) so I make no claim that this is the most efficient way to accomplish what you need but for a dataset that fits in memory like the one you describe I'm sure such inefficiencies will be negligible

Keras' train_test_split actually has a parameter for that. While it doesn't let you pick exact number of samples, it selects them evenly from the classes.
X_train_stratified, X_test_stratified, y_train_strat, y_test_strat = train_test_split(X_train, y_train, test_size=0.2, stratify=y)
If you want to do cross validation you can also use stratified shuffle split
I hope I understood your question correctly

Related

Progressive loading of large arbitrary datasets in keras

I'm training my keras dense models on very large datasets.
For practical reasons, I am saving them on my disk on separate .txt files. I have 1e4 text files, each containing 1e4 examples.
I would like to find a way to fit my keras model on this dataset as a whole. For now, I am only able to use "model.fit" on individual text files, i.e. :
for k in range(10000):
X = np.loadtxt('/path/X_'+str(k)+'.txt')
Y = np.loadtxt('/path/Y_'+str(k)+'.txt')
mod = model.fit(x=X, y=Y, batch_size=batch_size, epochs=epochs)
Which is problematic if I want for instance to perform several epochs on the whole datasets.
Ideally, I would like to have a dataloader function that could be used in the following way to feed all the sub-datasets as a single one:
mod = model.fit(dataloader('/path/'), batch_size=batch_size, epochs=epochs)
I think I found what I want, but only for datasets composed of images: tf.keras.preprocessing.image.ImageDataGenerator.flow_from_directory
Is there any tf/keras function doing something similar, but for datasets which are not composed of images?
Thanks!

You can create a generator function and then use tensorflow Dataset class using from_generator method to create a dataset, see bellow a dummy example:
def mygenerator():
for k in range(1000):
x = np.random.normal(size=1000,)
y = np.random.randint(low=0, high=5, size=1000)
yield x, y
from tensorflow.data import Dataset
mydataset = Dataset.from_generator(mygenerator, output_signature=(tf.TensorSpec(shape=(1000,), dtype=tf.float32), tf.TensorSpec(shape=(1000,), dtype=tf.int32)))
mytraindata = mydataset.batch(batch_size)

Tensorflow: Batching whole dataset (MNIST Tutorial)

Following this tutorial: https://www.tensorflow.org/versions/r1.3/get_started/mnist/pros
I wanted to solve a classification problem with labeled images by myself. Since I'm not using the MNIST database, I spent days creating my own dataset inside tensorflow. It looks like this:
#variables
batch_size = 50
dimension = 784
stages = 10
#step 1 read Dataset
filenames = tf.constant(filenamesList)
labels = tf.constant(labelsList)
#step 2 create Dataset
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
#step 3: parse every image in the dataset using `map`
def _parse_function(filename, label):
#convert label to one-hot encoding
one_hot = tf.one_hot(label, stages)
#read image file
image_string = tf.read_file(filename)
image_decoded = tf.image.decode_image(image_string, channels=3)
image = tf.cast(image_decoded, tf.float32)
return image, one_hot
#step 4 final input tensor
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size) #batch_size = 100
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
for _ in range(10):
dataset = dataset.shuffle(buffer_size = 100)
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
images, labels = iterator.get_next()
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
train_step.run(feed_dict={x: images, y_:labels})
Somehow using a higher batch_sizes will break python. What I'm trying to do is to train my neural network with new batches on each iteration. That's why Im also using dataset.shuffle(...). Using dataset.shuffle also breaks my Python.
What I wanted to do (because shuffle breaks) is to batch the whole dataset. By evaluating ('.eval()') I will get a numpy array. I will then shuffle the array with numpy.random.shuffle(images) and then pick up some the first elements to train it.
e.g.
for _ in range(1000):
images = tf.reshape(images, [batch_size,dimension]).eval()
labels = tf.reshape(labels, [batch_size,stages]).eval()
#shuffle
np.random.shuffle(images)
np.random.shuffle(labels)
train_step.run(feed_dict={x: images[0:train_size], y_:labels[0:train_size]})
But then here comes the problem that I can't batch the my whole dataset. It looks like that the data is too big for python to work with.
How should I solve this differently?
Since I'm not using the MNIST database there isn't a function like mnist.train.next_batch(100) which comes handy for me.

Notice how you call shuffle and batch inside your for loop? This is wrong. Datasets in TF work in the style of functional programming, so you are actually defining a pipeline for preprocessing the data to feed into your model. In a way, you give a recipe that answers the question "given this raw data, which operations (map, etc.) should I do to get batches that I can feed into my neural network?"
Now you are modifying that pipeline for every batch! What happens is that the first iteration, the batch size is, say [32 3600]. The next iteration, the elements of this shape are batched again, to [32 32 3600], and so on.
There's a great tutorial on the TF website where you can find out more how Datasets work, but here are a few suggestions how you can resolve your problem.
Move the shuffling to right after "Step 2" in your code. Then you are shuffling the whole dataset so your batches will have a good mixture of examples. Also increase the buffer_size argument, this works in a different way than you probably assume. It's usually a good idea to shuffle as early as possible, as it can be a slow operation if you have a large dataset -- the shuffled part of dataset will have to be read into memory. Here it does not really matter whether you shuffle the filenames and labels, or the read images and labels -- but the latter will have more work to do since the dataset is larger by that time.
Move batching and the iterator generator to be the last steps, just before starting your training loop.
Don't use feed_dict with Dataset iterators to input data into your model. Instead, define your model in terms of the outputs of iterator.get_next() and omit the feed_dict argument. See more details from this Q&A: Tensorflow: create minibatch from numpy array > 2 GB

Ive been getting through a lot of problems with creating tensorflow datasets. So I decided to use OpenCV to import images.
import opencv as cv
imgDataset = []
for i in range(len(files)):
imgDataset.append(cv2.imread(files[i]))
imgDataset = np.asarray(imgDataset)
the shape of imgDataset is (num_img, height, width, col_channels). Getting the i-th image should be imgDataset[i].
shuffling the dataset and getting only batches of it can be done like this:
from sklearn.utils import shuffle
X,y = shuffle(X, y)
X_feed = X[batch_size]
y_feed = y[batch_size]
Then you feed X_feed and y_feed into your model

Using tensorflow or keras to build a NN model by feeding 'pairwise' samples

I'm trying to implement a NN model with pairwise samples. Details are shown in follows:
Original data:
X_org with shape of (100, 50) for example, namely 100 samples with 50 features.
Y_org with shape of (100, 1).
Processing these original data for real training:
Select 2 samples from X_org randomly (so we have 100*99/2 such combinations) to form a new 'pairwise' sample, and the prediction target, namely the new y label is the subtraction of the two corresponding y_org labels (Y_org_sample1 - Y_org_sample2). Now we have new X_train and Y_train.
I need a more a NN model (DNN, CNN, LSTM, whatever ...), with which I can pass the first sub_sample of one pairwise sample from X_train into the model and will get one result, same step for the second sub_sample. By calculating the subtraction of the two results, I can get the prediction of this pairwise sample. This prediction will be the one compared with the corresponding Y label from Y_train.
Overall, I need to train a model (update the weights) after feeding it a 'pairwise' sample (two successive sub samples). The reason why I don't choose a 'two-arm' model (e.g. merge two arms by xxx.sub()) is that I will only feed one sub sample during test process. I will just use the model to predict one sub-sample finally.
So I will use the data from X_train during train step, while use X_org-like data format during test step. It looks a bit complex.
Looks like Tensorflow would be more feasible for this task, if keras also works, please kindly share your idea.

You can first create a model that will take only one X_org-like element:
#create a model the way you like it, it can be Functional API or Sequential, no problem
xOrgModel = createAModelForXOrgData(...)
Now, lets create a second model, this time necessarily functional API that works with both inputs:
from keras.models import Model
from keras.layers import Input, Subtract
input1 = Input(shapeOfInput)
input2 = Input(shapeOfInput)
output1 = xOrgModel(input1)
output2 = xOrgModel(input2)
output = Subtract()([output1,output2])
pairWiseModel = Model([input1,input2],output)
Now you have two models: xOrgModel and pairWiseModel. You can use any of them depending on the task you are doing at the moment.
Both models are sharing their weights. This means that you can train any of them and the other will be updated as well.
Using the pairwise model
First, organize your data in two separate arrays. (Because our model uses two inputs)
L = len(X_org)
x1 = []
x2 = []
y = []
for i in range(L):
for j in range(i+1,L):
x1.append(X_org[i])
x2.append(X_org[j])
y.append(Y_org[i] - Y_org[j])
x1 = np.array(x1)
x2 = np.array(x2)
y = np.array(y)
Train and predict with a list of inputs:
pairWiseModel.fit([x1,x2],y,...)

LSTM: Understand timesteps, samples and features and especially the use in reshape and input_shape

I'm trying to learn LSTM. Have taken this web courses, read this book (https://machinelearningmastery.com/lstms-with-python/) and a lot of blogs... But, I'm completely stuck. My interest is in multivariate LSTM's and I have read all I can find but still can't get it. Don't know if I'm stupid or what it is...
If this exact question and a good answer already exists then I am sorry for double posting but I have looked and haven't found it...
As I want to really know the basics I created a dummy dataset in excel where every "y" depends on the sum of each input x1 and x2 but also over time. As I understand it this is a many-to-one scenario.
Pseudo code:
x1(t) = sin(A(t))
x2(t) = cos(A(t))
tmp(t) = x1(t) + x2(t) (dummy variable)
y(t) = tmp(t) + tmp(t-1) + tmp(t-2) (i.e. sum over the last three steps)
(Basically I want to predict y(t) given x1 and x2 over three time steps)
This is then exported to a csv file with columns x1, x2, y
I have tried to code it up below but obviously it won't work.
I read the data and split it into a 80/20 test and train set as X_train, y_train, X_test, y_test with dimensions (217,2), (217,1), (54,2), (54/1)
What I really haven't got a grip on yet is what exactly are timesteps and samples and the use in reshape and input_shape. In many examples of code I have looked at they simply use numbers rather than defined variables which makes it very difficult to understand what is happening, especially if you want to change something. As an example, in one of the courses I took the reshaping was coded like this...
X_train = np.reshape(X_train, (1257, 1, 1))
This doesn't provide much info...
Anyway, when i run the code below it says
ValueError: cannot reshape array of size 434 into shape (217,3,2)
So, I know what the causes the error, but not what I need to do to fix it. If I set look_back=1 it works but that's not what I want.
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
# Load data
data_set = pd.read_csv('../Data/LSTM_test.csv',';')
"""
data loaded have three columns:
col 0, col 1: features (x)
col 2: y
"""
# Train/test and variable split
split = 0.8 # 80% train, 20% test
split_idx = int(data_set.shape[0]*split)
# ...train
X_train = data_set.values[0:split_idx,0:2]
y_train = data_set.values[0:split_idx,2]
# ...test
X_test = data_set.values[split_idx:-1,0:2]
y_test = data_set.values[split_idx:-1,2]
# Model setup
look_back = 3 # as that is how y was generated (i.e. sum last three steps)
num_features = 2 # in this case: 2 features x1, x2
output_dim = 1 # want to predict 1 y value
nb_hidden_neurons = 32 # assume something to start with
nb_epoch = 2 # assume something to start with
# Reshaping
nb_samples = len(X_train) # in this case 217 samples in the training set
X_train_reshaped = np.reshape(X_train,(nb_samples, look_back, num_features))
# Create model
model = Sequential()
model.add(LSTM(nb_hidden_neurons, input_shape=(look_back,num_features)))
model.add(Dense(units=output_dim))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(X_train_reshaped, y_train, batch_size = 32, epochs = nb_epoch)
print(model.summary())
Can anyone please explain what I have done wrong?
As I said, I have read a lot of blogs, questions, tutorials etc but if someone has a particularly good source of info I'd love to check that one up too.

I also had this question before. On a higher level, in (samples, time steps, features)
samples are the number of data, or say how many rows are there in your data set
time step is the number of times to feed in the model or LSTM
features is the number of columns of each sample
For me, I think a better example to understand it is that in NLP, suppose you have a sentence to process, then here sample is 1, which means 1 sentence to read, time step is the number of words in that sentence, you feed in the sentence word by word before the model read all the words and get a whole context of that sentence, features here is the dimension of each word, because in word embedding like word2vec or glove, each word is interpreted by a vector with multiple dimensions.
The input_shape parameter in Keras is only (time_steps, num_features),
more you can refer to this.
And the problem of yours is that when you reshape data, the multiplication of each dimension should equal to the multiplication of dimensions of original data set, where 434 does not equal to 217*3*2.
When you implement LSTM, you should be very clear of what are the features and what are the element you want the model to read each time step. There is a very similar case here surely can help you. For example, if you are trying to predict the value of time t using t-1 and t-2, you can either choose to feed in two values as one element to predict t, where (time_step, num_features)=(1, 2), or you can feed each value in 2 time steps, where (time_step, num_features)=(2, 1).
That's basically how I understand this, hope make it clear for you.

You seem to have a decent grasp of what LSTM expects and are just struggling with getting your data into the correct format. You start with an X_train of shape (217, 2) and you want to reshape this such that it's in the shape (nb_samples, look_back, num_features). You already have defined look_back and num_features and really all the work that's left is generating nb_samples chunks of length look_back with your original X_train. Numpy's reshape isn't really the tool for this, instead you'll have to write some code.
import numpy as np
nb_samples = X_train.shape[0] - look_back
x_train_reshaped = np.zeros((nb_samples, look_back, num_features))
y_train_reshaped = np.zeros((nb_samples))
for i in range(nb_samples):
y_position = i + look_back
x_train_reshaped[i] = X_train[i:y_position]
y_train_reshaped[i] = y_train[y_position]
model.fit(x_train_reshaped, y_train_reshaped, ...)
The shapes are now:
x_train_reshaped.shape
# (214, 3, 2)
y_train_reshaped.shape
# (214,)
You'll have to do the same thing with X_test and y_test.

This https://github.com/fchollet/keras/issues/2045 helped me.
But shortly, the answer for your question:
you want to reshape a list with 434 elements into shape (217,3,2), but it's impossible, let me show you why:
A new shape has 217*3*2 = 1302 elements, but you have 434 elements in the original list. Therefore, the solution is to change the dimensions of reshaping.

how do I use multiple data rows as a single input in TFLearn?

I'm attempting to use TFLearn (or TensorFlow directly) to reconstruct a noisy (as in harmonics) signal. My input has 168 columns to be converted to 84 outputs. I want to treat each column pair as a single pixel. I don't have to go realtime, so I want to use multiple rows of the input to generate a single output. I guess that I need at least 20 rows of input (10 on either side) to compute a single row of output. How do I reshape my data appropriately? E.g. see comments:
def learn1(data, answers):
# data.shape == 5000x168, answers.shape = 5003x84
network = tflearn.input_data(shape=[None, 20, 84, 2])
... set up 2D convolutional network ...
model = tflearn.DNN(network)
X = # data reshaped into overlapping groups of 20 -- what goes here?
Y = # I don't have any labels. What goes here?
Y_test = # what goes here?
model.fit(X, Y, n_epoch=50, shuffle=False,
validation_set=(answers, Y_test), batch_size=10)
I can generate test data at will. Thanks for any help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.