Multi task classification CNN - python

i have a df which is constructed like this, i have 5 columns and i have to perform a multi task or multi output classification
Daytime Environment Filename Weather
day wet 2018-10.png light_fog [Example of one row]
my problem is when i have done the flow from dataframe, i don't know how to use tf.data.dataset in order to build the dataset. Someone suggest me to use TFRecords but i have never used it. How can i do without using it?
train_data_gen = ImageDataGenerator(rescale=1. / 255)
train_gen = train_data_gen.flow_from_dataframe(train_df,
directory=dataset_dir,
x_col="Filename",
y_col=["Daytime", "Weather", "Environment"],
class_mode="multi_output",
target_size=(img_size, img_size),
batch_size=batch_size,
shuffle=True,
seed=SEED)

According to the tutorial (https://www.tensorflow.org/tutorials/load_data/pandas_dataframe#with_tfdata), you should be able to do something similar:
filenames = df.pop('Filename')
feature_names = ["Daytime", "Weather", "Environment"]
features = df[numeric_feature_names]
shuffle_buffer_size = len(df.index) # Or whatever shuffle size you want to use
ds = tf.data.Dataset.from_tensor_slices((filenames, features))
ds = ds.shuffle(buffer_size=shuffle_buffer_size, reshuffle_each_iteration=True)
# Performs the function for each pair of a filename and its corresponding features
ds = ds.map(prepare_image)
dataset = ds.batch(batch_size)
# Prefetch is just for optimization
dataset = ds.prefetch(buffer_size=1)
def prepare_image(filepath, features):
# You can do whatever transformations in here.
file_content = tf.io.read_file(image_file_path)
image = tf.io.decode_png(file_content)
return image, features
Note that this should just give you the idea. I did not test it, so it may throw an error or two.

Related

Tensorflow Data API: repeat()

The following code is a snippet from "Hands on machine learning with scimitar-learn, Keras and tensorflow".
I understand everything in the following code, except the chaining the .repeat(repeat) function in second line.
I know that repeat is repeating the dataset elements (i.e., in this case the file paths) and if the argument is set to None or left empty the repetition will continues forever until the function which using it decides when to stop.
As you can see in the code bellow, the author is setting the repeat() argument to None;
1 - basically I want to know why the author decided to do such?
2 - or is it because the code is trying to simulate a situation which the dataset is not fitting in the memory, if this is the case then in real situation we should avoid repeat(), am I correct?
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
n_read_threads=None, shuffle_buffer_size=10000,
n_parse_threads=5, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths, seed = 42).repeat(repeat)
dataset = dataset.interleave(
lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
cycle_length = n_readers, num_parallel_calls = n_read_threads)
dataset = dataset.shuffle(shuffle_buffer_size)
dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
dataset = dataset.batch(batch_size)
return dataset.prefetch(1)
train_set = csv_reader_dataset(train_filepaths, repeat = None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape = X_train.shape[-1: ]),
keras.layers.Dense(30, activation = 'relu'),
keras.layers.Dense(1)
])
m_loss = keras.losses.mean_squared_error
m_optimizer = keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_set, steps_per_epoch = len(X_train) // batch_size, epochs = 10, validation_data = valid_set)
For your questions, I think:
tf.data API won't easily lead to out-of-memory easily as it loads data given the file paths or tfrecrods (compressed mode). Hence, repeat() does not thing with memory here; instead, it is used for data-transforming.
I have to use repeat(#) when setting steps_per_epoch to #. Say your batch_num = 32, and steps_per_epoch = 100//32 = 3 -> require 3 * 32 = 96 samples per epoch but your data has 80 samples only. Then, I have to use data.repeat(2) to have totally 160 samples that 80 samples in repeat_1 and the first 16 samples in repeat_2 would be used within 1 epoch. This is to prevent the error Input run out of data.
I had another copy of the same question on book's author git repo.
The issue is clarified; it was due to a bug in Keras 2.0.
Read more on: https://github.com/ageron/handson-ml2/issues/407

How to split images into test and train set using my own data in TensorFlow

I am a little confused here... I just spent the last hour reading about how to split my dataset into test/train in TensorFlow. I was following this tutorial to import my images: https://www.tensorflow.org/tutorials/load_data/images. Apparently one can split into train/test with sklearn: model_selection.train_test_split .
But my question is: when do I split my dataset into train/test. I already have done this with my dataset (see below), now what? How do I split it? Do I have to do it before loading the files as tf.data.Dataset?
# determine names of classes
CLASS_NAMES = np.array([item.name for item in data_dir.glob('*') if item.name != "LICENSE.txt"])
print(CLASS_NAMES)
# count images
image_count = len(list(data_dir.glob('*/*.png')))
print(image_count)
# load the files as a tf.data.Dataset
list_ds = tf.data.Dataset.list_files(str(cwd + '/train/' + '*/*'))
Also, my data structure looks like the following. No test folder, no val folder. I would need to take 20% for test from that train set.
train
|__ class 1
|__ class 2
|__ class 3
You can use tf.keras.preprocessing.image.ImageDataGenerator:
image_generator = tf.keras.preprocessing.image.ImageDataGenerator(validation_split=0.2)
train_data_gen = image_generator.flow_from_directory(directory='train',
subset='training')
val_data_gen = image_generator.flow_from_directory(directory='train',
subset='validation')
Note that you'll probably need to set other data-related parameters for your generator.
UPDATE: You can obtain two slices of your dataset via skip() and take():
val_data = data.take(val_data_size)
train_data = data.skip(val_data_size)
If you have all data in same folder and wanted to split into validation/testing using tf.data then do the following:
list_ds = tf.data.Dataset.list_files(str(cwd + '/train/' + '*/*'))
image_count = len(list(data_dir.glob('*/*.png')))
val_size = int(image_count * 0.2)
train_set = list_ds.skip(val_size)
val_set = list_ds.take(val_size)

Custom Datagenerator

I have a custom file containing the paths to all my images and their labels which I load in a dataframe using:
MyIndex=pd.read_table('./MySet.txt')
MyIndex has two columns of interest ImagePath and ClassName
Next I do some train test split and encoding the output labels as:
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
images[0].shape
Classes=Sample['ClassName']
OutputClasses=Classes.unique().tolist()
labels=Sample['ClassName']
images=np.array(images, dtype="float") / 255.0
(trainX, testX, trainY, testY) = train_test_split(images,labels, test_size=0.10, random_state=42)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size=0.10, random_state=41)
images=None
labels=None
encoder = LabelEncoder()
encoder=encoder.fit(OutputClasses)
encoded_Y = encoder.transform(trainY)
# convert integers to dummy variables (i.e. one hot encoded)
trainY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(valY)
# convert integers to dummy variables (i.e. one hot encoded)
valY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
encoded_Y = encoder.transform(testY)
# convert integers to dummy variables (i.e. one hot encoded)
testY = to_categorical(encoded_Y, num_classes=len(OutputClasses))
datagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25)
datagen.fit(trainX,augment=True)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
batch_size=128
model.fit_generator(datagen.flow(trainX,trainY,batch_size=batch_size), epochs=500,
steps_per_epoch=trainX.shape[0]//batch_size,validation_data=(valX,valY))
The problem I face that the data loaded in one go is too large to fit in current machine memory and so I am unable to work with the complete dataset.
I have tried to work with the datagenerator but do not want to follow he directory conventions it follows and also cannot eradicate the augmentation part.
The question is that is there a way to load batches from the disk ensuring the two stated conditions.
I believe you should have a look at this post
What you are looking for is Keras flow_from_dataframe that let you load the batches from disk by providing the names of your files and their labels in a dataframe and also providing a top directory path that contains all your images.
Making a bit of midifications in your code and borrowing some from the link shared:
MyIndex=pd.read_table('./MySet.txt')
Classes=MyIndex['ClassName']
OutputClasses=Classes.unique().tolist()
trainDf=MyIndex[['ImageName','ClassName']]
train, test = train_test_split(trainDf, test_size=0.10, random_state=1)
#creating a data generator to load the files on runtime
traindatagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25,
validation_split=0.1)
train_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='training'
)
#Also a generator for the validation data
val_generator=traindatagen.flow_from_dataframe(
dataframe=train,
directory=basePath,#the directory containing all your images
x_col='ImageName',
y_col='ClassName',
class_mode='categorical',
target_size=(299, 299),
batch_size=batch_size,
subset='validation'
)
STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=val_generator.n//val_generator.batch_size
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit_generator(generator=train_generator, steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=val_generator,
validation_steps=STEP_SIZE_VALID,
epochs=500)
Also note now you do not need the encoding of the labels as you had in your original code and also omit the image loading code.
I have not tried this code itself so try to fix any bugs you may encounter, as the primary focus was to deliver you the basic idea.
In response to your comment:
If you have all files in different directories then one solution would be to have your ImagesName to store the relative path including the intermediate directory in path something like './Dir/File.jpg' and then move all the directories to one folder and use the one as base path and everything else stays the same.
Also looking at your code segment that loaded the files look like you already have file paths stored in ImageName column so the suggested approach should work for you.
images=[]
for index, row in MyIndex.iterrows():
img_path=basePath+row['ImageName']
img = image.load_img(img_path, target_size=(299, 299))
img_path=None
img_data = image.img_to_array(img)
img=None
images.append(img_data)
img_data=None
In case if still some ambiguity exists feel free to ask again.
I think the simplest way to do this would be to just load part of your images per each generator and repeatedly call .fit_generator() with that smaller batch.
This example uses `random.random()` to choose which images to load – you could use something more sophisticated.
The previous version used random.random(), but we can just as well use a start index and page size like in this revised version to loop over the list of images forever.
import itertools
def load_images(start_index, page_size):
images = []
for index in range(page_size):
# Generate index using modulo to loop over the list forever
index = (start_index + index) % len(rows)
row = MyIndex[index]
img_path = basePath + row["ImageName"]
img = image.load_img(img_path, target_size=(299, 299))
img_data = image.img_to_array(img)
images.append(img_data)
return images
def generate_datagen(batch_size, start_index, page_size):
images = load_images(start_index, page_size)
# ... everything else you need to get from images to trainX and trainY, etc. here ...
datagen = ImageDataGenerator(
rotation_range=90,
horizontal_flip=True,
vertical_flip=True,
width_shift_range=0.25,
height_shift_range=0.25,
)
datagen.fit(trainX, augment=True)
return (
trainX,
trainY,
valX,
valY,
datagen.flow(trainX, trainY, batch_size=batch_size),
)
model.compile(
loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)
page_size = (
500
) # load 500 images at a time; change this as suitable for your memory condition
for page in itertools.count(): # Count from zero to forever.
batch_size = 128
trainX, trainY, valX, valY, generator = generate_datagen(
128, page * page_size, page_size
)
model.fit_generator(
generator,
epochs=5,
steps_per_epoch=trainX.shape[0] // batch_size,
validation_data=(valX, valY),
)
# TODO: add a `break` clause with a suitable condition
If you want to load from the disk it is convenient to do with ImageDataGenerator that you used.
There are two ways to do it. By stating the directory of the data with flow_from_directory. Alternatively you can use flow_from_dataframe with Pandas dataframe
If you want to have a list of paths you should not use a custom generator that yields batches of images. Here is a stub:
def load_image_from_path(path):
"Loading and preprocessing"
...
def my_generator():
length = df.shape[0]
for i in range(0, length, batch_size)
batch = df.loc[i:min(i+batch_size, length-1)]
x, y = map(load_image_from_path, batch['ImageName']), batch['ClassName']
yield x, y
Note: in fit_generator there is an additional generator named validation_data for well you guessed it - validation.
One option is to pass the generators the indices to choose from in order to split train and test (assuming the data is shuffled, if not check this out).

Preprocess huge data with a custom data generator function for keras

Actually I'm building a keras model and I have a dataset in the msg format with over 10 million instances with 40 features which are all categorical. For the moment i'm using just a sample of it since reading all the dataset and encoding it is impossbile to fit into the memory. Here a part of the code i'm using:
import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler
def model():
model = Sequential()
model.add(Dense(120, input_dim=233, kernel_initializer='uniform', activation='selu'))
model.add(Dense(12, kernel_initializer='uniform', activation='sigmoid'))
model.compile(SGD(lr=0.008),loss='mean_squared_error', metrics=['accuracy'])
return model
def addrDataLoading():
data=pd.read_msgpack('datum.msg')
data=data.dropna(subset=['s_address','d_address'])
data=data.sample(300000) # taking a sample of all the dataset to make the encoding possible
y=data[['s_address','d_address']]
x=data.drop(['s_address','d_address'],1)
encX = be().fit(x, y)
numeric_X= encX.transform(x)
encY=be().fit(y,y)
numeric_Y=encY.transform(y)
scaler=StandardScaler()
X_all=scaler.fit_transform(numeric_X)
x_train=X_all[0:250000,:]
y_train=numeric_Y.iloc[0:250000,:]
x_val=X_all[250000:,:]
y_val=numeric_Y.iloc[250000:,:]
return x_train,y_train,x_val,y_val
x_train,y_train,x_val,y_val=addrDataLoading()
model.fit(x_train, y_train,validation_data=(x_val,y_val),nb_epoch=20, batch_size=200)
So my question is how to use a custom data generator function to read and process all the data I have and not just a sample, and then use fit_generator() function to train my model?
EDIT
This is a sample of the data:
netData
I think that taking different samples from the data results in different encoding dimensions.
For this sample there's 16 different categories: 4 addresses (3 bit), 4 hostnames (3 bit ), 1 subnetmask (1 bit), 5 infrastructures (3 bit ), 1 accesszone(1 bit ), so the binary encoding will give us 11 bit and the new dimension of the data is 11 previously 5. So let's say for another sample in the address column we have 8 different categories this will give 4 bit in binary and we let the same number of categories in the other columns so the overall encoding will result in 12 dimensions. I believe that what's causing the problem.
Slightly slow solution (repeating the same actions)
Edit - fit BinatyEncoder before create generators
Drop NA first and work with clean data further to avoid reassignments of data frame.
data = pd.read_msgpack('datum.msg')
data.dropna(subset=['s_address','d_address']).to_msgpack('datum_clean.msg')
In this solution data_generator can process same data multiple times. If it's not critical, you can use this solution.
Define function which reads the data snd splits index to train and test. It won't consume a lot of memory.
import pandas as pd
from category_encoders import BinaryEncoder as be
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
def model():
#some code defining the model
def train_test_index_split():
# if there's enough memory to add one more column
data = pd.read_msgpack('datum_cleaned.msg')
train_idx, test_idx = train_test_split(data.index)
return data, train_idx, test_idx
data, train_idx, test_idx = train_test_index_split()
Define and initialize data generator, both for train and validation
def data_generator(data, encX, encY, bathc_size, n_steps, index):
# EDIT: As the data was cleaned, you don't need dropna
# data = data.dropna(subset=['s_address','d_address'])
for i in range(n_steps):
batch_idx = np.random.choice(index, batch_size)
sample = data.loc[batch_idx]
y = sample[['s_address', 'd_address']]
x = sample.drop(['s_address', 'd_address'], 1)
numeric_X = encX.transform(x)
numeric_Y = encY.transform(y)
scaler = StandardScaler()
X_all = scaler.fit_transform(numeric_X)
yield X_all, numeric_Y
Edited part now train binary encoders. You should sub-sample your data to create representative training set for encoders. I guess error with the shape of the data was caused by incorrecly trained BinaryEncoder (Error when checking input: expected dense_9_input to have shape (233,) but got array with shape (234,)):
def get_minimal_unique_frame(df):
return (pd.Series([df[column].unique() for column in df], index=df.columns)
.apply(pd.Series) # tranform list on unique values to pd.Series
.T # transope frame: columns is columns again
.fillna(method='ffill')) # fill NaNs with last value
x = get_minimal_unique_frame(data.drop(['s_address', 'd_address'], 1))
y = get_minimal_unique_frame(data[['s_address', 'd_address']])
NB: I never used category_encoders and have incompatible system configuration, so can't install and check it. So, former code can evoke problems. In that case, I guess, you should compare length of x and y data frames and make it the same, and probaly change an index of data frames.
encX = be().fit(x, y)
encY = be().fit(y, y)
batch_size = 200
train_steps = 100000
val_steps = 5000
train_gen = data_generator(data, encX, encY, batch_size, train_steps, train_idx)
test_gen = data_generator(data, encX, encY, batch_size, test_steps, test_idx)
Edit Please provide an exapmple of x_sample, run train_gen and save output, and post x_samples, y_smaples:
x_samples = []
y_samples = []
for i in range(10):
x_sample, y_sample = next(train_gen)
x_samples.append(x_sample)
y_samples.append(y_sample)
Note: data generator won't stop itself. But itt will be stopped after train_steps by fit_generator method.
Fit model with generators:
model.fit_generator(generator=train_gen, steps_per_epoch=train_steps, epochs=1,
validation_data=test_gen, validation_steps=val_steps)
As far as I know, python does not copy pandas dataframes if you won't do it explicitply with copy() or so. Because of it, both generators use the same object. But if you use Jupyter Notebook, data leaks/uncollected carbage may occur, and a memory troubles comes with them.
More efficient solution - scketch
Clean your data
data = pd.read_msgpack('datum.msg')
data.dropna(subset=['s_address','d_address']).to_msgpack('datum_clean.msg')
Create train/test split, preprocess it and store as numpy array, if you have enough disk space.
data, train_idx, test_idx = train_test_index_split()
def data_preprocessor(data, path, index):
# data = data.dropna(subset=['s_address','d_address'])
sample = data.loc[batch_idx]
y = sample[['s_address', 'd_address']]
x = sample.drop(['s_address', 'd_address'], 1)
encX = be().fit(x, y)
numeric_X = encX.transform(x)
encY = be().fit(y, y)
numeric_Y = encY.transform(y)
scaler = StandardScaler()
X_all = scaler.fit_transform(numeric_X)
np.save(path + '_X', X_all)
np.save(path + '_y', numeric_Y)
data_preprocessor(data, 'train', train_idx)
data_preprocessor(data, 'test', test_idx)
Delete unnecessary data:
del data
Load your files and use following generator:
train_X = np.load('train_X.npy')
train_y = np.load('train_y.npy')
test_X = np.load('test_X.npy')
test_y = np.load('test_y.npy')
def data_generator(X, y, batch_size, n_steps):
idxs = np.arange(len(X))
np.random.shuffle(idxs)
ptr = 0
for _ in range(n_steps):
batch_idx = idxs[ptr:ptr+batch_size]
x_sample = X[batch_idx]
y_sample = y[batch_idx]
ptr += batch_size
if ptr > len(X):
ptr = 0
yield x_sapmple, y_sample
Prepare generators:
train_gen = data_generator(train_X, train_y, batch_size, train_steps)
test_gen = data_generator(test_X, test_y, batch_size, test_steps)
And fit the model finaly. Hope one of this solutions will help. At least if python does pass arrays and data frames buy reference, not by value. Stackoverflow answer about it.

How to use the iterator 'make_initializable_iterator' of tensorflow within a 'input_fn'?

I want to train my mode with a tf.estimator.Estimator and load my data by Dataset API.Because my data,for example 'mnist', is a array(tensor),so I try to load it with 'tf.data.Dataset.from_tensor_slices'.But I don't how to initialize 'make_initializable_iterator' within a 'input_fn'.
If I can use 'make_one_shot_iterator' to train successfully, but it load slowly before training. And 《Higher-Level APIs in TensorFlow》is a good example to 'make_initializable_iterator' within a 'input_fn',but it needs to return a 'iterator_initializer_hook' to other function from 'input_fn' . I want to know is there any other better or more elegant way?
def input_fn():
mnist_data = input_data.read_data_sets('mnist_data', one_hot=False)
images = mnist_data.train.images.reshape([-1, 28, 28, 1])
labels = np.asarray(mnist_data.train.labels, dtype=np.int64)
# Build dataset iterator
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset = dataset.repeat(None) # Infinite iterations
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(100)
iterator = dataset.make_one_shot_iterator()
next_example = iterator.get_next()
# Set runhook to initialize iterator
return next_example
In TensorFlow version 1.5 and later, the tf.estimator.Estimator will automatically create and initialize an initializable iterator when you return a tf.data.Dataset from your input_fn. This enables you to write the following code, without having to worry about initialization or hooks:
def input_fn():
mnist_data = input_data.read_data_sets('mnist_data', one_hot=False)
images = mnist_data.train.images.reshape([-1, 28, 28, 1])
labels = np.asarray(mnist_data.train.labels, dtype=np.int64)
# Build dataset.
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset = dataset.repeat(None) # Infinite iterations
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(100)
return dataset
Inside your code, add this:
self.hooks.append(utils_hooks.DatasetHook(iter))
In the run_loop.py, before the call into your fn, add this
for hook in dataset_hooks:
sess.run(hook.iterator().initializer)
Then, it should be fine.

Categories

Resources