Need a concrete example of fit_generator() - python

I'm making a speech recognition model with an input shape of (56088,22050,1) which as a whole can be loaded from a .npy file(~5GB in size) into the memory but I wanted to figure out a better way. I came across the keras fit_generator() method but most examples were based on mnist and used the ImageDataGenerator() function. I realised that I had to make a custom generator function but I wasn't really sure how. As per this thread, I referenced his generator function to make something like this but I still have to load the entire data to memory which takes a lot of time. Plus I'm uncertain if this program would run at all because it doesn't output anything at all for the first 20 minutes that I ran it for
Any other way out?
import librosa
import glob
import tensorflow as tf
import os
import numpy as np
class_list, X_train, Y_train = [],[],[]
filename = "D:\\SpeechRecognitionData\\train\\audio\\"
class_names = os.listdir(filename)
print(class_names)
for classes in class_names:
if classes == '_background_noise_':
continue
else:
class_list.append(''.join(filename+classes))
print(class_list,"\n",len(class_list))
def create_X(address):
wave,sr = librosa.load(address)
wave.reshape(-1,1)
yield wave
def getLabel(filename):
base_name = os.path.basename(filename)
return base_name
def onehot(Y_train):
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
Y_train = Y_train.reshape(-1,1)
enc.fit(Y_train)
Y_train = enc.transform(Y_train).toarray()
return Y_train
def execute(X_train, Y_train):
loop = 0
for i in class_list:
c=0
loop+=1
for file in glob.glob("".join(i+"\\*.wav")): # iterating through each .wav audio file in the directory to create training data
if np.array(list(create_X(file))).shape[0] == 22050:
c+=1
Y_train.append(class_names.index(getLabel(i)))
X_train.append(create_X(file))
if c%100==0:
print("{} files processed in loop {}".format(c,loop))
while 1:
for i in range(1558): # 36*1558 = 56088
if i%125==0:
print("i= "+str(i))
yield np.array(X_train[i*36:(i+1)*36]).reshape(X_train.shape[0],X_train.shape[1],1), onehot(np.array(Y_train[i*36:(i+1)*36]))
input_shape = (22050,1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv1D(16,activation='relu',input_shape=input_shape,kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(32,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(16,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128,activation='relu'))
model.add(tf.keras.layers.Dense(64,activation='relu'))
model.add(tf.keras.layers.Dense(30,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
generator = execute(X_train,Y_train)
model.fit_generator(generator,steps_per_epoch=56088//36,shuffle=True)
model.save("model.h5")

So I figured it out by looking at this example here- https://github.com/tjh48/keras_generators/blob/master/keras_generator_example.ipynb
If someone comes across this then they can refer to my notebook
https://github.com/DarshanDeshpande/Speech-Recognition/blob/master/SpeechRecognitionWithGenerators.ipynb
Thanks!

Related

Sentiment analysis model return identical output for any input

So I made this sentiment analysis model, and it works just fine in the training-testing script. I built a simple interface using streamlit for my saved model, but it always returned identical scores for any input text. Plus, it somehow returns so many scores when it should only return a single score for a single input.
here is my code:
import streamlit as st
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import pickle
with open("tokenizer.pkl", "rb") as f:
tokenizer = pickle.load(f)
#st.cache(allow_output_mutation=True)
def load_model():
model = tf.keras.models.load_model('C:/Users/k/Downloads/test/model_final.h5')
return model
if __name__ == '__main__':
model = load_model()
st.title('Analisis Sentimen')
txt = st.text_input('masukkan teks')
if not txt:
st.warning("masukkan teks sebelum lanjut")
st.stop()
else:
text = txt
text = tokenizer.texts_to_sequences(text)
text = pad_sequences(text)
prediction = model.predict(text)
st.title('sentimen: ')
if (prediction > 0.5).any():
st.write(prediction)
st.write('positif')
else:
st.write(prediction)
st.write('negatif')
here are some snaps when i try it with streamlit
ML is almost all trial and error. I modified the model like below, and got good results. Kindly try changing the model.
model = Sequential()
embedding_size = 50
model.add(Embedding(input_dim=1500,output_dim=embedding_size,input_length=max_tokens,name='embedding_layer'))
model.add(LSTM(16,))
model.add(Dropout(0.8))
model.add(Dense(16,))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
optimizer = Adam()
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),optimizer=optimizer,metrics=['accuracy'])
Please find the working code here. Thank you!

Upload data from DICOM files in Torchvision Model

I'm sorry if the question is too basic, but I am just getting started with PyTorch (and Python).
I was trying to follow step by step the instructions here:
https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
However,I am working with have some DICOM files, that I kept into two directories (CANCER/NOCANCER). I split them with split-folders, to have it structured to be used with the ImageFolder dataset (as done in the tutorial).
I am aware that I only need to load the pixel_arrays extracted from the DICOM files, and I wrote some helper functions to:
read all paths of the .dcm files;
read them and extract the pixel_array;
do a little preprocessing.
Here are the outlines of the helper functions:
import os
import pydicom
import cv2
import numpy as np
def createListFiles(dirName):
print("Fetching all the files in the data directory...")
lstFilesDCM =[]
for root, dir, fileList in os.walk(dirName):
for filename in fileList:
if ".dcm" in filename.lower():
lstFilesDCM.append(os.path.join( root , filename))
return lstFilesDCM
def castHeight(list):
lstHeight = []
min_height = 0
for filenameDCM in list:
readfile = pydicom.read_file(filenameDCM)
lstHeight.append(readfile.pixel_array.shape[0])
min_height = np.min(lstHeight)
return min_height
def castWidth(list):
lstWidth = []
min_Width = 0
for filenameDCM in list:
readfile = pydicom.read_file(filenameDCM)
lstWidth.append(readfile.pixel_array.shape[1])
min_Width = np.min(lstWidth)
return min_Width
def Preproc1(listDCM):
new_height, new_width = castHeight(listDCM), castWidth(listDCM)
ConstPixelDims = (len(listDCM), int(new_height), int(new_width))
ArrayDCM = np.zeros(ConstPixelDims, dtype=np.float32)
## loop through all the DICOM files
for filenameDCM in listDCM:
## read the file
ds = pydicom.read_file(filenameDCM)
mx0 = ds.pixel_array
## Standardisation
imgb = mx0.astype('float32')
imgb_stand = (imgb - imgb.mean(axis=(0, 1), keepdims=True)) / imgb.std(axis=(0, 1), keepdims=True)
## Normalisation
imgb_norm = cv2.normalize(imgb_stand, None, 0, 1, cv2.NORM_MINMAX)
## we make sure that data is saved as a data_array as a numpy array
data = np.array(imgb_norm)
## we save it into ArrayDicom and resize it based 'ConstPixelDims'
ArrayDCM[listDCM.index(filenameDCM), :, :] = cv2.resize(data, (int(new_width), int(new_height)), interpolation = cv2.INTER_CUBIC)
return ArrayDCM
So, now, how do I tell the dataloader to load the data, considering the structure it's in for labelling purposes, but only after doing this extraction and preprocessing on it?
I'm referencing the "Loading data" part of the tutorial in the documentation, that goes:
# Create training and validation datasets
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x]) for x in ['train', 'val']}
# Create training and validation dataloaders
dataloaders_dict = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batch_size, shuffle=True, num_workers=4) for x in ['train', 'val']}
If it makes any sense, is it possible to do something on the lines of
image_datasets = {x: datasets.ImageFolder(Preproc1(os.path.join(data_dir, x)), data_transforms[x]) for x in ['train', 'val']}
?
Also, another question I have is: is it worth to do a normalisation step in my preprocessing when the tutorial suggests to do a transforms.Normalize ?
I'm really sorry this sounds so vague, I've been trying to solve this for weeks now, but I can't manage.
It sounds like you will be better off implementing your own custom Dataset.
And indeed, I think it would be better deferring normalization and other stuff to the transformations applied just before reading the images for the model.

How to use tensorflow_io's IODataset?

I'm trying to write a program that can uses malicious pcap files as datasets and predicts if other pcaps files have malicious packets in them.
After some digging through the Tensorflow doumentation, I have found TensorIO, but I can't figure out how to use the dataset to create a model and predict with it.
Here's my code:
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np
from tensorflow import keras
try:
import tensorflow_io as tfio
import tensorflow_datasets as tfds
except:
!pip install tensorflow-io
!pip install tensorflow-datasets
import tensorflow_io as tfio
import tensorflow_datasets as tfds
# print(tf.__version__)
dataset = tfio.IODataset.from_pcap("dataset.pcap")
print(dataset) # <PcapIODataset shapes: ((), ()), types: (tf.float64, tf.string)>
(Using Google Collab)
Iv'e tried looking for answers online, but couldn't find any.
I have downloaded two pcap files and concatenated them. Later I have extracted the packet_timestamp and packet_data. Request you to preprocess the packet_data as per your requirement. If you have any labels to be added, you can add to the training dataset (In the below model example, I have created a dummy labels of all zero and adding as a column). If it is in a file then you can zip them to pcap files. Passing a dataset of (feature, label) pairs is all that's needed for Model.fit and Model.evaluate:
Below is an example of packet_data preprocessing - May be you can modify like if packet_data is valid then labels = valid else malicious.
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_io as tfio
import numpy as np
# Create an IODataset from a pcap file
first_file = tfio.IODataset.from_pcap('/content/fuzz-2006-06-26-2594.pcap')
second_file = tfio.IODataset.from_pcap(['/content/fuzz-2006-08-27-19853.pcap'])
# Concatenate the Read Files
feature = first_file.concatenate(second_file)
# List for pcap
packet_timestamp_list = []
packet_data_list = []
# some dummy labels
labels = []
packets_total = 0
for v in feature:
(packet_timestamp, packet_data) = v
packet_timestamp_list.append(packet_timestamp.numpy())
packet_data_list.append(packet_data.numpy())
labels.append(0)
if packets_total == 0:
assert np.isclose(
packet_timestamp.numpy()[0], 1084443427.311224, rtol=1e-15
) # we know this is the correct value in the test pcap file
assert (
len(packet_data.numpy()[0]) == 62
) # we know this is the correct packet data buffer length in the test pcap file
packets_total += 1
assert (
packets_total == 43
) # we know this is the correct number of packets in the test pcap file
Below is example of using in Model - The model won't work because I have not handled the packet_data which is of string type. Do the preprocessing as explained as per your requirement and use in the model.
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_io as tfio
import numpy as np
# Create an IODataset from a pcap file
first_file = tfio.IODataset.from_pcap('/content/fuzz-2006-06-26-2594.pcap')
second_file = tfio.IODataset.from_pcap(['/content/fuzz-2006-08-27-19853.pcap'])
# Concatenate the Read Files
feature = first_file.concatenate(second_file)
# List for pcap
packet_timestamp = []
packet_data = []
# some dummy labels
labels = []
# add 0 as label. You can use your actual labels here
for v in feature:
(timestamp, data) = v
packet_timestamp.append(timestamp.numpy())
packet_data.append(data.numpy())
labels.append(0)
## Do the preprocessing of packet_data here
# Add labels to the training data
# Preprocess the packet_data to convert string to meaningful value and use here
train_ds = tf.data.Dataset.from_tensor_slices(((packet_timestamp,packet_data), labels))
# Set the batch size
train_ds = train_ds.shuffle(5000).batch(32)
##### PROGRAM WILL RUN SUCCESSFULLY TILL HERE. TO USE IN THE MODEL DO THE PREPROCESSING OF PACKET DATA AS EXPLAINED ###
# Have defined some simple model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(100),
tf.keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds, epochs=2)
Hope this answers your question. Happy Learning.

How do I export an Estimator, tf.estimator.DNNClassifier

Hello guys this is my code, am still a beginner using tensorflow, this is my code
am trying to run a text classification DNN until now everything is fine.
I want to save my model and import it so i can use it to predict new values but I don't have any idea how to do it.
To give you a genral idea on what am trying to do.
I have 2 folders (training & test)
each folder has (4 folders (classification categories))
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import logging
print("Loading all files from directory ...")
# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
data = {}
data["sentence"] = []
data["tnemitnes"] = []
print("getting in a loop")
for file_path in os.listdir(directory):
with tf.gfile.GFile(os.path.join(directory, file_path), "r") as f:
print("directory : ",directory)
print("file path : ",file_path)
data["sentence"].append(f.read())
data["tnemitnes"].append(re.match("(\d+)\.txt", file_path).group(1))
return pd.DataFrame.from_dict(data)
print("merging all files in the training set ...")
# Merge all type of emails examples, add a polarity column and shuffle.
def load_dataset(directory):
pos_df = load_directory_data(os.path.join("train/br"))
neg_df = load_directory_data(os.path.join(directory, "train/mi"))
dos_df = load_directory_data(os.path.join(directory, "train/Brouillons")) #dsd
nos_df = load_directory_data(os.path.join(directory, "train/favoris")) #dsd
pos_df["polarity"] = 3
neg_df["polarity"] = 2
dos_df["polarity"] = 1
nos_df["polarity"] = 0
return pd.concat([pos_df, neg_df, dos_df , nos_df]).sample(frac=1).reset_index(drop=True)
print("Getting the data from files ...")
# Download and process the dataset files.
def download_and_load_datasets():
train_df = load_dataset(os.path.dirname("train"))
test_df = load_dataset(os.path.dirname("test"))
return train_df, test_df
print("configurring all logging output ...")
# Reduce logging output. ERROR
#logging.set_verbosity(tf.logging.INFO)
logging.getLogger().setLevel(logging.INFO)
print("Setting Up the data for the trainning ...")
train_df, test_df = download_and_load_datasets()
train_df.head()
print("Setting Up a Training input on the whole training set with no limit on training epochs ...")
# Training input on the whole training set with no limit on training epochs.
train_input_fn = tf.estimator.inputs.pandas_input_fn(train_df, train_df["polarity"], num_epochs=None, shuffle=True)
print("Setting Up a Prediction on the whole training set ...")
# Prediction on the whole training set.
predict_train_input_fn = tf.estimator.inputs.pandas_input_fn(train_df, train_df["polarity"], shuffle=False)
print("Setting Up a Prediction on the test set ...")
# Prediction on the test set.
predict_test_input_fn = tf.estimator.inputs.pandas_input_fn(test_df, test_df["polarity"], shuffle=False)
print("Removal of punctuation and splitting on spaces from the data ...")
#The module is responsible for preprocessing of sentences (e.g. removal of punctuation and splitting on spaces).
embedded_text_feature_column = hub.text_embedding_column(key="sentence", module_spec="https://tfhub.dev/google/nnlm-en-dim128/1")
print("Setting Up The Classifier ...")
#Estimator : For classification I did use a DNN Classifier
estimator = tf.estimator.DNNClassifier(
hidden_units=[10, 20],
feature_columns=[embedded_text_feature_column],
n_classes=4,
optimizer=tf.train.AdagradOptimizer(learning_rate=0.003))
print("Starting the Training ...")
# Training for 50 steps means 5000 training examples with the default
# batch size. This is roughly equivalent to 5 epochs since the training dataset
# contains less examples.
estimator.train(input_fn=train_input_fn, steps=20);
print("the Training had ended...")
print("setting Up the results ...")
train_eval_result = estimator.evaluate(input_fn=predict_train_input_fn)
test_eval_result = estimator.evaluate(input_fn=predict_test_input_fn)
print("Showing the results ...")
print("Training set accuracy: {accuracy}".format(**train_eval_result))
print("Test set accuracy: {accuracy}".format(**test_eval_result))
#this is when am having trouble !!! <====
tf.estimator.export(
os.path.dirname("Model"),
serving_input_fn,
default_output_alternative_key=None,
assets_extra=None,
as_text=False,
checkpoint_path=None,
graph_rewrite_specs=(GraphRewriteSpec((tag_constants.SERVING,), ()),),
strip_default_attrs=False
)
now after I have added the estimator export function it askes me to give serving_input_fn and to be honest I did find it hard to understand how to create one.
if there is an easier way it would be better.
You can esily get a serving_input_fn with tf.estimator.export.build_parsing_serving_input_receiver_fn (link)
In your case do something like:
serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
[embedded_text_feature_column])
If you expect to pass tensors directly there's also build_raw_serving_input_receiver_fn in the same package.
All I had to do is add model_dir= os.getcwd()+'\Model' to the estimator
model_dir= os.getcwd()+'\Model'
this is the new Code , I have created a new Folder and named it model.
estimator = tf.estimator.DNNClassifier(
hidden_units=[10, 20],
feature_columns=[embedded_text_feature_column],
n_classes=4,
optimizer=tf.train.AdagradOptimizer(learning_rate=0.003),
model_dir= os.getcwd()+'\Model')
You might read this before.
Tensorflow: how to save/restore a model?
A serving_input_receiver_fn should be defined.
https://www.tensorflow.org/api_docs/python/tf/estimator/export/build_parsing_serving_input_receiver_fn
This document introduce a valuable method to build the serving_input_receiver_fn.
Here is the example:
# first you should prepare feature_spec. it include the speciation of your feature columns.
feature_spec = tf.feature_column.make_parse_example_spec(my_feature_columns)
print feature_spec
serving_input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)
export_model = classifier.export_savedmodel('./iris/', serving_input_receiver_fn)

Training huge amounts of data with tensorflow

I have about 60 thousand samples of size 200x870, they are all numpy arrays and I want to build a four-dimensional tensor out of them (with one singleton dimension) and train them with a CNN in tensorflow. Up to this point, I was using data that I could just load and create batches as below:
with tf.Graph().as_default():
data_train = tf.to_float(getInput.data_train)
phase, lr = tf.placeholder(tf.bool), tf.placeholder(tf.float32)
global_step = tf.Variable(0,trainable = False)
image_train, label_train = tf.train.slice_input_producer([data_train, labels_train], num_epochs=args.num_epochs)
images_train, batch_labels_train = tf.train.batch([image_train, label_train], batch_size=args.bsize)
Can someone suggest a way to go around it?
I wanted to split the dataset into subsets and in one epoch train one after the ather using a Queue for the paths of this files:
import scipy.io as sc
import numpy as np
import threading
import time
import tensorflow as tf
from tensorflow.python.client import timeline
def testQueues():
paths = ['data1', 'data2', 'data3', 'data4','data5']
queue_capacity = 6
bsize = 10
num_epochs = 2
filename_queue = tf.FIFOQueue(
#min_after_dequeue=0,
capacity=queue_capacity,
dtypes=tf.string,
shapes=[[]]
)
filenames_placeholder = tf.placeholder(dtype='string', shape=(None))
filenames_enqueue_op = filename_queue.enqueue_many(filenames_placeholder)
data_train, phase = tf.placeholder(tf.float32), tf.placeholder(tf.bool)
sess= tf.Session()
sess.run(filenames_enqueue_op, feed_dict={filenames_placeholder: paths})
for i in range(len(paths)):
train_set_batch_name = sess.run(filename_queue.dequeue())
train_set_batch_name = train_set_batch_name.decode('utf-8')
train_set_batch = np.load(train_set_batch_name+'.npy')
train_set_batch = tf.cast(train_set_batch, tf.float32)
init_op = tf.group(tf.initialize_all_variables(), tf.initialize_local_variables())
sess.run(init_op)
run_one_epoch(train_set_batch, sess)
size = sess.run(filename_queue.size())
print(size)
print(train_set_batch)
def run_one_epoch(train_set,sess):
image_train = tf.train.slice_input_producer([train_set], num_epochs=1)
images_train = tf.train.batch(image_train, batch_size=10)
x = tf.nn.relu(images_train)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
sess.run(x)
except tf.errors.OutOfRangeError:
pass
finally:
# When done, ask the threads to stop.
coord.request_stop()
coord.join(threads)
testQueues()
However I get an error
FailedPreconditionError: Attempting to use uninitialized value input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs
[[Node: input_producer/input_producer/fraction_of_32_full/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:#input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](input_producer/input_producer/fraction_of_32_full/limit_epochs/epochs)]]
Also it seems as I can't feed the dictionary with a tf.tensor only with numpy array, but casting it later to tf.tensor is also troublesome.
Have a look at Dataset api.
"The tf.data API enables you to build complex input pipelines from simple, reusable pieces."
In this approach what you do is you model your graph such that it handles data for you and pulls in limited data at a time for you to train your model on.
If memory issue still persists then you might want to look into generator to create your tf.data.Dataset. Your next step could be to potentially speed up the process by preparing tfrecords to create you Dataset.
Follow all the links to learn more and feel free to comment if you don't understand something.
For data that doesn't fit into memory the standard solution is to use Queues. You can set up some ops that read from files directly (cvs files, image files), and feed them into TensorFlow -- https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html

Categories

Resources