How to use tensorflow_io's IODataset? - python

I'm trying to write a program that can uses malicious pcap files as datasets and predicts if other pcaps files have malicious packets in them.
After some digging through the Tensorflow doumentation, I have found TensorIO, but I can't figure out how to use the dataset to create a model and predict with it.
Here's my code:
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np
from tensorflow import keras
try:
import tensorflow_io as tfio
import tensorflow_datasets as tfds
except:
!pip install tensorflow-io
!pip install tensorflow-datasets
import tensorflow_io as tfio
import tensorflow_datasets as tfds
# print(tf.__version__)
dataset = tfio.IODataset.from_pcap("dataset.pcap")
print(dataset) # <PcapIODataset shapes: ((), ()), types: (tf.float64, tf.string)>
(Using Google Collab)
Iv'e tried looking for answers online, but couldn't find any.

I have downloaded two pcap files and concatenated them. Later I have extracted the packet_timestamp and packet_data. Request you to preprocess the packet_data as per your requirement. If you have any labels to be added, you can add to the training dataset (In the below model example, I have created a dummy labels of all zero and adding as a column). If it is in a file then you can zip them to pcap files. Passing a dataset of (feature, label) pairs is all that's needed for Model.fit and Model.evaluate:
Below is an example of packet_data preprocessing - May be you can modify like if packet_data is valid then labels = valid else malicious.
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_io as tfio
import numpy as np
# Create an IODataset from a pcap file
first_file = tfio.IODataset.from_pcap('/content/fuzz-2006-06-26-2594.pcap')
second_file = tfio.IODataset.from_pcap(['/content/fuzz-2006-08-27-19853.pcap'])
# Concatenate the Read Files
feature = first_file.concatenate(second_file)
# List for pcap
packet_timestamp_list = []
packet_data_list = []
# some dummy labels
labels = []
packets_total = 0
for v in feature:
(packet_timestamp, packet_data) = v
packet_timestamp_list.append(packet_timestamp.numpy())
packet_data_list.append(packet_data.numpy())
labels.append(0)
if packets_total == 0:
assert np.isclose(
packet_timestamp.numpy()[0], 1084443427.311224, rtol=1e-15
) # we know this is the correct value in the test pcap file
assert (
len(packet_data.numpy()[0]) == 62
) # we know this is the correct packet data buffer length in the test pcap file
packets_total += 1
assert (
packets_total == 43
) # we know this is the correct number of packets in the test pcap file
Below is example of using in Model - The model won't work because I have not handled the packet_data which is of string type. Do the preprocessing as explained as per your requirement and use in the model.
%tensorflow_version 2.x
import tensorflow as tf
import tensorflow_io as tfio
import numpy as np
# Create an IODataset from a pcap file
first_file = tfio.IODataset.from_pcap('/content/fuzz-2006-06-26-2594.pcap')
second_file = tfio.IODataset.from_pcap(['/content/fuzz-2006-08-27-19853.pcap'])
# Concatenate the Read Files
feature = first_file.concatenate(second_file)
# List for pcap
packet_timestamp = []
packet_data = []
# some dummy labels
labels = []
# add 0 as label. You can use your actual labels here
for v in feature:
(timestamp, data) = v
packet_timestamp.append(timestamp.numpy())
packet_data.append(data.numpy())
labels.append(0)
## Do the preprocessing of packet_data here
# Add labels to the training data
# Preprocess the packet_data to convert string to meaningful value and use here
train_ds = tf.data.Dataset.from_tensor_slices(((packet_timestamp,packet_data), labels))
# Set the batch size
train_ds = train_ds.shuffle(5000).batch(32)
##### PROGRAM WILL RUN SUCCESSFULLY TILL HERE. TO USE IN THE MODEL DO THE PREPROCESSING OF PACKET DATA AS EXPLAINED ###
# Have defined some simple model
model = tf.keras.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(100),
tf.keras.layers.Dense(10)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds, epochs=2)
Hope this answers your question. Happy Learning.

Related

Linkprediction using Hinsage/Graphsage in StellarGraph returns NaNs

I am trying to run a link prediction using HinSAGE in the stellargraph python package.
I have a network of people and products, with edges from person to person (KNOWs) and person to products (BOUGHT).
Both people and products got a property vector attached, albeit a different one from each type (Persons vector is 1024 products is 200).
I am trying to create a link prediction algorithm from person to product based on all the information in the network. The reason for me for using HinSAGE is the option for inductive learning.
I have the code below, and I thought I was doing it similar to the examples
https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/hinsage-link-prediction.html
https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/graphsage-link-prediction.html
but I keep getting "nan" as my output predictions, anyone got a suggestion to what I can try?
import networkx as nx
import pandas as pd
import numpy as np
from tensorflow.keras import Model, optimizers, losses, metrics
import stellargraph as sg
from stellargraph.data import EdgeSplitter
from stellargraph.mapper import HinSAGELinkGenerator
from stellargraph.layer import HinSAGE, link_classification, link_regression
from sklearn.model_selection import train_test_split
graph.info()
#StellarGraph: Undirected multigraph
# Nodes: 54226, Edges: 259120
#
# Node types:
# products: [45027]
# Features: float32 vector, length 200
# Edge types: products-BOUGHT->person
# person: [9199]
# Features: float32 vector, length 1024
# Edge types: person-KNOWS->person, person-BOUGHT->product
#
# Edge types:
# person-KNOWS->person: [246131]
# Weights: all 1 (default)
# Features: none
# person-BOUGHT->product: [12989]
# Weights: all 1 (default)
# Features: none
import networkx as nx
import pandas as pd
import numpy as np
import os
import random
from tensorflow.keras import Model, optimizers, losses, metrics
import stellargraph as sg
from stellargraph.data import EdgeSplitter
from stellargraph.mapper import HinSAGELinkGenerator
from stellargraph.layer import HinSAGE, link_classification
from stellargraph.data import UniformRandomWalk
from stellargraph.data import UnsupervisedSampler
from sklearn.model_selection import train_test_split
from stellargraph.layer import HinSAGE, link_regression
edge_splitter_test = EdgeSplitter(graph)
graph_test, edges_test, labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", edge_label="BOUGHT"
)
edge_splitter_train = EdgeSplitter(graph_test, graph)
graph_train, edges_train, labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", edge_label="BOUGHT"
)
num_samples = [8, 4]
G = graph
batch_size = 20
epochs = 20
generator = HinSAGELinkGenerator(
G, batch_size, num_samples, head_node_types=["person", "product"]
)
train_gen = generator.flow(edges_train, labels_train, shuffle=True)
test_gen = generator.flow(edges_test, labels_test)
hinsage_layer_sizes = [32, 32]
assert len(hinsage_layer_sizes) == len(num_samples)
hinsage = HinSAGE(
layer_sizes=hinsage_layer_sizes, generator=generator, bias=True, dropout=0.0
)
# Expose input and output sockets of hinsage:
x_inp, x_out = hinsage.in_out_tensors()
# Final estimator layer
prediction = link_classification(
output_dim=1, output_act="sigmoid", edge_embedding_method="concat"
)(x_out)
model = Model(inputs=x_inp, outputs=prediction)
model.compile(
optimizer=optimizers.Adam(),
loss=losses.binary_crossentropy,
metrics=["acc"],
)
history = model.fit(train_gen, epochs=epochs, validation_data=test_gen, verbose=2)
So I found the problem, might be useful for others. If there is any node containing missing data, the thing will just produce NAs. Especially dangerous if you create your graph by joining pandas dataframes, I had a typo in one file that was integrated and led to the problem.

How to put skimage imread_collection through tensorflow

I'm trying to put a collection of images through a neural network, but I can't figure out how to get a large collection of images to go into a tensorflow model, as trying to convert the collection into a numpy array causes a memory error.
I should note that I am very new to tensorflow.
import numpy as np
from skimage.io import imread_collection
from tensorflow import keras
from tensorflow.keras import layers
def gen(arr):return(i.reshape(400*600*3) for i in arr) # Only used in Attempt2.
labelFile=open("lables_text_file.txt","r")
labels=labelFile.read()
labelFile.close()
labels=getTrain(labels)#Converts to a tuple containing the lables in order.
data = imread_collection("path_to_images/*.jpg", conserve_memory=True)
train=data[:-len(data)//4]
trainLabels=labels[:-len(data)//4]
test=data[-len(data)//4:]
testLabels=labels[-len(data)//4:]
#train = train.reshape(-1, 400*600*3) # Attempt1
#test = test.reshape(-1, 400*600*3) # Attempt1
#train = gen(train) # Attempt2
#test = gen(test) # Attempt2
trainLabels = keras.utils.to_categorical(trainLabels, 23)
testLabels = keras.utils.to_categorical(testLabels, 23)
model=keras.Sequential([keras.Input(shape=(400*600*3,)),
layers.Dense(600, name='hidden1', activation='relu'),
layers.Dense(400, name='hidden2', activation='relu'),
layers.Dense(46, name='hidden3', activation='relu'),
layers.Dense(23, activation="softmax")])
optimizer = keras.optimizers.Adam(learning_rate=0.0015)
model.compile(loss=keras.losses.CategoricalCrossentropy(), optimizer=optimizer, metrics=[keras.metrics.CategoricalAccuracy()])
model.fit(train,trainLabels,batch_size=128,epochs=8,validation_data=(test,testLabels), shuffle=True)
When I run the code as is, this is the result:
ValueError: Failed to find data adapter that can handle input: <class 'skimage.io.collection.ImageCollection'>, <class 'numpy.ndarray'>
When I try to use Attempt1, this is the result:
AttributeError: 'ImageCollection' object has no attribute 'reshape'
When I try to use Attempt2, this is the result:
ValueError: `y` argument is not supported when using python generator as input.
How can I put the data into `model.fit, such that it will successfully train the neural network?
I think I may have solved the problems.
Working code:
import numpy as np
from skimage.io import imread_collection
from tensorflow import keras
from tensorflow.keras import layers
def gen(arr,labels):return((arr[i].reshape(-1,400*600*3),labels[i].reshape(-1,23)) for i in range(len(arr)))
labelFile=open("lables_text_file.txt","r")
labels=labelFile.read()
labelFile.close()
labels=getTrain(labels)#Converts to a tuple containing the lables in order.
data = imread_collection("path_to_images/*.jpg", conserve_memory=True)
train=data[:-len(data)//4]
trainLabels=labels[:-len(data)//4]
test=data[-len(data)//4:]
testLabels=labels[-len(data)//4:]
#train = train.reshape(-1, 400*600*3) # Attempt1
#test = test.reshape(-1, 400*600*3) # Attempt1
trainLabels = keras.utils.to_categorical(trainLabels, 23)
testLabels = keras.utils.to_categorical(testLabels, 23)
train = gen(train,trainLabels) # Attempt2
test = gen(test,testLabels) # Attempt2
model=keras.Sequential([keras.Input(shape=(400*600*3,)),
layers.Dense(600, name='hidden1', activation='relu'),
layers.Dense(400, name='hidden2', activation='relu'),
layers.Dense(46, name='hidden3', activation='relu'),
layers.Dense(23, activation="softmax")])
optimizer = keras.optimizers.Adam(learning_rate=0.0015)
model.compile(loss=keras.losses.CategoricalCrossentropy(), optimizer=optimizer, metrics=[keras.metrics.CategoricalAccuracy()])
model.fit(train,None,batch_size=128,epochs=8,validation_data=(test,testLabels), shuffle=True)
The solution was to pass in a generator that returns two-tuples containing the input and label (instead of passing the labels in directly), but there were other problems that I may include in this answer if I get the time.

How to know input/output layer names and sizes for Pytorch model?

I have Pytorch model.pth using Detectron2's COCO Object Detection Baselines pretrained model R50-FPN.
I am trying to convert the .pth model to onnx.
My code is as follows.
import io
import numpy as np
from torch import nn
import torch.utils.model_zoo as model_zoo
import torch.onnx
from torchvision import models
model = torch.load('output_object_detection/model_final.pth')
x = torch.randn(1, 3, 1080, 1920, requires_grad=True)#0, in_cha, in_h, in_w
torch_out = torch_model(x)
print(model)
torch.onnx.export(torch_model, # model being run
x, # model input (or a tuple for multiple inputs)
"super_resolution.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=10, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['input'], # the model's input names
output_names = ['cls_score','bbox_pred'], # the model's output names
dynamic_axes={'input' : {0 : 'batch_size'}, # variable lenght axes
'output' : {0 : 'batch_size'}})
Is it correct way to convert ONNX model?
If it is the right way, how to know input_names and output_names?
Used netron to see input and output, but the graph doesn't show input/output layers.
Try this:
import io
import numpy as np
from torch import nn
import torch.utils.model_zoo as model_zoo
import torch.onnx
from torchvision import models
model = torch.load('model_final.pth')
model.eval()
print('Finished loading model!')
print(model)
device = torch.device("cpu" if args.cpu else "cuda")
model = model.to(device)
# ------------------------ export -----------------------------
output_onnx = 'super_resolution.onnx'
print("==> Exporting model to ONNX format at '{}'".format(output_onnx))
input_names = ["input0"]
output_names = ["output0","output1"]
inputs = torch.randn(1, 3, 1080, 1920).to(device)
torch_out = torch.onnx._export(model, inputs, output_onnx, export_params=True, verbose=False,
input_names=input_names, output_names=output_names)

Need a concrete example of fit_generator()

I'm making a speech recognition model with an input shape of (56088,22050,1) which as a whole can be loaded from a .npy file(~5GB in size) into the memory but I wanted to figure out a better way. I came across the keras fit_generator() method but most examples were based on mnist and used the ImageDataGenerator() function. I realised that I had to make a custom generator function but I wasn't really sure how. As per this thread, I referenced his generator function to make something like this but I still have to load the entire data to memory which takes a lot of time. Plus I'm uncertain if this program would run at all because it doesn't output anything at all for the first 20 minutes that I ran it for
Any other way out?
import librosa
import glob
import tensorflow as tf
import os
import numpy as np
class_list, X_train, Y_train = [],[],[]
filename = "D:\\SpeechRecognitionData\\train\\audio\\"
class_names = os.listdir(filename)
print(class_names)
for classes in class_names:
if classes == '_background_noise_':
continue
else:
class_list.append(''.join(filename+classes))
print(class_list,"\n",len(class_list))
def create_X(address):
wave,sr = librosa.load(address)
wave.reshape(-1,1)
yield wave
def getLabel(filename):
base_name = os.path.basename(filename)
return base_name
def onehot(Y_train):
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
Y_train = Y_train.reshape(-1,1)
enc.fit(Y_train)
Y_train = enc.transform(Y_train).toarray()
return Y_train
def execute(X_train, Y_train):
loop = 0
for i in class_list:
c=0
loop+=1
for file in glob.glob("".join(i+"\\*.wav")): # iterating through each .wav audio file in the directory to create training data
if np.array(list(create_X(file))).shape[0] == 22050:
c+=1
Y_train.append(class_names.index(getLabel(i)))
X_train.append(create_X(file))
if c%100==0:
print("{} files processed in loop {}".format(c,loop))
while 1:
for i in range(1558): # 36*1558 = 56088
if i%125==0:
print("i= "+str(i))
yield np.array(X_train[i*36:(i+1)*36]).reshape(X_train.shape[0],X_train.shape[1],1), onehot(np.array(Y_train[i*36:(i+1)*36]))
input_shape = (22050,1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv1D(16,activation='relu',input_shape=input_shape,kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(32,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Conv1D(16,activation='relu',kernel_size=(10)))
model.add(tf.keras.layers.MaxPool1D())
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128,activation='relu'))
model.add(tf.keras.layers.Dense(64,activation='relu'))
model.add(tf.keras.layers.Dense(30,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
generator = execute(X_train,Y_train)
model.fit_generator(generator,steps_per_epoch=56088//36,shuffle=True)
model.save("model.h5")
So I figured it out by looking at this example here- https://github.com/tjh48/keras_generators/blob/master/keras_generator_example.ipynb
If someone comes across this then they can refer to my notebook
https://github.com/DarshanDeshpande/Speech-Recognition/blob/master/SpeechRecognitionWithGenerators.ipynb
Thanks!

Keras/Tensorflow training on GCP with TPU

I am trying to train a model on GCP with keras and tensorflow 1.15.
From now my code is similar to what I could do on colab, namely :
# TPUs
import tensorflow as tf
print(tf.__version__)
cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver("tpu-name")
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(cluster_resolver)
print("Number of accelerators: ", tpu_strategy.num_replicas_in_sync)
import numpy as np
np.random.seed(123) # for reproducibility
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Convolution2D, MaxPooling2D, Input
from tensorflow.keras import utils
from tensorflow.keras.datasets import mnist, cifar10
from tensorflow.keras.models import Model
# 4. Load data into train and test sets
(X_train, y_train) = load_data(sets="gs://BUCKETS/dogscats/train/",target_size=img_size)
(X_test, y_test) = load_data(sets="gs://BUCKETS/dogscats/valid/",target_size=img_size)
print(X_train.shape, X_test.shape)
# 5. Preprocess input data
#X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
#X_test = X_test.reshape(X_test.shape[0], 28, 28,1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255.0
X_test /= 255.0
print(y_train.shape, y_test.shape)
# 6. Preprocess class labels One hot encoding
Y_train = utils.to_categorical(y_train, 2)
Y_test = utils.to_categorical(y_test, 2)
print(Y_train.shape, Y_test.shape)
with tpu_strategy.scope():
model = make_model((img_size, img_size, 3))
# 8. Compile model
model.compile(loss='categorical_crossentropy',
optimizer="sgd",
metrics=['accuracy'])
model.summary()
batch_size = 1250 * tpu_strategy.num_replicas_in_sync
# 9. Fit model on training data
model.fit(X_train, Y_train, steps_per_epoch=len(X_train)//batch_size,
epochs=5, verbose=1)
But my data is on the bucket and my code is on an VM. So what I have to do ? I tried to load my data using "gs://BUCKETS" but it does not work. What should I do ?
EDIT : I add my code to load data, I forgot it sorry.
def load_data(sets="dogcats/train/", k = 5000, target_size=250):
# define location of dataset
folder = sets
photos, labels = list(), list()
# determine class
output = 0.0
for i, dog in enumerate(listdir(folder + "dogs/")):
if i >= k:
break
# load image
photo = load_img(folder + "dogs/" +dog, target_size=(target_size, target_size))
# convert to numpy array
photo = img_to_array(photo)
# store
photos.append(photo)
labels.append(output)
output = 1.0
for i, cat in enumerate(listdir(folder + "cats/") ):
if i >= k:
break
# load image
photo = load_img(folder + "cats/"+cat, target_size=(target_size, target_size))
# convert to numpy array
photo = img_to_array(photo)
# store
photos.append(photo)
labels.append(output)
# convert to a numpy arrays
photos = asarray(photos)
labels = asarray(labels)
print(photos.shape, labels.shape)
photos, labels = shuffle(photos, labels, random_state=0)
return photos, labels
EDIT2 : To complete the answer of #daudnadeem in case some other people are in the same case.
My goal was to get images from a bucket, so the code works well and allowed to get byte object. To transform it into image you just need to use PIL library:
from PIL import Image
from io import BytesIO
import numpy as np
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket("BUCKETS")
blob = bucket.get_blob('dogscats/train/<you-will-need-to-point-to-a-file-and-not-a-directory>')
data = blob.download_as_string()
img = Image.open(BytesIO(data))
img = np.array(img)
(X_train, y_train) = load_data(sets="gs://BUCKETS/dogscats/train/",target_size=img_size)
(X_test, y_test) = load_data(sets="gs://BUCKETS/dogscats/valid/",target_size=img_size)
This obviously won't work since essentially all you've done is given sets a string. What you need to do is download this data as a string, and then use that.
First install the package pip install google-cloud-storage or pip3 install google-cloud-storage
pip -> Python
pip3 -> Python3
Have a look at this, you will need a service account to interact with GCP from your code. For authentication purposes.
When you get your service account as a json, you need to do one of two things:
Set it as an env variable:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
or my preferrable workaround
gcloud auth activate-service-account \
<repalce-with-email-from-json-file> \
--key-file=<path/to/your/json/file> --project=<name-of-your-gcp-project>
Now lets look at how you can use google-cloud-storage library to download your file as a string:
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket("BUCKETS")
blob = bucket.get_blob('/dogscats/train/<you-will-need-to-point-to-a-file-and-not-a-directory>')
data = blob.download_as_string()
Now that you have your data as a string, you can simply pass data into load data like so (X_train, y_train) = load_data(sets=data,target_size=img_size)
It sounds complex but heres a quick psuedo layout:
Install google-cloud-storage
Go to Google Cloud Platform Console -> IAM & Admin -> Service Accounts
Create service account with relative permissions (google-cloud-storage)
Download the (JSON) file, and remember location.
Activate service account
Download file as String and pass that string to your load_data(data)
Hope that helps!

Categories

Resources