Why CNN running in python is extremely slow in comparison to Matlab? - python

I have trained a CNN in Matlab 2019b that classifies images between three classes. When this CNN was tested in Matlab it was functioning fine and only took 10-15 seconds to classify an image. I used the exportONNXNetwork function in Maltab so that I can implement my CNN in Tensorflow. This is the code I am using to use the ONNX file in python:
import onnx
from onnx_tf.backend import prepare
import numpy as np
from PIL import Image
onnx_model = onnx.load('trainednet.onnx')
tf_rep = prepare(onnx_model)
filepath = 'filepath.png'
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
probabilities = tf_rep.run(img)
print(probabilities)
When trying to use this code to classify the same test set, it seems to be classifying the images correctly but it is very slow and freezes my computer as it reaches high memory usages of up to 95+% at some points.
I also noticed in the command prompt while classifying it prints this:
2020-04-18 18:26:39.214286: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:530] constant_folding failed: Deadline exceeded: constant_folding exceeded deadline., time = 486776.938ms.
Is there any way I can make this python code classify faster?

Maybe you could try to understand what part of the code takes a long time this way:
import onnx
from onnx_tf.backend import prepare
import numpy as np
from PIL import Image
import datetime
now = datetime.datetime.now()
onnx_model = onnx.load('trainednet.onnx')
tf_rep = prepare(onnx_model)
filepath = 'filepath.png'
later = datetime.datetime.now()
difference = later - now
print("Loading time : %f ms" % (difference.microseconds / 1000))
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
now = datetime.datetime.now()
probabilities = tf_rep.run(img)
later = datetime.datetime.now()
difference = later - now
print("Prediction time : %f ms" % (difference.microseconds / 1000))
print(probabilities)
Let me know what the output looks like :)

In this case, it appears that the Grapper optimization suite has encountered some kind of infinite loop or memory leak. I would recommend filing an issue against the Github repo.
It's challenging to debug why constant folding is taking so long, but you may have better performance using the ONNX TensorRT backend as compared to the TensorFlow backend. It achieves better performance as compared to the TensorFlow backend on Nvidia GPUs while compiling typical graphs more quickly. Constant folding usually doesn't provide large speedups for well optimized models.
import onnx
import onnx_tensorrt.backend as backend
import numpy as np
model = onnx.load("trainednet.onnx'")
engine = backend.prepare(model, device='CUDA:1')
filepath = 'filepath.png'
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
output_data = engine.run(img)[0]
print(output_data)

You should consider some points while working on TensorFlow with Python. A GPU will be better for work as it fastens the whole processing. For that, you have to install CUDA support. Apart from this, the compiler also sometimes matters. I can tell VSCode is better than Spyder from my experience.
I hope it helps.

Since the command prompt states that your program takes a long time to perform constant folding, it might be worthwhile to turn this off. Based on this documentation, you could try running:
import numpy as np
import timeit
import traceback
import contextlib
import onnx
from onnx_tf.backend import prepare
from PIL import Image
import tensorflow as tf
#contextlib.contextmanager
def options(options):
old_opts = tf.config.optimizer.get_experimental_options()
tf.config.optimizer.set_experimental_options(options)
try:
yield
finally:
tf.config.optimizer.set_experimental_options(old_opts)
with options({'constant_folding': False}):
onnx_model = onnx.load('trainednet.onnx')
tf_rep - prepare(onnx_model)
filepath = 'filepath.png'
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
probabilities = tf_rep.run(img)
print(probabilities)
This disables the constant folding performed in the TensorFlow Graph optimization. This can work both ways: on the one hand it will not reach the constant folding deadline, but on the other hand disabling constant folding can result in significant runtime increases. Anyway it is worth trying, good luck!

Related

In Fast AI part 2 lecture 9A of course I get the Cuda Out of Memory Error in PyTorch

I was trying to follow along Jeremy Howard Fast Ai course.
I am on Part 2 of of the course Lecture 9A: https://www.youtube.com/watch?v=0_BBRNYInx8&t=0s
The code I am trying to run can be found below (its the code in the second cell):
https://github.com/fastai/diffusion-nbs/blob/master/Stable%20Diffusion%20Deep%20Dive.ipynb
The imports:
from base64 import b64encode
import numpy
import torch
from diffusers import AutoencoderKL, LMSDiscreteScheduler, UNet2DConditionModel
from huggingface_hub import notebook_login
# For video display:
from IPython.display import HTML
from matplotlib import pyplot as plt
from PIL import Image
from torch import autocast
from torchvision import transforms as tfms
from tqdm.auto import tqdm
from transformers import CLIPTextModel, CLIPTokenizer, logging
torch.manual_seed(1)
# if not (Path.home()/'.huggingface'/'token').exists(): notebook_login()
# Supress some unnecessary warnings when loading the CLIPTextModel
logging.set_verbosity_error()
# Set device
print(torch.cuda.is_available())
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
The actual code:
# Load the autoencoder model which will be used to decode the latents into image space.
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
# Load the tokenizer and text encoder to tokenize and encode the text.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
# The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
# The noise scheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
# To the GPU we go!
vae = vae.to(torch_device)
text_encoder = text_encoder.to(torch_device)
unet = unet.to(torch_device);
The Problem: Cuda out of memory error
in the last line I unet.to(torch_device)
I encounter the Cuda out of memory error
I have a 2 GB Nvidia Graphics card (I know this memory is too small, so I tried a few things to fix this)
The solution I tried but it did not work
In the below code
I used set_attention_slice
I set vae and text_encoder to "cpu"
## To Prevent CUDA out of memory. Change 1024 to any numbeer
slice_size = unet.config.attention_head_dim // 1024
unet.set_attention_slice(slice_size)
# To the GPU we go!
vae = vae.to("cpu") #vae.to(torch_device)
text_encoder = text_encoder.to("cpu")#text_encoder.to(torch_device)
unet = unet.to(torch_device).to_fp16 # to_fp16 is my code comment
This did not work,
Is there anything else I can try?

Custom Image segmentation has different results on different model loads Pixellib

I am doing custom image segmentation using PixelLib.
import pixellib
from pixellib.instance import custom_segmentation
segment_image = custom_segmentation()
segment_image.inferConfig(num_classes= 1, class_names= ["BG", "road"])
segment_image.load_model("/content/mask_rcnn_model.143-0.199986.h5")
segment_image.segmentImage("/content/FullTrainingData/Warlick.PNG", show_bboxes=False, output_image_name="a0000sample_out" + str(i) +".PNG")
The segmentation looks good on my machine, but running it on another machine with the same h5 file the result is far worse
The result differs each time I reload the model. Any reason this could be happening?
adding these imports
import pixellib
from pixellib.tune_bg import alter_bg
causes it to work. No idea why though

How to convert "tensor" to "numpy" array in tensorflow?

I am trying to convert a tensor to numpy in the tesnorflow2.0 version. Since tf2.0 have eager execution enabled then it should work by default and working too in normal runtime. While I execute code in tf.data.Dataset API then it gives an error
"AttributeError: 'Tensor' object has no attribute 'numpy'"
I have tried ".numpy()" after tensorflow variable and for ".eval()" I am unable to get default session.
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
# tf.executing_eagerly()
import os
import time
import matplotlib.pyplot as plt
from IPython.display import clear_output
from model.utils import get_noise
import cv2
def random_noise(input_image):
img_out = get_noise(input_image)
return img_out
def load_denoising(image_file):
image = tf.io.read_file(image_file)
image = tf.image.decode_png(image)
real_image = image
input_image = random_noise(image.numpy())
input_image = tf.cast(input_image, tf.float32)
real_image = tf.cast(real_image, tf.float32)
return input_image, real_image
def load_image_train(image_file):
input_image, real_image = load_denoising(image_file)
return input_image, real_image
This works fine
inp, re = load_denoising('/data/images/train/18.png')
# Check for correct run
plt.figure()
plt.imshow(inp)
print(re.shape," ", inp.shape)
And this produces mentioned error
train_dataset = tf.data.Dataset.list_files('/data/images/train/*.png')
train_dataset = train_dataset.map(load_image_train,num_parallel_calls=tf.data.experimental.AUTOTUNE)
Note: random_noise have cv2 and sklearn functions
You can't use the .numpy method on a tensor, if this tensor is going to be used in a tf.data.Dataset.map call.
The tf.data.Dataset object under the hood works by creating a static graph: this means that you can't use .numpy() because the tf.Tensor object when in a static-graph context do not have this attribute.
Therefore, the line input_image = random_noise(image.numpy()) should be input_image = random_noise(image).
But the code is likely to fail again since random_noise calls get_noise from the model.utils package.
If the get_noise function is written using Tensorflow, then everything will work. Otherwise, it won't work.
The solution? Write the code using only the Tensorflow primitives.
For instance, if your function get_noise just creates random noise with the shee of your input image, you can define it like:
def get_noise(image):
return tf.random.normal(shape=tf.shape(image))
using only the Tensorflow primitives, and it will work.
Hope this overview helps!
P.S: you could be interested in having a look at the articles "Analyzing tf.function to discover AutoGraph strengths and subtleties" - they cover this aspect (perhaps part 3 is the one related to your scenario): part 1 part 2 part 3
In TF2.x version, use tf.config.run_functions_eagerly(True).

Neural Compute Stick 2: I've done all processing to use NCS2, but It's too slow

Recently, I gave Neural Compute Stick 2 from my professor,
After a lot of trial and error, I have configured the environment.
I got all the information from Intel official site.
sudo python3 mo_tf.py
\ --input_model /home/leehanbeen/PycharmProjects/TypeClassifier/inference_graph_type.pb
\ --input_shape "[1, 64, 128, 3]" --input "input"
I have successfully converted the pb file to the IR (.xml, .bin) file via model_optimizer and wanted to apply it to the raspberry pi.
import tensorflow as tf
import cv2
import numpy as np
BIN_PATH = '/home/pi/Downloads/inference_graph_type.bin'
XML_PATH = '/home/pi/Downloads/inference_graph_type.xml'
IMAGE_PATH = '/home/pi/Downloads/plate(110).jpg_2.jpg' #naming miss.. :(
net = cv2.dnn.readNet(XML_PATH, BIN_PATH)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
frame = cv2.imread(IMAGE_PATH)
frame = cv2.resize(frame, (128, 64))
blob = cv2.dnn.blobFromImage(frame, size=(128, 64), ddepth=cv2.CV_8U)
net.setInput(blob)
out = net.forward()
out = out.reshape(-1)
print(out)
print(np.max(out))
print(np.argmax(out))
This source works very well, but It's too slow.
When I give (128, 64, 3) image as input to model, inference time is 4.7 seconds
[0.0128479 0.2097168 0.76416016 0.00606918 0.00246811 0.00198746 0.00129604 0.00117588]
0.76416016
2
When I gave a smaller image(40, 40, 1) than this image, the time was rather infinitely slow.
I followed all the procedures as well as on the official Intel home page. Why is the inference time so slow? It's just a very simple classification model using CNN
Resolved.
Instead of using IE as a backend in OpenCV,
Using IE directly, the inference time was shortened from 4.7 seconds to 0.01 seconds.
But there is still a problem. The inference for color images (128, 64) is normal, while the grayscale image is still ending at the end of infinite time.
I have written the relevant source code on my GITHUB
It is in Korean, but you can see only the source at the bottom.

How to optimize the inference script to get a faster prediction of the classifier?

I have written the following prediction code that predicts from a trained classifier model. Now the prediction time is around 40s which I want to reduce as much as possible.
Can I do any optimization to my inference script or should I look for developments in training script?
import torch
import torch.nn as nn
from torchvision.models import resnet18
from torchvision.transforms import transforms
import matplotlib.pyplot as plt
import numpy as np
from torch.autograd import Variable
import torch.functional as F
from PIL import Image
import os
import sys
import argparse
import time
import json
parser = argparse.ArgumentParser(description = 'To Predict from a trained model')
parser.add_argument('-i','--image', dest = 'image_name', required = True, help='Path to the image file')
args = parser.parse_args()
def predict_image(image_path):
print("prediciton in progress")
image = Image.open(image_path)
transformation = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])
image_tensor = transformation(image).float()
image_tensor = image_tensor.unsqueeze_(0)
if cuda:
image_tensor.cuda()
input = Variable(image_tensor)
output = model(input)
index = output.data.numpy().argmax()
return index
def parameters():
hyp_param = open('param_predict.txt','r')
param = {}
for line in hyp_param:
l = line.strip('\n').split(':')
def class_mapping(index):
with open("class_mapping.json") as cm:
data = json.load(cm)
if index == -1:
return len(data)
else:
return data[str(index)]
def segregate():
with open("class_mapping.json") as cm:
data = json.load(cm)
try:
os.mkdir(seg_dir)
print("Directory " , seg_dir , " Created ")
except OSError:
print("Directory " , seg_dir , " already created")
for x in range (0,len(data)):
dir_path="./"+seg_dir+"/"+data[str(x)]
try:
os.mkdir(dir_path)
print("Directory " , dir_path , " Created ")
except OSError:
print("Directory " , dir_path , " already created")
path_to_model = "./models/"+'trained.model'
checkpoint = torch.load(path_to_model)
seg_dir="segregation_folder"
cuda = torch.cuda.is_available()
num_class = class_mapping(index=-1)
print num_class
model = resnet18(num_classes = num_class)
if cuda:
model.load_state_dict(checkpoint)
else:
model.load_state_dict(checkpoint, map_location = 'cpu')
model.eval()
if __name__ == "__main__":
imagepath = "./Predict_Image/"+args.image_name
since = time.time()
img = Image.open(imagepath)
prediction = predict_image(imagepath)
name = class_mapping(prediction)
print("Time taken = ",time.time()-since)
print("Predicted Class: ",name)
The entire project can be found at
https://github.com/amrit-das/custom_image_classifier_pytorch/
Without output from your profiler it's difficult to tell how much of that is because of inefficiencies in your code. That being said, PyTorch has a lot of startup overhead - in other words it's slow to initialize the library, model, load weights and to transfer it to GPU, as compared to inference time on a single image. This makes is pretty poor as a CLI utility for single-image prediction.
If your use-case really requires working with single images instead of batch-processing, there is not that much potential for optimization. Two options I see are
It may be worth it to skip GPU execution altogether and save on GPU allocations and transfers.
You will get better performance writing this code in C++ using LibTorch. This is a plenty of development work though.
You can also optimize inference itself by using e.g. OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together while preserving accuracy.
You can find a full tutorial on how to convert the PyTorch model here.
Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to IR, which is a default format for OpenVINO. It also changes the precision to FP16 for better performance (you can use FP32 as well). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Categories

Resources