Is it possible to process images using GPU in OpenCV - python

So, i got from my university a python related task.
we need to get s a set of images (from a folder), and then resize them as fast as possible. so i manage to do it using cv2 resize option. but, apparently we can do it a lot faster using the GPU. but unfortenally i was unable to find the best way to do it with the openCV module.
i found this code, but it isn't openCV realted.
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(['float32(float32, float32)'], target='cuda')
def pow(a, b):
return a ** b
def main():
vec_size = 100000000
a = b = np.array(np.random.sample(vec_size), dtype=np.float32)
c = np.zeros(vec_size, dtype=np.float32)
start = timer()
c = pow(a, b)
duration = timer() - start
print(duration)
if __name__ == '__main__':
main()
EDIT:
i found something called "UMat" what's the benefits of using it?
i tried to use in im my code in this way:
image = cv2.UMat(cv2.resize(image, (0, 0), fx=0.5, fy=0.5)) # Resize image by half

Yes, you can use GPU module in OpenCV, but unfortunately only in C++. There is no wrapper for Python.
Solutions:
Use C++ cuda API
Use different library for GPU computing in Python e.g. Pillow SIMD

Related

Why CNN running in python is extremely slow in comparison to Matlab?

I have trained a CNN in Matlab 2019b that classifies images between three classes. When this CNN was tested in Matlab it was functioning fine and only took 10-15 seconds to classify an image. I used the exportONNXNetwork function in Maltab so that I can implement my CNN in Tensorflow. This is the code I am using to use the ONNX file in python:
import onnx
from onnx_tf.backend import prepare
import numpy as np
from PIL import Image
onnx_model = onnx.load('trainednet.onnx')
tf_rep = prepare(onnx_model)
filepath = 'filepath.png'
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
probabilities = tf_rep.run(img)
print(probabilities)
When trying to use this code to classify the same test set, it seems to be classifying the images correctly but it is very slow and freezes my computer as it reaches high memory usages of up to 95+% at some points.
I also noticed in the command prompt while classifying it prints this:
2020-04-18 18:26:39.214286: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:530] constant_folding failed: Deadline exceeded: constant_folding exceeded deadline., time = 486776.938ms.
Is there any way I can make this python code classify faster?
Maybe you could try to understand what part of the code takes a long time this way:
import onnx
from onnx_tf.backend import prepare
import numpy as np
from PIL import Image
import datetime
now = datetime.datetime.now()
onnx_model = onnx.load('trainednet.onnx')
tf_rep = prepare(onnx_model)
filepath = 'filepath.png'
later = datetime.datetime.now()
difference = later - now
print("Loading time : %f ms" % (difference.microseconds / 1000))
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
now = datetime.datetime.now()
probabilities = tf_rep.run(img)
later = datetime.datetime.now()
difference = later - now
print("Prediction time : %f ms" % (difference.microseconds / 1000))
print(probabilities)
Let me know what the output looks like :)
In this case, it appears that the Grapper optimization suite has encountered some kind of infinite loop or memory leak. I would recommend filing an issue against the Github repo.
It's challenging to debug why constant folding is taking so long, but you may have better performance using the ONNX TensorRT backend as compared to the TensorFlow backend. It achieves better performance as compared to the TensorFlow backend on Nvidia GPUs while compiling typical graphs more quickly. Constant folding usually doesn't provide large speedups for well optimized models.
import onnx
import onnx_tensorrt.backend as backend
import numpy as np
model = onnx.load("trainednet.onnx'")
engine = backend.prepare(model, device='CUDA:1')
filepath = 'filepath.png'
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
output_data = engine.run(img)[0]
print(output_data)
You should consider some points while working on TensorFlow with Python. A GPU will be better for work as it fastens the whole processing. For that, you have to install CUDA support. Apart from this, the compiler also sometimes matters. I can tell VSCode is better than Spyder from my experience.
I hope it helps.
Since the command prompt states that your program takes a long time to perform constant folding, it might be worthwhile to turn this off. Based on this documentation, you could try running:
import numpy as np
import timeit
import traceback
import contextlib
import onnx
from onnx_tf.backend import prepare
from PIL import Image
import tensorflow as tf
#contextlib.contextmanager
def options(options):
old_opts = tf.config.optimizer.get_experimental_options()
tf.config.optimizer.set_experimental_options(options)
try:
yield
finally:
tf.config.optimizer.set_experimental_options(old_opts)
with options({'constant_folding': False}):
onnx_model = onnx.load('trainednet.onnx')
tf_rep - prepare(onnx_model)
filepath = 'filepath.png'
img = Image.open(filepath).resize((224,224)).convert("RGB")
img = array(img).transpose((2,0,1))
img = np.expand_dims(img, 0)
img = img.astype(np.uint8)
probabilities = tf_rep.run(img)
print(probabilities)
This disables the constant folding performed in the TensorFlow Graph optimization. This can work both ways: on the one hand it will not reach the constant folding deadline, but on the other hand disabling constant folding can result in significant runtime increases. Anyway it is worth trying, good luck!

How to use Dask to run python code on the GPU?

I have some code that uses Numba cuda.jit in order for me to run on the gpu, and I would like to layer dask on top of it if possible.
Example Code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from numba import cuda, njit
import numpy as np
from dask.distributed import Client, LocalCluster
#cuda.jit()
def addingNumbersCUDA (big_array, big_array2, save_array):
i = cuda.grid(1)
if i < big_array.shape[0]:
for j in range (big_array.shape[1]):
save_array[i][j] = big_array[i][j] * big_array2[i][j]
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
big_array = np.random.random_sample((100, 3000))
big_array2 = np.random.random_sample((100, 3000))
save_array = np.zeros(shape=(100, 3000))
arraysize = 100
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))
d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)
addingNumbersCUDA[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)
save_array = d_save_array.copy_to_host()
If my function addingNumbersCUDA didn't use any CUDA I would just put client.submit in front of my function (along with gather after) and it would work. But, since I'm using CUDA putting submit in front of the function doesn't work. The dask documentation says that you can target the gpu, but it's unclear as to how to actually set it up in practice. How would I set up my function to use dask with the gpu targeted and with cuda.jit if possible?
You may want to look through Dask's documentation on GPUs
But, since I'm using CUDA putting submit in front of the function doesn't work.
There is no particular reason why this should be the case. All Dask does it run your function on a different computer. It doesn't change or modify your function in any way.

Solving linear system using Python with numba and CUDA

I am trying to solve a linear system using numba with GPU processing using CUDA.
I have installed all the relevant packages and tested it so it seems that my GPU and CUDA etc is set up properly.
My code is:
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float64(float64, float64)'], target='cuda')
def solver(A, b):
return np.linalg.solve(A, b)
def main():
A = np.random.rand(100, 100).astype(np.float64)
b = np.random.rand(100, 1).astype(np.float64)
start = time.time()
C = solver(A, b)
vector_add_time = time.time() - start
print("Took " + str(vector_add_time) + " seconds to solve")
if __name__ == '__main__':
main()
Commenting the #vectorize... line, the code runs fine. However, when I try to do it with numba and cuda, I get a long list of errors, where I think he most relevant one is:
raise TypingError(msg)
numba.errors.TypingError: Failed at nopython (nopython frontend)
np.linalg.solve() only supported for array types
I assume the problem is that numpy.linalg.solve does not accept the data types required by cuda.
Am I correct in assuming this? Are there other data types that will work?
In this example problem, the same data type is passed to the function, so I think the problem lies with numpy.linalg.
Am I correct in assuming this?
No
Are there other data types that will work?
No
The problem here is that you cannot use numpy.linalg in code which is targeted to run on the numba GPU backend.

CUDA-Python: How to launch CUDA kernel in Python (Numba 0.25)?

could you please help me understand how to write CUDA kernels in Python? AFAIK, numba.vectorize can be performed on cuda, cpu, parallel(multi-cpus), based on target. But target='cuda' requires to set up CUDA kernels.
The main issue is that many examples, answers in Internet are related to deprecated NumbaPro library, so it's hard to follow to such as not-updated WIKIs, especially if you're newbie.
I have:
latest Anaconda (v2)
latest Numba (v0.25)
latest CUDA toolkit (v7)
Here is the error I'm getting:
numba.cuda.cudadrv.driver.CudaAPIError: 1 Call to cuLaunchKernel
results in CU DA_ERROR_INVALID_VALUE
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float32(float32, float32)'], target='cuda')
def VectorAdd(a, b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
start = time.time()
C = VectorAdd(A, B)
vector_add_time = time.time() - start
print "C[:5] = " + str(C[:5])
print "C[-5:] = " + str(C[-5:])
print "VectorAdd took for % seconds" % vector_add_time
if __name__ == '__main__':
main()
The code, as posted, is correct and will run on a Python 2 Numbapro/Accelerate system without error.
It was likely that the particular system being used to run the code wasn't very large in capacity and was hitting a display driver watchdog or free memory error with 32 million element vectors. Reducing the size of the input data allowed the code to run correctly.
[This answer assembled from comments and added as a community wiki entry to get this question off the unanswered list]

Importing scipy breaks multiprocessing support in Python

I am running into a bizarre problem that I can't explain. I'm hoping someone out there can help please!
I'm running Python 2.7.3 and Scipy v0.14.0 and am trying to implement some very simple multiprocessor algorithms to speeds up my code using the module multiprocessing. I've managed to make a basic example work:
import multiprocessing
import numpy as np
import time
# import scipy.special
def compute_something(t):
a = 0.
for i in range(100000):
a = np.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
This runs fine, returning
Pool size: 8
Built-in: 1.56904006004
Pool : 0.447728157043
But if I uncomment the line import scipy.special, I get:
Pool size: 8
Built-in: 1.58968091011
Pool : 1.59387993813
and I can see that only one core is doing the work on my system. In fact, importing any module from the scipy package seems to have this effect (I've tried several).
Any ideas? I've never seen a case like this before, where an apparently innocuous import can have such a strange and unexpected effect.
Thanks!
Update (1)
Moving the scipy import line to the function compute_something partially improves the problem:
Pool size: 8
Built-in: 1.66807389259
Pool : 0.596321105957
Update (2)
Thanks to #larsmans for testing on a different system. Problem was not confirmed using Scipy v.0.12.0. Moving this query to the scipy mailing list and will post any answers.
After much digging around and posting an issue on the Scipy GitHub site, I've found a solution.
Before I start, this is documented very well here - I'll just give an overview.
This problem is not related to the version of Scipy, or Numpy that I was using. It originates in the system BLAS libraries that Numpy and Scipy use for various linear algebra routines. You can tell which libraries Numpy is linked to by running
python -c 'import numpy; numpy.show_config()'
If you are using OpenBLAS in Linux, you may find that the CPU affinity is set to 1, meaning that once these algorithms are imported in Python (via Numpy/Scipy), you can access at most one core of the CPU. To test this, within a Python terminal run
import os
os.system('taskset -p %s' %os.getpid())
If the CPU affinity is returned as f, of ff, you can access multiple cores. In my case it would start like that, but upon importing numpy or scipy.any_module, it would switch to 1, hence my problem.
I've found two solutions:
Change CPU affinity
You can manually set the CPU affinity of the master process at the top of the main function so that the code looks like this:
import multiprocessing
import numpy as np
import math
import time
import os
def compute_something(t):
a = 0.
for i in range(10000000):
a = math.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
os.system('taskset -cp 0-%d %s' % (pool_size, os.getpid()))
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
Note that selecting a value higher than the number of cores for taskset doesn't seem to matter - it just uses the maximum possible number.
Switch BLAS libraries
Solution documented at the site linked above. Basically: install libatlas and run update-alternatives to point numpy to ATLAS rather than OpenBLAS.

Categories

Resources