Why is my GPU slower than CPU in matrix operations? - python

CPU: i7-9750 #2.6GHz (with 16G DDR4 Ram); GPU: Nvidia Geforce GTX 1600 TI (6G); OS: Windows 10-64bit
I tried to see how fast the GPU is in doing basic matrix operations compared with CPU, and I basically followed this https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56. The following is my super simple code
import numpy as np
import cupy as cp
import time
### Numpy and CPU
s = time.time()
A = np.random.random([10000,10000]); B = np.random.random([10000,10000])
CPU = np.matmul(A,B); CPU *= 5
e = time.time()
print(f'CPU time: {e - s: .2f}')
### CuPy and GPU
s = time.time()
C= cp.random.random([10000,10000]); D = cp.random.random([10000,10000])
GPU = cp.matmul(C,D); GPU *= 5
cp.cuda.Stream.null.synchronize()
# to let the code finish executing on the GPU before calculating the time
e = time.time()
print(f'GPU time: {e - s: .2f}')
Ironically, it shows
CPU time: 11.74
GPU time: 12.56
This really confuse me. How could the GPU be even slower than CPU on large matrix operations? Note that I even have not applied parallel computing (I am a beginner and I am not sure whether the system will open it for me or not.) I did have checked similar questions such as Why is my CPU doing matrix operations faster than GPU instead?. But here I am using cupy rather than mxnet (cupy is newer and designed for GPU computing).
Can someone help? I woud really appreciate!

numpy random is generating floats (32bit) as default. Cupy random generates 64bit (double) by default. To make an apples to apples comparison, change the GPU random number generation like this:
C= cp.random.random([10000,10000], dtype=cp.float32)
D = cp.random.random([10000,10000], dtype=cp.float32)
I have different hardware (both CPU and GPU) than you, but once this change is made the GPU version is about 12x faster than cpu version. Generating both ndarray of random numbers, matrix multiplication and scalar multiplication using cupy takes less than one second in total

Related

Memory usage when running TensorFlow operations

I noticed that TensorFlow allocates a large amount of main memory (700MB+) when running certain operations in eager mode, even on tiny tensors (i.e. with shapes as small as [5, 5]). For example:
import tensorflow as tf # Total memory usage is ~141MB
fst, snd = tf.random.uniform([5, 5]), tf.random.uniform([5, 5]) # ...grows to ~697MB
tf.matmul(fst, snd) # ...and then to ~1349MB
tf.concat([fst, snd], 0) # Total memory use remains ~1349MB
This is running on TensorFlow 2.6.0-rc1, on an RTX 3070 on Windows 10. I observed the same behaviour when enabling memory growth with tf.config.experimental.set_memory_growth.
Why do the calls to tf.matmul and tf.random.uniform result in almost 700MB being allocated each time, whilst the subsequent tf.concat results in no change in the memory use?

Weird behavior of Intel's MKL and Python (Ubuntu 16.04, Conda 4.3.30)

I am trying to improve the performance of NumPy in Python 3.6 using Intel's MKL. With a fresh anaconda installation i created a MKL environment using:
conda create -n idp intelpython3_core python=3
As written in this article,
it seems that the MKL has internal thresholds to decide whether to use threading or not. It seems one of these thresholds is given by the vector size used in the calculations (kind of obvious). This threshold is set to a vector size of 8192 (at least for my machine). When vectors exceed this size, i can observe my python scripts using 4 threads (i have 2 cores with hyper threading) for calculations like:
import numpy as np
x = np.random.rand(8193)
y = np.sin(x)
So far everything is working as intended.
Beside the threading part, MKL "Features highly optimized, threaded, and vectorized math functions that maximize performance on each processor family" (read here). Since the problems i'm usually working on do not exceed the vector size threshold, i'm not interested in the performance increase which is obtained by threading, but more in the optimized math functions of MKL. Unfortunately it seems like those are only used, when the vector size is above the threshold.
I've written a sample code to measure the performance of the sine operation on vectors with different sizes:
from timeit import default_timer as timer
import mkl
import numpy as np
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
np.random.seed(0)
Nop = int(1e4)
def func(x):
return np.sin(x)
def measure(x):
t1 = timer()
for i in range(0, Nop):
func(x)
t2 = timer()
diff = (t2 - t1)*1000.0
print("vec size: %i:" % len(x), end="")
print("\t time needed: %f ms" % diff)
x0 = np.random.rand(20000)
measure(np.array(x0[:8192]))
measure(np.array(x0[:8193]))
measure(np.array(x0[:8192]))
These lines:
import mkl
mkl.set_num_threads(1)
print("MKL threads:%i" % mkl.get_max_threads())
are just there to make sure, that the increase in performance is not due to threading (i also checked the CPU usage, it is indeed only using one thread)
I get these results:
vec size: 8192: time needed: 8185.900477 ms
vec size: 8193: time needed: 436.843237 ms
vec size: 8192: time needed: 1777.306942 ms
As you can see, the 8193-vector runs roughly 20x faster than the 8192-vector. What is even more confusing is the fact, that the second run of the 8193-vector is 4x faster then before, after doing the calculation on the bigger vector.
Now my questions:
Am i doing anything obviously wrong, which i am not aware of, which
leads to these results?
Can anyone reproduce these results or is it just my installation/my
machine behaving like this
Is the increase in performance really due to the optimized
implementation of sine?
Is it possible to enforce always using the optimized version of sine
independent of the vector size?
PS:
I actually tried the following in the simulation i'm running for my master thesis, which involve a lot of sine and cosine function calls:
Just added this line before anything else is calculated:
np.sin(np.zeros(8193))
And now everything runs 50% faster.

Time comparison for TensorFlow operation and Numpy multiplication

I've been trying to optimize my computations; and for most operations that I've tried, tensorflow is much faster. I'm trying to do a fairly simple operation...Transform a matrix (multiply each value by 1/2 and then add 1/2 to that value).
With the help of #mrry , I was able to do these operations in tensorflow. However to my surprise, the numpy method was significantly faster?!
tensorflow seems like an extremely useful tool for data scientists and I think this could help clarify it's use and advantages.
Am I not using tensorflow data structures and operations in the most efficient way? I'm not sure how non-tensorflow methods would be faster. I'm using a Mid-2012 Macbook Air 4GB RAM
trans1 is the tensorflow version while trans2 is numpy. DF_var is a pandas dataframe object
import pandas as pd
import tensorflow as tf
import numpy as np
def trans1(DF_var):
#Total user time is 31.8532807827 seconds
#Create placeholder
T_feed = tf.placeholder(tf.float32,DF_var.shape)
#Matrix transformation
T_signed = tf.add(
tf.constant(0.5,dtype=tf.float32),
tf.mul(T_feed,tf.constant(0.5,dtype=tf.float32))
)
#Get rid of of top triangle
T_ones = tf.constant(np.tril(np.ones(DF_var.shape)),dtype=tf.float32)
T_tril = tf.mul(T_signed,T_ones)
#Start Graph Session
sess = tf.Session()
DF_signed = pd.DataFrame(
sess.run(T_tril,feed_dict={T_feed: DF_var.as_matrix()}),
columns = DF_var.columns, index = DF_var.index
)
#Close Graph Session
sess.close()
return(DF_signed)
def trans2(DF_var):
#Total user time is 1.71233415604 seconds
M_computed = np.tril(np.ones(DF_var.shape))*(0.5 + 0.5*DF_var.as_matrix())
DF_signed = pd.DataFrame(M_computed,columns=DF_var.columns, index=DF_var.index)
return(DF_signed)
My timing method was:
import time
start_time = time.time()
#operation
print str(time.time() - start_time)
Your results are compatible with the benchmarks from another guy.
In his benchmark he compared NumPy, Theano and Tensorflow on
an Intel core i5-4460 CPU with 16GiB RAM and a Nvidia GTX 970 with 4
GiB RAM using Theano 0.8.2, Tensorflow 0.11.0, CUDA 8.0 on Linux Mint
18
His results for addition shows that:
He also tested a few other functions such as matrix multiplication:
The results are:
It is clear that the main strengths of Theano and TensorFlow are very
fast dot products and matrix exponents. The dot product is
approximately 8 and 7 times faster respectively with Theano/Tensorflow
compared to NumPy for the largest matrices. Strangely, matrix addition
is slow with the GPU libraries and NumPy is the fastest in these
tests.
The minimum and mean of matrices are slow in Theano and quick in
Tensorflow. It is not clear why Theano is as slow (worse than NumPy)
for these operations.

Why does the floatX's flag impact whether GPU is used in Theano?

I am testing Theano with GPU using the script provided in the tutorial for that purpose:
# Start gpu_test.py
# From http://deeplearning.net/software/theano/tutorial/using_gpu.html#using-gpu
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
# End gpu_test.py
If I specify floatX=float32, it runs on GPU:
francky#here:/fun$ THEANO_FLAGS='mode=FAST_RUN,device=gpu2,floatX=float32' python gpu_test.py
Using gpu device 2: GeForce GTX TITAN X (CNMeM is disabled)
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(Gp
Looping 1000 times took 1.458473 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
If I do not specify floatX=float32, it runs on CPU:
francky#here:/fun$ THEANO_FLAGS='mode=FAST_RUN,device=gpu2'
Using gpu device 2: GeForce GTX TITAN X (CNMeM is disabled)
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 3.086261 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
If I specify floatX=float64, it runs on CPU:
francky#here:/fun$ THEANO_FLAGS='mode=FAST_RUN,device=gpu2,floatX=float64' python gpu_test.py
Using gpu device 2: GeForce GTX TITAN X (CNMeM is disabled)
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 3.148040 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
Why does the floatX flag impact whether GPU is used in Theano?
I use:
Theano 0.7.0 (according to pip freeze),
Python 2.7.6 64 bits (according to import platform; platform.architecture()),
Nvidia-smi 361.28 (according to nvidia-smi),
CUDA 7.5.17 (according to nvcc --version),
GeForce GTX Titan X (according to nvidia-smi),
Ubuntu 14.04.4 LTS x64 (according to lsb_release -a and uname -i).
I read the documentation on floatX but it didn't help. It simply says:
config.floatX String value: either ‘float64’ or ‘float32’
Default: ‘float64’
This sets the default dtype returned by tensor.matrix(),
tensor.vector(), and similar functions. It also sets the default
theano bit width for arguments passed as Python floating-point
numbers.
From http://deeplearning.net/software/theano/tutorial/using_gpu.html#gpuarray-backend I read that it is possible to perform float64 calculations on GPU, but you have to install the libgpuarray from source.
I managed to install it, see this script, I used virtualenv, you don't even have to have sudo.
After installation you can use the old backend with config flag device=gpu and the new backend with device=cuda.
The new backend can perform 64 bit calculations, but it works differently for me. Some operations stopped working. ABSOLUTELY NO WARRANTY, to the extent permitted by applicable law :)
As far as I know, it's because they haven't yet implemented float64 for GPUs.
http://deeplearning.net/software/theano/tutorial/using_gpu.html :
Only computations with float32 data-type can be accelerated. Better support for float64 is expected in upcoming hardware but float64 computations are still relatively slow (Jan 2010).

How can I run theano on GPU

If I run the following code with python 3.5
import numpy as np
import time
import theano
A = np.random.rand(1000,10000).astype(theano.config.floatX)
B = np.random.rand(10000,1000).astype(theano.config.floatX)
np_start = time.time()
AB = A.dot(B)
np_end = time.time()
X,Y = theano.tensor.matrices('XY')
mf = theano.function([X,Y],X.dot(Y))
t_start = time.time()
tAB = mf(A,B)
t_end = time.time()
print ("NP time: %f[s], theano time: %f[s] **(times should be close when run
on CPU!)**" %(np_end-np_start, t_end-t_start))
print ("Result difference: %f" % (np.abs(AB-tAB).max(), ))
I get the output
NP time: 0.161123[s], theano time: 0.167119[s] (times should be close when
run on CPU!)
Result difference: 0.000000
it says if the times are close, it means that I am running on my CPU.
How can I run this code on my GPU?
NOTE:
I have a workstation with Nvidia Quadro k4200.
I have installed Cuda toolkit
I have successfully worked an cuda vectorAdd sample project on VS2012.
You configure Theano to use a GPU by specifying the device=gpu in Theano's config. There are two principle methods for setting the config: (1) in the THEANO_FLAGS environment variable, or (2) via the .theanorc file. Both methods, and all of Theano's configuration flags, are documented.
You will know that Theano is using the GPU if, after calling import theano you see a message that looks something like this
Using gpu device 0: GeForce GT 640 (CNMeM is disabled)
The details may vary for you but if no message appears at all then Theano is using the CPU only.
Note also that even if you see the GPU message, your particular computation graph may not run on the GPU. To see which parts of your computation are running on the GPU print its compiled and optimized graph
f = theano.function(...)
theano.printing.debugprint(f)
Operations that start with the prefix 'Gpu' will run on the GPU. Operations that do not have that prefix to their name will run on the CPU.
If you are on Linux, create a .theanorc file in your home folder and add the following to set up theano to run on GPU.
[global]
device = gpu
floatx = float32
Alternatively, if you want to use the GPU programattically:
import theano.sandbox.cuda
theano.sandbox.cuda.use("gpu0")
You should see a message like this:
Using gpu device 0: Tesla K80
Useful if the environment you are running in isn't easy to configure.

Categories

Resources