How to use Dask to run python code on the GPU?

How to use Dask to run python code on the GPU? - python

I have some code that uses Numba cuda.jit in order for me to run on the gpu, and I would like to layer dask on top of it if possible.
Example Code
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from numba import cuda, njit
import numpy as np
from dask.distributed import Client, LocalCluster
#cuda.jit()
def addingNumbersCUDA (big_array, big_array2, save_array):
i = cuda.grid(1)
if i < big_array.shape[0]:
for j in range (big_array.shape[1]):
save_array[i][j] = big_array[i][j] * big_array2[i][j]
if __name__ == "__main__":
cluster = LocalCluster()
client = Client(cluster)
big_array = np.random.random_sample((100, 3000))
big_array2 = np.random.random_sample((100, 3000))
save_array = np.zeros(shape=(100, 3000))
arraysize = 100
threadsperblock = 64
blockspergrid = (arraysize + (threadsperblock - 1))
d_big_array = cuda.to_device(big_array)
d_big_array2 = cuda.to_device(big_array2)
d_save_array = cuda.to_device(save_array)
addingNumbersCUDA[blockspergrid, threadsperblock](d_big_array, d_big_array2, d_save_array)
save_array = d_save_array.copy_to_host()
If my function addingNumbersCUDA didn't use any CUDA I would just put client.submit in front of my function (along with gather after) and it would work. But, since I'm using CUDA putting submit in front of the function doesn't work. The dask documentation says that you can target the gpu, but it's unclear as to how to actually set it up in practice. How would I set up my function to use dask with the gpu targeted and with cuda.jit if possible?

You may want to look through Dask's documentation on GPUs
But, since I'm using CUDA putting submit in front of the function doesn't work.
There is no particular reason why this should be the case. All Dask does it run your function on a different computer. It doesn't change or modify your function in any way.

Related

what are some circumstances that can cause vscode python debugger to be slow or deadlock?

What are some reasons that can cause the vscode debugger to be slow? Any speculations or link to vscode limitation would be great.
I'm currently creating sliding windows with numpy array using numpy.lib.stride_tricks.sliding_window_view
And for some reason in a specific section of the code the debugger just fails to compute large sliding windows.
len(df) > 10*24*60
win_size = 240
sliding_win = np.lib.stride_tricks.sliding_window_view(df.values,(win_size,len(df.columns)))
what is weird is that when i ran it without the debugger, it is computed really fast. If I run the below code (basically the above extracted to a stub file and with a even larger dataframe and window size), the debugger can run that with no issue as well.
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
from time import time
...
# some random code to increase memory usage.
...
p = np.zeros((10*365*24*60,7))
t = time()
q = sliding_window_view(p,(24*60,7))
print(time()-t)
end = 1 # break point
from guppy import hpy
h = hpy()
print(h.heap())
When checked using guppy, the original code used about 100 MB and the stub file used more. So it doesnt seem to be a memory issue.

Solving linear system using Python with numba and CUDA

I am trying to solve a linear system using numba with GPU processing using CUDA.
I have installed all the relevant packages and tested it so it seems that my GPU and CUDA etc is set up properly.
My code is:
import numpy as np
import time
from numba import vectorize, cuda
#vectorize(['float64(float64, float64)'], target='cuda')
def solver(A, b):
return np.linalg.solve(A, b)
def main():
A = np.random.rand(100, 100).astype(np.float64)
b = np.random.rand(100, 1).astype(np.float64)
start = time.time()
C = solver(A, b)
vector_add_time = time.time() - start
print("Took " + str(vector_add_time) + " seconds to solve")
if __name__ == '__main__':
main()
Commenting the #vectorize... line, the code runs fine. However, when I try to do it with numba and cuda, I get a long list of errors, where I think he most relevant one is:
raise TypingError(msg)
numba.errors.TypingError: Failed at nopython (nopython frontend)
np.linalg.solve() only supported for array types
I assume the problem is that numpy.linalg.solve does not accept the data types required by cuda.
Am I correct in assuming this? Are there other data types that will work?
In this example problem, the same data type is passed to the function, so I think the problem lies with numpy.linalg.

Am I correct in assuming this?
No
Are there other data types that will work?
No
The problem here is that you cannot use numpy.linalg in code which is targeted to run on the numba GPU backend.

TensorFlow: minimalist program fails on distributed mode

I wrote a very simple program that runs just fine without distribution but hangs on CheckpointSaverHook in distributed mode (everything on my localhost though!). I've seen there's been a few questions about hanging in distributed mode, but none seem to match my question.
Here's the script (made to toy with the new layers API):
import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn import learn_runner
from tensorflow.contrib import layers
DATA_SIZE=10
DIMENSION=5
FEATURES='features'
def generate_input_fn():
def _input_fn():
mid = int(DATA_SIZE/2)
data = np.array([np.ones(DIMENSION) if x < mid else -np.ones(DIMENSION) for x in range(DATA_SIZE)])
labels = ['0' if x < mid else '1' for x in range(DATA_SIZE)]
table = tf.contrib.lookup.string_to_index_table_from_tensor(tf.constant(['0', '1']))
label_tensor = table.lookup(tf.convert_to_tensor(labels, dtype=tf.string))
return dict(zip([FEATURES], [tf.convert_to_tensor(data, dtype=tf.float32)])), label_tensor
return _input_fn
def build_estimator(model_dir):
features = layers.real_valued_column(FEATURES, dimension=DIMENSION)
return tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir=model_dir,
dnn_feature_columns=[features],
dnn_hidden_units=[20,20])
def generate_exp_fun():
def _exp_fun(output_dir):
return tf.contrib.learn.Experiment(
build_estimator(output_dir),
train_input_fn=generate_input_fn(),
eval_input_fn=generate_input_fn(),
train_steps=100
)
return _exp_fun
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.DEBUG)
learn_runner.run(generate_exp_fun(), 'job_dir')
To test distributed mode, I simply launch it with the environment variable TF_CONFIG={"cluster": {"ps":["localhost:5040"], "worker":["localhost:5041"]}, "task":{"type":"worker","index":0}, "environment": "local"} (this is for the worker, the same with ps type is used to launch the parameter server.
I use tensorflow-1.0.1 (but had the same behavior with 1.0.0) on windows-64, only CPU. I actually never get any error, it just hang on after INFO:tensorflow:Create CheckpointSaverHook. forever... I've tried to attach VisualStudio C++ debugger to the process but with little success so far, so I can't print a stack for what's happening in the native part.
P.S.: it's not a problem with DNNLinearCombinedClassifier because it fails as well with a simple tf.contrib.learn.LinearClassifier. And as noted in the comments, it's not due to both process running on localhost, since it fails also when running on separate VMs.
EDIT: I think there's actually an issue with server launching. It looks like the server is not launched when you're in local mode (no matter if distributed or not), cf. tensorflow/contrib/learn/python/learn/experiment.py l.250-258:
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
This will prevent the server from being started in local mode for the workers... Anyone has an idea if it's a bug or there's something I'm missing?

So this has been answered in: https://github.com/tensorflow/tensorflow/issues/8796. Finally, one should use CLOUD for any distributed operation.

Parallel code hangs if run first, but works if run after running non-parallel code

So this sounds a bit convoluted, but I've had a problem with joblib lately where it will create a bunch of processes and then just hang there (aka, each process takes up memory, but uses no CPU time).
Here is the simplest code I've got that will reproduce the problem:
from sklearn import linear_model
import numpy as np
from sklearn import cross_validation as cval
from joblib import Parallel, delayed
def fit_hanging_model(n=10000, nx=10, ny=32, ndelay=10,
n_cvs=5, n_jobs=None):
# Create data
X = np.random.randn(n, ny*ndelay)
y = np.random.randn(n, nx)
# Create model + CV
model = linear_model.Ridge(alpha=1000.)
cvest = cval.KFold(n, n_folds=n_cvs, shuffle=True)
# Fit model
par = Parallel(n_jobs=n_jobs, verbose=10)
parfunc = delayed(_fit_model_cvs)
par(parfunc(X, y, train, test, model)
for i, (train, test) in enumerate(cvest))
def _fit_model_cvs(X, Y, train, test, model):
model.fit(X, Y)
n = 10
a = np.random.randn(n, 32)
b = np.random.randn(32, 10)
##########
c = np.dot(a, b)
##########
fit_hanging_model(n_jobs=3)
Here is what happens:
If I run all of the code above, then it spawns off three processes and hangs
If I run all of the code above, but use n_jobs=1, then it works fine
If I run all of the code above a second time, after running it once with n_jobs=1, then it works fine no matter how many jobs I use.
If I run all of the code above EXCEPT for the code between the ######, then it runs fine.
However, if I then run the code between the ######, and try to run fit_hanging_model with n_jobs > 1, then it hangs
This is with joblib = 0.8.0, and sklearn 0.15-git.
Note, this bug is on CentOS on linux. I have not been able to reproduce this bug on another machine, so it may be hard to reproduce.
Does anyone have any idea why this might be going on? It seems like that dot product is doing something strange, but I have no idea what it could be...I'm at the end of my rope...

Figured it out. Apparently this was an issue with Joblib creating multiple python processes, while MKL was simultaneously trying to do threading. See the issue and the solution (which involves setting environment variables) here:
https://github.com/joblib/joblib/issues/138

According to my experience, joblib hangs when the function it calls makes another level of parallelization.
Using the solution in the answer by #choldgraf, you need to disable the inner level of parallelization by MKL:
import os
os.environ['MKL_NUM_THREADS'] = '1'
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['MKL_DYNAMIC'] = 'FALSE'
This applies to other parallel computing libraries such as OpenMP.

Importing scipy breaks multiprocessing support in Python

I am running into a bizarre problem that I can't explain. I'm hoping someone out there can help please!
I'm running Python 2.7.3 and Scipy v0.14.0 and am trying to implement some very simple multiprocessor algorithms to speeds up my code using the module multiprocessing. I've managed to make a basic example work:
import multiprocessing
import numpy as np
import time
# import scipy.special
def compute_something(t):
a = 0.
for i in range(100000):
a = np.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
This runs fine, returning
Pool size: 8
Built-in: 1.56904006004
Pool : 0.447728157043
But if I uncomment the line import scipy.special, I get:
Pool size: 8
Built-in: 1.58968091011
Pool : 1.59387993813
and I can see that only one core is doing the work on my system. In fact, importing any module from the scipy package seems to have this effect (I've tried several).
Any ideas? I've never seen a case like this before, where an apparently innocuous import can have such a strange and unexpected effect.
Thanks!
Update (1)
Moving the scipy import line to the function compute_something partially improves the problem:
Pool size: 8
Built-in: 1.66807389259
Pool : 0.596321105957
Update (2)
Thanks to #larsmans for testing on a different system. Problem was not confirmed using Scipy v.0.12.0. Moving this query to the scipy mailing list and will post any answers.

After much digging around and posting an issue on the Scipy GitHub site, I've found a solution.
Before I start, this is documented very well here - I'll just give an overview.
This problem is not related to the version of Scipy, or Numpy that I was using. It originates in the system BLAS libraries that Numpy and Scipy use for various linear algebra routines. You can tell which libraries Numpy is linked to by running
python -c 'import numpy; numpy.show_config()'
If you are using OpenBLAS in Linux, you may find that the CPU affinity is set to 1, meaning that once these algorithms are imported in Python (via Numpy/Scipy), you can access at most one core of the CPU. To test this, within a Python terminal run
import os
os.system('taskset -p %s' %os.getpid())
If the CPU affinity is returned as f, of ff, you can access multiple cores. In my case it would start like that, but upon importing numpy or scipy.any_module, it would switch to 1, hence my problem.
I've found two solutions:
Change CPU affinity
You can manually set the CPU affinity of the master process at the top of the main function so that the code looks like this:
import multiprocessing
import numpy as np
import math
import time
import os
def compute_something(t):
a = 0.
for i in range(10000000):
a = math.sqrt(t)
return a
if __name__ == '__main__':
pool_size = multiprocessing.cpu_count()
os.system('taskset -cp 0-%d %s' % (pool_size, os.getpid()))
print "Pool size:", pool_size
pool = multiprocessing.Pool(processes=pool_size)
inputs = range(10)
tic = time.time()
builtin_outputs = map(compute_something, inputs)
print 'Built-in:', time.time() - tic
tic = time.time()
pool_outputs = pool.map(compute_something, inputs)
print 'Pool :', time.time() - tic
Note that selecting a value higher than the number of cores for taskset doesn't seem to matter - it just uses the maximum possible number.
Switch BLAS libraries
Solution documented at the site linked above. Basically: install libatlas and run update-alternatives to point numpy to ATLAS rather than OpenBLAS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Dask to run python code on the GPU? - python

Related

what are some circumstances that can cause vscode python debugger to be slow or deadlock?

Solving linear system using Python with numba and CUDA

TensorFlow: minimalist program fails on distributed mode

Parallel code hangs if run first, but works if run after running non-parallel code

Importing scipy breaks multiprocessing support in Python

Categories

Resources