Dask scheduler empty / graph not showing - python

I have a setup as follows:
# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3
def runall(**kwargs):
print("done")
def etl():
client = Client()
tasks = {}
tasks['task1'] = dask.delayed(task)(*args)
tasks['task2'] = dask.delayed(task)(*args)
tasks['task3'] = dask.delayed(task)(*args)
out = dask.delayed(runall)(**tasks)
out.compute()
This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.
However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy with tuned settings. Would adding async and await help?
However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe or bag internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe and bag. The entire script is run on a single 40 core machine.
Using bag.starmap I get a graph like this:
where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.

There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.

Related

Running two Tensorflow trainings in parallel using joblib and dask

I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
To that end, I do the following:
I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.
The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?
# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)
# Second, submit each training process
def train(sub_task):
global client
if client is None:
print('connecting')
client = Client()
data = some_data_to_train
# Third, process the training itself with N workers
with joblib.parallel_backend('dask'):
X = data[columns]
y = data[label]
niceties = dict(verbose=False)
model = KerasClassifier(build_fn=build_layers,
loss=tf.keras.losses.MeanSquaredError(), **niceties)
model.fit(X, y, epochs=500, verbose = 0)
This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.
If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.
The question, as given, could easily be marked as "unclear" for SO. A couple of notes:
global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here
You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.
--
-EDIT-
to answer the form of the question asked in a comment.
My question is whether using the same client variable in these two parallel processes creates a problem.
No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.
--
I think I must answer the question as phrased in your comment, which I advise to add to the main question
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
You have the following options:
create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that
You will connect in the function with
client = Client(address)
or
client = Client(scheduler_file=the_file_you_wrote)

Dask Client change number of workers mid-session

I have a rather large dataset across different files that I read in using dask, followed by a machine learning task for which I want to use dask as parallel backend.
I've noticed that reading in the files runs much faster using a Client with a higher number of workers instead of one worker with many threads. However, their individual share of memory is then too small to handle the ML task. I would therefore like to change the number of my workers to 1, with the maximum possible number of threads assigned to that new unique worker. Is there a way to do that without completely kiling and restarting my client?
I looked into the docs but couldn't find anything of use. Also happy about a hint where to look for this kind of info next time, if not there.
This is an example of what my current code looks like:
from dask.distributed import Client
import dask.dataframe as dd
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
client = Client(n_workers=4, threads_per_worker=2)
df = dd.read_hdf(path_to_file_dir, '/data')
feats = df['feats'].compute()
labels = df['labels'].compute()
dummy = LogisticRegression()
with parallel_backend('dask'):
dummy.fit(feats, labels) # FAILS bc of too high memory consumption
You can manually create Worker/Nanny classes if you want, or use the SpecCluster class for more fine-grained control. These are typically used by developers though, and may not be as user-friendly.

Dask task to worker appointment

I run a computation graph with long functions (several hours) and big results (several hundred of megabytes). This type of load can be atypical for dask.
I try to run this graph on 4 workers. I see depicted task to worker appointment:
In first row "green" task depends only on a "blue" one, not "violet". Why green task is not moved to other worker?
Is it possible to give some hints to a scheduler to always move a task on a free worker? Which information do you need and possible to obtain which helps to debug more?
Such appointment is non-optimal and graph computation takes more time than necessary.
A little bit information:
Computation graph composing is done using dask.delayed
Computation invocation is done using next code
to_compute = [result_of_dask_delayed_1, ... , result_of_dask_delayed_n]
running = client.persist(to_compute)
results = as_completed(
futures_of(running),
with_results=True,
raise_errors=not bool(cmdline_options.partial_fails),
)
Firstly, we should state stat scheduling tasks to workers is complicated - please read this description. Roughly, when a batch of tasks come in, they will be distributed to workers, accounting for dependencies.
Dask has no idea, before running, how long any given task will take or how big the data result it generates might be. Once a task is done, this information is available for the completed task (but not for ones still waiting to run), but dask needs to make a decision on whether to steal an allotted task to an idle worker, and copy across the data it needs to run. Please read the link to see how this is done.
Finally, you can indeed have more fine-grained control over where things are run, e.g., this section.

Slow len function on dask distributed dataframe

I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
import dask.dataframe as dd
from dask.distributed import Client
client = Client('192.168.1.220:8786')
log = pd.read_csv('800000test', sep='\t')
logd = dd.from_pandas(log,npartitions=20)
#This is the code than runs slowly
#(2.9 seconds whilst I would expect no more than a few hundred millisencods)
print(len(logd))
#Instead this code is actually running almost 20 times faster than pandas
logd.loc[:'Host'].count().compute()
Any ideas why this could be happening? It isn't important for me that len runs fast, but I feel that by not understanding this behaviour there is something I am not grasping about the library.
All of the green boxes correspond to "from_pandas" whilst in this article by Matthew Rocklin http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes the call graph looks better (len_chunk is called which is significantly faster and the calls don't seem to be locked by and wait for one worker to finish his task before starting the other)
Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
Load data with Pandas
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
Convert to a lazy Dask.dataframe
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Compute len
Calling len actually causes computation here (normally you would use df.some_aggregation().compute(). So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
Analysis
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Load your data on the workers
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv, dd.read_csv is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log should now be fast
len(log) # fast
Edit
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
Simple but sub-optimal
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Complex but optimal
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html
http://distributed.readthedocs.io/en/latest/memory.html
http://dask.pydata.org/en/latest/delayed-collections.html

Run Multiple BigQuery Jobs via Python API

I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!
The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries
BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)

Categories

Resources