I have a rather large dataset across different files that I read in using dask, followed by a machine learning task for which I want to use dask as parallel backend.
I've noticed that reading in the files runs much faster using a Client with a higher number of workers instead of one worker with many threads. However, their individual share of memory is then too small to handle the ML task. I would therefore like to change the number of my workers to 1, with the maximum possible number of threads assigned to that new unique worker. Is there a way to do that without completely kiling and restarting my client?
I looked into the docs but couldn't find anything of use. Also happy about a hint where to look for this kind of info next time, if not there.
This is an example of what my current code looks like:
from dask.distributed import Client
import dask.dataframe as dd
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
client = Client(n_workers=4, threads_per_worker=2)
df = dd.read_hdf(path_to_file_dir, '/data')
feats = df['feats'].compute()
labels = df['labels'].compute()
dummy = LogisticRegression()
with parallel_backend('dask'):
dummy.fit(feats, labels) # FAILS bc of too high memory consumption
You can manually create Worker/Nanny classes if you want, or use the SpecCluster class for more fine-grained control. These are typically used by developers though, and may not be as user-friendly.
Related
I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.
I have the following code that runs two TensorFlow trainings in parallel using Dask workers implemented in Docker containers.
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
To that end, I do the following:
I use joblib.delayed to spawn the two processes.
Within each process I run with joblib.parallel_backend('dask'): to execute the fit/training logic. Each training process triggers N dask workers.
The problem is that I don't know if the entire process is thread safe, are there any concurrency elements that I'm missing?
# First, submit the function twice using joblib delay
delayed_funcs = [joblib.delayed(train)(sub_task) for sub_task in [123, 456]]
parallel_pool = joblib.Parallel(n_jobs=2)
parallel_pool(delayed_funcs)
# Second, submit each training process
def train(sub_task):
global client
if client is None:
print('connecting')
client = Client()
data = some_data_to_train
# Third, process the training itself with N workers
with joblib.parallel_backend('dask'):
X = data[columns]
y = data[label]
niceties = dict(verbose=False)
model = KerasClassifier(build_fn=build_layers,
loss=tf.keras.losses.MeanSquaredError(), **niceties)
model.fit(X, y, epochs=500, verbose = 0)
This is pure speculation, but one potential concurrency issue is due to if client is None: part, where two processes could race to create a Client.
If this is resolved (e.g. by explicitly creating a client in advance), then dask scheduler will rely on time of submission to prioritize task (unless priority is clearly assigned) and also the graph (DAG) structure, there are further details available in docs.
The question, as given, could easily be marked as "unclear" for SO. A couple of notes:
global client : makes the client object available outside of the fucntion. But the function is run from another process, you do not affect the other process when making the client
if client is None : this is a name error, your code doesn't actually run as written
client = Client() : you make a new cluster in each subprocess, each assuming the total resources available, oversubscribing those resources.
dask knows whether any client has been created in the current process, but that doesn't help you here
You must ask yourself: why are you creating processes for the two fits at all? Why not just let Dask figure out its parallelism, which is what it's meant for.
--
-EDIT-
to answer the form of the question asked in a comment.
My question is whether using the same client variable in these two parallel processes creates a problem.
No, the two client variables are unrelated to one-another. You may see a warning message about not being able to bind to a default port, which you can safely ignore. However, please don't make it global as this is unnecessary and makes what you are doing less clear.
--
I think I must answer the question as phrased in your comment, which I advise to add to the main question
I need to launch two processes, using the same dask client, where each will train their respective models with N workers.
You have the following options:
create a client with a specific known address within your program or beforehand, then connect to it
create a default client Client() and get its address (e.g., client._scheduler_identity['address']) and connect to that
write a scheduler information file with client.write_scheduler_file and use that
You will connect in the function with
client = Client(address)
or
client = Client(scheduler_file=the_file_you_wrote)
I have a setup as follows:
# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3
def runall(**kwargs):
print("done")
def etl():
client = Client()
tasks = {}
tasks['task1'] = dask.delayed(task)(*args)
tasks['task2'] = dask.delayed(task)(*args)
tasks['task3'] = dask.delayed(task)(*args)
out = dask.delayed(runall)(**tasks)
out.compute()
This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.
However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy with tuned settings. Would adding async and await help?
However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe or bag internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe and bag. The entire script is run on a single 40 core machine.
Using bag.starmap I get a graph like this:
where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.
There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.
I have approximately 200 files in a single directory on a Linux machine named part-0001, part-0002, and so on. Each has approximately one million rows with the same columns (call them 'a', 'b', and so on). Let the pair 'a','b' be the key for each row (with many duplicates).
At the same time, I have set up a Spark 2.2.0 cluster with a master and two slaves with a total of 42 cores available. The address is spark://XXX.YYY.com:7077.
I then use PySpark to connect to the cluster and compute the counts across the 200 files for each unique pair as follows.
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sc = SparkContext("spark://XXX.YYY.com:7077")
sqlContext = SQLContext(sc)
data_path = "/location/to/my/data/part-*"
sparkdf = sqlContext.read.csv(path=data_path, header=True)
dfgrouped = sparkdf.groupBy(['a','b'])
counts_by_group = dfgrouped.count()
This works in that I can see Spark progressing through a series of messages and it does indeed return results that look plausible.
Problem: While this calculation is being performed top does not show any evidence that the slave cores are doing anything. There doesn't appear to be any parallelization. Each slave has a single related Java process that was there before the job (plus processes from other users and background system processes). So it appears that the master is doing all the work. Given that there are 200 odd files, I had expected to see 21 processes running on each slave machine until things wound down (this is what I see when I explicitly invoke parallelize as follows count = sc.parallelize(c=range(1, niters + 1), numSlices=ncores).map(f).reduce(add) in a separate implementation).
Questions: How do I ensure that Spark is actually parallelizing the count? I would like each core to grab one or more files, perform the count for the pairs it sees in the file, and then have the individual results reduced into a single DataFrame. Shouldn't I see this in top? Do I need to explicitly invoke parallelization?
(FWIW, I have seen example using partitioning, but my understanding is that this is used to distribute processing on chunks of a single file. My case is that I have many files.)
Thanks in advance.
TL;DR There is probably nothing wrong with your deployment.
I had expected to see 21 processes running
Unless you specifically configured Spark to use a single core per executor JVM, there is no reason for this to happen. Unlike RDD example you've mentioned in the question, DataFrame API doesn't use Python workers at all, with exception to Python UserDefinedFunctions.
At the same time, JVM executors use threading instead of full-fledged system processes (PySpark uses the later one to avoid GIL). Furthermore default spark.executor.cores in standalone mode is equal to the number of the available cores on the worker. So without additional configuration you should see two executor JVMs, each using 21 data processing threads.
Overall you should check Spark UI, if you see tasks assigned to the executors, all should be fine.
I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
import dask.dataframe as dd
from dask.distributed import Client
client = Client('192.168.1.220:8786')
log = pd.read_csv('800000test', sep='\t')
logd = dd.from_pandas(log,npartitions=20)
#This is the code than runs slowly
#(2.9 seconds whilst I would expect no more than a few hundred millisencods)
print(len(logd))
#Instead this code is actually running almost 20 times faster than pandas
logd.loc[:'Host'].count().compute()
Any ideas why this could be happening? It isn't important for me that len runs fast, but I feel that by not understanding this behaviour there is something I am not grasping about the library.
All of the green boxes correspond to "from_pandas" whilst in this article by Matthew Rocklin http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes the call graph looks better (len_chunk is called which is significantly faster and the calls don't seem to be locked by and wait for one worker to finish his task before starting the other)
Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
Load data with Pandas
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
Convert to a lazy Dask.dataframe
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Compute len
Calling len actually causes computation here (normally you would use df.some_aggregation().compute(). So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
Analysis
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Load your data on the workers
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv, dd.read_csv is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log should now be fast
len(log) # fast
Edit
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
Simple but sub-optimal
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Complex but optimal
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html
http://distributed.readthedocs.io/en/latest/memory.html
http://dask.pydata.org/en/latest/delayed-collections.html