Slow len function on dask distributed dataframe - python

I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc.
import dask.dataframe as dd
from dask.distributed import Client
client = Client('192.168.1.220:8786')
log = pd.read_csv('800000test', sep='\t')
logd = dd.from_pandas(log,npartitions=20)
#This is the code than runs slowly
#(2.9 seconds whilst I would expect no more than a few hundred millisencods)
print(len(logd))
#Instead this code is actually running almost 20 times faster than pandas
logd.loc[:'Host'].count().compute()
Any ideas why this could be happening? It isn't important for me that len runs fast, but I feel that by not understanding this behaviour there is something I am not grasping about the library.
All of the green boxes correspond to "from_pandas" whilst in this article by Matthew Rocklin http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes the call graph looks better (len_chunk is called which is significantly faster and the calls don't seem to be locked by and wait for one worker to finish his task before starting the other)

Good question, this gets at a few points about when data is moving up to the cluster and back down to the client (your python session). Lets look at a few stages of your compuation
Load data with Pandas
This is a Pandas dataframe in your python session, so it's obviously still in your local process.
log = pd.read_csv('800000test', sep='\t') # on client
Convert to a lazy Dask.dataframe
This breaks up your Pandas dataframe into twenty Pandas dataframes, however these are still on the client. Dask dataframes don't eagerly send data up to the cluster.
logd = dd.from_pandas(log,npartitions=20) # still on client
Compute len
Calling len actually causes computation here (normally you would use df.some_aggregation().compute(). So now Dask kicks in. First it moves your data out to the cluster (slow) then it calls len on all of the 20 partitions (fast), it aggregates those (fast) and then moves the result down to your client so that it can print.
print(len(logd)) # costly roundtrip client -> cluster -> client
Analysis
So the problem here is that our dask.dataframe still had all of its data in the local python session.
It would have been much faster to use, say, the local threaded scheduler rather than the distributed scheduler. This should compute in milliseconds
with dask.set_options(get=dask.threaded.get): # no cluster, just local threads
print(len(logd)) # stays on client
But presumably you want to know how to scale out to larger datasets, so lets do this the right way.
Load your data on the workers
Instead of loading with Pandas on your client/local session, let the Dask workers load bits of the csv file. This way no client-worker communication is necessary.
# log = pd.read_csv('800000test', sep='\t') # on client
log = dd.read_csv('800000test', sep='\t') # on cluster workers
However, unlike pd.read_csv, dd.read_csv is lazy, so this should return almost immediately. We can force Dask to actually do the computation with the persist method
log = client.persist(log) # triggers computation asynchronously
Now the cluster kicks into action and loads your data directly in the workers. This is relatively fast. Note that this method returns immediately while work happens in the background. If you want to wait until it finishes, call wait.
from dask.distributed import wait
wait(log) # blocks until read is done
If you're testing with a small dataset and want to get more partitions, try changing the blocksize.
log = dd.read_csv(..., blocksize=1000000) # 1 MB blocks
Regardless, operations on log should now be fast
len(log) # fast
Edit
In response to a question on this blogpost here are the assumptions that we're making about where the file lives.
Generally when you provide a filename to dd.read_csv it assumes that that file is visible from all of the workers. This is true if you are using a network file system, or a global store like S3 or HDFS. If you are using a network file system then you will want to either use absolute paths (like /path/to/myfile.*.csv) or else ensure that your workers and client have the same working directory.
If this is not the case, and your data is only on your client machine, then you will have to load and scatter it out.
Simple but sub-optimal
The simple way is just to do what you did originally, but persist your dask.dataframe
log = pd.read_csv('800000test', sep='\t') # on client
logd = dd.from_pandas(log,npartitions=20) # still on client
logd = client.persist(logd) # moves to workers
This is fine, but results in slightly less-than-ideal communication.
Complex but optimal
Instead, you might scatter your data out to the cluster explicitly
[future] = client.scatter([log])
This gets into more complex API though, so I'll just point you to docs
http://distributed.readthedocs.io/en/latest/manage-computation.html
http://distributed.readthedocs.io/en/latest/memory.html
http://dask.pydata.org/en/latest/delayed-collections.html

Related

How to parallel for loop in Sagemaker Processing job

I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes.
for example: script.py
for i in list:
run_function(i)
I'm kicking off the job from a notebook:
sklearn_processor = SKLearnProcessor(
framework_version="1.0-1", role=role,
instance_type="ml.m5.4xlarge", instance_count=1,
sagemaker_session = Session()
)
out_path = 's3://' + os.path.join(bucket, prefix,'outpath')
sklearn_processor.run(
code="script.py",
outputs=[
ProcessingOutput(output_name="load_training_data",
source = f'/opt/ml/processing/output}',
destination = out_path),
],
arguments=["--some-args", "args"]
)
I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible.
How can I do that
There are basically 3 paths you can take, depending on the context.
Parallelising function execution
This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.
Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python
Here is a simple example on how to parallelise:
from multiprocessing import Pool
import os
POOL_SIZE = os.cpu_count()
your_list = [...]
def run_function(i):
# ...
return your_result
if __name__ == '__main__':
with Pool(POOL_SIZE) as pool:
print(pool.map(run_function, your_list))
Splitting input data into multiple instances
This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.
It is clear in your case it is the instance_count parameter to set, as the documentation says:
instance_count (int or PipelineVariable) - The number of instances to
run the Processing job with. Defaults to 1.
This should be combined with the ProcessingInput split.
P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.
Combined approach
One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.
An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.
To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.

Dask scheduler empty / graph not showing

I have a setup as follows:
# etl.py
from dask.distributed import Client
import dask
from tasks import task1, task2, task3
def runall(**kwargs):
print("done")
def etl():
client = Client()
tasks = {}
tasks['task1'] = dask.delayed(task)(*args)
tasks['task2'] = dask.delayed(task)(*args)
tasks['task3'] = dask.delayed(task)(*args)
out = dask.delayed(runall)(**tasks)
out.compute()
This logic was borrowed from luigi and works nicely with if statements to control what tasks to run.
However, some of the tasks load large amounts of data from SQL and cause GIL freeze warnings (At least this is my suspicion as it is hard to diagnose what line exactly causes the issue). Sometimes the graph / monitoring shown on 8787 does not show anything just scheduler empty, I suspect these are caused by the app freezing dask. What is the best way to load large amounts of data from SQL in dask. (MSSQL and oracle). At the moment this is doen with sqlalchemy with tuned settings. Would adding async and await help?
However, some of tasks are a bit slow and I'd like to use stuff like dask.dataframe or bag internally. The docs advise against calling delayed inside delayed. Does this also hold for dataframe and bag. The entire script is run on a single 40 core machine.
Using bag.starmap I get a graph like this:
where the upper straight lines are added/ discovered once the computation reaches that task and compute is called inside it.
There appears to be no rhyme or reason other than the machine was / is very busy and struggling to show the state updates and bokeh plots as desired.

Dask Client change number of workers mid-session

I have a rather large dataset across different files that I read in using dask, followed by a machine learning task for which I want to use dask as parallel backend.
I've noticed that reading in the files runs much faster using a Client with a higher number of workers instead of one worker with many threads. However, their individual share of memory is then too small to handle the ML task. I would therefore like to change the number of my workers to 1, with the maximum possible number of threads assigned to that new unique worker. Is there a way to do that without completely kiling and restarting my client?
I looked into the docs but couldn't find anything of use. Also happy about a hint where to look for this kind of info next time, if not there.
This is an example of what my current code looks like:
from dask.distributed import Client
import dask.dataframe as dd
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
client = Client(n_workers=4, threads_per_worker=2)
df = dd.read_hdf(path_to_file_dir, '/data')
feats = df['feats'].compute()
labels = df['labels'].compute()
dummy = LogisticRegression()
with parallel_backend('dask'):
dummy.fit(feats, labels) # FAILS bc of too high memory consumption
You can manually create Worker/Nanny classes if you want, or use the SpecCluster class for more fine-grained control. These are typically used by developers though, and may not be as user-friendly.

Can I process 100 GB of data using Apache Spark on my local machine?

I have around 100GB data of users and want to process it using Apache Spark on my laptop.I have installed Hadoop and Spark and for the test I uploaded a file of around 9 GB to HDFS and accessed & queried it using pyspak.
The test file has total 113959238 records/rows, when I queried the data for a particular user i.e
select * from table where userid=????
it took around 6 minutes to retrieve the records of that user and if I run on the entire file then it will take a lot of time.
The analysis that I to make on that data is to extract the records a users, run some operations on it and then process the data of second user and so on for all the users in file. The data of the user queried, will not be much so it can be loaded in memory and operations can be preformed faster. But querying the record of a user from that big file takes time and will slow the process.
It is said that Spark is lighting fast so surely I will be missing something which is why it is taking that time. One thing that I noted while performing queries was Spark was not utilizing full RAM but almost 100% of CPU.
My machine specs are:
I also queried the data directly of the text file using Spark instead of HDFS file but there wasn't much difference in time.
The python code that I wrote is
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext,SQLContext
import time
conf=SparkConf()
conf.set("spark.executor.memory", "8g")
conf.set("spark.driver.memory", "8g")
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
sc=sparkSession.sparkContext.getOrCreate(conf)
sqlContext=SQLContext(sc)
#df_load = sparkSession.read.format("csv").option("header","true").load("hdfs://0.0.0.0:19000/test.txt")
df_load = sparkSession.read.format("csv").option("header","true").load("C:/Data/test_file/test.txt")
table=df_load.registerTempTable('test')
sp_tstart=time.time()
df=sqlContext.sql("select * from test where user_id='12345'")
db=df.rdd.collect()
sp_tend=time.time()
t_time=sp_tend-sp_tstart
df.show()
print(t_time/60)
Given my machine specs, is Spark taking normal time or Do I need to configure something? Do I need to upgrade the specs or is it enough for this data?
One of the things to understand with Spark, Hadoop and other Big Data providers is that they aren't aiming to get the maximum possible throughput from an individual machine. They're aiming to let you split the processing efficiently across multiple machines. They sacrifice a certain amount of individual machine throughput to provide horizontal scalability.
While you can run Spark on just a single machine, the main reasons to do so are to learn Spark or to write test code to then deploy to run against a cluster with more data.
As others have noted, if you just want to process data on a single machine, then there are libraries which are going to be more efficient in that scenario. 100GB is not a huge amount to process on a single machine.
From the sound of things you'd be better off importing that data into a database and adding suitable indexing. One other thing to understand is that a lot of the benefit of Big Data systems is supporting analysing and processing most or all of the data. Traditional database systems like Postgres or SQL Server can work well handling Terabytes of data when you're mainly querying for small amounts of data using indices.
If your objective is to analyze 100GB of data using python and there is no requirement for spark, you can also take a look into dask. https://dask.org/ It should be easier to setup and use with python.
For example dask dataframe: https://docs.dask.org/en/latest/dataframe.html
>>> import dask.dataframe as dd
>>> df = dd.read_csv('2014-*.csv')
>>> df.head()
x y
0 1 a
1 2 b
2 3 c
3 4 a
4 5 b
5 6 c
>>> df2 = df[df.y == 'a'].x + 1
You don't need Hadoop to process the file locally.
The advantages of Hadoop only apply when you use more than one machine as your file will be chunked and distributed to many processes at once.
Similarly, 100GB of plaintext isn't really "big data"; it still fits on a single machine and if stored in a better format like ORC or Parquet, would be significantly less in size
Also, to get faster times, don't use collect()
If you simply want to lookup data by ID, use a key value database like Redis or Accumulo, not Hadoop/Spark
The type of job you have described his a highly CPU intensive process, which unfortunately is only going to be sped up significantly by running many parallel queries on partitions of the data set. Compound the problem with not having enough system memory to hold the entire dataset and now you are also limited by significant read/writes on the hard drive.
This is the type of task where Spark really shines. The reason you aren't experiencing any improvement in performance is because with a single system you're missing the benefit of Spark entirely, which is the ability to split the data set into many partitions and distribute it across many machines who can work on many different users IDs at the same time.
Each worker node in your cluster will have a smaller data set to look at, which means on each node the entire data set it is looking at can easily be stored in memory. Each find and replace function (one per user ID) can be sent to a single CPU core, which means if you have 5 workers with 16 cores each, you can process 80 IDs at a time, from memory, on an optimized partition size.
Google CloudProc and Azure Databricks are super platforms for doing this. Simply choose the number of workers you need, and the CPU/Memory of each node, and fire up the cluster. Connect to your data and kick off your PySpark code. It can process this data so quickly that even though you are paying by the minute for the cluster, it will end up being very cheap ($10-$20 perhaps).

Parallelization in counting Spark dataframe groups in pyspark

I have approximately 200 files in a single directory on a Linux machine named part-0001, part-0002, and so on. Each has approximately one million rows with the same columns (call them 'a', 'b', and so on). Let the pair 'a','b' be the key for each row (with many duplicates).
At the same time, I have set up a Spark 2.2.0 cluster with a master and two slaves with a total of 42 cores available. The address is spark://XXX.YYY.com:7077.
I then use PySpark to connect to the cluster and compute the counts across the 200 files for each unique pair as follows.
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
sc = SparkContext("spark://XXX.YYY.com:7077")
sqlContext = SQLContext(sc)
data_path = "/location/to/my/data/part-*"
sparkdf = sqlContext.read.csv(path=data_path, header=True)
dfgrouped = sparkdf.groupBy(['a','b'])
counts_by_group = dfgrouped.count()
This works in that I can see Spark progressing through a series of messages and it does indeed return results that look plausible.
Problem: While this calculation is being performed top does not show any evidence that the slave cores are doing anything. There doesn't appear to be any parallelization. Each slave has a single related Java process that was there before the job (plus processes from other users and background system processes). So it appears that the master is doing all the work. Given that there are 200 odd files, I had expected to see 21 processes running on each slave machine until things wound down (this is what I see when I explicitly invoke parallelize as follows count = sc.parallelize(c=range(1, niters + 1), numSlices=ncores).map(f).reduce(add) in a separate implementation).
Questions: How do I ensure that Spark is actually parallelizing the count? I would like each core to grab one or more files, perform the count for the pairs it sees in the file, and then have the individual results reduced into a single DataFrame. Shouldn't I see this in top? Do I need to explicitly invoke parallelization?
(FWIW, I have seen example using partitioning, but my understanding is that this is used to distribute processing on chunks of a single file. My case is that I have many files.)
Thanks in advance.
TL;DR There is probably nothing wrong with your deployment.
I had expected to see 21 processes running
Unless you specifically configured Spark to use a single core per executor JVM, there is no reason for this to happen. Unlike RDD example you've mentioned in the question, DataFrame API doesn't use Python workers at all, with exception to Python UserDefinedFunctions.
At the same time, JVM executors use threading instead of full-fledged system processes (PySpark uses the later one to avoid GIL). Furthermore default spark.executor.cores in standalone mode is equal to the number of the available cores on the worker. So without additional configuration you should see two executor JVMs, each using 21 data processing threads.
Overall you should check Spark UI, if you see tasks assigned to the executors, all should be fine.

Categories

Resources