pyspark on cluster, make sure all nodes are used

pyspark on cluster, make sure all nodes are used - python

Deployment info: "pyspark --master yarn-client --num-executors 16 --driver-memory 16g --executor-memory 2g "
I am turning a 100,000 line text file (in hdfs dfs format) into a RDD object with corpus = sc.textFile("my_file_name"). When I execute corpus.count() I do get 100000. I realize that all these steps are performed on the master node.
Now, my question is when I perform some action like new_corpus=corpus.map(some_function), will the job be automatically distributed by pyspark among all available slaves (16 in my case)? Or do I have to specify something?
Notes:
I don't think that anything gets distributed actually (or at least not on the 16 nodes) because when I do new_corpus.count(), what prints out is [Stage some_number:> (0+2)/2], not [Stage some_number:> (0+16)/16]
I don't think that doing corpus = sc.textFile("my_file_name",16) is the solution for me because the function I want to apply works at the line level and therefore should be applied 100,000 times (the goal of parallelization is to speed up this process, like having each slave taking 100000/16 lines). It should not be applied 16 times on 16 subsets of the original text file.

Your observations are not really correct. Stages are not "executors". In Spark we have jobs, tasks and then stages. The job is kicked off by the master driver and then task are assigned to different worker nodes where stage is a collection of task which has the same shuffling dependencies. In your case shuffling happens only once.
To check if executors are really 16, you have to look into the resource manager. Usually it is at port 4040 since you are using yarn.
Also if you use rdd.map(), then it should parallelize according to your defined partitions and not the executors which you set in sc.textFile("my_file_name", numPartitions).
Here is an overview again:
https://spark.apache.org/docs/1.6.0/cluster-overview.html

First of, I saw yarn-client and a chill ran down my spine.
Is there a reason why you want the node where you submit your job to be running the driver? Why not let Yarn do its thing?
But about your question:
I realize that all these steps are performed on the master node.
No they are not. You might be mislead by the fact you are running your driver on the node you are connected to (see my spine-chill ;) ).
You tell yarn to start up 16 executors for you, and Yarn will do so.
It will try to take your rack and data locality into account to the best of its ability while doing so. These will be run in parallel.
Yarn is a resource manager, it manages the resources so you don't have to. All you have to specify with Spark is the number of executors you want and the memory yarn has to assign to the executors and driver.
Update: I have added this image to clarify how spark-submit (in clustered mode) works

Related

Job, Worker, and Task in dask_jobqueue

I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)

Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs

Storm physical processes excede number bolts and excutors

I have a problem in my Storm setup and it looks like there's some discrepancy between the number of executors I set for the topology, and the number of actual bolt processes I see running on one of the servers in that topology.
When setting the number of executors per bolt I use the setBolt method of TopologyBuilder. The number of executors per the UI is correct (total of 105), and when drilling down to the number of executors per server I see that every server in my topology should hold 7-9 executors. This is all good and well however, when sshing to the server and using htop I see that there is one parent process with at least 30 child processes running for that bolt type.
A few notes:
I am using a very old version of Storm (0.9.3) that unfortunately I can't upgrade.
I'm running a Storm instance that is running python processes (don't know how relevant that is)
I think I'm missing something on the relation between the number of Storm processes and number of bolts/executors I'm configuring or, how to read htop properly. In any case, I would love to get some explanation.
I found this answer, saying that htop shows threads as processes but I still don't think that answers my question.
Thank you

Spark Memory Error at Parallelization Step

we are using the most recent Spark build. We have as input a very large list of tuples (800 Mio.). We run our Pyspark program using docker containers with a master and multiple worker nodes. A driver is used to run the program and connect to the master.
When running the program, at the line sc.parallelize(tuplelist) the program either quits with an java heap error message or quits without an error at all. We do not use any Hadoop HDFS layer and no YARN.
We have so far considered the possible factors as mentioned in these SO postings:
Spark java.lang.OutOfMemoryError : Java Heap space
Spark java.lang.OutOfMemoryError: Java heap space (also the list of possible solutions by samthebest did not helped to solve the issue)
At this points we have the following question(s):
How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Do you know any (common?) mistake which may lead to the observed behevior?

How do we know how many partitions we should use for the sc.parallelize step? What is here a good rule-of-thumb?
Ans: There are multiple factors to decide the number of partitions.
1) There may be cases where having number of partitions 3-4 X times of your cores will be good case(considering each partition going to be processing more than few secs)
2) Partitions shouldn't be too small or too large(128MB or 256MB) will be good enough
Do you know any (common?) mistake which may lead to the observed behevior?
Can you check the executor memory and disk that is available to run the size.
If you can specify more details about the job e.g. number of cores, executor memory, number of executors and disk available it will be helpful to point out the issue.

Dask with HTCondor scheduler

Background
I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed. The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used.
Issue
The admin will install HTCondor as a scheduler for the cluster.
Thought
In order order to have my code running on the new setup I was planning to use the approach showed in the dask manual for SGE because the cluster has a shared network files system.
# job1
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
# Job3
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')
Question and advice
If I understood correctly with this approach I will start scheduler, workers and analysis as independent jobs (different HTCondor submit files). How can I make sure that the order of execution will be correct? Is there a way I can use the same processing approach I have being using before or will be more efficient to translate the code to work better with HTCondor?
Thanks for the help!

HTCondor JobQueue support has been merged (https://github.com/dask/dask-jobqueue/pull/245) and should now be available in Dask JobQueue (HTCondorCluster(cores=1, memory='100MB', disk='100MB') )

Spark is only using one worker machine when more are available

I'm trying to parallelize a machine learning prediction task via Spark. I've used Spark successfully a number of times before on other tasks and have faced no issues with parallelization before.
In this particular task, my cluster has 4 workers. I'm calling mapPartitions on an RDD with 4 partitions. The map function loads a model from disk (a bootstrap script distributes all that is needed to do this; I've verified it exists on each slave machine) and performs prediction on data points in the RDD partition.
The code runs, but only utilizes one executor. The logs for the other executors say "Shutdown hook called". On different runs of the code, it uses different machines, but only one at a time.
How can I get Spark to use multiple machines at once?
I'm using PySpark on Amazon EMR via Zeppelin notebook. Code snippets are below.
%spark.pyspark
sc.addPyFile("/home/hadoop/MyClassifier.py")
sc.addPyFile("/home/hadoop/ModelLoader.py")
from ModelLoader import ModelLoader
from MyClassifier import MyClassifier
def load_models():
models_path = '/home/hadoop/models'
model_loader = ModelLoader(models_path)
models = model_loader.load_models()
return models
def process_file(file_contents, models):
filename = file_contents[0]
filetext = file_contents[1]
pred = MyClassifier.predict(filetext, models)
return (filename, pred)
def process_partition(file_list):
models = load_models()
for file_contents in file_list:
pred = process_file(file_contents, models)
yield pred
all_contents = sc.wholeTextFiles("s3://some-path", 4)
processed_pages = all_contents.mapPartitions(process_partition)
processedDF = processed_pages.toDF(["filename", "pred"])
processedDF.write.json("s3://some-other-path", mode='overwrite')
There are four tasks as expected, but they all run on the same executor!
I have the cluster running and can provide logs as available in Resource Manager. I just don't know yet where to look.

Two points to mention here (not sure if they will solve your issue though):
wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat, and because of CombineFileInputFormat, it will try to combine groups of small files into one partition. So if you set the number of partition to 2 for example, you 'might' get two partitions but it is not guaranteed, it depends on the size of the files you are reading.
The output of wholeTextFiles is an RDD that contains an entire file in each record (and each record/file cannot be split so it will end by being in a single partition/worker). So if you are reading one file only, you will end by having the full file in one partition despite that you set the partitioning to 4 in your example.

The process has as many as partitions you specified but it is going in serialized way.
Executors
The process might spin up default number of executors. This can be seen in the yarn resource manager. In your case all the processing is done by one executor. If executor has more than one core it will parellize the job. In emr you have do this changes in order to have more than 1 core for the executor.
What specifically happening in our case is, the data is small, so all the data is read in one executor(ie which is using one node). With out the following property the executor uses only single core. Hence all the tasks are serialized.
Setting the property
sudo vi /etc/hadoop/conf/capacity-scheduler.xml
Setting the following property as shown
"yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalcul‌ator"
In order to make this property applicable you have to restart the yarn
sudo hadoop-yarn-resourcemanager stop
Restart the yarn
sudo hadoop-yarn-resourcemanager start
When your job is submitted see the yarn and the spark-ui
In Yarn you will see more cores for executor

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.