I'm investigating the suitability of using spark as the back end for my REST API. A problem with that seems to be Spark's FIFO scheduling approach. This means that if a large task is under execution, no small task can finish until that heavy task has finished. According to https://spark.apache.org/docs/latest/job-scheduling.html a fair scheduler should fix this. However, I cannot notice this changing anything. Am I configuring the scheduler wrong?
scheduler.xml:
<?xml version="1.0"?>
<allocations>
<pool name="test">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>10</minShare>
</pool>
</allocations>
My code:
$ pyspark --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/home/hadoop/scheduler.xml
>>> import threading
>>> sc.setLocalProperty("spark.scheduler.pool", "test")
>>> def heavy_spark_job():
# Do some heavy work
>>>
>>> def smaller_spark_job():
# Do something simple
>>>
>>> threading.Thread(target=heavy_spark_job).start()
>>> smaller_spark_job()
The smaller spark job can only start when the first task of the heavy spark job doesn't need all of the available CPU cores.
You just need to set a different pools for your tasks:
By default, each pool gets an equal share of the cluster (also equal in share to each job in the default pool), but inside each pool, jobs run in FIFO order. For example, if you create one pool per user, this means that each user will get an equal share of the cluster, and that each user’s queries will run in order instead of later queries taking resources from that user’s earlier ones.
https://spark.apache.org/docs/latest/job-scheduling.html#default-behavior-of-pools
Also, in PySpark the child thread couldn't inherit the parent thread's local properties, you have to set the pool inside the thread target functions.
I believe, you have implemented fair scheduling correctly. There is a problem in threading as you are not running smaller_spark_job as a thread. Hence, it is waiting for heavy_spark_job to get complete first. It should be like below.
threading.Thread(target=heavy_spark_job).start()
threading.Thread(target= smaller_spark_job).start()
Also, my post can help you to verify FAIR Scheduling from Spark UI.
Related
I am using a SLURM cluster with Dask and don't quite understand the configuration part. The documentation talks of jobs and workers and even has a section on the difference:
In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers.
Problem is I still don't get it. I use the word task to refer to a single function one submits using a client, i.e with a client.submit(task, *params) call.
My understanding of how Dask works is that there are n_workers set up and that each task is submitted to a pool of said workers. Any worker works on one task at a given time potentially using multiple threads and processes.
However my understanding does not leave any room for the term job and is thus certainly wrong. Moreover most configurations of the cluster (cores, memory, processes) are done on a per job basis according to the docs.
So my question is what is a job? Can anyone explain in simpler terms its relation to a task and a worker? And how the cores, memory, processes, and n_workers configurations interact? (I have read the docs, just don't understand and could use another explanation)
Your understanding of tasks and workers is correct. Job is a concept specific to SLURM (and other HPC clusters where users submit jobs). Job consists of the instruction of what to execute and what resources are needed, so the typical workflow of a SLURM user is to write a script and then submit it for execution using salloc or sbatch.
One can submit a job with instruction to launch multiple dask-workers (there might be advantages for this due to latency, permissions, resource availability, etc, but this would need to be determined from the particular cluster configuration).
From dask perspective what matters is the number of workers, but from dask-jobqueue the number of jobs also matters. For example, if number of workers per job is 2, then to get 10 workers in total dask-jobqueue will submit 5 jobs to the SLURM scheduler.
This example adapted from docs, will result in 10 dask-worker, each with 24 cores:
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=24,
processes=1,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
If we specify multiple processes, then the total number of workers will be jobs * processes (assuming sufficient cores), so the following will give 100 workers with 2 cores each and 50 GB per worker (note the memory in config is total):
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue='regular',
project="myproj",
cores=20,
processes=10,
memory="500 GB"
)
cluster.scale(jobs=10) # ask for 10 jobs
I'm working with a django application hosted on heroku with redistogo addon:nano pack. I'm using rq, to execute tasks in the background - the tasks are initiated by online users. I've a constraint on increasing number of connections, limited resources I'm afraid.
I'm currently having a single worker running over 'n' number of queues. Each queue uses an instance of connection from the connection pool to handle 'n' different types of task. For instance, lets say if 4 users initiate same type of task, I would like to have my main worker create child processes dynamically, to handle it. Is there a way to achieve required multiprocessing and concurrency?
I tried with multiprocessing module, initially without introducing Lock(); but that exposes and overwrites user passed data to the initiating function, with the previous request data. After applying locks, it restricts second user to initiate the requests by returning a server error - 500
github link #1: Looks like the team is working on the PR; not yet released though!
github link #2: This post helps to explain creating more workers at runtime.
This solution however also overrides the data. The new request is again processed with the previous requests data.
Let me know if you need to see some code. I'll try to post a minimal reproducible snippet.
Any thoughts/suggestions/guidelines?
Did you get a chance to try AutoWorker?
Spawn RQ Workers automatically.
from autoworker import AutoWorker
aw = AutoWorker(queue='high', max_procs=6)
aw.work()
It makes use of multiprocessing with StrictRedis from redis module and following imports from rq
from rq.contrib.legacy import cleanup_ghosts
from rq.queue import Queue
from rq.worker import Worker, WorkerStatus
After looking under the hood, I realised Worker class is already implementing multiprocessing.
The work function internally calls execute_job(job, queue) which in turn as quoted in the module
Spawns a work horse to perform the actual work and passes it a job.
The worker will wait for the work horse and make sure it executes within the given timeout bounds,
or will end the work horse with SIGALRM.
The execute_job() funtion makes a call to fork_work_horse(job, queue) implicitly which spawns a work horse to perform the actual work and passes it a job as per the following logic:
def fork_work_horse(self, job, queue):
child_pid = os.fork()
os.environ['RQ_WORKER_ID'] = self.name
os.environ['RQ_JOB_ID'] = job.id
if child_pid == 0:
self.main_work_horse(job, queue)
else:
self._horse_pid = child_pid
self.procline('Forked {0} at {1}'.format(child_pid, time.time()))
The main_work_horse makes an internal call to perform_job(job, queue) which makes a few other calls to actually perform the job.
All the steps about The Worker Lifecycle mentioned over rq's official documentation page are taken care within these calls.
It's not the multiprocessing I was expecting, but I guess they have a way of doing things. However my original post is still not answered with this, also I'm still not sure about concurrency..
The documentation there still needs to be worked upon, since it hardly covers the true essence of this library!
I have been using celery for a while but am looking for an alternative due to the lack of windows support.
The top competitors seem to be dask and dramatiq. What I'm really looking for is for something that can distribute 1000 long running tasks onto 10 machines. Each should pick up the next job when it has completed the task, and give a callback with updates (in celery this can be nicely achieved with #task(bind=True), as the task instance itself can be accessed and I can send the status back to the instance that sent it with an update).
Is there a similar functionality available in dramatiq or dask? Any suggestions would be appreciated.
On the Dask side you're probably looking for the futures interface : https://docs.dask.org/en/latest/futures.html
Futures have a basic status like "finished" or "pending" or "error" that you can check any time. If you want more complex messages then you should look into Dask Queues, PubSub, or other intertask communication mechanisms, also available from that doc page.
I've started a new Python 3 project in which my goal is to download tweets and analyze them. As I'll be downloading tweets from different subjects, I want to have a pool of workers that must download from Twitter status with the given keywords and store them in a database. I name this workers fetchers.
Other kind of worker is the analyzers whose function is to analyze tweets contents and extract information from them, storing the result in a database also. As I'll be analyzing a lot of tweets, would be a good idea to have a pool of this kind of workers too.
I've been thinking in using RabbitMQ and Celery for this but I have some questions:
General question: Is really a good approach to solve this problem?
I need at least one fetcher worker per downloading task and this could be running for a whole year (actually is a 15 minutes cycle that repeats and last for a year). Is it appropriate to define an "infinite" task?
I've been trying Celery and I used delay to launch some example tasks. The think is that I don't want to call ready() method constantly to check if the task is completed. Is it possible to define a callback? I'm not talking about a celery task callback, just a function defined by myself. I've been searching for this and I don't find anything.
I want to have a single RabbitMQ + Celery server with workers in different networks. Is it possible to define remote workers?
Yeah, it looks like a good approach to me.
There is no such thing as infinite task. You might reschedule a task it to run once in a while. Celery has periodic tasks, so you can schedule a task so that it runs at particular times. You don't necessarily need celery for this. You can also use a cron job if you want.
You can call a function once a task is successfully completed.
from celery.signals import task_success
#task_success(sender='task_i_am_waiting_to_complete')
def call_me_when_my_task_is_done():
pass
Yes, you can have remote workes on different networks.
If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?
Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)
From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.
You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327
Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.