I am trying to understand how to collect data for computing resource use, for example, the average number of customers waiting in line. I looked at the documentation at the following link, but it is just too much for me. I am looking for an example of how it is used and how to compute time-based average line length. I appreciate any guidance.
https://simpy.readthedocs.io/en/latest/topical_guides/monitoring.html#monitoring-your-processes
Resources do not have utilization logging. You will need to collect that yourself.
Monkey patching is a way to wrap resource requests with code to collects the stats without changing how resource requests are called. A more simple way is to just make a logger, and add a log call where you need it. That is how I did it in my example. The down side is you have to remember to add the logging code were you need it.
simple resources have the following properties for collecting stats: capacity, count (number of users with a resource, users (list of users with a resource), queue (list of pending resource requests)
"""
A quick example on how to get average line length
using a custom logger class
programmer: Michael R. Gibbs
"""
import simpy
import numpy as np
import pandas as pd
class LineLogger():
"""
logs the size of a resource line
"""
def __init__(self, env):
self.env = env
# the log
self.samples_df = pd.DataFrame(columns=['time','len'])
def log(self, time, len):
"""
log a time and length of the resoure request queue
time: time the messure is taken
len: length of the queue
"""
self.samples_df = self.samples_df.append({'time':time,'len':len},ignore_index=True)
def get_ave_line(self):
"""
finds the time weighted average of the queue length
"""
# use the next row to figure out how long the queue was at that length
self.samples_df['time_span'] = self.samples_df['time'].shift(-1) - self.samples_df['time']
# drop the last row because it would have a infinate time span
trimed_samples_df = self.samples_df[0:-1]
ave = np.average(trimed_samples_df['len'], weights=trimed_samples_df['time_span'])
return ave
def task(env, res, line_logger):
"""
A simple task that grabs a resouce for a bit of time
"""
with res.request() as req: # Generate a request event
# requester enters queue for resouce
line_logger.log(env.now,len(res.queue))
yield req
# requester got a resource and leaves requeuest queue
# if the resource was available when the request was made, then time in queue will be 0
line_logger.log(env.now,len(res.queue))
# keep resource to build a queue
yield env.timeout(3.5)
def gen_tasks(env, res, line_logger):
"""
generates 5 tasks to seize a resource
building a queue over time
"""
for i in range(5):
env.process(task(env,res,line_logger))
# put some time between requests
yield env.timeout(1)
if __name__ == '__main__':
env = simpy.Environment()
res = simpy.Resource(env, capacity=1)
line_logger = LineLogger(env)
env.process(gen_tasks(env,res,line_logger))
env.run(100)
print("finish sim")
print("average queue length is: ",line_logger.get_ave_line())
print()
print("log data")
print(line_logger.samples_df)
print()
print("done")
Use a process running in parallel with your main process to monitor utilization. Here is boilerplate code for a generator function you can use in the monitoring process.
data = []
def monitor_process(env, resource):
"""
Generator for monitoring process
that shares the environment with the main process
and collects information.
"""
while True:
item = (env.now,
resource.count,
len(resource.queue))
data.append(item)
yield env.timeout(0.25)
This generator function is set up to poll the resource object 4 times each simulation step and puts the result in an array. You can change the polling frequency. Call this generator like so:
env.process(monitor_process(env, target_resource))
When you call env.run(until=120) (for example) to run your main process, this process will run parallel and log resource statistics.
I have implemented monkey-patching for comparison to this approach. Monkey-patching decorates some of a resource's methods with logging features. The code is more elegant but also more complex. Moreover, with monkey-patching, the resource stats will be logged each time an event occurs, i.e. any of the target resource's get, put, request or release methods is called. The approach I have shown here will log resource stats at regular time intervals and the code is relatively simpler.
Hope this helps.
Cheers!
Related
I am using multiprocessing with multiple workers (subclasses of multiprocessing.Process) and queues (multiprocessing.JoinableQueue), to implement a complex workflow of data manipulation.
One of the workers (JobSender) is submitting jobs to a remote system (a web service), which returns an identifier immediately. Those jobs can take a very long time to be performed.
I therefore have another worker (StatusPoller) in charge of polling that remote system for status of the job. To do so, the JobSender adds the identifier in a queue that the StatusPoller uses as input. If the job is not completed, the StatusPoller puts the identifier back on the same queue. If the job is completed, the StatusPoller retrieves the result information and then adds it to a list (multiprocessing.Manager.list()).
My question: I don't want to hammer the remote system with continuous requests for status, which would happen in my setup. I want to introduce a delay somewhere to ensure that status polling for any given identifier only happens every 20 seconds.
Currently I'm doing this by having a time.sleep(20) just before the StatusPoller puts the identifier back on the queue. But that means that the StatusPoller is now idle for 20 seconds and cannot pick up another polling task from the queue. I will have multiple StatusPollers but I can't have one for each of the jobs (there might be hundreds of those).
class StatusPoller(multiprocessing.Process):
def __init__(self, polling_queue, results_queue, errors_queue):
multiprocessing.Process.__init__(self)
self.polling_queue = polling_queue
self.results_queue = results_queue
def run(self):
while True:
# Pick a task from the queue
next_id = self.polling_queue.get()
# Poison pill => shutdown
if next_id == 'END':
self.polling_queue.task_done()
break
# Process the task
response = remote_system.get_status(next_id)
if response == "IN_PROGRESS":
time.sleep(20)
self.polling_queue.put(next_id)
else:
self.results_queue.put(response)
self.polling_queue.task_done()
Any idea how to implement such a workflow?
When you consider that the multiprocessing.Process and multithreading.Threading classes can be instantiated with the target keyword, I consider it to be an antipattern to actually subclass these classes since you then lose some flexibility and reuse. In fact, in your case I would think that given that StatusPoller is just waiting on a queue and a reply from a network, that multithreading would be more than adequate, especially if, as you say, you have "hundreds of those." I also cannot see in your current code the need for a joinable queue.
So I would suggest using multithreading with regular queue.Queue instances and the sched.scheduler class instance from the sched module, which can be shared among all StatusPoller instances as the code appears to the thread safe. Here is the general idea:
from threading import Thread
from queue import Queue
import time
# Start of modified sched.scheduler code:
#########################################################
# Heavily modified from sched.scheduler
import time
import heapq
from collections import namedtuple
import threading
from time import monotonic as _time
class Event(namedtuple('Event', 'time, priority, action, argument, kwargs')):
__slots__ = []
def __eq__(s, o): return (s.time, s.priority) == (o.time, o.priority)
def __lt__(s, o): return (s.time, s.priority) < (o.time, o.priority)
def __le__(s, o): return (s.time, s.priority) <= (o.time, o.priority)
def __gt__(s, o): return (s.time, s.priority) > (o.time, o.priority)
_sentinel = object()
class Scheduler():
"""
Code modified from sched.scheduler
"""
delayfunc = time.sleep
def __init__(self, timefunc=_time):
"""Initialize a new instance, passing the time functions"""
self._queue = []
self.timefunc = timefunc
self.got_event = threading.Condition(threading.RLock())
self.thread_started = False
def enterabs(self, time, priority, action, argument=(), kwargs=_sentinel):
"""Enter a new event in the queue at an absolute time.
Returns an ID for the event which can be used to remove it,
if necessary.
"""
if kwargs is _sentinel:
kwargs = {}
event = Event(time, priority, action, argument, kwargs)
with self.got_event:
if not self.thread_started:
self.thread_started = True
threading.Thread(target=self.run, daemon=True).start()
heapq.heappush(self._queue, event)
# Show new Event has been entered:
self.got_event.notify()
return event # The ID
def cancel(self, event):
"""Remove an event from the queue.
This must be presented the ID as returned by enter().
If the event is not in the queue, this raises ValueError.
"""
with self.got_event:
self._queue.remove(event)
heapq.heapify(self._queue)
def enter(self, delay, priority, action, argument=(), kwargs=_sentinel):
"""A variant that specifies the time as a relative time.
This is actually the more commonly used interface.
"""
time = self.timefunc() + delay
return self.enterabs(time, priority, action, argument, kwargs)
def empty(self):
"""Check whether the queue is empty."""
with self.got_event:
return not self._queue
def run(self):
"""Execute events until the queue is empty."""
# localize variable access to minimize overhead
# and to improve thread safety
got_event = self.got_event
q = self._queue
timefunc = self.timefunc
delayfunc = self.delayfunc
pop = heapq.heappop
while True:
try:
while True:
with got_event:
got_event.wait_for(lambda: len(q) != 0)
time, priority, action, argument, kwargs = q[0]
now = timefunc()
if time > now:
# Wait for either the time to elapse or a new
# event to be added:
got_event.wait(timeout=(time - now))
continue
pop(q)
action(*argument, **kwargs)
delayfunc(0) # Let other threads run
except:
pass
#property
def queue(self):
"""An ordered list of upcoming events.
Events are named tuples with fields for:
time, priority, action, arguments, kwargs
"""
# Use heapq to sort the queue rather than using 'sorted(self._queue)'.
# With heapq, two events scheduled at the same time will show in
# the actual order they would be retrieved.
with self.got_event:
events = self._queue[:]
return list(map(heapq.heappop, [events]*len(events)))
###########################################################
def re_queue(polling_queue, id):
polling_queue.put(id)
class StatusPoller:
scheduler = Scheduler()
def __init__(self, polling_queue, results_queue, errors_queue):
self.polling_queue = polling_queue
self.results_queue = results_queue
def run(self):
while True:
# Pick a task from the queue
next_id = self.polling_queue.get()
# Poison pill => shutdown
if next_id == 'END':
break
# Process the task
response = remote_system.get_status(next_id)
if response == "IN_PROGRESS":
self.scheduler.enter(20, 1, re_queue, argument=(self.polling_queue, next_id))
else:
self.results_queue.put(response)
Explanation
First, why did I say that I saw no reason for a JoinableQueue? The run method is programmed to return if it finds an input message that is 'END'. But because of the way this method when finding "IN_PROGRES" responses from the remote system requeues messages back onto the pollinq_queue, the possibility exists that when END is received and run terminates that there is one or more of these requeued messages remaining on the queue. So how can another process or thread depend on calling polling_queue.join() without possibly hanging? It cannot.
Instead, if you have N processes or threads (we haven't decided yet which) doing get requests against a single queue instance, it should suffice to just put N 'END' shutdown messages on the queue. This will result in the N processes terminating. The main process now instead of joining the queue just joins the N processes or threads if it wishes to block on the actual termination of these processes/threads.
The way I would use a JoinableQueue, which I don't think fits your use case, would be if the processes/threads were in an infinite loop never terminating, that is, not quitting "prematurely" and therefore never leaving items left on the queue. You would make these processes/threads daemon processes so that they would eventually end when the main process eventually terminates. So you could not force a termination with an 'END' message. So I just don't see how a JoinableQueue works here, but you can point out to me if I have misunderstood something.
Yes, StatusPoller could be the target of a Process instance (or even a subclass of Process as you originally had it, although except for the fact that is how you currently have it coded, I see no advantage to doing that). But it seems to me that it will be spending most of its time waiting on either getting from a queue or getting a network response. In both cases it will release the Global Interpreter Lock and multithreading should be very performant. Threads will also take up far fewer resources if we are indeed talking about creating hundreds of instances of these tasks, especially if you are running under Windows. You will also not be able to share the scheduler, which runs in its own thread, across all StatusPoller instances. There will be one scheduler now running in each process since each StatusPoller is running in its own process.
Is there a way to track the progress of a chord, preferably in a tqdm bar?
For example if we take the documentation exemple, we would create this file:
#proj/tasks.py
#app.task
def add(x, y):
return x + y
#app.task
def tsum(numbers):
return sum(numbers)
and then run this script:
from celery import chord
from proj.tasks import add, tsum
chord(add.s(i, i)
for i in range(100))(tsum.s()).get()
How could we track the progression on the chord?
We cannot use update_state since the chord() object is not a function.
We cannot use collect() since chord()(callback) blocks the script until the results are ready.
Ideally I would envision something like this custom tqdm subclass for Dask, however I've been unable to find a similar solution.
Any help or hint much appreciated!
So I found a way around it.
First, chord()(callback) doesn't actually block the script, only the .get() part does. It just might take a long time to publish all tasks to the broker. Luckily, there's a simple way to track this publishing process through signals. We can create a progress bar before the publishing begins and modify the example handler from the documentation to update it:
from tqdm import tqdm
from celery.signals import after_task_publish
publish_pbar = tqdm(total=100, desc="Publishing tasks")
#after_task_publish.connect(sender='tasks.add')
def task_sent_handler(sender=None, headers=None, body=None, **kwargs):
publish_pbar.update(1)
c = chord(add.s(i, i)
for i in range(100))(tsum.s())
# The script will resume once all tasks are published so close the pbar
publish_pbar.close()
However this only works for publishing tasks since this signal is executed in the signal that sent the task. The task_success signal is executed in the worker process, so this trick can only be used in the worker log (to the best of my understanding).
So to track progress once all tasks have been published and the script resumes, I turned to worker stats from app.control.inspect().stats(). This returns a dict with various stats, among which are the completed tasks. Here's my implementation:
tasks_pbar = tqdm(total=100, desc="Executing tasks")
previous_total = 0
current_total = 0
while current_total<100:
current_total = 0
for key in app.control.inspect().stats():
current_total += app.control.inspect().stats()[key]['total']['tasks.add']
if current_total > previous_total:
tasks_pbar.update(current_total-previous_total)
previous_total = current_total
results = c.get()
tasks_pbar.close()
Finally, I think it might be necessary to give names to the tasks, both for filtering by the signal handler and for the stats() dict, so do not forget to add this to your tasks:
#proj/tasks.py
#app.task(name='tasks.add')
def add(x, y):
return x + y
If someone can find a better solution, please do share!
A fairly common case for me is to have a periodic update of a value, say every 30 seconds. This value is available on, for instance, a website.
I want to take this value (using a reader), transform it (with a transformer) and publish the result, say on another website (with a publisher).
Both source and destination can be unavailable from time to time, and I'm only interested in new values and timeouts.
My current method is to use a queue for my values and another queue for my results. The reader, the transformer and the publisher are all separate 'threads' using multiprocessing.
This has the advantage that every step can be allowed to 'hang' for some time and the next step can use a get with a timeout to implement some default action in case there is no valid message in the queue.
The drawback of this method is that I'm left with all previous values and results in my queue once the transformer or publisher stalls. In the worst case the publisher has an unrecoverable error and the entire tool runs out of memory.
A possible resolution to this problem is to limit the queue size to 1, use a non-blocking put and handle a queue full exception by throwing away the current value and re-putting the new. This is quite a lot of code for such a simple action and a clear indication that a queue is not the right tool for the job.
I can write my own class to get the behavior I want using multiprocessing primitives, but this is a very common situation for me, so I assume it also is for others and I feel there should be a 'right' solution out there somewhere.
In short is there a standard threadsafe class with the following interface?
class Updatable():
def put(value):
#store value, overwriting existing
def get(timeout):
#blocking, raises Exception when timeout is set and exceeded
return value
edit: my current implementation using multiprocessing
import multiprocessing
from time import sleep
class Updatable():
def __init__(self):
self.manager = multiprocessing.Manager()
self.ns = self.manager.Namespace()
self.updated = self.manager.Event()
def get(self, timeout=None):
self.updated.wait(timeout)
self.updated.clear()
return self.ns.x
def put(self, item):
self.ns.x = item
self.updated.set()
def consumer(updatable):
print(updatable.get()) # expect 1
sleep(1)
print(updatable.get()) # expect "2"
sleep(1)
print(updatable.get()) # expect {3}, after 2 sec
sleep(1)
print(updatable.get()) # expect [4]
sleep(2)
print(updatable.get()) # expect 6
sleep(1)
def producer():
sleep(.5) # make output more stable, by giving both sides 0.5 sec to process
updatable.put(1)
sleep(1)
updatable.put("2")
sleep(2)
updatable.put({3})
sleep(1)
updatable.put([4])
sleep(1)
updatable.put(5,) # will never be consumed
sleep(1)
updatable.put(6)
if __name__ == '__main__':
updatable = Updatable()
p = multiprocessing.Process(target=consumer, args=(updatable,))
p.start()
producer()
Imagine I have a dask grid with 10 workers & 40 cores totals. This is a shared grid, so I don't want to fully saturate it with my work. I have 1000 tasks to do, and I want to submit (and have actively running) a maximum of 20 tasks at a time.
To be concrete,
from time import sleep
from random import random
def inc(x):
from random import random
sleep(random() * 2)
return x + 1
def double(x):
from random import random
sleep(random())
return 2 * x
>>> from distributed import Executor
>>> e = Executor('127.0.0.1:8786')
>>> e
<Executor: scheduler=127.0.0.1:8786 workers=10 threads=40>
If I setup a system of Queues
>>> from queue import Queue
>>> input_q = Queue()
>>> remote_q = e.scatter(input_q)
>>> inc_q = e.map(inc, remote_q)
>>> double_q = e.map(double, inc_q)
This will work, BUT, this will just dump ALL of my tasks to the grid, saturating it. Ideally I could:
e.scatter(input_q, max_submit=20)
It seems that the example from the docs here would allow me to use a maxsize queue. But that looks like from a user-perspective I would still have to deal with the backpressure. Ideally dask would automatically take care of this.
Use maxsize=
You're very close. All of scatter, gather, and map take the same maxsize= keyword argument that Queue takes. So a simple workflow might be as follows:
Example
from time import sleep
def inc(x):
sleep(1)
return x + 1
your_input_data = list(range(1000))
from queue import Queue # Put your data into a queue
q = Queue()
for i in your_input_data:
q.put(i)
from dask.distributed import Executor
e = Executor('127.0.0.1:8786') # Connect to cluster
futures = e.map(inc, q, maxsize=20) # Map inc over data
results = e.gather(futures) # Gather results
L = []
while not q.empty() or not futures.empty() or not results.empty():
L.append(results.get()) # this blocks waiting for all results
All of q, futures, and results are Python Queue objects. The q and results queues don't have a limit, so they'll greedily pull in as much as they can. The futures queue however has a maximum size of 20, so it will only allow 20 futures in flight at any given time. Once the leading future is complete it will immediately be consumed by the gather function and its result will be placed into the results queue. This frees up space in futures and causes another task to be submitted.
Note that this isn't exactly what you wanted. These queues are ordered so futures will only get popped off when they're in the front of the queue. If all of the in-flight futures have finished except for the first they'll still stay in the queue, taking up space. Given this constraint you might want to choose a maxsize= slightly more than your desired 20 items.
Extending this
Here we do a simple map->gather pipeline with no logic in between. You could also put other map computations in here or even pull futures out of the queues and do custom work with them on your own. It's easy to break out of the mold provided above.
The solution posted on github was very useful - https://github.com/dask/distributed/issues/864
Solution:
inputs = iter(inputs)
futures = [c.submit(func, next(inputs)) for i in range(maxsize)]
ac = as_completed(futures)
for finished_future in ac:
# submit new future
try:
new_future = c.submit(func, next(inputs))
ac.add(new_future)
except StopIteration:
pass
result = finished_future.result()
... # do stuff with result
Query:
However for determining the workers that are free for throttling the tasks, am trying to utilize the client.has_what() api. Seems like the load on workers does not get reflected immediately similar to what is shown on the status UI page. At times it takes quite a bit of time for has_what to reflect any data.
Is there another api that can be used to determine number of free workers which can then be used to determine the throttle range similar to what UI is utilizing.
Problem
I've segmented a long-running task into logical subtasks, so I can report the results of each subtask as it completes. However, I'm trying to report the results of a task that will effectively never complete (instead yielding values as it goes), and am struggling to do so with my existing solution.
Background
I'm building a web interface to some Python programs I've written. Users can submit jobs through web forms, then check back to see the job's progress.
Let's say I have two functions, each accessed via separate forms:
med_func: Takes ~1 minute to execute, results are passed off to render(), which produces additional data.
long_func: Returns a generator. Each yield takes on the order of 30 minutes, and should be reported to the user. There are so many yields, we can consider this iterator as infinite (terminating only when revoked).
Code, current implementation
With med_func, I report results as follows:
On form submission, I save an AsyncResult to a Django session:
task_result = med_func.apply_async([form], link=render.s())
request.session["task_result"] = task_result
The Django view for the results page accesses this AsyncResult. When a task has completed, results are saved into an object that is passed as context to a Django template.
def results(request):
""" Serve (possibly incomplete) results of a session's latest run. """
session = request.session
try: # Load most recent task
task_result = session["task_result"]
except KeyError: # Already cleared, or doesn't exist
if "results" not in session:
session["status"] = "No job submitted"
else: # Extract data from Asynchronous Tasks
session["status"] = task_result.status
if task_result.ready():
session["results"] = task_result.get()
render_task = task_result.children[0]
# Decorate with rendering results
session["render_status"] = render_task.status
if render_task.ready():
session["results"].render_output = render_task.get()
del(request.session["task_result"]) # Don't need any more
return render_to_response('results.html', request.session)
This solution only works when the function actually terminates. I can't chain together logical subtasks of long_func, because there are an unknown number of yields (each iteration of long_func's loop may not produce a result).
Question
Is there any sensible way to access yielded objects from an extremely long-running Celery task, so that they can be displayed before the generator is exhausted?
In order for Celery to know what the current state of the task is, it sets some metadata in whatever result backend you have. You can piggy-back on that to store other kinds of metadata.
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.backend.mark_as_started(
report_progress.request.id,
progress=progress)
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
I wouldn't throw a ton of data in there, but it works well for tracking the progress of a long-running task.
Paul's answer is great. As an alternative to using mark_as_started you can use Task's update_state method. They ultimately do the same thing, but the name "update_state" is a little more appropriate for what you're trying to do. You can optionally define a custom state that indicates your task is in progress (I've named my custom state 'PROGRESS'):
def yielder():
for i in range(2**100):
yield i
#task
def report_progress():
for progress in yielder():
# set current progress on the task
report_progress.update_state(state='PROGRESS', meta={'progress': progress})
def view_function(request):
task_id = request.session['task_id']
task = AsyncResult(task_id)
progress = task.info['progress']
# do something with your current progress
Celery part:
def long_func(*args, **kwargs):
i = 0
while True:
yield i
do_something_here(*args, **kwargs)
i += 1
#task()
def test_yield_task(task_id=None, **kwargs):
the_progress = 0
for the_progress in long_func(**kwargs):
cache.set('celery-task-%s' % task_id, the_progress)
Webclient side, starting task:
r = test_yield_task.apply_async()
request.session['task_id'] = r.task_id
Testing last yielded value:
v = cache.get('celery-task-%s' % session.get('task_id'))
if v:
do_someting()
If you do not like to use cache, or it's impossible, you can use db, file or any other place which celery worker and server side will have both accesss. With cache it's a simplest solution, but workers and server have to use the same cache.
A couple options to consider:
1 -- task groups. If you can enumerate all the sub tasks from the time of invocation, you can apply the group as a whole -- that returns a TaskSetResult object you can use to monitor the results of the group as a whole, or of individual tasks in the group -- query this as-needed when you need to check status.
2 -- callbacks. If you can't enumerate all sub tasks (or even if you can!) you can define a web hook / callback that's the last step in the task -- called when the rest of the task completes. The hook would be against a URI in your app that ingests the result and makes it available via DB or app-internal API.
Some combination of these could solve your challenge.
See also this great PyCon preso from one of the Instagram engineers.
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
At video mark 16:00, he discusses how they structure long lists of sub-tasks.