from celery import Celery
app = Celery('tasks', backend='amqp://guest#localhost//', broker='amqp://guest#localhost//')
a_num = 0
#app.task
def addone():
global a_num
a_num = a_num + 1
return a_num
this is the code I used to test celery.
I hope every time I use addone() the return value should increase.
But it's always 1
why???
Results
python
>> from tasks import addone
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
By default when a worker is started Celery starts it with a concurrency of 4, which means it has 4 processes started to handle task requests. (Plus a process that controls the other processes.) I don't know what algorithm is used to assign task requests to the processes started for a worker but eventually, if you execute addone.delay().get() enough, you'll see the number get greater than 1. What happens is that each process (not each task) gets its own copy of a_num. When I try it here, my fifth execution of addone.delay().get() returns 2.
You could force the number to increment each time by starting your worker with a single process to handle requests. (e.g. celery -A tasks worker -c1) However, if you ever restart your worker, the numbering will be reset to 0. Moreover, I would not design code that works only if the number of processes handling requests is forced to be 1. One day down the road a colleague decides that multiple processes should handle the requests for the tasks and then things break. (Big fat warnings in comments in the code can only do so much.)
At the end of the day, such state should be shared in a cache, like Redis, or a database used as a cache, which would work for the code in your question.
However, in a comment you wrote:
Let's see I want use a task to send something. Instead of connecting every time in task, I want to share a global connection.
Storing the connection in a cache won't work. I would strongly advocate having each process that Celery starts use its own connection rather than try to share it among processes. The connection does not need to be reopened with each new task request. It is opened once per process, and then each task request served by this process reuses the connection.
In many cases, trying to share the same connection among processes (through sharing virtual memory through a fork, for instance) would flat out not work anyway. Connections often carry state with them (e.g. whether a database connection is in autocommit mode). If two parts of the code expect the connection to be in different states, the code will operate inconsistently.
The tasks will run asynchronously so every time it starts a new task a_num will be set to 0. They are run as separate instances.
If you want to work with values I suggest a value store or database of some sort.
Related
I am currently working on an app, which has to process several long running tasks.
I am using python 3, flask, celery, redis.
I have a working solution on localhost, but on heroku there are many errors and every execution of the app triggers everytime a different set of errors. I know it cant be random so I am trying to figure out where to start looking.
I have a feeling something must be wrong with redis and I am trying to understand what clients are and where they come from, but I am not able to find an official documentation or explanation on this topic.
Question:
If the redis server is started (even on localhost) many clients are connected, eventough I haven't done anything. On heroku (I am using heroku-redis) I always have 6 clients, on localhost 11 clients.
I have done some research and I am able to display them with:
if 'DYNO' in os.environ:
redis_db = redis.StrictRedis(host='HOST', port=15249, password='REDISDBPW')
else:
redis_db = redis.StrictRedis()
# see what keys are in Redis
all_keys = redis_db.keys()
print (all_keys)
all_clients = redis_db.client_list()
print (all_clients)
I see all these clients but the information there doesn't help me at all. What are they? Why are they there? Where are they coming from?
All the heroku redis add-ons have a client limit, so I need to understand and optimize this. First I thought clientsnumber == tasknumber, but thats not it.
In total I have 12 tasks defined, but I am testing now with 2 tasks (both finish in less than 30 sec.).
When I execute the tasks on localhost the clients increase, from 11 to 16. If I execute once again from 16 to 18 and after this they always stay at 18 doesnt matter how often I execute the tasks.
So what is going on here? I have 2 tasks, why the clients increase from 11 to 16 and then from 16 to 18? Why are they not closed after the task is finished?
I am struggling with the whole issue now for a few days (eventhough it always works perfectly on localhost) so any help or ideas are welcome. I need to start looking somewhere, so currently I am trying to understand the clients.
EDIT:
I installed flower and tryed to monitor the 2 tasks on localhost, everything looks good. It processes two tasks and both succeed in a few seconds. The return value is correct (but it always worked great on localhost).
Still the issue is, after I started flower the amount of clients jumped to 30. I still have no clue: what are clients? With the amount of client I generate I would need a 100$ add-on to just process two tasks, which need a few seconds to finish, this cant be true, I am still thinking something is wrong with redis, even on localhost.
My redis setup is pretty simple:
if 'DYNO' in os.environ:
app.config['CELERY_BROKER_URL'] = 'redis://[the full URL from the redis add-on]'
app.config['CELERY_RESULT_BACKEND'] = 'redis://[the full URL from the redis add-on]'
else:
app.config['CELERY_BROKER_URL'] = 'redis://localhost:6379/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://localhost'
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'], backend=app.config['CELERY_RESULT_BACKEND'])
Here is an example of a task:
#celery.task(bind=True)
def get_users_deregistrations_task(self, g_start_date, g_end_date):
start_date = datetime.strptime(g_start_date, '%d-%m-%Y')
end_date = datetime.strptime(g_end_date, '%d-%m-%Y')
a1 = db_session.query(func.sum(UsersTransactionsVK.amount)).filter(UsersTransactionsVK.date_added >= start_date, UsersTransactionsVK.date_added <= end_date, UsersTransactionsVK.payed == 'Yes').scalar()
a2 = db_session.query(func.sum(UsersTransactionsStripe.amount)).filter(UsersTransactionsStripe.date_added >= start_date, UsersTransactionsStripe.date_added <= end_date, UsersTransactionsStripe.payed == 'Yes').scalar()
a3 = db_session.query(func.sum(UsersTransactions.amount)).filter(UsersTransactions.date_added >= start_date, UsersTransactions.date_added <= end_date, UsersTransactions.on_hold == 'No').scalar()
if a1 is None:
a1 = 0
if a2 is None:
a2 = 0
if a3 is None:
a3 = 0
amount = a1 + a2 + a3
return {'some_value' : amount}
# Selects user deregistrations between selected dates
#app.route('/get-users-deregistration', methods=["POST"])
#basic_auth.required
#check_verified
def get_users_deregistrations():
if request.method == "POST":
# init task
task = get_users_deregistrations_task.apply_async([session['g_start_date'], session['g_end_date']])
return json.dumps({}), 202, {'Location': url_for('taskstatus_get_users_deregistrations', task_id=task.id)}
#app.route('/status/<task_id>')
def taskstatus_get_users_deregistrations(task_id):
task = get_users_deregistrations_task.AsyncResult(task_id)
if task.state == 'PENDING':
response = {
'state': task.state,
'current': 0,
'total': 1,
'status': 'Pending...'
}
elif task.state != 'FAILURE':
response = {
'state': task.state,
'current': task.info['current'],
'total': task.info['total'],
'status': 'Finished',
'statistic': task.info['statistic'],
'final_dataset': task.info
}
if 'result' in task.info:
response['result'] = task.info['result']
else:
print ('in else')
# something went wrong in the background job
response = {
'state': task.state,
'current': 1,
'total': 1,
'status': str(task.info), # this is the exception raised
}
return json.dumps(response)
EDIT:
Here is my procfile for heroku:
web: gunicorn stats_main:app
worker: celery worker -A stats_main.celery --loglevel=info
EDIT
I am thinking the issue might be the connection pool (on the redis side), which I am not using properly.
I have also found some configurations for celery and added them:
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'], backend=app.config['CELERY_RESULT_BACKEND'], redis_max_connections=20, BROKER_TRANSPORT_OPTIONS = {
'max_connections': 20,
}, broker_pool_limit=None)
I uploaded everything again to heroku with these configurations. I am still testing with only 2 tasks, which are both fast.
I have executed the tasks on heroku 10 times in a row, 7 times they worked. 3 times it looked like they finished too early: the returned result was wrong (correct result is f.e. 30000 and it returned 3 times 18000).
The clients quickly jumped to 20, but they never went above 20, so at least the max client error and lost connection to redis error are resolved.
The big issue now is that the tasks can finish too early, its very important that the returned results are correct, performance is not important at all.
EDIT
Nevermind, nothing is solved, everything seems random.
I added two print() in one of the tasks to further debug and uploaded to heroku. After 2 executions I see again connection to redis lost, maximum number of clients reached (Eventhough my redismonitor add-on shows that clients never went above 20)
EDIT
The high amount of clients might be caused by idle clients, which are for some reason never closed (found in a blog post by heroku):
By default, Redis will never close idle connections, which means that
if you don't close your Redis connections explicitly, you will lock
yourself out of your instance.
To ensure this doesn't happen, Heroku Redis sets a default connection
timeout of 300 seconds. This timeout doesn’t apply to
non-publish/subscribe clients, and other blocking operations.
I added now a kill function for idle clients right before EVERY one of my tasks:
def kill_idle_clients():
if 'DYNO' in os.environ:
redis_db = redis.StrictRedis(host='HOST', port=15249, password='REDISDBPW')
else:
redis_db = redis.StrictRedis()
all_clients = redis_db.client_list()
counter = 0
for client in all_clients:
if int(client['idle']) >= 15:
redis_db.client_kill(client['addr'])
counter += 1
print ('killing idle clients:', counter)
Before a task starts it closes all clients, which idle more than 15 sec. It works again on localhost (but no surprise, it always worked on localhost). I have less clients, but on heroku it worked now only 2 times of 10. 8 times the tasks finished too early again. Maybe the idle clients were not really idle, I have no clue.
Its also almost impossible to test, as every execution of the tasks has a different outcome (Loses connection to redis, reached client limit, finishes too early, works perfect).
EDIT
It seems celery settings were ignored all the time. I was suspicious about this all the time and decided to test it by adding some random arguments and changing values to non-sense. I restarted the celery worker ofc.
I expected to see some errors, but it works like nothing happened.
Everything works like before with these non-sense configurations:
celery = Celery(app.name, broker=app.config['REDIS_URL'], backend=app.config['REDIS_URL'], redis_max_connections='pups', BROKER_TRANSPORT_OPTIONS = {
'max_connections': 20,
}, broker_pool_limit=None, broker_connection_timeout='pups', pups="pups")
celery.conf.broker_transport_options = {'visibility_timeout': 'pups'}
EDIT
I changed the way I load configurations for celery (from a seperate config file). Seems to work now, but the issues remain the same.
celery_task = Celery(broker=app.config['REDIS_URL'], backend=app.config['REDIS_URL'])
celery_task.config_from_object('celeryconfig')
EDIT
With these configurations I managed to cap the amount of clients on localhost at 18 for all tasks (I tried all 12 tasks). However on heroku it "somehow" works. There are less clients, but the amount reached 20 once, eventhough I thought I could not exceed 18. (I tested on heroku with 4 tasks).
Testing on heroku with all 12 tasks triggers many different SQL errors. I am now more confused than before. It seems the same task is executed multiple times, but I see only 12 task URL's.
I think that because the SQL errors are f.e.:
sqlalchemy.exc.InternalError: (pymysql.err.InternalError) Packet sequence number wrong - got 117 expected 1
or
sqlalchemy.exc.InterfaceError: (pymysql.err.InterfaceError) (0, '')
or
Multiple rows were found for one()
I tested a few times on heroku with 4 tasks and there were times the task results were returned, but the results were super weird.
This time the tasks did not finish too early but returned increased values, it looked like task A has returned the value 2 times and summed it.
Example: Task A must return 10k, but it returned 20k, so the task has been executed twice and the result has been summed.
Here are my current configurations. I still dont understand the math 100%, but I think its (for the amount of clients):
max-conncurency * CELERYD_MAX_TASKS_PER_CHILD
On localhost I found a new CLI command to inspect worker stats and I had max-conncurecy=3 and CELERYD_MAX_TASKS_PER_CHILD=6
CLI command:
celery -A stats_main.celery_task inspect stats
My current configurations:
worker start:
celery worker -A stats_main.celery_task --loglevel=info --autoscale=10,3
config:
CELERY_REDIS_MAX_CONNECTIONS=20
BROKER_POOL_LIMIT=None
CELERYD_WORKER_LOST_WAIT=20
CELERYD_MAX_TASKS_PER_CHILD=6
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 18000} # 5 hours
CELERY_RESULT_DB_SHORT_LIVED_SESSIONS = True #useful if: For example, intermittent errors like (OperationalError) (2006, ‘MySQL server has gone away’)
EDIT
Seeing all these SQL errors now I decided to research into a completely different direction. My new theory is, that it could be a MySQL issue.
I adjusted my connection to the MySQL server as described in the answer of this question.
I also found out that pymsql has threadsafety=1, I dont know yet whether this could be an issue, but it seems MySQL has something to do with connections and connection pools.
At the moment I also can say that memory can not be an issue, because if the packages were too big it shouldn't work on localhost, which means I left the max_allowed_packet at the default value, which is around 4MB.
I have also created 3 dummy tasks, which make some simple calculations without connecting to an external MySQL DB. I have executed now 5 times on heroku and there were no errors, the results were always correct, so I assume the issue is not celery, redis, but MySQL, eventhough I have no clue why it would work on localhost. Maybe its a combination of all 3, which lead to the issues on heroku.
EDIT
I adjusted my JS file. Now every task is called one after another, which means they are not async (I still use celery's apply_async because apply did not worked)
So its a hard workaround. I simply created a var for each task, f.e. var task_1_rdy = false;
I also created a function, which runs every 2 seconds and checks whether one task is ready, if ready it will start the next task. I think its easy to understand what I did here.
Tested this on heroku and had no errors at all, even with multiple tasks, so the issue is maybe solved. I need to make more tests but it looks very promising. Ofc. I am not using the async functionality and running task after task will probably have the worst performance, but hey it works now. I will benchmark the performance difference and update the question on monday.
EDIT
I have done a lot of testing today. The time it takes until the tasks complete is the same (sync vs. async) I dont know why, but it it the same.
Working with all 12 tasks on heroku and selecting a huge timerange (huge timerange = tasks take longer, because more data to process):
Again the task results are not precise, the returned values are wrong, only slightly wrong, but wrong and therefore not reliable, f.e. task A must return 20k and on heroku it returned 19500. I dont know how it is possible that data is lost / task returns too early, but after 2 weeks I will give up and try to use a completely different system.
sounds like you r a rest-api using celery worker redis as msg queue.
here is the chk list:
1 in your client did you close the connection after the logic finish
2 celery will new workers, the workers may cause trouble, try monitor celery with flower
3 make sure your client finish the task, try debug with print something, sometimes staging and local has network issues which are stopping you from ending the celery task
4 if you are using redis for celery msg queue, try monitor the number of queues, maybe they auto scale up?
Now i am 60% sure that it is your task that is taking too long and the server cannot respond within a default web request return time. The 70% / 30% thing is applicable when you are on a local machine, where the network is very fast. On the cloud platform latency is the problem and sometimes it affects your program. Before that, if the celery worker failed it will auto create another worker to finish the unfinished job because of gunicon and celery, which causes the increase of the connection.
So the solution is:
Option 1 make your task finish faster
Option 2 return an acknowledgement first , calculate at background, and make another api call to send back the results
I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?
I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.
First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.
First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.
I have a Flask application that is supposed to display the result of a long running function to the user on a specified route. The result is about to change every hour or so. In order to avoid the user having to wait for the result, I want to have it cached somewhere in the application and re-compute it in specific intervals in the background (e.g. every hour) so that no user request ever has to wait for the long running computation function.
The idea I came up with to solve this is as follows, however, I am not completely sure whether this is really "safe" to do in a production environment with a multi-threaded or even multi-processed webserver such as waitress, eventlet, gunicorn or what not.
To re-compute the result in the background, I use a BackgroundScheduler from the APScheduler library.
The result is then left-appended in a collections.deque object which is registered as a module-wide variable (since there is no better possibility to save application wide globals in a Flask application as far as I know?!). Since the maximum size of the deque is set as 2, old results will pop out on the right side of the deque as new ones come in.
A Flask view now returns deque[0] to the requester which should always be the newest result. I decided for deque over Queue since the latter has no built-in possibility to read the first item without removing it.
Thus, it is guaranteed that no user ever has to wait for the result because the old one only disappears from "cache" in the very moment the new one comes in.
See below for a minimal example of this. When running the script and hitting http://localhost:5000, one can see the caching in action - "Job finished at" should never be later than 10 seconds plus some very short time for re-computing it behind "Current time", still one should never have to wait the time.sleep(5) seconds from the job function until the request returns.
Is this a valid implementation for the given requirement that will also work in a production-ready WSGI server setting or should this be accomplished differently?
from flask import Flask
from apscheduler.schedulers.background import BackgroundScheduler
import time
import datetime
from collections import deque
# a global deque that is filled by APScheduler and read by a Flask view
deque = deque(maxlen=2)
# a function filling the deque that is executed in regular intervals by APScheduler
def some_long_running_job():
print('complicated long running job started...')
time.sleep(5)
job_finished_at = datetime.datetime.now()
deque.appendleft(job_finished_at)
# a function setting up the scheduler
def start_scheduler():
scheduler = BackgroundScheduler()
scheduler.add_job(some_long_running_job,
trigger='interval',
seconds=10,
next_run_time=datetime.datetime.utcnow(),
id='1',
name='Some Job name'
)
scheduler.start()
# a flask application
app = Flask(__name__)
# a flask route returning an item from the global deque
#app.route('/')
def display_job_result():
current_time = datetime.datetime.now()
job_finished_at = deque[0]
return '''
Current time is: {0} <br>
Job finished at: {1}
'''.format(current_time, job_finished_at)
# start the scheduler and flask server
if __name__ == '__main__':
start_scheduler()
app.run()
Thread-safety is not enough if you run multiple processes:
Even though collections.deque is thread-safe:
Deques support thread-safe, memory efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction.
Source: https://docs.python.org/3/library/collections.html#collections.deque
Depending on your configuration, your webserver might run multiple workers in multiple processes, so each of those processes has their own instance of the object.
Even with one worker, thread-safety might not be enough:
You might have selected an asynchronous worker type. The asynchronous worker won't know when it's safe to yield and your code would have to be protected against scenarios like this:
Worker for request 1 reads value a and yields
Worker for request 2 also reads value a, writes a + 1 and yields
Worker for request 1 writes value a + 1, even though it should be a + 1 + 1
Possible solutions:
Use something outside of the Flask app to store the data. This can be a database, in this case preferably an in-memory database like Redis. Or if your worker type is compatible with the multiprocessing module, you can try to use multiprocessing.managers.BaseManager to provide your Python object to all worker processes.
Are global variables thread safe in flask? How do I share data between requests?
How can I provide shared state to my Flask app with multiple workers without depending on additional software?
Store large data or a service connection per Flask session
I am running processes in parallel but need to create a database for each cpu process to write to. I only want as many databases as cpu's assigned on each server, so the 100 jobs written to 3 databases that can be merged after.
Is there worker id number or core id that I can identify each worker as?
def workerProcess(job):
if workerDBexist(r'c:\temp\db\' + workerid):
##processjob into this database
else:
makeDB(r'c:\temp\db\' + workerid)
##first time this 'worker/ core' used, make DB then process
import pp
ppservers = ()
ncpus = 3
job_server = pp.Server(ncpus, ppservers=ppservers)
for work in 100WorkItems:
job_server.submit(workerProcess, (work,))
As far as I know, pp doesn't have any such feature in its API.
If you used the stdlib modules instead, that would make your life a lot easier—e.g., multiprocessing.Pool takes an initializer argument, which you could use to initialize a database for each process, which would then be available as a variable that each task could use.
However, there is a relatively easy workaround.
Each process has a unique (at least while it's running) process ID.* In Python, you can access the process ID of the current process with os.getpid(). So, in each task, you can do something like this:
dbname = 'database{}'.format(os.getpid())
Then use dbname to open/create the database. I don't know whether by "database" you mean a dbm file, a sqlite3 file, a database on a MySQL server, or what. You may need to, e.g., create a tempfile.TemporaryDirectory in the parent, pass it to all of the children, and have them os.path.join it to the dbname (so after all the children are done, you can grab everything in os.listdir(the_temp_dir)).
The problem with this is that if pp.Server restarts one of the processes, you'll end up with 4 databases instead of 3. Probably not a huge deal, but your code should deal with that possibility. (IIRC, pp.Server usually doesn't restart the processes unless you pass restart=True, but it may do so if, e.g., one of them crashes.)
But what if (as seems to be the case) you're actually running each task in a brand-new process, rather than using a pool of 3 processes? Well, then you're going to end up with as many databases as there are processes, which probably isn't what you want. Your real problem here is that you're not using a pool of 3 processes, which is what you ought to fix. But are there other ways you could get what you want? Maybe.
For example, let's say you created three locks, one for each database, maybe as lockfiles. Then, each task could do this pseudocode:
for i, lockfile in enumerate(lockfiles):
try:
with lockfile:
do stuff with databases[i]
break
except AlreadyLockedError:
pass
else:
assert False, "oops, couldn't get any of the locks"
If you can actually lock the databases themselves (with an flock, or with some API for the relevant database, etc.) things are even easier: just try to connect to them in turn until one of them succeeds.
As long as your code isn't actually segfaulting or the like,** if you're actually never running more than 3 tasks at a time, there's no way all 3 lockfiles could be locked, so you're guaranteed to get one.
* This isn't quite true, but it's true enough for your purposes. For example, on Windows, each process has a unique HANDLE, and if you ask for its pid one will be generated if it didn't already have one. And on some *nixes, each thread has a unique thread ID, and the process's pid is the thread ID of the first thread. And so on. But as far as your code can tell, each of your processes has a unique pid, which is what matters.
** Even if your code is crashing, you can deal with that, it's just more complicated. For example, use pidfiles instead of empty lockfiles. Get a read lock on the pidfile, then try to upgrade to a write lock. If it fails, read the pid from the file, and check whether any such process exists (e.g., on *nix, if os.kill(pid, 0) raises, there is no such process), and if so forcibly break the lock. Either way, now you've got a write lock, so write your pid to the file.
If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?
Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)
From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.
You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327
Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.