Same task executed multiple times - python

I have ETA tasks that get sent to a Redis broker for Celery. It is a single celery and redis instance, both int he same machine.
The problem is, tasks are getting executed multiple times. I've seen tasks executed 4 to 11 times.
I set up the visibility timeout to be 12 hours, given that my ETA's are between 4-11 hours (determined at runtime):
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 12 * 60 * 60}
Even with that, tasks still get executed multiple times.
Initially, the task in question was not idempotent, so I tried adding in a DB check to make them idempotent.
it looks something like this:
#app.task
def foo(side_effect_action):
if side_effect_action.executed:
return ALREADY_EXECUTED
else:
do_side_effect()
side_effect_action.executed = True
side_effect_action.save() #hits the db
return JUST_EXECUTED
Turns out that the celery worker gets to the task before foo is able to call side_effect_action.save() and save the state, so in all cases when it's looking for side_effect_action.executed it is still False, and thus gets executed multiple times.
Any ideas how can I solve this issue?

I switched my celery broker to RabbitMQ to avoid this issue. It is unfortunate since I now have one more component in my webapp (I still need redis for something else), but it did solve the multiple execution for ETA tasks bug.

Related

Celery Django runing periodic tasks after previus was done. [django-celery-beat]

I want to use django-celery-beat library to make some changes in my database periodically. I set task to run each 10 minutes. Everything working fine till my task takes less than 10 minutes, if it lasts longer next tasks starts while first one is doing calculations and it couses an error.
my tasks loks like that:
from celery import shared_task
from .utils.database_blockchain import BlockchainVerify
#shared_task()
def run_function():
build_block = BlockchainVerify()
return "Database updated"
is there a way to avoid starting the same task if previous wasn't done ?
There is definitely a way. It's locking.
There is whole page in the celery documentation - Ensuring a task is only executed one at a time.
Shortly explained - you can use some cache or even database to put lock in and then every time some task starts just check if this lock is still in use or has been already released.
Be aware of that the task may fail or run longer than expected. Task failure may be handled by adding some expiration to the lock. And set the lock expiration to be long enough just in case the task is still running.
There already is a good thread on SO - link.

Celery apply_async get get called multiple times

I have created a task
#app.task(bind=True, max_retries=1)
def notify_feedback(self, req_id):
#some things
I have called this task from my view with a delay of 1 hour like
later = datetime.datetime.utcnow() + datetime.timedelta(hours=1)
notify_feedback.apply_async((req_id,), eta=later)
When I checked the SQS Messages in Flight it has 1 count pending
after one hour this notify_feedback get called multiple times. Did any one encountered this kind of issue with celery?
celery- 4.1.0 is used
I faced such issue as well, but I have delayed task more than for 1 hour.
When I set this in settings.py my I solved my issue.
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 86400}
The visibility timeout defines the number of seconds to wait for the worker to acknowledge the task before the message is redelivered to another worker.
More details there.

Python redis and celery too many clients, different errors on each execution | Tasks connect to MySQL using pymsql

I am currently working on an app, which has to process several long running tasks.
I am using python 3, flask, celery, redis.
I have a working solution on localhost, but on heroku there are many errors and every execution of the app triggers everytime a different set of errors. I know it cant be random so I am trying to figure out where to start looking.
I have a feeling something must be wrong with redis and I am trying to understand what clients are and where they come from, but I am not able to find an official documentation or explanation on this topic.
Question:
If the redis server is started (even on localhost) many clients are connected, eventough I haven't done anything. On heroku (I am using heroku-redis) I always have 6 clients, on localhost 11 clients.
I have done some research and I am able to display them with:
if 'DYNO' in os.environ:
redis_db = redis.StrictRedis(host='HOST', port=15249, password='REDISDBPW')
else:
redis_db = redis.StrictRedis()
# see what keys are in Redis
all_keys = redis_db.keys()
print (all_keys)
all_clients = redis_db.client_list()
print (all_clients)
I see all these clients but the information there doesn't help me at all. What are they? Why are they there? Where are they coming from?
All the heroku redis add-ons have a client limit, so I need to understand and optimize this. First I thought clientsnumber == tasknumber, but thats not it.
In total I have 12 tasks defined, but I am testing now with 2 tasks (both finish in less than 30 sec.).
When I execute the tasks on localhost the clients increase, from 11 to 16. If I execute once again from 16 to 18 and after this they always stay at 18 doesnt matter how often I execute the tasks.
So what is going on here? I have 2 tasks, why the clients increase from 11 to 16 and then from 16 to 18? Why are they not closed after the task is finished?
I am struggling with the whole issue now for a few days (eventhough it always works perfectly on localhost) so any help or ideas are welcome. I need to start looking somewhere, so currently I am trying to understand the clients.
EDIT:
I installed flower and tryed to monitor the 2 tasks on localhost, everything looks good. It processes two tasks and both succeed in a few seconds. The return value is correct (but it always worked great on localhost).
Still the issue is, after I started flower the amount of clients jumped to 30. I still have no clue: what are clients? With the amount of client I generate I would need a 100$ add-on to just process two tasks, which need a few seconds to finish, this cant be true, I am still thinking something is wrong with redis, even on localhost.
My redis setup is pretty simple:
if 'DYNO' in os.environ:
app.config['CELERY_BROKER_URL'] = 'redis://[the full URL from the redis add-on]'
app.config['CELERY_RESULT_BACKEND'] = 'redis://[the full URL from the redis add-on]'
else:
app.config['CELERY_BROKER_URL'] = 'redis://localhost:6379/0'
app.config['CELERY_RESULT_BACKEND'] = 'redis://localhost'
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'], backend=app.config['CELERY_RESULT_BACKEND'])
Here is an example of a task:
#celery.task(bind=True)
def get_users_deregistrations_task(self, g_start_date, g_end_date):
start_date = datetime.strptime(g_start_date, '%d-%m-%Y')
end_date = datetime.strptime(g_end_date, '%d-%m-%Y')
a1 = db_session.query(func.sum(UsersTransactionsVK.amount)).filter(UsersTransactionsVK.date_added >= start_date, UsersTransactionsVK.date_added <= end_date, UsersTransactionsVK.payed == 'Yes').scalar()
a2 = db_session.query(func.sum(UsersTransactionsStripe.amount)).filter(UsersTransactionsStripe.date_added >= start_date, UsersTransactionsStripe.date_added <= end_date, UsersTransactionsStripe.payed == 'Yes').scalar()
a3 = db_session.query(func.sum(UsersTransactions.amount)).filter(UsersTransactions.date_added >= start_date, UsersTransactions.date_added <= end_date, UsersTransactions.on_hold == 'No').scalar()
if a1 is None:
a1 = 0
if a2 is None:
a2 = 0
if a3 is None:
a3 = 0
amount = a1 + a2 + a3
return {'some_value' : amount}
# Selects user deregistrations between selected dates
#app.route('/get-users-deregistration', methods=["POST"])
#basic_auth.required
#check_verified
def get_users_deregistrations():
if request.method == "POST":
# init task
task = get_users_deregistrations_task.apply_async([session['g_start_date'], session['g_end_date']])
return json.dumps({}), 202, {'Location': url_for('taskstatus_get_users_deregistrations', task_id=task.id)}
#app.route('/status/<task_id>')
def taskstatus_get_users_deregistrations(task_id):
task = get_users_deregistrations_task.AsyncResult(task_id)
if task.state == 'PENDING':
response = {
'state': task.state,
'current': 0,
'total': 1,
'status': 'Pending...'
}
elif task.state != 'FAILURE':
response = {
'state': task.state,
'current': task.info['current'],
'total': task.info['total'],
'status': 'Finished',
'statistic': task.info['statistic'],
'final_dataset': task.info
}
if 'result' in task.info:
response['result'] = task.info['result']
else:
print ('in else')
# something went wrong in the background job
response = {
'state': task.state,
'current': 1,
'total': 1,
'status': str(task.info), # this is the exception raised
}
return json.dumps(response)
EDIT:
Here is my procfile for heroku:
web: gunicorn stats_main:app
worker: celery worker -A stats_main.celery --loglevel=info
EDIT
I am thinking the issue might be the connection pool (on the redis side), which I am not using properly.
I have also found some configurations for celery and added them:
celery = Celery(app.name, broker=app.config['CELERY_BROKER_URL'], backend=app.config['CELERY_RESULT_BACKEND'], redis_max_connections=20, BROKER_TRANSPORT_OPTIONS = {
'max_connections': 20,
}, broker_pool_limit=None)
I uploaded everything again to heroku with these configurations. I am still testing with only 2 tasks, which are both fast.
I have executed the tasks on heroku 10 times in a row, 7 times they worked. 3 times it looked like they finished too early: the returned result was wrong (correct result is f.e. 30000 and it returned 3 times 18000).
The clients quickly jumped to 20, but they never went above 20, so at least the max client error and lost connection to redis error are resolved.
The big issue now is that the tasks can finish too early, its very important that the returned results are correct, performance is not important at all.
EDIT
Nevermind, nothing is solved, everything seems random.
I added two print() in one of the tasks to further debug and uploaded to heroku. After 2 executions I see again connection to redis lost, maximum number of clients reached (Eventhough my redismonitor add-on shows that clients never went above 20)
EDIT
The high amount of clients might be caused by idle clients, which are for some reason never closed (found in a blog post by heroku):
By default, Redis will never close idle connections, which means that
if you don't close your Redis connections explicitly, you will lock
yourself out of your instance.
To ensure this doesn't happen, Heroku Redis sets a default connection
timeout of 300 seconds. This timeout doesn’t apply to
non-publish/subscribe clients, and other blocking operations.
I added now a kill function for idle clients right before EVERY one of my tasks:
def kill_idle_clients():
if 'DYNO' in os.environ:
redis_db = redis.StrictRedis(host='HOST', port=15249, password='REDISDBPW')
else:
redis_db = redis.StrictRedis()
all_clients = redis_db.client_list()
counter = 0
for client in all_clients:
if int(client['idle']) >= 15:
redis_db.client_kill(client['addr'])
counter += 1
print ('killing idle clients:', counter)
Before a task starts it closes all clients, which idle more than 15 sec. It works again on localhost (but no surprise, it always worked on localhost). I have less clients, but on heroku it worked now only 2 times of 10. 8 times the tasks finished too early again. Maybe the idle clients were not really idle, I have no clue.
Its also almost impossible to test, as every execution of the tasks has a different outcome (Loses connection to redis, reached client limit, finishes too early, works perfect).
EDIT
It seems celery settings were ignored all the time. I was suspicious about this all the time and decided to test it by adding some random arguments and changing values to non-sense. I restarted the celery worker ofc.
I expected to see some errors, but it works like nothing happened.
Everything works like before with these non-sense configurations:
celery = Celery(app.name, broker=app.config['REDIS_URL'], backend=app.config['REDIS_URL'], redis_max_connections='pups', BROKER_TRANSPORT_OPTIONS = {
'max_connections': 20,
}, broker_pool_limit=None, broker_connection_timeout='pups', pups="pups")
celery.conf.broker_transport_options = {'visibility_timeout': 'pups'}
EDIT
I changed the way I load configurations for celery (from a seperate config file). Seems to work now, but the issues remain the same.
celery_task = Celery(broker=app.config['REDIS_URL'], backend=app.config['REDIS_URL'])
celery_task.config_from_object('celeryconfig')
EDIT
With these configurations I managed to cap the amount of clients on localhost at 18 for all tasks (I tried all 12 tasks). However on heroku it "somehow" works. There are less clients, but the amount reached 20 once, eventhough I thought I could not exceed 18. (I tested on heroku with 4 tasks).
Testing on heroku with all 12 tasks triggers many different SQL errors. I am now more confused than before. It seems the same task is executed multiple times, but I see only 12 task URL's.
I think that because the SQL errors are f.e.:
sqlalchemy.exc.InternalError: (pymysql.err.InternalError) Packet sequence number wrong - got 117 expected 1
or
sqlalchemy.exc.InterfaceError: (pymysql.err.InterfaceError) (0, '')
or
Multiple rows were found for one()
I tested a few times on heroku with 4 tasks and there were times the task results were returned, but the results were super weird.
This time the tasks did not finish too early but returned increased values, it looked like task A has returned the value 2 times and summed it.
Example: Task A must return 10k, but it returned 20k, so the task has been executed twice and the result has been summed.
Here are my current configurations. I still dont understand the math 100%, but I think its (for the amount of clients):
max-conncurency * CELERYD_MAX_TASKS_PER_CHILD
On localhost I found a new CLI command to inspect worker stats and I had max-conncurecy=3 and CELERYD_MAX_TASKS_PER_CHILD=6
CLI command:
celery -A stats_main.celery_task inspect stats
My current configurations:
worker start:
celery worker -A stats_main.celery_task --loglevel=info --autoscale=10,3
config:
CELERY_REDIS_MAX_CONNECTIONS=20
BROKER_POOL_LIMIT=None
CELERYD_WORKER_LOST_WAIT=20
CELERYD_MAX_TASKS_PER_CHILD=6
BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 18000} # 5 hours
CELERY_RESULT_DB_SHORT_LIVED_SESSIONS = True #useful if: For example, intermittent errors like (OperationalError) (2006, ‘MySQL server has gone away’)
EDIT
Seeing all these SQL errors now I decided to research into a completely different direction. My new theory is, that it could be a MySQL issue.
I adjusted my connection to the MySQL server as described in the answer of this question.
I also found out that pymsql has threadsafety=1, I dont know yet whether this could be an issue, but it seems MySQL has something to do with connections and connection pools.
At the moment I also can say that memory can not be an issue, because if the packages were too big it shouldn't work on localhost, which means I left the max_allowed_packet at the default value, which is around 4MB.
I have also created 3 dummy tasks, which make some simple calculations without connecting to an external MySQL DB. I have executed now 5 times on heroku and there were no errors, the results were always correct, so I assume the issue is not celery, redis, but MySQL, eventhough I have no clue why it would work on localhost. Maybe its a combination of all 3, which lead to the issues on heroku.
EDIT
I adjusted my JS file. Now every task is called one after another, which means they are not async (I still use celery's apply_async because apply did not worked)
So its a hard workaround. I simply created a var for each task, f.e. var task_1_rdy = false;
I also created a function, which runs every 2 seconds and checks whether one task is ready, if ready it will start the next task. I think its easy to understand what I did here.
Tested this on heroku and had no errors at all, even with multiple tasks, so the issue is maybe solved. I need to make more tests but it looks very promising. Ofc. I am not using the async functionality and running task after task will probably have the worst performance, but hey it works now. I will benchmark the performance difference and update the question on monday.
EDIT
I have done a lot of testing today. The time it takes until the tasks complete is the same (sync vs. async) I dont know why, but it it the same.
Working with all 12 tasks on heroku and selecting a huge timerange (huge timerange = tasks take longer, because more data to process):
Again the task results are not precise, the returned values are wrong, only slightly wrong, but wrong and therefore not reliable, f.e. task A must return 20k and on heroku it returned 19500. I dont know how it is possible that data is lost / task returns too early, but after 2 weeks I will give up and try to use a completely different system.
sounds like you r a rest-api using celery worker redis as msg queue.
here is the chk list:
1 in your client did you close the connection after the logic finish
2 celery will new workers, the workers may cause trouble, try monitor celery with flower
3 make sure your client finish the task, try debug with print something, sometimes staging and local has network issues which are stopping you from ending the celery task
4 if you are using redis for celery msg queue, try monitor the number of queues, maybe they auto scale up?
Now i am 60% sure that it is your task that is taking too long and the server cannot respond within a default web request return time. The 70% / 30% thing is applicable when you are on a local machine, where the network is very fast. On the cloud platform latency is the problem and sometimes it affects your program. Before that, if the celery worker failed it will auto create another worker to finish the unfinished job because of gunicon and celery, which causes the increase of the connection.
So the solution is:
Option 1 make your task finish faster
Option 2 return an acknowledgement first , calculate at background, and make another api call to send back the results

limited number of user-initiated background processes

I need to allow users to submit requests for very, very large jobs. We are talking 100 gigabytes of memory and 20 hours of computing time. This costs our company a lot of money, so it was stipulated that only 2 jobs could be running at any time, and requests for new jobs when 2 are already running would be rejected (and the user notified that the server is busy).
My current solution uses an Executor from concurrent.futures, and requires setting the Apache server to run only one process, reducing responsiveness (current user count is very low, so it's okay for now).
If possible I would like to use Celery for this, but I did not see in the documentation any way to accomplish this particular setting.
How can I run up to a limited number of jobs in the background in a Django application, and notify users when jobs are rejected because the server is busy?
I have two solutions for this particular case, one an out of the box solution by celery, and another one that you implement yourself.
You can do something like this with celery workers. In particular, you only create two worker processes with concurrency=1 (or well, one with concurrency=2, but that's gonna be threads, not different processes), this way, only two jobs can be done asynchronously. Now you need a way to raise exceptions if both jobs are occupied, then you use inspect, to count the number of active tasks and throw exceptions if required. For implementation, you can checkout this SO post.
You might also be interested in rate limits.
You can do it all yourself, using a locking solution of choice. In particular, a nice implementation that makes sure only two processes are running with redis (and redis-py) is as simple as the following. (Considering you know redis, since you know celery)
from redis import StrictRedis
redis = StrictRedis('localhost', '6379')
locks = ['compute:lock1', 'compute:lock2']
for key in locks:
lock = redis.lock(key, blocking_timeout=5)
acquired = lock.acquire()
if acquired:
do_huge_computation()
lock.release()
break
print("Gonna try next possible slot")
if not acquired:
raise SystemLimitsReached("Already at max capacity !")
This way you make sure only two running processes can exist in the system. A third processes will block in the line lock.acquire() for blocking_timeout seconds, if the locking was successful, acquired would be True, else it's False and you'd tell your user to wait !
I had the same requirement sometime in the past and what I ended up coding was something like the solution above. In particular
This has the least amount of race conditions possible
It's easy to read
Doesn't depend on a sysadmin, suddenly doubling the concurrency of workers under load and blowing up the whole system.
You can also implement the limit per user, meaning each user can have 2 simultaneous running jobs, by only changing the lock keys from compute:lock1 to compute:userId:lock1 and lock2 accordingly. You can't do this one with vanila celery.
First of all you need to limit concurrency on your worker (docs):
celery -A proj worker --loglevel=INFO --concurrency=2 -n <worker_name>
This will help to make sure that you do not have more than 2 active tasks even if you will have errors in the code.
Now you have 2 ways to implement task number validation:
You can use inspect to get number of active and scheduled tasks:
from celery import current_app
def start_job():
inspect = current_app.control.inspect()
active_tasks = inspect.active() or {}
scheduled_tasks = inspect.scheduled() or {}
worker_key = 'celery#%s' % <worker_name>
worker_tasks = active_tasks.get(worker_key, []) + scheduled_tasks.get(worker_key, [])
if len(worker_tasks) >= 2:
raise MyCustomException('It is impossible to start more than 2 tasks.')
else:
my_task.delay()
You can store number of currently executing tasks in DB and validate task execution based on it.
Second approach could be better if you want to scale your functionality - introduce premium users or do not allow to execute 2 requests by one user.
First
You need the first part of SpiXel's solution. According to him, "you only create two worker processes with concurrency=1".
Second
Set the time out for the task waiting in the queue, which is set CELERY_EVENT_QUEUE_TTL and the queue length limit according to how to limit number of tasks in queue and stop feeding when full?.
Therefore, when the two work running jobs, and the task in the queue waiting like 10 sec or any period time you like, the task will be time out. Or if the queue has been fulfilled, new arrival tasks will be dropped out.
Third
you need extra things to deal with notifying "users when jobs are rejected because the server is busy".
Dead Letter Exchanges is what you need. Every time a task is failed because of the queue length limit or message timeout. "Messages will be dropped or dead-lettered from the front of the queue to make room for new messages once the limit is reached."
You can set "x-dead-letter-exchange" to route to another queue, once this queue receive the dead lettered message, you can send a notification message to users.

celery doesn't work with global variable

from celery import Celery
app = Celery('tasks', backend='amqp://guest#localhost//', broker='amqp://guest#localhost//')
a_num = 0
#app.task
def addone():
global a_num
a_num = a_num + 1
return a_num
this is the code I used to test celery.
I hope every time I use addone() the return value should increase.
But it's always 1
why???
Results
python
>> from tasks import addone
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
>> r = addone.delay()
>> r.get()
1
By default when a worker is started Celery starts it with a concurrency of 4, which means it has 4 processes started to handle task requests. (Plus a process that controls the other processes.) I don't know what algorithm is used to assign task requests to the processes started for a worker but eventually, if you execute addone.delay().get() enough, you'll see the number get greater than 1. What happens is that each process (not each task) gets its own copy of a_num. When I try it here, my fifth execution of addone.delay().get() returns 2.
You could force the number to increment each time by starting your worker with a single process to handle requests. (e.g. celery -A tasks worker -c1) However, if you ever restart your worker, the numbering will be reset to 0. Moreover, I would not design code that works only if the number of processes handling requests is forced to be 1. One day down the road a colleague decides that multiple processes should handle the requests for the tasks and then things break. (Big fat warnings in comments in the code can only do so much.)
At the end of the day, such state should be shared in a cache, like Redis, or a database used as a cache, which would work for the code in your question.
However, in a comment you wrote:
Let's see I want use a task to send something. Instead of connecting every time in task, I want to share a global connection.
Storing the connection in a cache won't work. I would strongly advocate having each process that Celery starts use its own connection rather than try to share it among processes. The connection does not need to be reopened with each new task request. It is opened once per process, and then each task request served by this process reuses the connection.
In many cases, trying to share the same connection among processes (through sharing virtual memory through a fork, for instance) would flat out not work anyway. Connections often carry state with them (e.g. whether a database connection is in autocommit mode). If two parts of the code expect the connection to be in different states, the code will operate inconsistently.
The tasks will run asynchronously so every time it starts a new task a_num will be set to 0. They are run as separate instances.
If you want to work with values I suggest a value store or database of some sort.

Categories

Resources