UWSGI timer and cron decorators running duplicate jobs - python

I have been trying to make the uwsgi python spooler work properly for quite some time. I have a setup in which I run a django application with two worker processes. I have tried setting a cron spooler (and a timer spooler) to run a task every ten minutes, but no matter what configuration of settings I've tried, it always seems to register the signal multiple times, and running the task multiple times.
This is how I run uwsgi:
#!/bin/bash
sudo uwsgi --emperor /etc/uwsgi/vassals --uid http --gid http --enable-threads --pidfile=/tmp/uwsgi.pid --daemonize=/var/log/uwsgi/uwsgi.log
This is my uwsgi vassal config in /etc/uwsgi/vassals/django.ini:
[uwsgi]
chdir = /home/user/django
module = django.wsgi
master = true
processes = 2
socket = /tmp/uwsgi-django.sock
vacuum = true
pidfile = /tmp/uwsgi-django.pid
daemonize = /home/user/django/log.log
env = DJANGO_SETTINGS_MODULE=django.settings
#lazy-apps = false
#lazy = false
spooler = %(chdir)/tasks
#spooler-processes = 1
#import = django-app/spooler.py
#spooler-import = django-app/spooler.py
shared-import = django-app/spooler.py
(I have changed some of the path names for privacy reasons). The lines that are commented out are various attempts at making it not duplicate my signals, but every time it seems to register the signal twice, and sometimes even thrice (presumably in both the workers and the single spooler process).
[uwsgi-signal] signum 0 registered (wid: 0 modifier1: 0 target: default, any worker)
[uwsgi-signal] signum 1 registered (wid: 1 modifier1: 0 target: default, any worker)
[uwsgi-signal] signum 1 registered (wid: 2 modifier1: 0 target: default, any worker)
Does anyone know why this is happening, and how to properly prevent it?
This is the spooler.py file:
#cron(-10, -1, -1, -1, -1)
def periodicUpdate(signal):
print "Running cron job..."
_getStats()
also tried
#timer(600)
def periodicUpdate(signal):
print "Running cron job..."
_getStats()
I also tried adding target='spooler' to the timer/cron-decorator, but it did not seem to many any difference.

Are you sure you do not have other signals registered in django.wsgi, settings.py or other django-related file ? --shared-import will only load things one time (in the master).
Btw i do not get what you are trying to accomplish. This is not how the spooler is supposed to work, and even if you want to use it as a signal handler target you have to specify it when you register signals (with target='spooler' in the decorator)

While this is an old question, I couldn't find the answer elsewhere.
I used this solution with Flask, but it should be similar with Django.
During initialization (prefork mode) you need to register a signal.
uwsgi.register_signal(26, "spooler", periodicUpdate)
Then the timer should look like this:
#timer(600, target='spooler')
def periodicUpdate(signal):
print "Running cron job..."
_getStats()
As for the comments:
The error 'only the master and the workers can register signal handlers' is correct because you haven't register any signal.
The issue with the
'whenever I load one of the pages in my django application it
re-registers it'
can probably happen because its worker is calling the method (periodicUpdate) once. That's why the signal must be register before the workers are being spawned.

Related

Celery: different settings for task_acks_late per worker / add custom option to celery

This question is a follow up of django + celery: disable prefetch for one worker, Is there a bug?
I had a problem with celery (see the question that I follow up) and in order to resolve it I'd like to have two celery workers with -concurrency 1 each but with two different settings of task_acks_late.
My current approach is working, but in my opinion not very beautiful. I am doing the following:
in settings.py of my django project:
CELERY_TASK_ACKS_LATE = os.environ.get("LACK", "False") == "True"
This allows me to start the celery workers with following commands:
LACK=True celery -A miniclry worker --concurrency=1 -n w2 -Q=fast,slow --prefetch-multiplier 1
celery -A miniclry worker --concurrency=1 -n w1 -Q=fast
What would be more intuitive would be if I could do something like:
celery -A miniclry worker --concurrency=1 -n w2 -Q=fast,slow --prefetch-multiplier 1 --late-ack=True
celery -A miniclry worker --concurrency=1 -n w1 -Q=fast --late-ack=False
I found Initializing Different Celery Workers with Different Values but don't understand how to embed this in my django / celery context. In which files would I have to add the code that's adding an argument to the parser and how could I use the custom param to modify task_acks_late of the celery settings.
Update:
Thanks to #Greenev's answer I managed to add custom options to celery. However it seems, that changing the config with this mechanism 'arrives too late' and the chagne is not taken into account.
One possible solution here is to provide acks_late=True as an argument of the shared_task decorator, given your code from the prior question:
#shared_task(acks_late=True)
def task_fast(delay=0.1):
logger.warning("fast in")
time.sleep(delay)
logger.warning("fast out")
UPD. I haven't got task_acks_late to be set using this approach, but you could add a command line argument as follows.
You've already linked to a solution. I can't see any django specifics here, just put the parser.add_argument code to where you have defined your app, given your code from the prior question, you would have something like this:
app = Celery("miniclry", backend="rpc", broker="pyamqp://")
app.config_from_object('django.conf:settings', namespace='CELERY')
def add_worker_arguments(parser):
parser.add_argument('--late-ack', default=False)
app.user_options['worker'].add(add_worker_arguments)
Then you could access your argument value in celeryd_init signal handler
#celeryd_init.connect
def configure_worker(sender=None, conf=None, options=None, **kwargs):
conf.task_acks_late = options.get('late-ack') # get custom argument value from options

Celery task reprocessing itself in an infinite loop

I'm running into an odd situation where celery would reprocess a task that's been completed. The overall design looks like this:
Celery Beat: Pulls files periodically, if a file was pulled it creates a new entry in the DB and delegates processing of that file to another celery task in a 1 worker queue (that way only 1 file gets processed at a time)
Celery Task: Processes the file, once it's done it's done, no retries, no loops.
#app.task(name='periodic_pull_file')
def periodic_pull_file():
for f in get_files_from_some_dir(...):
ingested_file = IngestedFile(filename=filename)
ingested_file.document.save(filename, File(f))
ingested_file.save()
process_import(ingested_file.id)
#deletes the file from the dir source
os.remove(....somepath)
def process_import(ingested_file_id):
ingested_file = IngestedFile.objects.get(id=ingested_file_id)
if 'foo' in ingested_file.filename.lower():
f = process_foo
else:
f = process_real_stuff
f.apply_async(args=[ingested_file_id], queue='import')
#app.task(name='process_real_stuff')
def process_real_stuff(file_id):
#dostuff
process_foo and process_real_stuff is just a function that loops over the file once and once it's done it's done. I can actually keep track of the percentage of where it's at and the interesting thing I noticed was that the same file kept getting processed over and over again (note that these are large files and processing is slow, takes hours to process. Now I started wondering if it was just creating duplicate tasks in the queue. I checked my redis queue when I have 13 pending files to import:
-bash-4.1$ redis-cli -p 6380 llen import
(integer) 13
And aha, 13, I checked the content of each queued task to see if it was just repeated ingested_file_ids using:
redis-cli -p 6380 lrange import 0 -1
And they're all unique tasks with unique ingested_file_id. Am I overlooking something? Is there any reason why it would finish a task -> loop over the same task over and over again? This only started happening recently with no code changes. Before things used to be pretty snappy and seamless. I know it's also not from a "failed" process that somehow magically retries itself because it's not moving down in the queue. i.e. it's receiving the same task in the same order again and again, so it never gets to touch the other 13 files it should've processed.
Note, this is my worker:
python manage.py celery worker -A myapp -l info -c 1 -Q import
Use this
celery -Q your_queue_name purge

Python3: RuntimeError: can't start new thread [duplicate]

I have a site that runs with follow configuration:
Django + mod-wsgi + apache
In one of user's request, I send another HTTP request to another service, and solve this by httplib library of python.
But sometimes this service don't get answer too long, and timeout for httplib doesn't work. So I creating thread, in this thread I send request to service, and join it after 20 sec (20 sec - is a timeout of request). This is how it works:
class HttpGetTimeOut(threading.Thread):
def __init__(self,**kwargs):
self.config = kwargs
self.resp_data = None
self.exception = None
super(HttpGetTimeOut,self).__init__()
def run(self):
h = httplib.HTTPSConnection(self.config['server'])
h.connect()
sended_data = self.config['sended_data']
h.putrequest("POST", self.config['path'])
h.putheader("Content-Length", str(len(sended_data)))
h.putheader("Content-Type", 'text/xml; charset="utf-8"')
if 'base_auth' in self.config:
base64string = base64.encodestring('%s:%s' % self.config['base_auth'])[:-1]
h.putheader("Authorization", "Basic %s" % base64string)
h.endheaders()
try:
h.send(sended_data)
self.resp_data = h.getresponse()
except httplib.HTTPException,e:
self.exception = e
except Exception,e:
self.exception = e
something like this...
And use it by this function:
getting = HttpGetTimeOut(**req_config)
getting.start()
getting.join(COOPERATION_TIMEOUT)
if getting.isAlive(): #maybe need some block
getting._Thread__stop()
raise ValueError('Timeout')
else:
if getting.resp_data:
r = getting.resp_data
else:
if getting.exception:
raise ValueError('REquest Exception')
else:
raise ValueError('Undefined exception')
And all works fine, but sometime I start catching this exception:
error: can't start new thread
at the line of starting new thread:
getting.start()
and the next and the final line of traceback is
File "/usr/lib/python2.5/threading.py", line 440, in start
_start_new_thread(self.__bootstrap, ())
And the answer is: What's happen?
Thank's for all, and sorry for my pure English. :)
The "can't start new thread" error almost certainly due to the fact that you have already have too many threads running within your python process, and due to a resource limit of some kind the request to create a new thread is refused.
You should probably look at the number of threads you're creating; the maximum number you will be able to create will be determined by your environment, but it should be in the order of hundreds at least.
It would probably be a good idea to re-think your architecture here; seeing as this is running asynchronously anyhow, perhaps you could use a pool of threads to fetch resources from another site instead of always starting up a thread for every request.
Another improvement to consider is your use of Thread.join and Thread.stop; this would probably be better accomplished by providing a timeout value to the constructor of HTTPSConnection.
You are starting more threads than can be handled by your system. There is a limit to the number of threads that can be active for one process.
Your application is starting threads faster than the threads are running to completion. If you need to start many threads you need to do it in a more controlled manner I would suggest using a thread pool.
I was running on a similar situation, but my process needed a lot of threads running to take care of a lot of connections.
I counted the number of threads with the command:
ps -fLu user | wc -l
It displayed 4098.
I switched to the user and looked to system limits:
sudo -u myuser -s /bin/bash
ulimit -u
Got 4096 as response.
So, I edited /etc/security/limits.d/30-myuser.conf and added the lines:
myuser hard nproc 16384
myuser soft nproc 16384
Restarted the service and now it's running with 7017 threads.
Ps. I have a 32 cores server and I'm handling 18k simultaneous connections with this configuration.
I think the best way in your case is to set socket timeout instead of spawning thread:
h = httplib.HTTPSConnection(self.config['server'],
timeout=self.config['timeout'])
Also you can set global default timeout with socket.setdefaulttimeout() function.
Update: See answers to Is there any way to kill a Thread in Python? question (there are several quite informative) to understand why. Thread.__stop() doesn't terminate thread, but rather set internal flag so that it's considered already stopped.
I completely rewrite code from httplib to pycurl.
c = pycurl.Curl()
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, CONNECTION_TIMEOUT)
c.setopt(pycurl.TIMEOUT, COOPERATION_TIMEOUT)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.POST, 1)
c.setopt(pycurl.SSL_VERIFYHOST, 0)
c.setopt(pycurl.SSL_VERIFYPEER, 0)
c.setopt(pycurl.URL, "https://"+server+path)
c.setopt(pycurl.POSTFIELDS,sended_data)
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.perform()
something like that.
And I testing it now. Thanks all of you for help.
If you are tying to set timeout why don't you use urllib2.
I'm running a python script on my machine only to copy and convert some files from one format to another, I want to maximize the number of running threads to finish as quickly as possible.
Note: It is not a good workaround from an architecture perspective If you aren't using it for a quick script on a specific machine.
In my case, I checked the max number of running threads that my machine can run before I got the error, It was 150
I added this code before starting a new thread. which checks if the max limit of running threads is reached then the app will wait until some of the running threads finish, then it will start new threads
while threading.active_count()>150 :
time.sleep(5)
mythread.start()
If you are using a ThreadPoolExecutor, the problem may be that your max_workers is higher than the threads allowed by your OS.
It seems that the executor keeps the information of the last executed threads in the process table, even if the threads are already done. This means that when your application has been running for a long time, eventually it will register in the process table as many threads as ThreadPoolExecutor.max_workers
As far as I can tell it's not a python problem. Your system somehow cannot create another thread (I had the same problem and couldn't start htop on another cli via ssh).
The answer of Fernando Ulisses dos Santos is really good. I just want to add, that there are other tools limiting the number of processes and memory usage "from the outside". It's pretty common for virtual servers. Starting point is the interface of your vendor or you might have luck finding some information in files like
/proc/user_beancounters

How can I detect whether I'm running in a Celery worker?

Is there a way to determine, programatically, that the current module being imported/run is done so in the context of a celery worker?
We've settled on setting an environment variable before running the Celery worker, and checking this environment variable in the code, but I wonder if there's a better way?
Simple,
import sys
IN_CELERY_WORKER_PROCESS = sys.argv and sys.argv[0].endswith('celery')\
and 'worker' in sys.argv
if IN_CELERY_WORKER_PROCESS:
print ('Im in Celery worker')
http://percentl.com/blog/django-how-can-i-detect-whether-im-running-celery-worker/
As of celery 4.2 you can also do this by setting a flag on the worker_ready signal
in celery.py:
from celery.signals import worker_ready
app = Celery(...)
app.running = False
#worker_ready.connect
def set_running(*args, **kwargs):
app.running = True
Now you can check within your task by using the global app instance
to see whether or not you are running. This can be very useful to determine which logger to use.
Depending on what your use-case scenario is exactly, you may be able to detect it by checking whether the request id is set:
#app.task(bind=True)
def foo(self):
print self.request.id
If you invoke the above as foo.delay() then the task will be sent to a worker and self.request.id will be set to a unique number. If you invoke it as foo(), then it will be executed in your current process and self.request.id will be None.
You can use the current_worker_task property from the Celery application instance class. Docs here.
With the following task defined:
# whatever_app/tasks.py
celery_app = Celery(app)
#celery_app.task
def test_task():
if celery_app.current_worker_task:
return 'running in a celery worker'
return 'just running'
You can run the following on a python shell:
In [1]: from whatever_app.tasks import test_task
In [2]: test_task()
Out[2]: 'just running'
In [3]: r = test_task.delay()
In [4]: r.result
Out[4]: u'running in a celery worker'
Note: Obviously for test_task.delay() to succeed, you need to have at least one celery worker running and configured to load tasks from whatever_app.tasks.
Adding a environment variable is a good way to check if the module is being run by celery worker. In the task submitter process we may set the environment variable, to mark that it is not running in the context of a celery worker.
But the better way may be to use some celery signals which may help to know if the module is running in worker or task submitter. For example, worker-process-init signal is sent to each child task executor process (in preforked mode) and the handler can be used to set some global variable indicating it is a worker process.
It is a good practice to start workers with names, so that it becomes easier to manage(stop/kill/restart) them. You can use -n to name a worker.
celery worker -l info -A test -n foo
Now, in your script you can use app.control.inspect to see if that worker is running.
In [22]: import test
In [23]: i = test.app.control.inspect(['foo'])
In [24]: i.app.control.ping()
Out[24]: [{'celery#foo': {'ok': 'pong'}}]
You can read more about this in celery worker docs

Celery transfer command line arguments to Task

I am struggling with transfering additional command line arguments to celery task. I can set the desired attribute in bootstep however the same attribute is emtpy when accessed directly from task (I guess it gets overriden)
class Arguments(bootsteps.Step):
def __init__(self, worker, environment, **options):
ArgumentTask.args = {'environment': environment}
# this works
print ArgumentTask.args
Here is the custom task
class ArgumentTask(Task):
abstract = True
_args = {}
#property
def args(self):
return self._args
#args.setter
def args(self, value):
self._args.update(value)
And actual task
#celery.task(base = ArgumentTask, bind = True, name = 'jobs.send')
def send(self):
# this prints empty dictionary
print self.args
Do I need to use some additional persistence layer, eg. persistent objects or am I missing something really obvious?
Similar question
It does not seem to be possible. The reason for that is that your task could be consumed anywhere by any consumer of the queue and each consumer having different command line parameters and therefore it's processing should not depend on workers configuration.
If your problem is to manage environment dev/prod this is the way we managed it in our project:
Each environment is jailed in it's venv having a configuration so that the project is self aware of it's environment(in our case it's just db links in configuration that changes). And each environment has its queues and celery workers launched with this command:
/path/venv/bin/celery worker -A async.myapp --workdir /path -E -n celery-name#server -Ofair
Hope it helped.
If you really want to dig hard on that, each task can access a .control which allows to launch control operations on celery (like some monitoring). But I didn't find anything helpful there.

Categories

Resources