Pydoop call not working when inside Celery task - python

I've set up two files for a project using Celery and Pydoop, tasks.py and HDFSStorage.py
# tasks.py
from celery import Celery
from celery import shared_task
from celery.utils.log import get_task_logger
from HDFSStorage import HDFSStorage
app = Celery('tasks', broker='amqp://guest#localhost//')
logger = get_task_logger(__name__)
fs = HDFSStorage()
print fs.exists("/myfile.txt")
#shared_task
def add(x,y):
logger.info('Adding {0} + {1}'.format(x, y))
logger.info('Checking if file exists')
fs.exists("/myfile.txt")
logger.info('Done checking if file exists')
return x+y
# HDFSStorage.py
import pydoop
from pydoop.hdfs import hdfs
class HDFSStorage():
def __init__(self):
self.client = hdfs(host="master", port=54310, user="oskar")
def exists(self, name):
return self.client.exists(name)
Running Celery starts with the fs.exists() call outside the task and outputs True as expected.
$ celery -A tasks worker -l info
True
[2016-06-08 15:54:15,298: WARNING/MainProcess] /usr/local/lib/python2.7/dist-packages/ce
lery/apps/worker.py:161: CDeprecationWarning:
Starting from version 3.2 Celery will refuse to accept pickle by default.
The pickle serializer is a security concern as it may give attackers
the ability to execute any command. It's important to secure
your broker from unauthorized access when using pickle, so we think
that enabling pickle should require a deliberate action and not be
the default choice.
If you depend on pickle then you should set a setting to disable this
warning and to be sure that everything will continue working
when you upgrade to Celery 3.2::
CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']
You must only enable the serializers that you will actually use.
warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED))
-------------- celery#master v3.1.23 (Cipater)
---- **** -----
--- * *** * -- Linux-3.19.0-32-generic-x86_64-with-LinuxMint-17.3-rosa
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: tasks:0x7f510d3162d0
- ** ---------- .> transport: amqp://guest:**#localhost:5672//
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
[tasks]
. tasks.add
[2016-06-08 15:54:15,371: INFO/MainProcess] Connected to amqp://guest:**#127.0.0.1:5672/
/
[2016-06-08 15:54:15,382: INFO/MainProcess] mingle: searching for neighbors
[2016-06-08 15:54:16,395: INFO/MainProcess] mingle: all alone
[2016-06-08 15:54:16,412: WARNING/MainProcess] celery#master ready.
[2016-06-08 15:54:19,736: INFO/MainProcess] Events of group {task} enabled by remote.
However, running the task which has the same fs.exists() call gets stuck for some unknown reason.
$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from tasks import add
True
>>> print add.delay(5,4).get()
[2016-06-08 15:54:32,833: INFO/MainProcess] Received task: tasks.add[a50409a8-f82d-4376-
ace2-442a09c9ed4f]
[2016-06-08 15:54:32,834: INFO/Worker-2] tasks.add[a50409a8-f82d-4376-ace2-442a09c9ed4f]
: Adding 5 + 3
[2016-06-08 15:54:32,834: INFO/Worker-2] tasks.add[a50409a8-f82d-4376-ace2-442a09c9ed4f]
: Checking if file exists
Removing the fs.exists() call in the task makes the task finish correctly.
What am I doing wrong? What makes Celery to not work with Pydoop?

The HDFSStorage instance must be created inside the task
# tasks.py
from celery import Celery
from celery import shared_task
from celery.utils.log import get_task_logger
from HDFSStorage import HDFSStorage
app = Celery('tasks', broker='amqp://guest#localhost//')
logger = get_task_logger(__name__)
#shared_task
def add(x,y):
fs = HDFSStorage()
logger.info('Adding {0} + {1}'.format(x, y))
logger.info('Checking if file exists')
fs.exists("/myfile.txt")
logger.info('Done checking if file exists')
return x+y

Related

Celery - worker only sometimes picks up tasks

I am building a lead generation portal that can be accessed online. Please don't mind the verbosity of the code, I'm doing a lot of debugging right now.
My Celery worker inconsistently picks up tasks assigned to it, and I'm not sure why.
The weird thing about this, is that sometimes it works 100% perfect: there never are any explicit errors in the terminal.
I am currently in DEBUG = TRUE and REDIS as a broker!
celery start worker terminal command and response
celery -A mysite worker -l info --pool=solo
-------------- celery#DESKTOP-OG8ENRQ v5.0.2 (singularity)
--- ***** -----
-- ******* ---- Windows-10-10.0.19041-SP0 2020-11-09 00:36:13
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: mysite:0x41ba490
- ** ---------- .> transport: redis://localhost:6379//
- ** ---------- .> results: redis://localhost:6379/
- *** --- * --- .> concurrency: 12 (solo)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> celery exchange=celery(direct) key=celery
[tasks]
. mysite.celery.debug_task
. submit
[2020-11-09 00:36:13,899: INFO/MainProcess] Connected to redis://localhost:6379//
[2020-11-09 00:36:14,939: WARNING/MainProcess] c:\users\coole\pycharmprojects\lead_django_retry\venv\lib\site-packages\celery\app\control.py:48: DuplicateNodenameWarning: Received multiple replies from node name: celery#DESKTOP-OG8ENRQ.
Please make sure you give each node a unique nodename using
the celery worker `-n` option.
warnings.warn(DuplicateNodenameWarning(
[2020-11-09 00:36:14,939: INFO/MainProcess] mingle: all alone
[2020-11-09 00:36:14,947: INFO/MainProcess] celery#DESKTOP-OG8ENRQ ready.
views.py
class LeadInputView(FormView):
template_name = 'lead_main.html'
form_class = LeadInput
def form_valid(self, form):
print("I'm at views")
form.submit()
print(form.submit)
return HttpResponseRedirect('./success/')
tasks.py
#task(name="submit")
def start_task(city, category, email):
print("I'm at tasks!")
print(city, category, email)
"""sends an email when feedback form is filled successfully"""
logger.info("Submitted")
return start(city, category, email)
forms.py
class LeadInput(forms.Form):
city = forms.CharField(max_length=50)
category = forms.CharField(max_length=50)
email = forms.EmailField()
def submit(self):
print("I'm at forms!")
x = (start_task.delay(self.cleaned_data['city'], self.cleaned_data['category'], self.cleaned_data['email']))
return x
celery.py
from __future__ import absolute_import
import os
from celery import Celery
from django.conf import settings
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'mysite.settings')
app = Celery('mysite')
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
#app.task(bind=True)
def debug_task(self):
print('Request: {0!r}'.format(self.request))
settings.py
BROKER_URL = 'redis://localhost:6379'
CELERY_RESULT_BACKEND = 'redis://localhost:6379'
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'
The runserver terminal will look something like this:
I'm at views
I'm at forms!
<bound method LeadInput.submit of <LeadInput bound=True, valid=True, fields=(city;category;email)>>
But the worker doesn't say that it picked up anything, just that "celery#DESKTOP-OG8ENRQ ready." Except, when it does work... for some reason? I'm at a loss!
Hello to whoever sees this. It turns out, that this is a bug with celery (or maybe redis?)... apparently many windows users run into this. https://github.com/celery/celery/issues/3759
Turns out, the answer is to make -P solo when starting worker. I'm not sure why this is the case... but that solved it!
Thank you Naqib for your help! You put me down the right rabbit hole to a solution.
By default, celery will use the hostname as worker name if your willing to use multiple workers in the same host then specify -n option.
celery -A mysite worker -l info --pool=solo -n worker2#%h
Your code works fine but the task is passed to the first worker, see
DuplicateNodenameWarning with no obvious reason #2938

Celery using 'application/x-python-serialize' instead of `application/json`

I'm using the celery module v. 3.1.25 in Python 2.7 and Windows 10 to run a Celery worker. The results must be returned encoded in json and not pickle.
Problem: When the worker returns the result of a task, RabbitMQ management console shows the results to be content_type: application/x-python-serialize. Why is it still x-python-serialize when we have set task_serializer, result_serializer and accept_content to json?
proj/celery.py
from __future__ import absolute_import, unicode_literals
from celery import Celery
app = Celery('tasks',
broker='amqp://test:test#192.168.1.26:5672//', # running in Win10 VM
backend='amqp://',
task_serializer='json',
result_serializer='json',
accept_content=['application/json'],
include=['proj.tasks'])
proj/tasks.py
from __future__ import absolute_import, unicode_literals
from .celery import app
#app.task
def myTask():
...
return ...
Worker is started using
celery -A proj worker --loglevel=info
and gives a warning about the pickle serializer
Starting from version 3.2 Celery will refuse to accept pickle by default.
The pickle serializer is a security concern as it may give attackers
the ability to execute any command. It's important to secure
your broker from unauthorized access when using pickle, so we think
that enabling pickle should require a deliberate action and not be
the default choice.
If you depend on pickle then you should set a setting to disable this
warning and to be sure that everything will continue working
when you upgrade to Celery 3.2::
CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']
You must only enable the serializers that you will actually use.
warnings.warn(CDeprecationWarning(W_PICKLE_DEPRECATED))
-------------- celery#Y-PC v3.1.25 (Cipater)
---- **** -----
--- * *** * -- Windows-10-10.0.14393
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: tasks:0x40ffeb8
- ** ---------- .> transport: amqp://test:**#192.168.1.26:5672//
- ** ---------- .> results: amqp://
- *** --- * --- .> concurrency: 12 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
Does it help to change your Celery() config parameter to accept_content=['json'], instead of application/json?

Simple periodic task in celery doesn't work but no errors

I'm new in Celery. I'm trying to properly configure Celery with my Django project. To test whether the celery works, I've created a periodic task which should print "periodic_task" each 2 seconds. Unfortunately it doesn't work but no error.
1 Installed rabbitmq
2 Project/project/celery.py
from __future__ import absolute_import
import os
from celery import Celery
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'project.settings')
from django.conf import settings # noqa
app = Celery('project')
app.config_from_object('django.conf:settings')
app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)
#app.task(bind=True)
def myfunc():
print 'periodic_task'
#app.task(bind=True)
def debudeg_task(self):
print('Request: {0!r}'.format(self.request))
3 Project/project/__init__.py
from __future__ import absolute_import
from .celery import app as celery_app
4 Settings.py
INSTALLED_APPS = [
'djcelery',
...]
...
...
CELERYBEAT_SCHEDULE = {
'schedule-name': {
'task': 'project.celery.myfunc', # We are going to create a email_sending_method later in this post.
'schedule': timedelta(seconds=2),
},
}
And before python manage.py, I run celery -A project worker -l info
Still can't see any "periodic_task" printed in console every 2 seconds... Do you know what to do?
EDIT CELERY CONSOLE:
-------------- celery#Milwou_NB v3.1.23 (Cipater)
---- **** -----
--- * *** * -- Windows-8-6.2.9200
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: dolava:0x33d1350
- ** ---------- .> transport: amqp://guest:**#localhost:5672//
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 4 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
[tasks]
. project.celery.debudeg_task
. project.celery.myfunc
EDIT:
After changing worker to beat, it seems to work. Something is happening each 2 seconds (changed to 5 seconds) but I can't see the results of the task. (I can put anything into the CELERYBEAT_SCHEDULE, even wrong path and it doesn't raises any error..)
I changed myfunc code to:
#app.task(bind=True)
def myfunc():
# notifications.send_message_to_admin('sdaa','dsadasdsa')
with open('text.txt','a') as f:
f.write('sa')
But I can't see text.txt anywhere.
> celery -A dolava beat -l info
celery beat v3.1.23 (Cipater) is starting.
__ - ... __ - _
Configuration ->
. broker -> amqp://guest:**#localhost:5672//
. loader -> celery.loaders.app.AppLoader
. scheduler -> djcelery.schedulers.DatabaseScheduler
. logfile -> [stderr]#%INFO
. maxinterval -> now (0s)
[2016-10-26 17:46:50,135: INFO/MainProcess] beat: Starting...
[2016-10-26 17:46:50,138: INFO/MainProcess] Writing entries...
[2016-10-26 17:46:51,433: INFO/MainProcess] DatabaseScheduler: Schedule changed.
[2016-10-26 17:46:51,433: INFO/MainProcess] Writing entries...
[2016-10-26 17:46:51,812: INFO/MainProcess] Scheduler: Sending due task schedule-name (dolava_app.tasks.myfunc)
[2016-10-26 17:46:51,864: INFO/MainProcess] Writing entries...
[2016-10-26 17:46:57,138: INFO/MainProcess] Scheduler: Sending due task schedule-name (dolava_app.tasks.myfunc)
[2016-10-26 17:47:02,230: INFO/MainProcess] Scheduler: Sending due task schedule-name (dolava_app.tasks.myfunc)
Try to run
$ celery -A project beat -l info

Celery Backend Enabled, Results Say Otherwise

I'll keep it short and to the point:
project directory
proj/__init__.py
/tasks.py
/celery_app.py
celery_app.py
from __future__ import absolute_import
from celery import Celery
app = Celery('proj',
broker='amqp://',
backend='amqp://',
include=['proj.tasks'])
app.conf.update(
CELERY_TASK_RESULT_EXPIRES=3600,
)
if __name__ == '__main__':
app.start()
tasks.py
from __future__ import absolute_import
from celery import current_app
from celery.contrib.methods import task_method
class A:
#current_app.task(filter=task_method)
def add(self,x, y):
return x + y
worker log
-------------- celery#mycomp.localdomain v3.1.17 (Cipater)
---- **** -----
--- * *** * -- Linux-2.6.32-504.8.1.el6.x86_64-x86_64-with-centos-6.6-Final
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> app: proj:0x1dc12d0
- ** ---------- .> transport: amqp://guest:**#localhost:5672//
- ** ---------- .> results: amqp://
- *** --- * --- .> concurrency: 24 (prefork)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
[tasks]
. proj.tasks.add
[2015-04-08 17:45:20,788: INFO/MainProcess] Connected to amqp://guest:**#127.0.0.1:5672//
[2015-04-08 17:45:20,801: INFO/MainProcess] mingle: searching for neighbors
[2015-04-08 17:45:21,812: INFO/MainProcess] mingle: all alone
[2015-04-08 17:45:21,828: WARNING/MainProcess] celery#mycomp.localdomain ready.
[2015-04-08 17:50:25,610: INFO/MainProcess] Received task: proj.tasks.add[e0020f67-dbe7-4f6d-9547-a8ace36c2a2c]
[2015-04-08 17:50:25,635: INFO/MainProcess] Task proj.tasks.add[e0020f67-dbe7-4f6d-9547-a8ace36c2a2c] succeeded in 0.023062946042s: 4
python shell
>>> from proj.tasks import A
>>> a = A()
>>> s = a.add.delay(2,2)
>>> s
<AsyncResult: e0020f67-dbe7-4f6d-9547-a8ace36c2a2c>
>>> s.backend
<celery.backends.base.DisabledBackend object at 0x113fdd0>
As you can see, I have a backend enabled. I'm using amqp. However, when I try and get the result, it's saying I dont have an enabled backend.
By including the line from proj.celery_app import app in tasks.py, the backend started to work.
This seems like a bug, since current_app should contain that backend instance.
I opened an issue on the celery github. Hopefully this helps anyone who encounters this problem as well.
Link to the github issue

Celery not running chord callback

After looking at a lot of articles about chord callbacks not executing and trying their solutions, I am still unable to get it to work. In fact, the chord_unlock method is also not getting executed for some reason.
celery.py
from __future__ import absolute_import
from celery import Celery
app = Celery('sophie',
broker='redis://localhost:6379/2',
backend='redis://localhost:6379/2',
include=['sophie.lib.chord_test'])
app.conf.update(
CELERY_ACCEPT_CONTENT=["json"],
CELERY_TASK_SERIALIZER="json",
CELERY_TRACK_STARTED=True,
CELERYD_PREFETCH_MULTIPLIER=1, # NO PREFETCHING OF TASKS
BROKER_TRANSPORT_OPTIONS = {
'priority_steps': [0, 1] # ALLOW ONLY 2 TASK PRIORITIES
}
)
if __name__ == '__main__':
app.start()
chord_test.py
from __future__ import absolute_import
from sophie.celery import app
from celery import chord
#app.task(name='sophie.lib.add')
def add(x, y):
return x + y
#app.task(name='sophie.lib.tsum')
def tsum(numbers):
return sum(numbers)
if __name__ == '__main__':
tasks = [add.s(100, 100), add.s(200, 200)]
chord(tasks, tsum.s()).apply_async()
The output of my worker logfile is as follows
$ celery worker -l info --app=sophie.celery -n worker1.%h
-------------- celery#worker1.vagrant-ubuntu-11 v3.1.6 (Cipater)
---- **** -----
--- * *** * -- Linux-3.0.0-12-server-x86_64-with-Ubuntu-11.10-oneiric
-- * - **** ---
- ** ---------- [config]
- ** ---------- .> broker: redis://localhost:6379/2
- ** ---------- .> app: sophie:0x3554250
- ** ---------- .> concurrency: 1 (prefork)
- *** --- * --- .> events: OFF (enable -E to monitor this worker)
-- ******* ----
--- ***** ----- [queues]
-------------- .> celery exchange=celery(direct) key=celery
[tasks]
. sophie.lib.add
. sophie.lib.tsum
[2013-12-12 19:37:26,499: INFO/MainProcess] Connected to redis://localhost:6379/2
[2013-12-12 19:37:26,506: INFO/MainProcess] mingle: searching for neighbors
[2013-12-12 19:37:27,512: INFO/MainProcess] mingle: all alone
[2013-12-12 19:37:27,527: WARNING/MainProcess] celery#worker1.vagrant-ubuntu-11 ready.
[2013-12-12 19:37:29,723: INFO/MainProcess] Received task: sophie.lib.add[b7d504c1-217f-43a9-b57e-86f0fcbdbe22]
[2013-12-12 19:37:29,734: INFO/MainProcess] Task sophie.lib.add[b7d504c1-217f-43a9-b57e-86f0fcbdbe22] succeeded in 0.009769904
00522s: 200
[2013-12-12 19:37:29,735: INFO/MainProcess] Received task: sophie.lib.add[eb01a73e-f6c8-401d-8049-6cdbc5f0bd90]
[2013-12-12 19:37:29,737: INFO/MainProcess] Task sophie.lib.add[eb01a73e-f6c8-401d-8049-6cdbc5f0bd90] succeeded in 0.001446505
00442s: 400
There is no chord_unlock being called at all. Some more output to give further context:
$ sudo pip freeze | egrep 'celery|kombu|billiard'
billiard==3.3.0.12
celery==3.1.6
kombu==3.0.7
$ uname -a
Linux vagrant-ubuntu-11 3.0.0-12-server #20-Ubuntu SMP Fri Oct 7 16:36:30 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
$ redis-server --version
Redis server version 2.2.11 (00000000:0)
The chords documentation uses the following syntax for chord:
result = chord(header)(callback)
So you could do:
chord(tasks)(tsum.s())
There is an Important Notes section under the chords documentation that says:
Tasks used within a chord must not ignore their results. In practice this means that you must enable a CELERY_RESULT_BACKEND in order to use chords. Additionally, if CELERY_IGNORE_RESULT is set to True in your configuration, be sure that the individual tasks to be used within the chord are defined with ignore_result=False. This applies to both Task subclasses and decorated tasks.
Even when I followed the recommendations the chord callback is not executed, but might help others.
EDIT:
Found what it was in my case! The chord doesn't fully support specifying queue name in all cases:
https://github.com/celery/celery/issues/2085
Happened to me when my task signatures did not match the argments provided by the flow, accepting a *args might help debug such an issue.

Categories

Resources