Celery task not detecting error from subprocess

Celery task not detecting error from subprocess - python

I have the following task with runs a python scripts using Popen (I already tried with check_output, call with the same result)
#app.task
def TaskName(filename):
try:
proc = subprocess.Popen(['python', filename],stdout=subprocess.PIPE,
stderr=subprocess.PIPE):
taskoutp, taskerror = proc.communicate()
print('Output: ', taskoutp)
print('Error: ', taskerror)
Except:
raise Exception('Task error: ', taskerror)
If I generate an error in the subprocess it is displayed in the worker cmd window as if it were a normal output/print, the task remains indefinetly with the status 'STARTED' even when I manually close the window opened by the Popen subprocess.
What can I do so the tasks not only prints the error but actually stops and changes its status to FAILURE?

If you want to store results of your task , you could use this parameter result_backend or CELERY_RESULT_BACKEND depending on the version of celery you're using.
This parameter can be used for acknowledging a task only after it is processed :
task_acks_late
Default: Disabled.
Late ack means the task messages will be acknowledged after the task has been executed, not just before (the default behavior).
Complete list of Configuration options can be found here => https://docs.celeryproject.org/en/stable/userguide/configuration.html

Related

Flask hangs when run as a subprocess alongside pytest

I've spent the last hour and a half trying and failing to debug this test and I am utterly stumped. To simplify the process of testing the Flask server I am building, I have made a relatively simple script which starts the server, then runs pytest, kills the server, writes the outputs to files, and exits with Pytest's exit code. This code was working perfectly until today, and I haven't modified it since (aside from debugging this issue).
Here's the problem: when it gets to a certain point in the tests, it hangs. The weird thing is that this does not happen if I run my tests in any other way.
Debugging my server in VS Code, and running tests in the terminal: works
Running my server using the same code used in the test script and running pytest manually: works
Running pytest using the test script and running the server through the start server script (which uses the same code for running the server as the test script does) in a second terminal: works
Here's the other interesting thing: the tests always hang in the same place, part way through the setup fixture. It sends the clear command, and an echo request to the server (which prints the name of the current test). The database clears successfully, and the server echoes the correct information, but the echo route never exits - my tests never get a response. This echo route behaves perfectly for the 50 or so tests that happen before this point. If I comment out the test that is causing it to fail, it fails on the next test. If I comment out the call to the echo then it hangs on a later test on a completely different request to a different route. When it hangs, the server cannot be killed using a SIGTERM, but instead requires a SIGKILL.
Here is my echo route:
#debug.get('/echo')
def echo() -> IEcho:
"""
Echo an input. This returns the given value, but also prints it to stdout
on the server. Useful for debugging tests.
## Params:
* `value` (`str`): value to echo
"""
try:
value = request.args['value']
except KeyError:
raise http_errors.BadRequest('echo route requires a `value` argument')
to_print = f'{Fore.MAGENTA}[ECHO]\t\t{value}{Fore.RESET}'
# Print it to both stdout and stderr to ensure it is seen across all logs
# Otherwise it could be more difficult to figure out what's up with server
# output
print(to_print)
print(to_print, file=sys.stderr)
return {'value': value}
And here is my code that sends the requests:
def get(token: JWT | None, url: str, params: dict) -> dict:
"""
Returns the response to a GET web request
This also parses the response to help with error checking
### Args:
* `url` (`str`): URL to request to
* `params` (`dict`): parameters to send
### Returns:
* `dict`: response data
"""
return handle_response(requests.get(
url,
params=params,
headers=encode_headers(token),
timeout=3
))
def echo(value: str) -> IEcho:
"""
Echo an input. This returns the given value, but also prints it to stdout
on the server. Useful for debugging tests.
## Params:
* `value` (`str`): value to echo
"""
return cast(IEcho, get(None, f"{URL}/echo", {"value": value}))
#pytest.fixture(autouse=True)
def before_each(request: pytest.FixtureRequest):
"""Clear the database between tests"""
clear()
echo(f"{request.module.__name__}.{request.function.__name__}")
print("After echo") # This never prints
Here is my code for running Pytest in my test script
def pytest():
pytest = subprocess.Popen(
[sys.executable, '-u', '-m', 'pytest', '-v', '-s'],
)
# Wait for tests to finish
print("🔨 Running tests...")
try:
ret = pytest.wait()
except KeyboardInterrupt:
print("❗ Testing cancelled")
pytest.terminate()
# write_outputs(pytest, None)
# write_outputs(pytest, "pytest")
raise
# write_outputs(pytest, "pytest")
if ret == 0:
print("✅ It works!")
else:
print("❌ Tests failed")
return bool(ret)
And here is my code for running my server in my test script:
def backend(debug=False, live_output=False):
env = os.environ.copy()
if debug:
env.update({"ENSEMBLE_DEBUG": "TRUE"})
debug_flag = ["--debug"]
else:
debug_flag = []
if live_output is False:
outputs = subprocess.PIPE
else:
outputs = None
flask = subprocess.Popen(
[sys.executable, '-u', '-m', 'flask'] + debug_flag + ['run'],
env=env,
stderr=outputs,
stdout=outputs,
)
if outputs is not None and (flask.stderr is None or flask.stdout is None):
print("❗ Can't read flask output", file=sys.stderr)
flask.kill()
sys.exit(1)
# Request until we get a success, but crash if we failed to start in 10
# seconds
start_time = time.time()
started = False
while time.time() - start_time < 10:
try:
requests.get(
f'http://localhost:{os.getenv("FLASK_RUN_PORT")}/debug/echo',
params={'value': 'Test script startup...'},
)
except requests.ConnectionError:
continue
started = True
break
if not started:
print("❗ Server failed to start in time")
flask.kill()
if outputs is not None:
write_outputs(flask, None)
sys.exit(1)
else:
if flask.poll() is not None:
print("❗ Server crashed during startup")
if outputs is not None:
write_outputs(flask, None)
sys.exit(1)
print("✅ Server started")
return flask
So in summary, does anyone have any idea what on earth is happening? It freezes on such a simple route that this makes me very concerned. I think I may have found some crazy bug in Flask or in the requests library or something.
Even if you don't know what's happening with this, it'd be really helpful to have any ideas as to how I can debug this further, as I have absolutely no idea what is going on.

It turns out that my server output was filling up all the buffer space in the pipe, meaning that it would wait for the buffer to empty. The issue is that my test script was waiting for the tests to exit, and the tests could not progress unless the server was active. As such, the code reached a three-way deadlock. I fixed it by redirecting my output through a file (where limited buffer size wasn't a problem).

How can I find out if a DAG is paused/unpaused in Airflow?

I would like to pause DAGs that are idle and redundant, How do I know which DAGs are unpaused and which are paused?
So I have a list of DAGs that are to be unpaused using a bashcommand that executes airflow pause <dag_id>. I would like to know if the command is successful or not by checking the state of pause of each DAGs. I've checked the airflow webserver and it seems that all of my paused DAGs are still running.
def pause_idle_dags(dags = ["myTutorial"]):
"""
Pauses dags from the airflow
:param dags: dags considered to be idle
:return: Success state
"""
# TODO
for dag in dags:
command = "airflow pause {}".format(dag)
print(executeBashCommand(command))
def executeBashCommand(command):
print('========RUN========', command)
p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = p.communicate()
if p.returncode != 0:
print('========STDOUT========\n',stdout.decode())
print('========STDERR========\n',stderr.decode())
raise Exception('There is error while executing bash command: '+command+'\nwith log:\n'+stderr.decode())
return stdout, stderr

When you run airflow commands that instruct it to perform some action it should edit the internal stat of the backend database that it is connected to. By default airflow using SQLite. You may have setup your own one. Either way you can query that to check.
e.g. airflow=# select * from dag where is_paused;
Here you can obviously also perform updates such as
airflow=# update dag set is_paused='false' where is_paused;

Python/PostgreSQL - check if server is running

I've just started using the psycopg2 module to access a local PostgreSQL server and I'll like a way to progammatically check if the server is already started so my program can handle the error when I try to start the server:
psqldir = 'C:/Program Files/PostgreSQL/9.2/bin' # Windows
username = os.environ.get('USERNAME')
dbname = 'mydb'
os.chdir(psqldir)
os.system('pg_ctl start -D C:/mydatadir') # Error here if server already started
conn = psycopg2.connect('dbname=' + dbname + ' user=' + username)
cur = conn.cursor()
I experimented a little and it seems that this returns a "0" or "3", which would solve my problem, but I didn't find any information on the PostgreSQL/psycopg2 manual that confirms if this is a documented behavior:
server_state = os.system('pg_ctl status -D C:/mydatadir')
What's the best way? Thanks!

From the pg_ctl documentation:
-w
Wait for the startup or shutdown to complete. Waiting is the default option for shutdowns, but not startups. When waiting for
startup, pg_ctl repeatedly attempts to connect to the server. When
waiting for shutdown, pg_ctl waits for the server to remove its PID
file. pg_ctl returns an exit code based on the success of the startup
or shutdown.
You will also find pg_ctl status useful:
status mode checks whether a server is running in the specified data
directory. If it is, the PID and the command line options that were
used to invoke it are displayed. If the server is not running, the
process returns an exit status of 3.

Try to use pgrep command.
proc = subprocess.Popen(["pgrep -u postgres -f -- -D"], stdout=subprocess.PIPE, shell=True)
(out, err) = proc.communicate()
try:
if int(out) > 0:
return True
except Exception as e:
return False

Deploying web application with fabric and supervisor - SIGHUP causes server to be terminated

We are using supervisor to deploy a python web application. On deployment, the web application is installed on the server through buildout, and a script for running supervisor is created using collective.recipe.supervisor . This script is called at the end of the deployment process by a fabric script. The problem is that when the deployment script is finished, a SIGHUP signal is sent to the process, which causes supervisor to restart (as per this line: https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisord.py#L300 ), but due to some reason, the web app is not restarted after it is terminated. There is no log output after the following:
2012-10-24 15:23:51,510 WARN received SIGHUP indicating restart request
2012-10-24 15:23:51,511 INFO waiting for app-server to die
2012-10-24 15:23:54,650 INFO waiting for app-server to die
2012-10-24 15:23:57,653 INFO waiting for app-server to die
2012-10-24 15:24:00,657 INFO waiting for app-server to die
2012-10-24 15:24:01,658 WARN killing 'app-server' (28981) with SIGKILL
2012-10-24 15:24:01,659 INFO stopped: app-server (terminated by SIGKILL)
So I have two questions. The first one is, does anyone know why supervisor restarts on SIGHUP? I couldn't find any explanation for this, and there are no command line options that would turn this behavior off. The second question is, how can we fix the problem we are facing? We tried starting supervisor with a nohup, but the SIGHUP is still received. The weird thing is that this doesn't happen when I log on to the server, start supervisor by hand, and log out.
Here is the supervisor script generated by buildout:
#!/usr/bin/python2.6
import sys
sys.path[0:0] = [
'/home/username/.buildout/eggs/supervisor-3.0b1-py2.6.egg',
'/home/username/.buildout/eggs/meld3-0.6.9-py2.6.egg',
'/home/username/.buildout/eggs/distribute-0.6.30-py2.6.egg',
]
import sys; sys.argv.extend(["-c","/home/username/app_directory/parts/supervisor/supervisord.conf"])
import supervisor.supervisord
if __name__ == '__main__':
sys.exit(supervisor.supervisord.main())
And here is the configuration file for supervisor, also generated by buildout:
[supervisord]
childlogdir = /home/username/app_directory/var/log
logfile = /home/username/app_directory/var/log/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /home/username/app_directory/var/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false
[unix_http_server]
file = /home/username/app_directory/supervisor.sock
username = username
password = apasswd
chmod = 0700
[supervisorctl]
serverurl = unix:///home/username/app_directory/supervisor.sock
username = username
password = apasswd
[rpcinterface:supervisor]
supervisor.rpcinterface_factory=supervisor.rpcinterface:make_main_rpcinterface
[program:app-server]
command = /home/username/app_directory/bin/gunicorn --bind 0.0.0.0:5000 app:wsgi
process_name = app-server
directory = /home/username/app_directory/bin
priority = 50
redirect_stderr = false
directory = /home/username/app_directory
We don't want to install a patched version of supervisor before really understanding the problem, so any information would be highly appreciated.
Thanks in advance

Restarting or reloading on SIGHUP is common practice in system programming for linux. The question is why you are getting SIGHUP after deployment ends. Since supervisor daemonize itself correctly (because you can start it and log out and it will work), the reload signal may be sent to supervisor by building bot, indicating that webapp need to be restarted, because code has changed.
So supervisor initiates app shutdown, in order to start app with new code. But app would not stop in given timeout, and supervisor decides that app hang and kill it with SIGKILL.
To solve a problem, you need teach the app to shutdown when supervisor ask for it.

The supervisord docs clearly state that sending SIGHUP to a supervisord process will "stop all processes, reload the configuration from the first config file it finds, and restart all processes".
ref - http://supervisord.org/running.html#signal-handlers
Perhaps your process is misbehaving; it looks like supervisor made several attempts to nicely shut it down, but then decided it needed a hard kill:
process.py:560
# kill processes which are taking too long to stop with a final
# sigkill. if this doesn't kill it, the process will be stuck
# in the STOPPING state forever.
self.config.options.logger.warn(
killing %r (%s) with SIGKILL' % (self.config.name, self.pid))
self.kill(signal.SIGKILL)
Maybe the kill call is failing?

You might have run into this bug: https://github.com/Supervisor/supervisor/issues/121
The workaround would be to downgrade supervisord until that is fixed in a released version.

Ran into the exactly same problem, downgrading to 3.0a10 solved it.

Detect whether Celery is Available/Running

I'm using Celery to manage asynchronous tasks. Occasionally, however, the celery process goes down which causes none of the tasks to get executed. I would like to be able to check the status of celery and make sure everything is working fine, and if I detect any problems display an error message to the user. From the Celery Worker documentation it looks like I might be able to use ping or inspect for this, but ping feels hacky and it's not clear exactly how inspect is meant to be used (if inspect().registered() is empty?).
Any guidance on this would be appreciated. Basically what I'm looking for is a method like so:
def celery_is_alive():
from celery.task.control import inspect
return bool(inspect().registered()) # is this right??
EDIT: It doesn't even look like registered() is available on celery 2.3.3 (even though the 2.1 docs list it). Maybe ping is the right answer.
EDIT: Ping also doesn't appear to do what I thought it would do, so still not sure the answer here.

Here's the code I've been using. celery.task.control.Inspect.stats() returns a dict containing lots of details about the currently available workers, None if there are no workers running, or raises an IOError if it can't connect to the message broker. I'm using RabbitMQ - it's possible that other messaging systems might behave slightly differently. This worked in Celery 2.3.x and 2.4.x; I'm not sure how far back it goes.
def get_celery_worker_status():
ERROR_KEY = "ERROR"
try:
from celery.task.control import inspect
insp = inspect()
d = insp.stats()
if not d:
d = { ERROR_KEY: 'No running Celery workers were found.' }
except IOError as e:
from errno import errorcode
msg = "Error connecting to the backend: " + str(e)
if len(e.args) > 0 and errorcode.get(e.args[0]) == 'ECONNREFUSED':
msg += ' Check that the RabbitMQ server is running.'
d = { ERROR_KEY: msg }
except ImportError as e:
d = { ERROR_KEY: str(e)}
return d

From the documentation of celery 4.2:
from your_celery_app import app
def get_celery_worker_status():
i = app.control.inspect()
availability = i.ping()
stats = i.stats()
registered_tasks = i.registered()
active_tasks = i.active()
scheduled_tasks = i.scheduled()
result = {
'availability': availability,
'stats': stats,
'registered_tasks': registered_tasks,
'active_tasks': active_tasks,
'scheduled_tasks': scheduled_tasks
}
return result
of course you could/should improve the code with error handling...

To check the same using command line in case celery is running as daemon,
Activate virtualenv and go to the dir where the 'app' is
Now run : celery -A [app_name] status
It will show if celery is up or not plus no. of nodes online
Source:
http://michal.karzynski.pl/blog/2014/05/18/setting-up-an-asynchronous-task-queue-for-django-using-celery-redis/

The following worked for me:
import socket
from kombu import Connection
celery_broker_url = "amqp://localhost"
try:
conn = Connection(celery_broker_url)
conn.ensure_connection(max_retries=3)
except socket.error:
raise RuntimeError("Failed to connect to RabbitMQ instance at {}".format(celery_broker_url))

One method to test if any worker is responding is to send out a 'ping' broadcast and return with a successful result on the first response.
from .celery import app # the celery 'app' created in your project
def is_celery_working():
result = app.control.broadcast('ping', reply=True, limit=1)
return bool(result) # True if at least one result
This broadcasts a 'ping' and will wait up to one second for responses. As soon as the first response comes in, it will return a result. If you want a False result faster, you can add a timeout argument to reduce how long it waits before giving up.

I found an elegant solution:
from .celery import app
try:
app.broker_connection().ensure_connection(max_retries=3)
except Exception as ex:
raise RuntimeError("Failed to connect to celery broker, {}".format(str(ex)))

You can use ping method to check whether any worker (or specific worker) is alive or not https://docs.celeryproject.org/en/latest/_modules/celery/app/control.html#Control.ping
celey_app.control.ping()

You can test on your terminal by running the following command.
celery -A proj_name worker -l INFO
You can review every time your celery runs.

The below script is worked for me.
#Import the celery app from project
from application_package import app as celery_app
def get_celery_worker_status():
insp = celery_app.control.inspect()
nodes = insp.stats()
if not nodes:
raise Exception("celery is not running.")
logger.error("celery workers are: {}".format(nodes))
return nodes

Run celery status to get the status.
When celery is running,
(venv) ubuntu#server1:~/project-dir$ celery status
-> celery#server1: OK
1 node online.
When no celery worker is running, you get the below information displayed in terminal.
(venv) ubuntu#server1:~/project-dir$ celery status
Error: No nodes replied within time constraint

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Celery task not detecting error from subprocess - python

Related

Flask hangs when run as a subprocess alongside pytest

How can I find out if a DAG is paused/unpaused in Airflow?

Python/PostgreSQL - check if server is running

Deploying web application with fabric and supervisor - SIGHUP causes server to be terminated

Detect whether Celery is Available/Running

Categories

Resources