Failed due to unhealthy allocations - fly.io

Failed due to unhealthy allocations - fly.io - python

What could be the reason for a Python script failing with "Exit Code: 0" and "Unhealthy allocations" on Fly.io, and how can I troubleshoot it? I'm just trying a host a simple python script on fly.io.
#-----python script-----#
import datetime
import time
def periodic_task():
while True:
now = datetime.datetime.now()
week_number = now.isocalendar()[1]
print(f"Week number: {week_number}")
# Sleep for a minute
time.sleep(5)
if __name__ == '__main__':
periodic_task()
#-----fly.toml file-----#
# fly.toml file generated for withered-snowflake-645 on 2023-02-05T04:28:07+05:30
app = "withered-snowflake-645"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []
[build]
builder = "paketobuildpacks/builder:base"
[env]
PORT = "8080"
[experimental]
auto_rollback = true
[[services]]
http_checks = []
internal_port = 8080
processes = ["app"]
protocol = "tcp"
script_checks = []
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.tcp_checks]]
grace_period = "1s"
interval = "15s"
restart_limit = 0
timeout = "2s"
#-----procfile-----#
web: python db_script.py
the error message in the console was,
==> Creating release
--> release v4 created
--> You can detach the terminal anytime without stopping the deployment
==> Monitoring deployment
Logs: https://fly.io/apps/withered-snowflake-645/monitoring
1 desired, 1 placed, 0 healthy, 1 unhealthy [restarts: 2] [health checks: 1 total]
Failed Instances
Failure #1
Instance
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
4bdf42c6 app 4 fra run failed 1 total 2 21s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2023-02-04T23:39:19Z Received Task received by client
2023-02-04T23:39:19Z Task Setup Building Task Directory
2023-02-04T23:39:24Z Started Task started by client
2023-02-04T23:39:26Z Terminated Exit Code: 0
2023-02-04T23:39:26Z Restarting Task restarting in 1.02248256s
2023-02-04T23:39:31Z Started Task started by client
2023-02-04T23:39:33Z Terminated Exit Code: 0
2023-02-04T23:39:33Z Restarting Task restarting in 1.047249935s
2023-02-04T23:39:39Z Started Task started by client
2023-02-04T23:39:41Z Terminated Exit Code: 0
2023-02-04T23:39:41Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
2023-02-04T23:39:41Z Alloc Unhealthy Unhealthy because of failed task
2023-02-04T23:39:22Z [info]Unpacking image
2023-02-04T23:39:23Z [info]Preparing kernel init
2023-02-04T23:39:24Z [info]Configuring firecracker
2023-02-04T23:39:24Z [info]Starting virtual machine
2023-02-04T23:39:24Z [info]Starting init (commit: e3cff9e)...
2023-02-04T23:39:24Z [info]Preparing to run: `/cnb/process/web` as 1000
2023-02-04T23:39:24Z [info]2023/02/04 23:39:24 listening on [fdaa:1:2a7f:a7b:b6:4bdf:42c6:2]:22 (DNS: [fdaa::3]:53)
2023-02-04T23:39:25Z [info]Starting clean up.
2023-02-04T23:39:30Z [info]Starting instance
2023-02-04T23:39:30Z [info]Configuring virtual machine
2023-02-04T23:39:30Z [info]Pulling container image
2023-02-04T23:39:31Z [info]Unpacking image
2023-02-04T23:39:31Z [info]Preparing kernel init
2023-02-04T23:39:31Z [info]Configuring firecracker
2023-02-04T23:39:31Z [info]Starting virtual machine
2023-02-04T23:39:31Z [info]Starting init (commit: e3cff9e)...
2023-02-04T23:39:31Z [info]Preparing to run: `/cnb/process/web` as 1000
2023-02-04T23:39:31Z [info]2023/02/04 23:39:31 listening on [fdaa:1:2a7f:a7b:b6:4bdf:42c6:2]:22 (DNS: [fdaa::3]:53)
2023-02-04T23:39:32Z [info]Starting clean up.
2023-02-04T23:39:37Z [info]Starting instance
2023-02-04T23:39:38Z [info]Configuring virtual machine
2023-02-04T23:39:38Z [info]Pulling container image
2023-02-04T23:39:38Z [info]Unpacking image
2023-02-04T23:39:38Z [info]Preparing kernel init
2023-02-04T23:39:38Z [info]Configuring firecracker
2023-02-04T23:39:39Z [info]Starting virtual machine
2023-02-04T23:39:39Z [info]Starting init (commit: e3cff9e)...
2023-02-04T23:39:39Z [info]Preparing to run: `/cnb/process/web` as 1000
2023-02-04T23:39:39Z [info]2023/02/04 23:39:39 listening on [fdaa:1:2a7f:a7b:b6:4bdf:42c6:2]:22 (DNS: [fdaa::3]:53)
2023-02-04T23:39:40Z [info]Starting clean up.
--> v4 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v5
--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort
file directory of the folder
I tried procfile with worker: python db_script.py it didn't work.
requirement.txt file with and without codes of dependencies didn't work.
By increasing the time of time.sleep() didn't work either.(tried for 5sec, 1hr)
I want to know what is the reason for this error and how to solve it.
Can someone please help me troubleshoot this issue and find a solution to get my task running on Fly.io? Any help is greatly appreciated. Thank you!

Open the log of your app in one terminal window by command below and try "fly deploy" in another window again. It allows us to see longer logs.
fly logs -a appname

Related

Celery group.apply_async().join() never returns

Consider the following script
tasks.py:
from celery import Celery
from celery import group
app = Celery()
app.conf.update(
broker_url='pyamqp://guest#localhost//',
result_backend='redis://localhost',
)
#app.task
def my_calc(data):
for i in range(100):
data[0]=data[0]/1.04856
data[1]=data[1]/1.02496
return data
def compute(parallel_tasks):
tasks=[]
for i in range(parallel_tasks):
tasks.append([i+1.3,i+2.65])
job = group([my_calc.s(task) for task in tasks])
results = job.apply_async().join(timeout=120)
#for result in results:
# print(result.get(timeout=20))
def start(parallel_tasks,iterations):
for i in range(iterations):
print(i)
compute(parallel_tasks)
The script executes a given number of tasks (parallel_tasks) in a given number of iterations (iterations) using celery's group function
The problem is that, the more task I submit in a single iteration (the greater the parallel_tasks input parameter) the more likely that the execution of the batch will time out because of an unknown reason. The workers don't get overloaded, when the timeout happens workers are already idle.
Calling start(2,100000) works just fine.
Calling start(20,40) stops around the 10th iteration.
The issue is independent from broker and backend types. My primary config uses RabbitMQ as broker and Redis as backend, but I've tried vice versa, RabbitMQ only and Redis only configuration too.
I start the worker just the standard way: worker -A tasks -l info
Environment:
Miniconda - Python 3.6.6 (see requirements.txt for details below)
Debian 9 running in Virtualbox. VM Config: 4 cores and 8GB RAM
Redis 4.0.11
RabbitMQ 3.6.6 on Erlang 19.2.1
**Output of celery -A tasks report**
software -> celery:4.2.1 (windowlicker) kombu:4.2.1 py:3.6.6
billiard:3.5.0.4 py-amqp:2.3.2
platform -> system:Linux arch:64bit, ELF imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:pyamqp results:redis://localhost/
RabbitMQ log contains the following errors:
=ERROR REPORT==== 7-Sep-2018::17:31:42 ===
closing AMQP connection <0.1688.0> (127.0.0.1:52602 -> 127.0.0.1:5672):
missed heartbeats from client, timeout: 60s

Cant enable fail2ban jail sshd

When enabled sshd jail i see Starting fail2ban: ERROR NOK: ("Failed to initialize any backend for Jail 'sshd'",)
ERROR NOK: ('sshd',)
In logs :
ERROR Backend 'systemd' failed to initialize due to No module named systemd
ERROR Failed to initialize any backend for Jail 'sshd'
Centos 6.7 no have systemd module .
CentOS 6.7, python 2.6

Just replace in you jail config backend to auto
From
[sshd]
enabled = true
filter = sshd
port = ssh
logpath = %(sshd_log)s
backend = %(sshd_backend)s
To
[sshd]
enabled = true
filter = sshd
port = ssh
logpath = %(sshd_log)s
backend = auto
And restart service service fail2ban restart

The reason for this error is that after installation of fail2ban the configuration file /etc/fail2ban/paths-fedora.conf contains several lines, which set backends for some applications to systemd, which is not present in CentOS 6.x.
Just remove all strings like
syslog_backend = systemd
sshd_backend = systemd
dropbear_backend = systemd
proftpd_backend = systemd
pureftpd_backend = systemd
wuftpd_backend = systemd
postfix_backend = systemd
dovecot_backend = systemd
from /etc/fail2ban/paths-fedora.conf (or search for the file, which contains such strings, using grep). In this case you do not need to change backend = %(sshd_backend)s to backend = auto -- everything will work fine without such changes.

I was able to fix this by editing the paths-common.conf file from:
default_backend = %(default/backend)s
to:
default_backend = pynotify or default_backend = auto

using dispy to distribute work to ec2 instances

I'm trying to use dispy to distribute work to ec2 instances.
I've followed:
Using dispy with port forwarding via ssh tunnel
http://dispy.sourceforge.net/dispy.html#cloud
but it's not going anywhere, the client script is hung and the server node doesn't receive anything .
what i have right now is:
from my machine:
ssh -i mypemfile.pem -R 51347:localhost:51347 ubuntu#ec2-52-30-185-175.eu-west-1.compute.amazonaws.com
then on remote machine:
sudo dispynode.py --ext_ip_addr ec2-52-30-185-175.eu-west-1.compute.amazonaws.com -d
and i get:
2016-02-03 18:38:39,410 - dispynode - dispynode version 4.6.7
2016-02-03 18:38:39,414 - asyncoro - poller: epoll
2016-02-03 18:38:39,417 - dispynode - serving 1 cpus at 172.31.26.18:51348
2016-02-03 18:38:39,417 - dispynode - tcp server at 172.31.26.18:51348
2016-02-03 18:38:39,422 - asyncoro - waiting for 2 coroutines to terminate
Enter "quit" or "exit" to terminate dispynode,
"stop" to stop service, "start" to restart service,
"cpus" to change CPUs used, anything else to get status:
on the client machine i run:
import dispy
def f(x):
return 2*x
cluster = dispy.JobCluster(f,nodes=['ec2-52-30-185-175.eu-west-1.compute.amazonaws.com'],ext_ip_addr='localhost')
job = cluster.submit(2)
print('result: %s' % job())
and nothing happens, it's just stuck.
thanks

Celery + Redis losing connection

I have a very simple Celery task that runs a (long running) shell script:
import os
from celery import Celery
os.environ['CELERY_TIMEZONE'] = 'Europe/Rome'
os.environ['TIMEZONE'] = 'Europe/Rome'
app = Celery('tasks', backend='redis', broker='redis://OTHER_SERVER:6379/0')
#app.task(name='ct.execute_script')
def execute_script(command):
return os.system(command)
I have this task running on server MY_SERVER and I launch it from OTHER_SERVER where is also running the Redis database.
The task seems to run successfully (I see the result of executing the script on the filesystem) but the I always start getting the following error:
INTERNAL ERROR: ConnectionError('Error 111 connecting to localhost:6379. Connection refused.',)
What could it be? Why is it trying to contact localhost while I've set the Redis server to be redis://OTHER_SERVER:6379/0 and it works (since the task is launched)? Thanks

When you set the backend argument, Celery will use it as the result backend.
On your code, you tell Celery to use local redis server as the result backend.
You seen ConnectionError， because celery can't save the reult to local redis server.
You can disable result backend or start an local redis server or set it to OTHER_SERVER.
ref:
http://celery.readthedocs.org/en/latest/getting-started/first-steps-with-celery.html#keeping-results
http://celery.readthedocs.org/en/latest/configuration.html#celery-result-backend

Deploying web application with fabric and supervisor - SIGHUP causes server to be terminated

We are using supervisor to deploy a python web application. On deployment, the web application is installed on the server through buildout, and a script for running supervisor is created using collective.recipe.supervisor . This script is called at the end of the deployment process by a fabric script. The problem is that when the deployment script is finished, a SIGHUP signal is sent to the process, which causes supervisor to restart (as per this line: https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisord.py#L300 ), but due to some reason, the web app is not restarted after it is terminated. There is no log output after the following:
2012-10-24 15:23:51,510 WARN received SIGHUP indicating restart request
2012-10-24 15:23:51,511 INFO waiting for app-server to die
2012-10-24 15:23:54,650 INFO waiting for app-server to die
2012-10-24 15:23:57,653 INFO waiting for app-server to die
2012-10-24 15:24:00,657 INFO waiting for app-server to die
2012-10-24 15:24:01,658 WARN killing 'app-server' (28981) with SIGKILL
2012-10-24 15:24:01,659 INFO stopped: app-server (terminated by SIGKILL)
So I have two questions. The first one is, does anyone know why supervisor restarts on SIGHUP? I couldn't find any explanation for this, and there are no command line options that would turn this behavior off. The second question is, how can we fix the problem we are facing? We tried starting supervisor with a nohup, but the SIGHUP is still received. The weird thing is that this doesn't happen when I log on to the server, start supervisor by hand, and log out.
Here is the supervisor script generated by buildout:
#!/usr/bin/python2.6
import sys
sys.path[0:0] = [
'/home/username/.buildout/eggs/supervisor-3.0b1-py2.6.egg',
'/home/username/.buildout/eggs/meld3-0.6.9-py2.6.egg',
'/home/username/.buildout/eggs/distribute-0.6.30-py2.6.egg',
]
import sys; sys.argv.extend(["-c","/home/username/app_directory/parts/supervisor/supervisord.conf"])
import supervisor.supervisord
if __name__ == '__main__':
sys.exit(supervisor.supervisord.main())
And here is the configuration file for supervisor, also generated by buildout:
[supervisord]
childlogdir = /home/username/app_directory/var/log
logfile = /home/username/app_directory/var/log/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /home/username/app_directory/var/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false
[unix_http_server]
file = /home/username/app_directory/supervisor.sock
username = username
password = apasswd
chmod = 0700
[supervisorctl]
serverurl = unix:///home/username/app_directory/supervisor.sock
username = username
password = apasswd
[rpcinterface:supervisor]
supervisor.rpcinterface_factory=supervisor.rpcinterface:make_main_rpcinterface
[program:app-server]
command = /home/username/app_directory/bin/gunicorn --bind 0.0.0.0:5000 app:wsgi
process_name = app-server
directory = /home/username/app_directory/bin
priority = 50
redirect_stderr = false
directory = /home/username/app_directory
We don't want to install a patched version of supervisor before really understanding the problem, so any information would be highly appreciated.
Thanks in advance

Restarting or reloading on SIGHUP is common practice in system programming for linux. The question is why you are getting SIGHUP after deployment ends. Since supervisor daemonize itself correctly (because you can start it and log out and it will work), the reload signal may be sent to supervisor by building bot, indicating that webapp need to be restarted, because code has changed.
So supervisor initiates app shutdown, in order to start app with new code. But app would not stop in given timeout, and supervisor decides that app hang and kill it with SIGKILL.
To solve a problem, you need teach the app to shutdown when supervisor ask for it.

The supervisord docs clearly state that sending SIGHUP to a supervisord process will "stop all processes, reload the configuration from the first config file it finds, and restart all processes".
ref - http://supervisord.org/running.html#signal-handlers
Perhaps your process is misbehaving; it looks like supervisor made several attempts to nicely shut it down, but then decided it needed a hard kill:
process.py:560
# kill processes which are taking too long to stop with a final
# sigkill. if this doesn't kill it, the process will be stuck
# in the STOPPING state forever.
self.config.options.logger.warn(
killing %r (%s) with SIGKILL' % (self.config.name, self.pid))
self.kill(signal.SIGKILL)
Maybe the kill call is failing?

You might have run into this bug: https://github.com/Supervisor/supervisor/issues/121
The workaround would be to downgrade supervisord until that is fixed in a released version.

Ran into the exactly same problem, downgrading to 3.0a10 solved it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Failed due to unhealthy allocations - fly.io - python

Open the log of your app in one terminal window by command below and try "fly deploy" in another window again. It allows us to see longer logs. fly logs -a appname

Related

Celery group.apply_async().join() never returns

Cant enable fail2ban jail sshd

using dispy to distribute work to ec2 instances

Celery + Redis losing connection

Deploying web application with fabric and supervisor - SIGHUP causes server to be terminated

Categories

Resources