I'm trying to use dispy to distribute work to ec2 instances.
I've followed:
Using dispy with port forwarding via ssh tunnel
http://dispy.sourceforge.net/dispy.html#cloud
but it's not going anywhere, the client script is hung and the server node doesn't receive anything .
what i have right now is:
from my machine:
ssh -i mypemfile.pem -R 51347:localhost:51347 ubuntu#ec2-52-30-185-175.eu-west-1.compute.amazonaws.com
then on remote machine:
sudo dispynode.py --ext_ip_addr ec2-52-30-185-175.eu-west-1.compute.amazonaws.com -d
and i get:
2016-02-03 18:38:39,410 - dispynode - dispynode version 4.6.7
2016-02-03 18:38:39,414 - asyncoro - poller: epoll
2016-02-03 18:38:39,417 - dispynode - serving 1 cpus at 172.31.26.18:51348
2016-02-03 18:38:39,417 - dispynode - tcp server at 172.31.26.18:51348
2016-02-03 18:38:39,422 - asyncoro - waiting for 2 coroutines to terminate
Enter "quit" or "exit" to terminate dispynode,
"stop" to stop service, "start" to restart service,
"cpus" to change CPUs used, anything else to get status:
on the client machine i run:
import dispy
def f(x):
return 2*x
cluster = dispy.JobCluster(f,nodes=['ec2-52-30-185-175.eu-west-1.compute.amazonaws.com'],ext_ip_addr='localhost')
job = cluster.submit(2)
print('result: %s' % job())
and nothing happens, it's just stuck.
thanks
Related
What could be the reason for a Python script failing with "Exit Code: 0" and "Unhealthy allocations" on Fly.io, and how can I troubleshoot it? I'm just trying a host a simple python script on fly.io.
#-----python script-----#
import datetime
import time
def periodic_task():
while True:
now = datetime.datetime.now()
week_number = now.isocalendar()[1]
print(f"Week number: {week_number}")
# Sleep for a minute
time.sleep(5)
if __name__ == '__main__':
periodic_task()
#-----fly.toml file-----#
# fly.toml file generated for withered-snowflake-645 on 2023-02-05T04:28:07+05:30
app = "withered-snowflake-645"
kill_signal = "SIGINT"
kill_timeout = 5
processes = []
[build]
builder = "paketobuildpacks/builder:base"
[env]
PORT = "8080"
[experimental]
auto_rollback = true
[[services]]
http_checks = []
internal_port = 8080
processes = ["app"]
protocol = "tcp"
script_checks = []
[services.concurrency]
hard_limit = 25
soft_limit = 20
type = "connections"
[[services.ports]]
force_https = true
handlers = ["http"]
port = 80
[[services.ports]]
handlers = ["tls", "http"]
port = 443
[[services.tcp_checks]]
grace_period = "1s"
interval = "15s"
restart_limit = 0
timeout = "2s"
#-----procfile-----#
web: python db_script.py
the error message in the console was,
==> Creating release
--> release v4 created
--> You can detach the terminal anytime without stopping the deployment
==> Monitoring deployment
Logs: https://fly.io/apps/withered-snowflake-645/monitoring
1 desired, 1 placed, 0 healthy, 1 unhealthy [restarts: 2] [health checks: 1 total]
Failed Instances
Failure #1
Instance
ID PROCESS VERSION REGION DESIRED STATUS HEALTH CHECKS RESTARTS CREATED
4bdf42c6 app 4 fra run failed 1 total 2 21s ago
Recent Events
TIMESTAMP TYPE MESSAGE
2023-02-04T23:39:19Z Received Task received by client
2023-02-04T23:39:19Z Task Setup Building Task Directory
2023-02-04T23:39:24Z Started Task started by client
2023-02-04T23:39:26Z Terminated Exit Code: 0
2023-02-04T23:39:26Z Restarting Task restarting in 1.02248256s
2023-02-04T23:39:31Z Started Task started by client
2023-02-04T23:39:33Z Terminated Exit Code: 0
2023-02-04T23:39:33Z Restarting Task restarting in 1.047249935s
2023-02-04T23:39:39Z Started Task started by client
2023-02-04T23:39:41Z Terminated Exit Code: 0
2023-02-04T23:39:41Z Not Restarting Exceeded allowed attempts 2 in interval 5m0s and mode is "fail"
2023-02-04T23:39:41Z Alloc Unhealthy Unhealthy because of failed task
2023-02-04T23:39:22Z [info]Unpacking image
2023-02-04T23:39:23Z [info]Preparing kernel init
2023-02-04T23:39:24Z [info]Configuring firecracker
2023-02-04T23:39:24Z [info]Starting virtual machine
2023-02-04T23:39:24Z [info]Starting init (commit: e3cff9e)...
2023-02-04T23:39:24Z [info]Preparing to run: `/cnb/process/web` as 1000
2023-02-04T23:39:24Z [info]2023/02/04 23:39:24 listening on [fdaa:1:2a7f:a7b:b6:4bdf:42c6:2]:22 (DNS: [fdaa::3]:53)
2023-02-04T23:39:25Z [info]Starting clean up.
2023-02-04T23:39:30Z [info]Starting instance
2023-02-04T23:39:30Z [info]Configuring virtual machine
2023-02-04T23:39:30Z [info]Pulling container image
2023-02-04T23:39:31Z [info]Unpacking image
2023-02-04T23:39:31Z [info]Preparing kernel init
2023-02-04T23:39:31Z [info]Configuring firecracker
2023-02-04T23:39:31Z [info]Starting virtual machine
2023-02-04T23:39:31Z [info]Starting init (commit: e3cff9e)...
2023-02-04T23:39:31Z [info]Preparing to run: `/cnb/process/web` as 1000
2023-02-04T23:39:31Z [info]2023/02/04 23:39:31 listening on [fdaa:1:2a7f:a7b:b6:4bdf:42c6:2]:22 (DNS: [fdaa::3]:53)
2023-02-04T23:39:32Z [info]Starting clean up.
2023-02-04T23:39:37Z [info]Starting instance
2023-02-04T23:39:38Z [info]Configuring virtual machine
2023-02-04T23:39:38Z [info]Pulling container image
2023-02-04T23:39:38Z [info]Unpacking image
2023-02-04T23:39:38Z [info]Preparing kernel init
2023-02-04T23:39:38Z [info]Configuring firecracker
2023-02-04T23:39:39Z [info]Starting virtual machine
2023-02-04T23:39:39Z [info]Starting init (commit: e3cff9e)...
2023-02-04T23:39:39Z [info]Preparing to run: `/cnb/process/web` as 1000
2023-02-04T23:39:39Z [info]2023/02/04 23:39:39 listening on [fdaa:1:2a7f:a7b:b6:4bdf:42c6:2]:22 (DNS: [fdaa::3]:53)
2023-02-04T23:39:40Z [info]Starting clean up.
--> v4 failed - Failed due to unhealthy allocations - no stable job version to auto revert to and deploying as v5
--> Troubleshooting guide at https://fly.io/docs/getting-started/troubleshooting/
Error abort
file directory of the folder
I tried procfile with worker: python db_script.py it didn't work.
requirement.txt file with and without codes of dependencies didn't work.
By increasing the time of time.sleep() didn't work either.(tried for 5sec, 1hr)
I want to know what is the reason for this error and how to solve it.
Can someone please help me troubleshoot this issue and find a solution to get my task running on Fly.io? Any help is greatly appreciated. Thank you!
Open the log of your app in one terminal window by command below and try "fly deploy" in another window again. It allows us to see longer logs.
fly logs -a appname
I am using kubernetes-client/python and want to write a method which will block control until a set of Pods is in Ready state (Running state). I found that kubernetes supports wait --for command for doing same thing via command. Can someone please help me with how to achieve same functionality using kubernetes python client.
To be precise i am mostly interested in equivalent of-
kubectl wait --for condition=Ready pod -l 'app in (kafka,elasticsearch)'
You can use the watch functionality available in the client library.
from kubernetes import client, config, watch
config.load_kube_config()
w = watch.Watch()
core_v1 = client.CoreV1Api()
for event in w.stream(func=core_v1.list_namespaced_pod,
namespace=namespace,
label_selector=label,
timeout_seconds=60):
if event["object"].status.phase == "Running":
w.stop()
end_time = time.time()
logger.info("%s started in %0.2f sec", full_name, end_time-start_time)
return
# event.type: ADDED, MODIFIED, DELETED
if event["type"] == "DELETED":
# Pod was deleted while we were waiting for it to start.
logger.debug("%s deleted before it started", full_name)
w.stop()
return
I have got a successful connection between a Kafka producer and consumer on a Google Cloud Platform cluster established by:
$ cd /usr/lib/kafka
$ bin/kafka-console-producer.sh config/server.properties --broker-list \
PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092 --topic test
and executing in a new shell
$ cd /usr/lib/kafka
$ bin/kafka-console-consumer.sh --bootstrap-server \
PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092 --topic test \
--from-beginning
Now, I want to send messages to the Kafka producer server using the following python script:
from kafka import *
topic = 'test'
producer = KafkaProducer(bootstrap_servers='PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092',
api_version=(0,10))
producer.send(topic, b"Test test test")
However, this results in a KafkaTimeoutError:
"Failed to update metadata after %.1f secs." % (max_wait,))
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
Looking around online told me to consider:
uncommenting listeners=... and advertised.listeners=... in the /usr/lib/kafka/config/server.properties file.
However, listeners=PLAINTEXT://:9092 does not work and this post suggests to set PLAINTEXT://<external-ip>:9092.
So, I started wondering about accessing a Kafka server through an external (static) IP address of the GCP cluster. Then, we have set up a firewall rule to access the port (?) and allow https access to the cluster. But I am unsure whether this is an overkill of the problem.
I definitely need some guidance to connect successfully to the Kafka server from the python script.
You need to set advertised.listeners to the address that your client connects to.
More info: https://rmoff.net/2018/08/02/kafka-listeners-explained/
Thanks Robin! The link you posted was very helpful to find the below working configurations.
Despite the fact that SimpleProducer seems to be a deprecated approach, the following settings finally worked for me:
Python script:
from kafka import *
topic = 'test'
kafka = KafkaClient('[project-name]-w-0.c.[cluster-id].internal:9092')
producer = SimpleProducer(kafka)
message = "Test"
producer.send_messages(topic, message.encode('utf-8'))
and uncomment in the /usr/lib/kafka/config/server.properties file:
listeners=PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092
advertised.listeners=PLAINTEXT://[project-name]-w-0.c.[cluster-id].internal:9092
We are using supervisor to deploy a python web application. On deployment, the web application is installed on the server through buildout, and a script for running supervisor is created using collective.recipe.supervisor . This script is called at the end of the deployment process by a fabric script. The problem is that when the deployment script is finished, a SIGHUP signal is sent to the process, which causes supervisor to restart (as per this line: https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisord.py#L300 ), but due to some reason, the web app is not restarted after it is terminated. There is no log output after the following:
2012-10-24 15:23:51,510 WARN received SIGHUP indicating restart request
2012-10-24 15:23:51,511 INFO waiting for app-server to die
2012-10-24 15:23:54,650 INFO waiting for app-server to die
2012-10-24 15:23:57,653 INFO waiting for app-server to die
2012-10-24 15:24:00,657 INFO waiting for app-server to die
2012-10-24 15:24:01,658 WARN killing 'app-server' (28981) with SIGKILL
2012-10-24 15:24:01,659 INFO stopped: app-server (terminated by SIGKILL)
So I have two questions. The first one is, does anyone know why supervisor restarts on SIGHUP? I couldn't find any explanation for this, and there are no command line options that would turn this behavior off. The second question is, how can we fix the problem we are facing? We tried starting supervisor with a nohup, but the SIGHUP is still received. The weird thing is that this doesn't happen when I log on to the server, start supervisor by hand, and log out.
Here is the supervisor script generated by buildout:
#!/usr/bin/python2.6
import sys
sys.path[0:0] = [
'/home/username/.buildout/eggs/supervisor-3.0b1-py2.6.egg',
'/home/username/.buildout/eggs/meld3-0.6.9-py2.6.egg',
'/home/username/.buildout/eggs/distribute-0.6.30-py2.6.egg',
]
import sys; sys.argv.extend(["-c","/home/username/app_directory/parts/supervisor/supervisord.conf"])
import supervisor.supervisord
if __name__ == '__main__':
sys.exit(supervisor.supervisord.main())
And here is the configuration file for supervisor, also generated by buildout:
[supervisord]
childlogdir = /home/username/app_directory/var/log
logfile = /home/username/app_directory/var/log/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /home/username/app_directory/var/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false
[unix_http_server]
file = /home/username/app_directory/supervisor.sock
username = username
password = apasswd
chmod = 0700
[supervisorctl]
serverurl = unix:///home/username/app_directory/supervisor.sock
username = username
password = apasswd
[rpcinterface:supervisor]
supervisor.rpcinterface_factory=supervisor.rpcinterface:make_main_rpcinterface
[program:app-server]
command = /home/username/app_directory/bin/gunicorn --bind 0.0.0.0:5000 app:wsgi
process_name = app-server
directory = /home/username/app_directory/bin
priority = 50
redirect_stderr = false
directory = /home/username/app_directory
We don't want to install a patched version of supervisor before really understanding the problem, so any information would be highly appreciated.
Thanks in advance
Restarting or reloading on SIGHUP is common practice in system programming for linux. The question is why you are getting SIGHUP after deployment ends. Since supervisor daemonize itself correctly (because you can start it and log out and it will work), the reload signal may be sent to supervisor by building bot, indicating that webapp need to be restarted, because code has changed.
So supervisor initiates app shutdown, in order to start app with new code. But app would not stop in given timeout, and supervisor decides that app hang and kill it with SIGKILL.
To solve a problem, you need teach the app to shutdown when supervisor ask for it.
The supervisord docs clearly state that sending SIGHUP to a supervisord process will "stop all processes, reload the configuration from the first config file it finds, and restart all processes".
ref - http://supervisord.org/running.html#signal-handlers
Perhaps your process is misbehaving; it looks like supervisor made several attempts to nicely shut it down, but then decided it needed a hard kill:
process.py:560
# kill processes which are taking too long to stop with a final
# sigkill. if this doesn't kill it, the process will be stuck
# in the STOPPING state forever.
self.config.options.logger.warn(
killing %r (%s) with SIGKILL' % (self.config.name, self.pid))
self.kill(signal.SIGKILL)
Maybe the kill call is failing?
You might have run into this bug: https://github.com/Supervisor/supervisor/issues/121
The workaround would be to downgrade supervisord until that is fixed in a released version.
Ran into the exactly same problem, downgrading to 3.0a10 solved it.
I ran an API RestFul with bottle and python, all works fine, the API is a daemon running in the system, if I stop the daemon by command line the service stop very well and closed all the port and connections, but when I go to close the service through the API, the port keep alive in state LISTEN and later in TIME_WAIT, it does not liberate the port. I've read for two days but the problem is because bottle have a socket and it does not close the server well, but I can find he solution
The code to close the API as a service is a subprocess launched by python like this
#get('/v1.0/services/<id_service>/restart')
def restart_service(id_service):
try:
service = __find_a_specific_service(id_service)
if(service == None or len(service) < 1):
logging.warning("RESTful URI: /v1.0/services/<id_service>/restart " + id_service +" , restart a specific service, service does not exists")
response.status = utils.CODE_404
return utils.convert_to_json(utils.FAILURE, utils.create_failed_resource(utils.WARNING, utils.SERVICES_API_SERVICE_NOT_EXIST))
else:
if id_service != "API":
api.ServiceApi().restart(id_service)
else:
import subprocess
args='/var/lib/stackops-head/bin/apirestd stop; sleep 5; /var/lib/stackops-head/bin/apirestd start'
subprocess.Popen(args, shell=True)
logging.info("RESTful URI: /v1.0/services/<id_service>/restart " + id_service +" , restart a specific service, ready to construct json response...")
return utils.convert_to_json(utils.SERVICE, None)
except Exception, e:
logging.error("Services: Error during the process of restart a specific service. %r", e)
raise HTTPError(code=utils.CODE_500, output=e.message, exception=e, traceback=None, head
To terminate a bottle process from the outside, send SIGINT.
If app is exited or killed, all file descriptors/handles also including socket are closed by OS.
You can also use
sudo netstat -anp --tcp
in Linux to make sure if the specified port is owned by some processes. Or use
netstat -a -n -b -p tcp
in Windows to do the same thing.
TIME_WAIT is normal state managed by OS rather than app to keep a connection/port for a while. Sometimes it is annoying. Your can tune OS for how long it will keep, but it is not safe.