Is there a good way to automatically restart an instance if it reaches the end of a start up script?
I have a Python script that I want to run continuously on Compute Engine which checks the pub/sub from a GAE instance that's running a CRON job. I haven't figured out a good way to catch every possible error and there are many edge cases that are hard to test (e.g. the instance running out of memory). It would be better if I could just restart the instance every time the script finishes (because it should never finish). The autorestart option won't work because the instance doesn't shutdown, it just stops running the script.
A simple shutdown -r now may be enough.
Or if you prefer gcloud:
gcloud compute instances reset $(hostname)
Mind that reset is a real reset, without a proper OS shutdown.
You might also need to check this documentation before performing 'Resetting or Restarting operation in an instance'
Related
I am running machine learning scripts that take a long time to finish. I want to run them on AWS on a faster processor and stop the instance when it finishes.
Can boto be used within the running script to stop its own instance? Is there a simpler way?
If your EC2 instance is running Linux, you can simply issue a halt or shutdown command to stop your EC2 instance. This allows you to shutdown your EC2 instance without requiring IAM permissions.
See Creating a Connection on how to create a connection. Never tried this one before, so use caution. Also make sure the instance is EBS backed. Otherwise the instance will be terminated when you stop it.
import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region("us-east-1") # or your region
# Get the current instance's id
my_id = boto.utils.get_instance_metadata()['instance-id']
conn.stop_instances(instance_ids=[my_id])
I have a Python program that I am running as a Job on a Kubernetes cluster every 2 hours. I also have a webserver that starts the job whenever user clicks a button on a page.
I need to ensure that at most only one instance of the Job is running on the cluster at any given time.
Given that I am using Kubernetes to run the job and connecting to Postgresql from within the job, the solution should somehow leverage these two. I though a bit about it and came with the following ideas:
Find a setting in Kubernetes that would set this limit, attempts to start second instance would then fail. I was unable to find this setting.
Create a shared lock, or mutex. Disadvantage is that if job crashes, I may not unlock before quitting.
Kubernetes is running etcd, maybe I can use that
Create a 'lock' table in Postgresql, when new instance connects, it checks if it is the only one running. Use transactions somehow so that one wins and proceeds, while others quit. I have not yet thought this out, but is should work.
Query kubernetes API for a label I use on the job, see if there are some instances. This may not be atomic, so more than one instance may slip through.
What are the usual solutions to this problem given the platform choice I made? What should I do, so that I don't reinvent the wheel and have something reliable?
A completely different approach would be to run a (web) server that executes the job functionality. At a high level, the idea is that the webserver can contact this new job server to execute functionality. In addition, this new job server will have an internal cron to trigger the same functionality every 2 hours.
There could be 2 approaches to implementing this:
You can put the checking mechanism inside the jobserver code to ensure that even if 2 API calls happen simultaneously to the job server, only one executes, while the other waits. You could use the language platform's locking features to achieve this, or use a message queue.
You can put the checking mechanism outside the jobserver code (in the database) to ensure that only one API call executes. Similar to what you suggested. If you use a postgres transaction, you don't have to worry about your job crashing and the value of the lock remaining set.
The pros/cons of both approaches are straightforward. The major difference in my mind between 1 & 2, is that if you update the job server code, then you might have a situation where 2 job servers might be running at the same time. This would destroy the isolation property you want. Hence, database might work better, or be more idiomatic in the k8s sense (all servers are stateless so all the k8s goodies work; put any shared state in a database that can handle concurrency).
Addressing your ideas, here are my thoughts:
Find a setting in k8s that will limit this: k8s will not start things with the same name (in the metadata of the spec). But anything else goes for a job, and k8s will start another job.
a) etcd3 supports distributed locking primitives. However, I've never used this and I don't really know what to watch out for.
b) postgres lock value should work. Even in case of a job crash, you don't have to worry about the value of the lock remaining set.
Querying k8s API server for things that should be atomic is not a good idea like you said. I've used a system that reacts to k8s events (like an annotation change on an object spec), but I've had bugs where my 'operator' suddenly stops getting k8s events and needs to be restarted, or again, if I want to push an update to the event-handler server, then there might be 2 event handlers that exist at the same time.
I would recommend sticking with what you are best familiar with. In my case that would be implementing a job-server like k8s deployment that runs as a server and listens to events/API calls.
I am using systemd on Raspbian to run a Python script script.py. The my.service file looks like this:
[Unit]
Description=My Python Script
Requires=other.service
[Service]
Restart=always
ExecStart=/home/script.py
ExecStop=/home/script.py
[Install]
WantedBy=multi-user.target
When the Required=other.service stops, I want my.service to stop immediately and also terminate Python process running script.py.
However, when trying this out by stopping other.service and then monitoring the state of my.service using systemctl, it seems like it takes good while for my.service to actually enter a 'failed' state (stopped). It seems that calling ExecStop to the script is not enough to terminate my.service itself and the subsequent script.py in a minute manner.
Just to be extra clear: I want the script to terminate pretty immediately in a way that is analogous to Ctrl + C. Basic Python clean-up is OK, but I don't want systemd to be waiting for a 'graceful' response time-out, or something like that.
Questions:
Is my interpretation of the delay correct, or is it just systemctl that is slow to update its status overview?
What is the recommendable way to stop the service and terminate the script. Should I include some sort of SIGINT catching in the Python script? If so, how? Or is there something that can be done in my.service to expedite the stopping of the service and killing of the script?
I think you should look into TimeoutStopSec and it's default value param DefaultTimeoutStartSec. On the priovided links, there are some more info about WatchdogSec and other options that you might find usefull. It looks like DefaultTimeoutStartSec's default is 90 seconds, which might be the delay you are experiencing..?
Under unit section options you should use Requisite=other.service This is similar to Requires= However, if the units listed here are not started already, they will not be started and the transaction will fail immediately.
For triggering script execution again under unit section you can use OnFailure= which is a space-separated list of one or more units that are activated when this unit enters the "failed" state.
Also using BindsTo= option configures requirement dependencies, very similar in style to Requires=, however in addition to this behavior, it also declares that this unit is stopped when any of the units listed suddenly disappears. Units can suddenly, unexpectedly disappear if a service terminates on its own choice, a device is unplugged or a mount point unmounted without involvement of systemd.
I think in your case BindsTo= is the option to use since it causes the current unit to stop when the associated unit terminates.
From systemd.unit man
Summary: I have a python script which collects tweets using Twitter API and i have postgreSQL database in the backend which collects all the streamed tweets. I have custom code which overcomes the ratelimit issue and i made it to run 24/7 for months.
Issue: Sometimes streaming breaks and sleeps for given secs but it is not helpful. I do not want to check it manually.
def on_error(self,status)://tweepy method
self.mailMeIfError(['me <me#localhost'],'listen.py <root#localhost>','Error Occured on_error method',str(error))
time.sleep(300)
return True
Assume mailMeIfError is a method which takes care of sending me a mail.
I want a simple cron script which always checks the process and restart the python script if not running/error/breaks. I have gone through some answers from stackoverflow where they have used Process ID. In my case process ID still exists because this script sleeps if Error.
Thanks in advance.
Using Process ID is much easier and safer. Try using watchdog.
This can all be done in your one script. Cron would need to be configured to start your script periodically, say every minute. The start of your script then just needs to determine if it is the only copy of itself running on the machine. If it spots that another copy is running, it just silently terminates. Else it continues to run.
This behaviour is called a Singleton pattern. There are a number of ways to achieve this for example Python: single instance of program
I am working on a django web application.
A function 'xyx' (it updates a variable) needs to be called every 2 minutes.
I want one http request should start the daemon and keep calling xyz (every 2 minutes) until I send another http request to stop it.
Appreciate your ideas.
Thanks
Vishal Rana
There are a number of ways to achieve this. Assuming the correct server resources I would write a python script that calls function xyz "outside" of your django directory (although importing the necessary stuff) that only runs if /var/run/django-stuff/my-daemon.run exists. Get cron to run this every two minutes.
Then, for your django functions, your start function creates the above mentioned file if it doesn't already exist and the stop function destroys it.
As I say, there are other ways to achieve this. You could have a python script on loop waiting for approx 2 minutes... etc. In either case, you're up against the fact that two python scripts run on two different invocations of cpython (no idea if this is the case with mod_wsgi) cannot communicate with each other and as such IPC between python scripts is not simple, so you need to use some sort of formal IPC (like semaphores, files etc) rather than just common variables (which won't work).
Probably a little hacked but you could try this:
Set up a crontab entry that runs a script every two minutes. This script will check for some sort of flag (file existence, contents of a file, etc.) on the disk to decide whether to run a given python module. The problem with this is it could take up to 1:59 to run the function the first time after it is started.
I think if you started a daemon in the view function it would keep the httpd worker process alive as well as the connection unless you figure out how to send a connection close without terminating the django view function. This could be very bad if you want to be able to do this in parallel for different users. Also to kill the function this way, you would have to somehow know which python and/or httpd process you want to kill later so you don't kill all of them.
The real way to do it would be to code an actual daemon in w/e language and just make a system call to "/etc/init.d/daemon_name start" and "... stop" in the django views. For this, you need to make sure your web server user has permission to execute the daemon.
If the easy solutions (loop in a script, crontab signaled by a temp file) are too fragile for your intended usage, you could use Twisted facilities for process handling and scheduling and networking. Your Django app (using a Twisted client) would simply communicate via TCP (locally) with the Twisted server.