I have a web crawling python script running in terminal for several hours, which is continuously populating my database. It has several nested for loops. For some reasons I need to restart my computer and continue my script from exactly the place where I left. Is it possible to preserve the pointer state and resume the previously running script in terminal?
I am looking for a solution which will work without altering the python script. Modifying the code is a lower priority as that would mean to relaunch the program and reinvest time.
Update:
Thanks for the VM suggestion. I'll take that. For the sake of completion, what generic modifications should be made to script to make it pause and resumable?
Update2:
Porting on VM works fine. I have also modified script to make it failsafe against network failures. Code written below.
You might try suspending your computer or running in a virtual machine which you can subsequently suspend. But as your script is working with network connections chances are your script won't work from the point you left once you bring up the system. Suspending a computer and restoring it or saving a Virtual M/C and restoring it would mean you need to restablish the network connection. This is true for any elements which are external to your system and network is one of them. And there are high chances that if you are using a dynamic network, the next time you boot chances are you would get a new IP and the network state that you were working previously would be void.
If you are planning to modify the script, few things you need to keep it mind.
Add serializing and Deserializing capabilities. Python has the pickle and the faster cPickle method to do it.
Add Restart points. The best way to do this is to save the state at regular interval and when restarting your script, restart from last saved state after establishing all the transients elements like network.
This would not be an easy task so consider investing a considrable amount of time :-)
Note***
On a second thought. There is one alternative from changing your script. You can try using cloud Virtualization Solutions like Amazon EC2.
I ported my script to VM and launched it from there. However there were network connection glitches after resuming from hibernation. Here's how I solved it by tweaking python script:
import logging
import socket
import time
socket.setdefaulttimeout(30) #set timeout in secs
maxretry = 10 #set max retries
sleeptime_between_retry = 1 #waiting time between retries
erroroccured = 0
while True:
try:
domroot = parse(urllib2.urlopen(myurl)).getroot()
except Exception as e:
erroroccured += 1
if erroroccured>maxretry:
logger.info("Maximum retries reached. Quitting this leg.")
break
time.sleep(sleeptime_between_retry)
logging.info("Network error occurred. Retrying %d time..."%(erroroccured))
continue
finally:
#common code to execute after try or except block, if any
pass
break
This modification made my script temper proof to network failures.
As others have commented, unless you are running your script in a virtual machine that can be suspended, you would need to modify your script to track its state.
Since you're populating a database with your data, I suggest to use it as a way to track the progress of the script (get the latest URL parsed, have a list of pending URLs, etc.).
If the script is terminated abruptly, you don't have to worry about saving its state because the database transactions will come to the rescue and only the data that you've committed will be saved.
When the script is retarted, only the data for the URLs that you completely processed will be stored and you it can resume just picking up the next URL according to the database.
If this problem is important enough to warrant this kind of financial investment, you could run the script on a virtual machine. When you need to shut down, suspend the virtual machine, and then shut down the computer. When you want to start again, start the computer, and then wake up your virtual machine.
WinPDB is a python debugger that supports remote debugging. I never used it, and don't know if remote debugging a running process requires a modification to the script (which is very likely, otherwise it'd be a security issue); but if remote debugging without modifying the script is possible then you may be able to dump the current state of the script to a file and figure out later how to load it. I don't think it would work though.
Related
I use a python script to insert data in my database (using pandas and sqlalchemy). The script read from various sources, clean the data and insert it in the database. I plan on running this script once in a while to completely override the existing database with more recent data.
At first I wanted to have a single service and simply add an endpoint that would require higher privileges to run the script. But in the end that looks a bit odd and, more importantly, that python script is using quite a lot of memory (~700M) which makes me wonder how I should configure my deployment.
Increasing the memory limit of my pod for this (once in a while) operation looks like a bad idea to me, but I'm quite new to Kubernetes, so maybe I'm wrong. Thus this question.
So what would be a good (better) solution? Run another service just for that, simply connect to my machine and run the update manually using the python script directly?
To run on demand
https://kubernetes.io/docs/concepts/workloads/controllers/job/.
This generates a Pod that runs till completion (exit) only once - a Job.
To run on schedule
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/.
Every time when hitting a schedule, this generates the new, separate Job.
I am creating a test automation which uses an application without any interfaces. However, The application calls a batch script when it changes modes, and I am therefore am able to catch the mode transitions.
What I want to do is to get the batch script to give an input to my python script (I have a state machine running in python) during runtime. Such that I can monitor the state of the application with python instead of the batch file.
I am using a similar state machine to the one of Karn Saheb:
https://dev.to/karn/building-a-simple-state-machine-in-python
However, instead of changing states statically like:
device.on_event('event')
I want the python script to do something similar to:
while(True):
device.on_event(input()) # where the input is passed from the batch script:
REM state.bat
set CurrentState=%1
"magic code to pass CurrentState to python input()" %CurrentState%
I see that a solution would be to start the python script from the batch file every time it is called with the "event" and then save the current event in another file upon termination of the python script... But I want to avoid such handling and rather evaluate this during runtime.
Thank you in advance!
A reasonably portable way of doing this without ugly polling on temporary files is to use a socket: have the main process listen and have the batch file(s) start a small program that connects to the server and writes a message.
There are security considerations here: you can start by listening only to the loopback interface, with further authentication if the local machine should not be trusted.
If you have more than one of these processes, or if you need to handle the child dying before it issues its next report, you’ll have to use threads or something like select to unify the news from different input channels (e.g., waiting on the child to exit vs. waiting on news from the next batch file).
I am using a postgres database with sql-alchemy and flask. I have a couple of jobs which I have to run through the entire database to updates entries. When I do this on my local machine I get a very different behavior compared to the server.
E.g. there seems to be an upper limit on how many entries I can get from the database?
On my local machine I just query all elements, while on the server I have to query 2000 entries step by step.
If I have too many entries the server gives me the message 'Killed'.
I would like to know
1. Who is killing my jobs (sqlalchemy, postgres)?
2. Since this does seem to behave differently on my local machine there must be a way to control this. Where would that be?
thanks
carl
Just the message "killed" appearing in the terminal window usually means the kernel was running out of memory and killed the process as an emergency measure.
Most libraries which connect to PostgreSQL will read the entire result set into memory, by default. But some libraries have a way to tell it to process the results row by row, so they aren't all read into memory at once. I don't know if flask has this option or not.
Perhaps your local machine has more available RAM than the server does (or fewer demands on the RAM it does have), or perhaps your local machine is configured to read from the database row by row rather than all at once.
Most likely kernel is killing your Python script. Python can have horrible memory usage.
I have a feeling you are trying to do these 2000 entry batches in a loop in one Python process. Python does not release all used memory, so the memory usage grows until it gets killed. (You can watch this with top command.)
You should try adapting your script to process 2000 records in a step and then quit. If you run in multiple times, it should continue where it left off. Or, a better option, try using multiprocessing and run each job in separate worker. Run the jobs serially and let them die, when they finish. This way they will release the memory back to OS when they exit.
While the copy is in progress, can we put a PC into sleep mode for a specific period of time, then wake up and continue copy using python script? Can you please share the code?
Actually this is possible using shell script.
Most machines manufactured after 2000 support real-time clock wakeup. There are many reasons to do so, one of which would be to record a TV program at a certain time. See ACPI Wakeup.
You'll have to explain what you mean by "While the copy is in progress" - there's not much to go on in the question. While OS drivers have suspend/resume functions, I don't know how to tell the python interpreter to save its state in the middle of running a script and then resume after wakeup. It's possible that the OS suspend/hibernate would fully capture the state of a copy operation and resume without a hiccup, but I wouldn't trust it to do so without substantial testing.
If you have "Wake On Lan" enabled you could potentially run a python script on a different PC and trigger the wake up after your specific period of time.
The scripts would probably need to talk to each other, unless you just do it all at times set in advance.
I'd like to prevent multiple instances of the same long-running python command-line script from running at the same time, and I'd like the new instance to be able to send data to the original instance before the new instance commits suicide. How can I do this in a cross-platform way?
Specifically, I'd like to enable the following behavior:
"foo.py" is launched from the command line, and it will stay running for a long time-- days or weeks until the machine is rebooted or the parent process kills it.
every few minutes the same script is launched again, but with different command-line parameters
when launched, the script should see if any other instances are running.
if other instances are running, then instance #2 should send its command-line parameters to instance #1, and then instance #2 should exit.
instance #1, if it receives command-line parameters from another script, should spin up a new thread and (using the command-line parameters sent in the step above) start performing the work that instance #2 was going to perform.
So I'm looking for two things: how can a python program know another instance of itself is running, and then how can one python command-line program communicate with another?
Making this more complicated, the same script needs to run on both Windows and Linux, so ideally the solution would use only the Python standard library and not any OS-specific calls. Although if I need to have a Windows codepath and an *nix codepath (and a big if statement in my code to choose one or the other), that's OK if a "same code" solution isn't possible.
I realize I could probably work out a file-based approach (e.g. instance #1 watches a directory for changes and each instance drops a file into that directory when it wants to do work) but I'm a little concerned about cleaning up those files after a non-graceful machine shutdown. I'd ideally be able to use an in-memory solution. But again I'm flexible, if a persistent-file-based approach is the only way to do it, I'm open to that option.
More details: I'm trying to do this because our servers are using a monitoring tool which supports running python scripts to collect monitoring data (e.g. results of a database query or web service call) which the monitoring tool then indexes for later use. Some of these scripts are very expensive to start up but cheap to run after startup (e.g. making a DB connection vs. running a query). So we've chosen to keep them running in an infinite loop until the parent process kills them.
This works great, but on larger servers 100 instances of the same script may be running, even if they're only gathering data every 20 minutes each. This wreaks havoc with RAM, DB connection limits, etc. We want to switch from 100 processes with 1 thread to one process with 100 threads, each executing the work that, previously, one script was doing.
But changing how the scripts are invoked by the monitoring tool is not possible. We need to keep invocation the same (launch a process with different command-line parameters) but but change the scripts to recognize that another one is active, and have the "new" script send its work instructions (from the command line params) over to the "old" script.
BTW, this is not something I want to do on a one-script basis. Instead, I want to package this behavior into a library which many script authors can leverage-- my goal is to enable script authors to write simple, single-threaded scripts which are unaware of multi-instance issues, and to handle the multi-threading and single-instancing under the covers.
The Alex Martelli approach of setting up a communications channel is the appropriate one. I would use a multiprocessing.connection.Listener to create a listener, in your choice. Documentation at:
http://docs.python.org/library/multiprocessing.html#multiprocessing-listeners-clients
Rather than using AF_INET (sockets) you may elect to use AF_UNIX for Linux and AF_PIPE for Windows. Hopefully a small "if" wouldn't hurt.
Edit: I guess an example wouldn't hurt. It is a basic one, though.
#!/usr/bin/env python
from multiprocessing.connection import Listener, Client
import socket
from array import array
from sys import argv
def myloop(address):
try:
listener = Listener(*address)
conn = listener.accept()
serve(conn)
except socket.error, e:
conn = Client(*address)
conn.send('this is a client')
conn.send('close')
def serve(conn):
while True:
msg = conn.recv()
if msg.upper() == 'CLOSE':
break
print msg
conn.close()
if __name__ == '__main__':
address = ('/tmp/testipc', 'AF_UNIX')
myloop(address)
This works on OS X, so it needs testing with both Linux and (after substituting the right address) Windows. A lot of caveats exists from a security point, the main one being that conn.recv unpickles its data, so you are almost always better of with recv_bytes.
The general approach is to have the script, on startup, set up a communication channel in a way that's guaranteed to be exclusive (other attempts to set up the same channel fail in a predictable way) so that further instances of the script can detect the first one's running and talk to it.
Your requirements for cross-platform functionality strongly point towards using a socket as the communication channel in question: you can designate a "well known port" that's reserved for your script, say 12345, and open a socket on that port listening to localhost only (127.0.0.1). If the attempt to open that socket fails, because the port in question is "taken", then you can connect to that port number instead, and that will let you communicate with the existing script.
If you're not familiar with socket programming, there's a good HOWTO doc here. You can also look at the relevant chapter in Python in a Nutshell (I'm biased about that one, of course;-).
Perhaps try using sockets for communication?
Sounds like your best bet is sticking with a pid file but have it not only contain the process Id - have it also include the port number that the prior instance is listening on. So when starting up check for the pid file and if present see if a process with that Id is running - if so send your data to it and quit otherwise overwrite the pid file with the current process's info.