I have a linux server used for scientific calculations.
The server provides bsub < [filename] for submitting [filename] to queues, the calculation would return files like "A.chk" after 2 hours;
And also bjobs for checking status of all jobs, including jobID, run status(RUN, PEND, EXIT), etc.
Now I want to meet the following demand:
Submitting A.sh to the server;
Waitting for the calculation terminates, assume that it generates B.chk;
Do another calculation with B.chk and an existing script C.py
Do all these stuff automatically in one .py script.
I've managed to get the latest jobID and run status in a python script.
And I've tried the following solution:
import os
import time
os.system("bsub < %s" %(A.sh))
job_ID = get_jobID()
while(get_runstatus(jobID) != "RUN" and get_runstatus(jobID) != "PEND"):
time.sleep(30)
if "B.chk" in os.listdir():
os.system("python C.py")
But the while cycle in the server occupies too much public resources, so I'm not willing to do that. Then I searched for solutions like process.join() in SubProcess, but the job squence on the server cannot be treated as subprocess in python. So I'm here asking you for better solutions. Thank you.
While this doesnt answer your question fully, there is a Python API for LSF on github
https://github.com/IBMSpectrumComputing/lsf-python-api
Related
Original problem
I am creating an API using express that queries a sqlite DB and outputs the result as a PDF using html-pdf module.
The problem is that certain queries might take a long time to process and thus would like to de-couple the actual query call from the node server where express is running, otherwise the API might slow down if several clients are running heavy queries.
My idea to solve this was to decouple the execution of the sqlite query and instead run that on a python script. This script can then be called from the API and thus avoid using node to query the DB.
Current problem
After quickly creating a python script that runs a sqlite query, and calling that from my API using child_process.spawn(), I found out that express seems to get an exit code signal as soon as the python script starts to execute the query.
To confirm this, I created a simple python script that just sleeps in between printing two messages and the problem was isolated.
To reproduce this behavior you can create a python script like this:
print("test 1")
sleep(1)
print("test 2)
Then call it from express like this:
router.get('/async', function(req, res, next) {
var python = child_process.spawn([
'python3'
);
var output = "";
python.stdout.on('data', function(data){
output += data
console.log(output)
});
python.on('close', function(code){
if (code !== 0) {
return res.status(200).send(code)
}
return res.status(200).send(output)
});
});
If you then run the express server and do a GET /async you will get a "1" as the exit code.
However if you comment the sleep(1) line, the server successfully returns
test 1
test 2
as the response.
You can even trigger this using sleep(0).
I have tried flushing the stdout before the sleep, I have also tried piping the result instead of using .on('close') and I have also tried using -u option when calling python (to use unbuffered streams).
None of this has worked, so I'm guessing there's some mechanism baked into express that closes the request as soon as the spawned process sleeps OR finishes (instead of only when finishing).
I also found this answer related to using child_process.fork() but I'm not sure if this would have a different behavior or not and this one is very similar to my issue but has no answer.
Main question
So my question is, why does the python script send an exit signal when doing a sleep() (or in the case of my query script when running cursor.execute(query))?
If my supposition is correct that express closes the request when a spawned process sleeps, is this avoidable?
One potential solution I found suggested the use of ZeroRPC, but I don't see how that would make express keep the connection open.
The only other option I can think of is using something like Kue so that my express API will only need to respond with some sort of job ID, and then Kue will actually spawn the python script and wait for its response, so that I can query the result via some other API endpoint.
Is there something I'm missing?
Edit:
AllTheTime's comment is correct regarding the sleep issue. After I added from time import sleep it worked. However my sqlite script is not working yet.
As it turns out AllTheTime was indeed correct.
The problem was that in my python script I was loading a config.json file, which was loaded correctly when called from the console because the path was relative to the script.
However when calling it from node, the relative path was no longer correct.
After fixing the path it worked exactly as expected.
I have an email account set up that triggers a python script whenever it receives an email. The script goes through several functions which can take about 30 seconds and writes an entry into a MYSQL database.
Everything runs smoothly until a second email is sent in less than 30 seconds after the first. The second email is processed correctly, but the first email creates a corrupted entry into the database.
I'm looking to hold the email data,
msg=email.message_from_file(sys.stdin)
in a queue if the script has not finished processing the prior email.
I'm using python 2.5.
Can anyone recommend a package/script that would accomplish this?
I find this a simple way to avoid running a cronjob while the previous cronjob is still running.
fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
This will raise an IOError that I then handle by having the process kill itself.
See http://docs.python.org/library/fcntl.html#fcntl.lockf for more info.
Anyways you can easily use the same idea to only allow a single job to run at a time, which really isn't the same as a queue (since any process waiting could potentially acquire the lock), but it achieves what you want.
import fcntl
import time
fd = open('lock_file', 'w')
fcntl.lockf(fd, fcntl.LOCK_EX)
# optionally write pid to another file so you have an indicator
# of the currently running process
print 'Hello'
time.sleep(1)
You could also just use http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes, which does exactly what you want.
While Celery is a very fine piece of software, using it in this scenario is akin to driving in a nail with a sledgehammer. At a conceptual level, you are looking for a job queue (which is what celery provides) but the e-mail inbox you are using to trigger the script is also a capable job-queue.
The more direct solution is to have the Python worker script poll the mail server itself (using the built in poplib for example) retrieve all new mail every few seconds, then process any new e-mails one at a time. This will serialize the work your script is doing, thereby preventing two copies from running at once.
For example, you would wrap your existing script in a function like this (from the documentation linked above):
import getpass, poplib
from time import sleep
M = poplib.POP3('localhost')
M.user(getpass.getuser())
M.pass_(getpass.getpass())
while True:
numMessages = len(M.list()[1])
for i in range(numMessages):
email = '\n'.join(M.retr(i+1)[1])
# This is what your script normally does:
do_work_for_message(email)
sleep(5)
edit: grammar
I would look into http://celeryproject.org/
I'm fairly certain that will meet your needs exactly.
I am using a cluster of computers to do some parallel computation. My home directory is shared across the cluster. In one machine, I have a ruby code that creates bash script containing computation command and write the script to, say, ~/q/ directory. The scripts are named *.worker1.sh, *.worker2.sh, etc.
On other 20 machines, I have 20 python code running ( one at each machine ) that (constantly) check the ~/q/ directory and look for jobs that belong to that machine, using a python code like this:
jobs = glob.glob('q/*.worker1.sh')
[os.system('sh ' + job + ' &') for job in jobs]
For some additional control, the ruby code will create a empty file like workeri.start (i = 1..20) at q directory after it write the bash script to q directory, the python code will check for that 'start' file before it runs the above code. And in the bash script, if the command finishes successfully, the bash script will create an empty file like 'workeri.sccuess', the python code checks this file after it runs the above code to make sure the computation finishs successfully. If python finds out that the computation finishs successfully, it will remove the 'start' file in q directory, so the ruby code knows that job finishs successfully. After the 20 bash script all finished, the ruby code will create new bash script and python read and executes new scripts and so on.
I know this is not a elegant way to coordinate the computation, but I haven't figured out a better to communicate between different machines.
Now the question is: I expect that the 20 jobs will run somewhat in parallel. The total time to finish the 20 jobs will not be much longer than the time to finish one job. However, it seems that these jobs runs sequentially and time is much longer than I expected.
I suspect that part of the reason is that multiple codes are reading and writing the same directory at once but the linux system or python locks the directory and only allow one process to oprate the directory. This makes the code execute one at a time.
I am not sure if this is the case. If I split the bash scripts to different directories, and let the python code on different machines read and write different directories, will that solve the problem? Or is there any other reasons that cause the problem?
Thanks a lot for any suggestions! Let me know if I didn't explain anything clearly.
Some additional info:
my home directory is at /home/my_group/my_home, here is the mount info for it
:/vol/my_group on /home/my_group type nfs (rw,nosuid,nodev,noatime,tcp,timeo=600,retrans=2,rsize=65536,wsize=65536,addr=...)
I say constantly check the q directory, meaning a python loop like this:
While True:
if 'start' file exists:
find the scripts and execute them as I mentioned above
I know this is not a elegant way to coordinate the computation, but I
haven't figured out a better to communicate between different
machines.
While this isn't directly what you asked, you should really, really consider fixing your problem at this level, using some sort of shared message queue is likely to be a lot simpler to manage and debug than relying on the locking semantics of a particular networked filesystem.
The simplest solution to set up and run in my experience is redis on the machine currently running the Ruby script that creates the jobs. It should literally be as simple as downloading the source, compiling it and starting it up. Once the redis server is up and running, you change your code to append your the computation commands to one or more Redis lists. In ruby you would use the redis-rb library like this:
require "redis"
redis = Redis.new
# Your other code to build up command lists...
redis.lpush 'commands', command1, command2...
If the computations need to be handled by certain machines, use a list per-machine like this:
redis.lpush 'jobs:machine1', command1
# etc.
Then in your Python code, you can use redis-py to connect to the Redis server and pull jobs off the list like so:
from redis import Redis
r = Redis(host="hostname-of-machine-running-redis")
while r.llen('jobs:machine1'):
job = r.lpop('commands:machine1')
os.system('sh ' + job + ' &')
Of course, you could just as easily pull jobs off the queue and execute them in Ruby:
require 'redis'
redis = Redis.new(:host => 'hostname-of-machine-running-redis')
while redis.llen('jobs:machine1')
job = redis.lpop('commands:machine1')
`sh #{job} &`
end
With some more details about the needs of the computation and the environment it's running in, it would be possible to recommend even simpler approaches to managing it.
Try a while loop? If that doesn't work, on the python side try using a TRY statement like so:
Try:
with open("myfile.whatever", "r") as f:
f.read()
except:
(do something if it doesnt work, perhaps a PASS? (must be in a loop to constantly check this)
else:
execute your code if successful
I'd like to prevent multiple instances of the same long-running python command-line script from running at the same time, and I'd like the new instance to be able to send data to the original instance before the new instance commits suicide. How can I do this in a cross-platform way?
Specifically, I'd like to enable the following behavior:
"foo.py" is launched from the command line, and it will stay running for a long time-- days or weeks until the machine is rebooted or the parent process kills it.
every few minutes the same script is launched again, but with different command-line parameters
when launched, the script should see if any other instances are running.
if other instances are running, then instance #2 should send its command-line parameters to instance #1, and then instance #2 should exit.
instance #1, if it receives command-line parameters from another script, should spin up a new thread and (using the command-line parameters sent in the step above) start performing the work that instance #2 was going to perform.
So I'm looking for two things: how can a python program know another instance of itself is running, and then how can one python command-line program communicate with another?
Making this more complicated, the same script needs to run on both Windows and Linux, so ideally the solution would use only the Python standard library and not any OS-specific calls. Although if I need to have a Windows codepath and an *nix codepath (and a big if statement in my code to choose one or the other), that's OK if a "same code" solution isn't possible.
I realize I could probably work out a file-based approach (e.g. instance #1 watches a directory for changes and each instance drops a file into that directory when it wants to do work) but I'm a little concerned about cleaning up those files after a non-graceful machine shutdown. I'd ideally be able to use an in-memory solution. But again I'm flexible, if a persistent-file-based approach is the only way to do it, I'm open to that option.
More details: I'm trying to do this because our servers are using a monitoring tool which supports running python scripts to collect monitoring data (e.g. results of a database query or web service call) which the monitoring tool then indexes for later use. Some of these scripts are very expensive to start up but cheap to run after startup (e.g. making a DB connection vs. running a query). So we've chosen to keep them running in an infinite loop until the parent process kills them.
This works great, but on larger servers 100 instances of the same script may be running, even if they're only gathering data every 20 minutes each. This wreaks havoc with RAM, DB connection limits, etc. We want to switch from 100 processes with 1 thread to one process with 100 threads, each executing the work that, previously, one script was doing.
But changing how the scripts are invoked by the monitoring tool is not possible. We need to keep invocation the same (launch a process with different command-line parameters) but but change the scripts to recognize that another one is active, and have the "new" script send its work instructions (from the command line params) over to the "old" script.
BTW, this is not something I want to do on a one-script basis. Instead, I want to package this behavior into a library which many script authors can leverage-- my goal is to enable script authors to write simple, single-threaded scripts which are unaware of multi-instance issues, and to handle the multi-threading and single-instancing under the covers.
The Alex Martelli approach of setting up a communications channel is the appropriate one. I would use a multiprocessing.connection.Listener to create a listener, in your choice. Documentation at:
http://docs.python.org/library/multiprocessing.html#multiprocessing-listeners-clients
Rather than using AF_INET (sockets) you may elect to use AF_UNIX for Linux and AF_PIPE for Windows. Hopefully a small "if" wouldn't hurt.
Edit: I guess an example wouldn't hurt. It is a basic one, though.
#!/usr/bin/env python
from multiprocessing.connection import Listener, Client
import socket
from array import array
from sys import argv
def myloop(address):
try:
listener = Listener(*address)
conn = listener.accept()
serve(conn)
except socket.error, e:
conn = Client(*address)
conn.send('this is a client')
conn.send('close')
def serve(conn):
while True:
msg = conn.recv()
if msg.upper() == 'CLOSE':
break
print msg
conn.close()
if __name__ == '__main__':
address = ('/tmp/testipc', 'AF_UNIX')
myloop(address)
This works on OS X, so it needs testing with both Linux and (after substituting the right address) Windows. A lot of caveats exists from a security point, the main one being that conn.recv unpickles its data, so you are almost always better of with recv_bytes.
The general approach is to have the script, on startup, set up a communication channel in a way that's guaranteed to be exclusive (other attempts to set up the same channel fail in a predictable way) so that further instances of the script can detect the first one's running and talk to it.
Your requirements for cross-platform functionality strongly point towards using a socket as the communication channel in question: you can designate a "well known port" that's reserved for your script, say 12345, and open a socket on that port listening to localhost only (127.0.0.1). If the attempt to open that socket fails, because the port in question is "taken", then you can connect to that port number instead, and that will let you communicate with the existing script.
If you're not familiar with socket programming, there's a good HOWTO doc here. You can also look at the relevant chapter in Python in a Nutshell (I'm biased about that one, of course;-).
Perhaps try using sockets for communication?
Sounds like your best bet is sticking with a pid file but have it not only contain the process Id - have it also include the port number that the prior instance is listening on. So when starting up check for the pid file and if present see if a process with that Id is running - if so send your data to it and quit otherwise overwrite the pid file with the current process's info.
I have inherited a django+fastcgi application which needs to be modified to perform a lengthy computation (up to half an hour or more). What I want to do is run the computation in the background and return a "your job has been started" -type response. While the process is running, further hits to the url should return "your job is still running" until the job finishes at which point the results of the job should be returned. Any subsequent hit on the url should return the cached result.
I'm an utter novice at django and haven't done any significant web work in a decade so I don't know if there's a built-in way to do what I want. I've tried starting the process via subprocess.Popen(), and that works fine except for the fact it leaves a defunct entry in the process table. I need a clean solution that can remove temporary files and any traces of the process once it has finished.
I've also experimented with fork() and threads and have yet to come up with a viable solution. Is there a canonical solution to what seems to me to be a pretty common use case? FWIW this will only be used on an internal server with very low traffic.
I have to solve a similar problem now. It is not going to be a public site, but similarly, an internal server with low traffic.
Technical constraints:
all input data to the long running process can be supplied on its start
long running process does not require user interaction (except for the initial input to start a process)
the time of the computation is long enough so that the results cannot be served to the client in an immediate HTTP response
some sort of feedback (sort of progress bar) from the long running process is required.
Hence, we need at least two web “views”: one to initiate the long running process, and the other, to monitor its status/collect the results.
We also need some sort of interprocess communication: send user data from the initiator (the web server on http request) to the long running process, and then send its results to the reciever (again web server, driven by http requests). The former is easy, the latter is less obvious. Unlike in normal unix programming, the receiver is not known initially. The receiver may be a different process from the initiator, and it may start when the long running job is still in progress or is already finished. So the pipes do not work and we need some permamence of the results of the long running process.
I see two possible solutions:
dispatch launches of the long running processes to the long running job manager (this is probably what the above-mentioned django-queue-service is);
save the results permanently, either in a file or in DB.
I preferred to use temporary files and to remember their locaiton in the session data. I don't think it can be made more simple.
A job script (this is the long running process), myjob.py:
import sys
from time import sleep
i = 0
while i < 1000:
print 'myjob:', i
i=i+1
sleep(0.1)
sys.stdout.flush()
django urls.py mapping:
urlpatterns = patterns('',
(r'^startjob/$', 'mysite.myapp.views.startjob'),
(r'^showjob/$', 'mysite.myapp.views.showjob'),
(r'^rmjob/$', 'mysite.myapp.views.rmjob'),
)
django views:
from tempfile import mkstemp
from os import fdopen,unlink,kill
from subprocess import Popen
import signal
def startjob(request):
"""Start a new long running process unless already started."""
if not request.session.has_key('job'):
# create a temporary file to save the resuls
outfd,outname=mkstemp()
request.session['jobfile']=outname
outfile=fdopen(outfd,'a+')
proc=Popen("python myjob.py",shell=True,stdout=outfile)
# remember pid to terminate the job later
request.session['job']=proc.pid
return HttpResponse('A new job has started.')
def showjob(request):
"""Show the last result of the running job."""
if not request.session.has_key('job'):
return HttpResponse('Not running a job.'+\
'Start a new one?')
else:
filename=request.session['jobfile']
results=open(filename)
lines=results.readlines()
try:
return HttpResponse(lines[-1]+\
'<p>Terminate?')
except:
return HttpResponse('No results yet.'+\
'<p>Terminate?')
return response
def rmjob(request):
"""Terminate the runining job."""
if request.session.has_key('job'):
job=request.session['job']
filename=request.session['jobfile']
try:
kill(job,signal.SIGKILL) # unix only
unlink(filename)
except OSError, e:
pass # probably the job has finished already
del request.session['job']
del request.session['jobfile']
return HttpResponseRedirect('/startjob/') # start a new one
Maybe you could look at the problem the other way around.
Maybe you could try DjangoQueueService, and have a "daemon" listening to the queue, seeing if there's something new and process it.