Creating a processing queue in python

Creating a processing queue in python - python

I have an email account set up that triggers a python script whenever it receives an email. The script goes through several functions which can take about 30 seconds and writes an entry into a MYSQL database.
Everything runs smoothly until a second email is sent in less than 30 seconds after the first. The second email is processed correctly, but the first email creates a corrupted entry into the database.
I'm looking to hold the email data,
msg=email.message_from_file(sys.stdin)
in a queue if the script has not finished processing the prior email.
I'm using python 2.5.
Can anyone recommend a package/script that would accomplish this?

I find this a simple way to avoid running a cronjob while the previous cronjob is still running.
fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
This will raise an IOError that I then handle by having the process kill itself.
See http://docs.python.org/library/fcntl.html#fcntl.lockf for more info.
Anyways you can easily use the same idea to only allow a single job to run at a time, which really isn't the same as a queue (since any process waiting could potentially acquire the lock), but it achieves what you want.
import fcntl
import time
fd = open('lock_file', 'w')
fcntl.lockf(fd, fcntl.LOCK_EX)
# optionally write pid to another file so you have an indicator
# of the currently running process
print 'Hello'
time.sleep(1)
You could also just use http://docs.python.org/dev/library/multiprocessing.html#exchanging-objects-between-processes, which does exactly what you want.

While Celery is a very fine piece of software, using it in this scenario is akin to driving in a nail with a sledgehammer. At a conceptual level, you are looking for a job queue (which is what celery provides) but the e-mail inbox you are using to trigger the script is also a capable job-queue.
The more direct solution is to have the Python worker script poll the mail server itself (using the built in poplib for example) retrieve all new mail every few seconds, then process any new e-mails one at a time. This will serialize the work your script is doing, thereby preventing two copies from running at once.
For example, you would wrap your existing script in a function like this (from the documentation linked above):
import getpass, poplib
from time import sleep
M = poplib.POP3('localhost')
M.user(getpass.getuser())
M.pass_(getpass.getpass())
while True:
numMessages = len(M.list()[1])
for i in range(numMessages):
email = '\n'.join(M.retr(i+1)[1])
# This is what your script normally does:
do_work_for_message(email)
sleep(5)
edit: grammar

I would look into http://celeryproject.org/
I'm fairly certain that will meet your needs exactly.

Related

Restart python script if not running/stopped/error with simple cron job

Summary: I have a python script which collects tweets using Twitter API and i have postgreSQL database in the backend which collects all the streamed tweets. I have custom code which overcomes the ratelimit issue and i made it to run 24/7 for months.
Issue: Sometimes streaming breaks and sleeps for given secs but it is not helpful. I do not want to check it manually.
def on_error(self,status)://tweepy method
self.mailMeIfError(['me <me#localhost'],'listen.py <root#localhost>','Error Occured on_error method',str(error))
time.sleep(300)
return True
Assume mailMeIfError is a method which takes care of sending me a mail.
I want a simple cron script which always checks the process and restart the python script if not running/error/breaks. I have gone through some answers from stackoverflow where they have used Process ID. In my case process ID still exists because this script sleeps if Error.
Thanks in advance.

Using Process ID is much easier and safer. Try using watchdog.

This can all be done in your one script. Cron would need to be configured to start your script periodically, say every minute. The start of your script then just needs to determine if it is the only copy of itself running on the machine. If it spots that another copy is running, it just silently terminates. Else it continues to run.
This behaviour is called a Singleton pattern. There are a number of ways to achieve this for example Python: single instance of program

Query Python3 script to get stats about the script

I have a script that continually runs and accepts data (For those that are familiar and if it helps, it is connected to EMDR - https://eve-market-data-relay.readthedocs.org).
Inside the script I have debugging built in so that I can see how much data is currently in the queue for the threads to process, however this is built to be used with just printing to the console. What I would like to do is be able to either run the same script with an additional option or a totally different script that would return the current queue count without having to enable debug.
Is there a way to do this could someone please point me in the direction of the documentation/libaries that I need to research?

There are many ways to solve this; two that come to mind:
You can write the queue count to a k/v store (like memcache or redis) and then have another script read that for you and do whatever other actions required.
You can create a specific logger for your informational output (like the queue length) and set it to log somewhere else other than the console. For example, you could use it to send you an email or log to an external service, etc. See the logging cookbook for examples.

Timed email reminder in python

I have written up a python script that allows a user to input a message, his email and the time and they would like the email sent. This is all stored in a mysql database.
However, how do I get the script to execute on the said time and date? will it require a cron job? I mean say at 2:15 on april 20th, the script will search the database for all times of 2:15, and send out those emails. But what about for emails at 2:16?
I am using a shared hosting provided, so cant have a continously running script.
Thanks

If you cannot have a continuously running script, something must trigger it, so that would have to rely on your OS internals. In a unix environment a cron job, as you self state, would do the trick.
Set cron to run the script, and make the script wait for a given time and then continue running and sending until the next email is more than this given time away. Then make your script add a new cron job for a new wakeup time.

Looks like this django application was made just for people in your situation...
http://code.google.com/p/django-cron/
Also, your design seems a little flawed. If its running at 2:15 you wouldn't want to send out just emails that should be sent at 2:15, but all ones that should have been sent in the past that have not been sent.
Your database should either:
A. Delete the entries once they send
or
B. Have a column defined on your database table to store whether it was sent or not. Then your logic should make use of that column.

A cronjob every minute or so would do it. If you're considering this, you might like to mind two things:
1 - How many e-mails are expected to be sent per minute? If it takes you 1 second to send an e-mail and you have 100 e-mails per minute, you won't finish your queue.
2 - What will happen if one job starts before the last one finishes? Be careful not to send e-mails twice. You need either to make sure first process ends (risk: you can drop an e-mail eventually), avoid next process to start (risk: first process hangs whole queue) or make them work in parallel (risk: synchronization problems).
If you take daramarak's suggestion - make you script add a new cron job at end - you have the risk of whole system colapsing if one error occurs.

can a python script know that another instance of the same script is running... and then talk to it?

I'd like to prevent multiple instances of the same long-running python command-line script from running at the same time, and I'd like the new instance to be able to send data to the original instance before the new instance commits suicide. How can I do this in a cross-platform way?
Specifically, I'd like to enable the following behavior:
"foo.py" is launched from the command line, and it will stay running for a long time-- days or weeks until the machine is rebooted or the parent process kills it.
every few minutes the same script is launched again, but with different command-line parameters
when launched, the script should see if any other instances are running.
if other instances are running, then instance #2 should send its command-line parameters to instance #1, and then instance #2 should exit.
instance #1, if it receives command-line parameters from another script, should spin up a new thread and (using the command-line parameters sent in the step above) start performing the work that instance #2 was going to perform.
So I'm looking for two things: how can a python program know another instance of itself is running, and then how can one python command-line program communicate with another?
Making this more complicated, the same script needs to run on both Windows and Linux, so ideally the solution would use only the Python standard library and not any OS-specific calls. Although if I need to have a Windows codepath and an *nix codepath (and a big if statement in my code to choose one or the other), that's OK if a "same code" solution isn't possible.
I realize I could probably work out a file-based approach (e.g. instance #1 watches a directory for changes and each instance drops a file into that directory when it wants to do work) but I'm a little concerned about cleaning up those files after a non-graceful machine shutdown. I'd ideally be able to use an in-memory solution. But again I'm flexible, if a persistent-file-based approach is the only way to do it, I'm open to that option.
More details: I'm trying to do this because our servers are using a monitoring tool which supports running python scripts to collect monitoring data (e.g. results of a database query or web service call) which the monitoring tool then indexes for later use. Some of these scripts are very expensive to start up but cheap to run after startup (e.g. making a DB connection vs. running a query). So we've chosen to keep them running in an infinite loop until the parent process kills them.
This works great, but on larger servers 100 instances of the same script may be running, even if they're only gathering data every 20 minutes each. This wreaks havoc with RAM, DB connection limits, etc. We want to switch from 100 processes with 1 thread to one process with 100 threads, each executing the work that, previously, one script was doing.
But changing how the scripts are invoked by the monitoring tool is not possible. We need to keep invocation the same (launch a process with different command-line parameters) but but change the scripts to recognize that another one is active, and have the "new" script send its work instructions (from the command line params) over to the "old" script.
BTW, this is not something I want to do on a one-script basis. Instead, I want to package this behavior into a library which many script authors can leverage-- my goal is to enable script authors to write simple, single-threaded scripts which are unaware of multi-instance issues, and to handle the multi-threading and single-instancing under the covers.

The Alex Martelli approach of setting up a communications channel is the appropriate one. I would use a multiprocessing.connection.Listener to create a listener, in your choice. Documentation at:
http://docs.python.org/library/multiprocessing.html#multiprocessing-listeners-clients
Rather than using AF_INET (sockets) you may elect to use AF_UNIX for Linux and AF_PIPE for Windows. Hopefully a small "if" wouldn't hurt.
Edit: I guess an example wouldn't hurt. It is a basic one, though.
#!/usr/bin/env python
from multiprocessing.connection import Listener, Client
import socket
from array import array
from sys import argv
def myloop(address):
try:
listener = Listener(*address)
conn = listener.accept()
serve(conn)
except socket.error, e:
conn = Client(*address)
conn.send('this is a client')
conn.send('close')
def serve(conn):
while True:
msg = conn.recv()
if msg.upper() == 'CLOSE':
break
print msg
conn.close()
if __name__ == '__main__':
address = ('/tmp/testipc', 'AF_UNIX')
myloop(address)
This works on OS X, so it needs testing with both Linux and (after substituting the right address) Windows. A lot of caveats exists from a security point, the main one being that conn.recv unpickles its data, so you are almost always better of with recv_bytes.

The general approach is to have the script, on startup, set up a communication channel in a way that's guaranteed to be exclusive (other attempts to set up the same channel fail in a predictable way) so that further instances of the script can detect the first one's running and talk to it.
Your requirements for cross-platform functionality strongly point towards using a socket as the communication channel in question: you can designate a "well known port" that's reserved for your script, say 12345, and open a socket on that port listening to localhost only (127.0.0.1). If the attempt to open that socket fails, because the port in question is "taken", then you can connect to that port number instead, and that will let you communicate with the existing script.
If you're not familiar with socket programming, there's a good HOWTO doc here. You can also look at the relevant chapter in Python in a Nutshell (I'm biased about that one, of course;-).

Perhaps try using sockets for communication?

Sounds like your best bet is sticking with a pid file but have it not only contain the process Id - have it also include the port number that the prior instance is listening on. So when starting up check for the pid file and if present see if a process with that Id is running - if so send your data to it and quit otherwise overwrite the pid file with the current process's info.

django,fastcgi: how to manage a long running process?

I have inherited a django+fastcgi application which needs to be modified to perform a lengthy computation (up to half an hour or more). What I want to do is run the computation in the background and return a "your job has been started" -type response. While the process is running, further hits to the url should return "your job is still running" until the job finishes at which point the results of the job should be returned. Any subsequent hit on the url should return the cached result.
I'm an utter novice at django and haven't done any significant web work in a decade so I don't know if there's a built-in way to do what I want. I've tried starting the process via subprocess.Popen(), and that works fine except for the fact it leaves a defunct entry in the process table. I need a clean solution that can remove temporary files and any traces of the process once it has finished.
I've also experimented with fork() and threads and have yet to come up with a viable solution. Is there a canonical solution to what seems to me to be a pretty common use case? FWIW this will only be used on an internal server with very low traffic.

I have to solve a similar problem now. It is not going to be a public site, but similarly, an internal server with low traffic.
Technical constraints:
all input data to the long running process can be supplied on its start
long running process does not require user interaction (except for the initial input to start a process)
the time of the computation is long enough so that the results cannot be served to the client in an immediate HTTP response
some sort of feedback (sort of progress bar) from the long running process is required.
Hence, we need at least two web “views”: one to initiate the long running process, and the other, to monitor its status/collect the results.
We also need some sort of interprocess communication: send user data from the initiator (the web server on http request) to the long running process, and then send its results to the reciever (again web server, driven by http requests). The former is easy, the latter is less obvious. Unlike in normal unix programming, the receiver is not known initially. The receiver may be a different process from the initiator, and it may start when the long running job is still in progress or is already finished. So the pipes do not work and we need some permamence of the results of the long running process.
I see two possible solutions:
dispatch launches of the long running processes to the long running job manager (this is probably what the above-mentioned django-queue-service is);
save the results permanently, either in a file or in DB.
I preferred to use temporary files and to remember their locaiton in the session data. I don't think it can be made more simple.
A job script (this is the long running process), myjob.py:
import sys
from time import sleep
i = 0
while i < 1000:
print 'myjob:', i
i=i+1
sleep(0.1)
sys.stdout.flush()
django urls.py mapping:
urlpatterns = patterns('',
(r'^startjob/$', 'mysite.myapp.views.startjob'),
(r'^showjob/$', 'mysite.myapp.views.showjob'),
(r'^rmjob/$', 'mysite.myapp.views.rmjob'),
)
django views:
from tempfile import mkstemp
from os import fdopen,unlink,kill
from subprocess import Popen
import signal
def startjob(request):
"""Start a new long running process unless already started."""
if not request.session.has_key('job'):
# create a temporary file to save the resuls
outfd,outname=mkstemp()
request.session['jobfile']=outname
outfile=fdopen(outfd,'a+')
proc=Popen("python myjob.py",shell=True,stdout=outfile)
# remember pid to terminate the job later
request.session['job']=proc.pid
return HttpResponse('A new job has started.')
def showjob(request):
"""Show the last result of the running job."""
if not request.session.has_key('job'):
return HttpResponse('Not running a job.'+\
'Start a new one?')
else:
filename=request.session['jobfile']
results=open(filename)
lines=results.readlines()
try:
return HttpResponse(lines[-1]+\
'<p>Terminate?')
except:
return HttpResponse('No results yet.'+\
'<p>Terminate?')
return response
def rmjob(request):
"""Terminate the runining job."""
if request.session.has_key('job'):
job=request.session['job']
filename=request.session['jobfile']
try:
kill(job,signal.SIGKILL) # unix only
unlink(filename)
except OSError, e:
pass # probably the job has finished already
del request.session['job']
del request.session['jobfile']
return HttpResponseRedirect('/startjob/') # start a new one

Maybe you could look at the problem the other way around.
Maybe you could try DjangoQueueService, and have a "daemon" listening to the queue, seeing if there's something new and process it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.