Best practice for Python process control - python

This is my first hack at doing any system-level programming (mostly a LAMPhp, specifically Drupal, web dev up to this point).
Because of availability of a library with a very specific feature, I am using Python for an upcoming project. I need to run, restart as needed, monitor and respond to the output of multiple Python script processes, controlled ideally via a HTTP API from another master program which keeps a database of processes that need to be running, and some metadata about those processes (parameters, pid, etc). I'm planning on building this master program in PHP as I have far more experience in it, hence the want for a nice HTTP API.
Is there some best practice for this type of system? Some initial research lead me to supervisord (which has XML-RPC built in, apparently), but I thought I'd check the wisdom of the masses who've actually been down this road before moving forward with testing.

I can't say I have been down this road, but I am working to go down this road. I would look into the multiprocessing libraries for Python. There are network transparent libraries. A couple of routes you could take with those:
1. Create a process that controls all of the other processes. Make this process a server you can control with your PHP.
2. Determine how to get PHP to communicate to these networked Python processes. They may still need to be launched from a central Python process however.

Related

Create a daemon background service?

I'm trying to create a background service in Python. The service will be called from another Python program. It needs to run as a daemon process because it uses a heavy object (300MB) that has to be loaded previously into the memory. I've had a look at python-daemon and still haven't found out how to do it. In particular, I know how to make a daemon run and periodically do some stuff itself, but I don't know how to make it callable from another program. Could you please give some help?
I had a similar situation when I wanted to access a big binary matrix from a web app.
Of course there are many solutions, but I used Redis, a popular in-memory database/cache system, to store and access my object successfully. It has practical Python bindings (several probably equivalent wrapper libraries).
The main advantage is that when the service goes down, a copy of the data still remains on disk. Also, I noticed that once in place, it could be used for other things in my app (for instance Celery proposes it as backend), and actually, for other services in any other unrelated program.

Examples for writing a daemon or a service in Linux

I have been looking at daemons for Linux such as httpd and have also looked at some code that can be used as a skeleton. I have done a fair amount of research and now I want to practice writing it. However, I'm not sure of what can I use a daemon for. Any good examples/ideas that I can try to execute?
I was thinking of using a daemon along with libnotify on Ubuntu to have pop-up notifications of select tweets.
Is this a bad example for implementing a daemon?
Will you even need a daemon for this?
Can this be implemented as a service rather than a daemon?
First: PEP 3143 tries to enumerate all of the fiddly details you have to get right to write a daemon in Python. And it specifies a library that takes care of those details for you.
The PEP was deferred—at least in part because the community felt it was more a responsibility of POSIX or some Linux standards group or something to first define exactly what is essential to being a daemon, before Python could have its own position on how to implement one. But it's still a great guide. However, the reference implementation of that proposed library still lives on, as python-daemon, which you can install from PyPI.
Meanwhile, the really interesting question for this project isn't so much service vs. daemon, as root vs. user. Do you want a single process that keeps track of all users' twitter accounts, and sends notifications to anyone who's logged in? Just a per-user process? Or maybe both, a single process watching all the tweets, then sending notifications via user processes?
Of course you don't really need a daemon or service for this. For example, it could be a GUI app whose main window is a configuration dialog, which keeps running (maybe with a traybar thingy) even when you close the config dialog, and it would work just as well. The question isn't whether you need a daemon, but whether it's more appropriate. Which really is a design choice.

Fastest, simplest way to handle long-running upstream requests for Django

I'm using Django with Uwsgi. We have 8 processes running, and I have no real indication that our code is particularly thread safe, as it was never designed with threads in mind.
Recently, we added the ability to get live rates from vendors of a service through their various APIs and display them at once for the user. The problem is these requests are old web services technologies, and due to their response times, the time needed before the all rates from vendors are acquired (or it gives up), can be up to 10 seconds.
This presents a problem. We have a pretty decent amount of traffic on our site, and the customers need to look at these rates pretty often. With only 8 processes, it's quite easy to see how the server can get tied up waiting on these upstream requests. Especially when other optimizations need to be made to make the site baseline faster anyway (we're working on that).
We made a separate library (which should be mostly threadsafe, and if not, should be converted to it easily enough) for the rates requesting, and we can separate out its configuration. So I was thinking of making a separate service with its own threads, perhaps in Twisted, and having the browser contact that service for JSON instead of having it run in the main Django server.
Is this solution a good one? Can you think of a better or simpler way to do it? Should I use something other than Twisted, and if so, why?
If you want to use your code in-process with Django, you can simply call out to your Twisted by using Crochet, which can automatically manage the creation, running, and shutdown of the reactor within whatever WSGI implementation you choose (presuming that it behaves like a regular Python process, at least).
Obviously it might be less complex to just run within the Twisted WSGI container :-).
It might also be worth looking at TReq to issue your service client requests; your new "thread safe" library will still have the disadvantage of tying up an entire thread for each blocking client, which is a non-trivial amount of memory and additional concurrency overhead, whereas with Twisted you will only need to worry about a couple of objects.

AWS and Python threading scalability

I have a service running on a local server, written using Python threading library. Think of it as a kind of web crawler. It uses 50 threads. I want deploy it on Amazon Web Services cloud and scale it up, so it uses more threads.
Simply, I have two queues: Qinput with URLs and Qoutput with pages content. The threads pick URLs from Qinput, fetch content of the web page an put it to Qoutput
Question: is it enough that I simply increase the number of threads to, say, 500, 5,000 or 50,000 and AWS + Python will handle it? Should I expect the service to run seamlessly or there are some "standard" design pitfalls that I should be aware of when porting a multithreading service on AWS?
I am aware of Global Interpreter Lock although it should not be an issue here, as the main task of the threads is to call outside the interpreter while crawling / scraping pages
Any single instance has its limit. You will probably be able to spawn quite a lot of threads in your instance, especially if you choose the larger ones. But you will get diminished return on the additional threads, until it will not help you any more to get more performance.
However, if you want your system to scale beyond the limitation of a single instance, it is best to be able to run your system on multiple instances. Then your decisions is only operational and not technical. I think that if you are running in AWS environment, which allows you almost endless operational resources, you should think into it.
You can also check out SQS, which is basically a distributed queue system. It will allow you to synchronize the work of as many instances as you need.

Is there a good way to split a python program into independent modules?

I'm trying to do some machinery automation with python, but I've run into a problem.
I have code that does the actual control, code that logs, code the provides a GUI, and some other modules all being called from a single script.
The issue is that an error in one module halts all the others. So, for instance a bug in the GUI will kill the control systems.
I want to be able to have the modules run independently, so one can crash, be restarted, be patched, etc without halting the others.
The only way I can find to make that work is to store the variables in an SQL database, or files or something.
Is there a way for one python script to sort of ..debug another? so that one script can read or change the variables in the other? I can't find a way to do that that also allows to scripts to be started and stopped independently.
Does anyone have any ideas or advice?
A fairly effective way to do this is to use message passing. Each of your modules are independent, but they can send and receive messages to each other. A very good reference on the many ways to achieve this in Python is the Python wiki page for parallel processing.
A generic strategy
Split your program into pieces where there are servers and clients. You could then use middleware such as 0MQ, Apache ActiveMQ or RabbitMQ to send data between different parts of the system.
In this case, your GUI could send a message to the log parser server telling it to begin work. Once it's done, the log parser will send a broadcast message to anyone interested telling the world the a reference to the results. The GUI could be a subscriber to the channel that the log parser subscribes to. Once it receives the message, it will open up the results file and display whatever the user is interested in.
Serialization and deserialization speed is important also. You want to minimise the overhead for communicating. Google Protocol Buffers and Apache Thrift are effective tools here.
You will also need some form of supervision strategy to prevent a failure in one of the servers from blocking everything. supervisord will restart things for you and is quite easy to configure. Again, it is only one of many options in this space.
Overkill much?
It sounds like you have created a simple utility. The multiprocessing module is an excellent way to have different bits of the program running fairly independently. You still apply the same strategy (message passing, no shared shared state, supervision), but with different tactics.
You want multiply independent processes, and you want them to talk to each other. Hence: read what methods of inter-process communication are available on your OS. I recommend sockets (generic, will work over a n/w and with diff OSs). You can easily invent a simple (maybe http-like) protocol on top of TCP, maybe with json for messages. There is a bunch of classes coming with Python distribution to make it easy (SocketServer.ThreadingMixIn, SocketServer.TCPServer, etc.).

Categories

Resources