Python file-based queue that is process-safe - python

Is there a process safe persistent (disk-based) Python FIFO queue?
Could someone provide a simple example with a script that writes a string to the queue, and another that reads one-string-at-a-time from the queue? Note that each of these processes can be launched multiple times from the command line. I.e. multiple writers and multiple readers.
Background: I have several scripts that each produces some output every once in a while. I would like to be aware of the output they create, and use another script(s) to process the information they produce. Unfortunately I cannot use the multiprocessing or threading modules of Python, because the scripts can run from different machines on the same file system. I.e. each of the scripts is launched from the command line.
All I need is that each of the queue elements is a string, and that the queue is process-safe.
Edit: I found the modules of queuelib pqueque, but they don't provide process safety. I am thinking of maybe replacing the file handling of queuelib, with atomicfile calls. But I am afraid spending too long on debugging it, since I am short of time, and there are always fine details to care when writing software that should be process safe.
Note1: I am using Python 2.7.
Note2: I am aware of this question, however my requirements are very modest, and I am looking for a simple solution, while the provided answers on that question refer to complex libraries, with no examples on how to use the queues.
EDIT (2019): For anyone landing here, I found that SQLite database provides atomic access, it has a python API, and it is serverless. Therefore, I believe it is possible (and easy) to implement the queue in my question with it.

Related

Calling Python code from Twisted

First of all, I should say that this might more a design question rather than about code itself.
I have a network with one Server and multiple Clients (written in Twisted cause I need these asynchronous non-blocking features), such server-client couple it's just only receiving-sending messages.
However, at some point, I want one client to run a python file when received a certain message. That client should keep listening and talking to the Server, also, I should be able to stop that file if needed, so my first thought is starting a thread for that python file and forget about it.
At the end it should go like this: Server sends message to ClientA, ClientA and its dataReceived function interprets the message and decides to run that python file (which I don't know how long will take and maybe contains blocking calls), when that python file finishes running should send the result to ClientB.
So, questions are:
Would it be starting a thread a good idea for that python file in ClientA?
As I want to send the result of that python file to ClientB, can I have another reactor loop inside that python file?
In any case I would highly appreciate any kind of advice as both python and twisted are not my specialty and all these ideas may not be the best ones.
Thanks!
At first reading, I though you were implying twisted isn't python. If you are thinking that, keep in mind the following:
Twisted is a python framework, I.E. it is python. Specifically it's about getting the most out of a single process/thread/core by allowing the programmer to hand-tune the scheduling/order-of-ops in their own code (which is nearly the opposite of the typical use of threads).
While you can interact with threads in twisted, its quite tricky to do without ruining the efficiency of twisted. (for longer description of threads vs events see SO: https://stackoverflow.com/a/23876498/3334178 )
If you really want to spawn your new python away from your twisted python (I.E. get this work running on a different core) then I would look at spawning it off as a process, see Glyph's answer in this SO: https://stackoverflow.com/a/5720492/3334178 for good libraries to get that done.
Processes give the proper separation to allow your twisted apps to run without meaningful slowdown and you should find all your starting/stoping/pausing/killing needs will be fulfilled.
Answering your questions specifically:
Would it be starting a thread a good idea for that python file in ClientA?
I would say "No" its generally not a good idea, and in your specific case you should look at using processes instead.
Can I have another reactor loop inside that python file?
Strictly speaking "no you can't have multiple reactors" - but what twisted can do is concurrently manage hundreds or thousands of separate tasks, all in the same reactor, will do what you need. I.E. run all your different async tasks in one reactor, that is what twisted is built for.
BTW I always recommend the following tutorial for twisted: krondo Twisted Introduction http://krondo.com/?page_id=1327 Its long, but if you get through it this kind of work will become very clear.
All the best!

Multiprocessing or Multithreading for plugin architecture in Python

I'm trying to implement a plugin architecture in Python.
I've started writing it using the Threading module where each plugin is a thread which I invoke using the Thread.start() method (since all plugins subclass BasePlugin which subclasses Thread). However I've just come across the multiprocessing module.
I'm currently wondering if I should switch to the multiprocessing module and share data using shared memory / Pipes etc...
I'd like to get other's opinions on this.
The plugin architecture I've been working on works as follows:
An event is received by the Plugin Manager. The Plugin Manager checks for all the plugins who've subscribed to that type of event. It activates them and sends them the event object (since it holds additional information). If one of the plugins is already active there is no need to spawn it (just send the event object to it).
In addition there are a few resources which belong only to one plugin at any point in time. Each plugin can request the resource (I'm not worrying about any race condition here since there won't be that many plugins active at once).
Threads share memory with the primary process and each other. For example you can have a list that is available to all threads. An item appended to a list can be seen by other threads. But you have to be careful. You have to understand which operations on data structures are thread safe and which are not. What happens to the behaviour of your program when two threads are checking for the existence of a key in a dictionary and then writing to it?
Multiple processes do not share memory. The new process that you start gets a copy of the memory at the point where it was spawned.
Threads use less resources. But can be hard to reason about. On the other hand communication between processes is tricky. And you can't just access an arbitrary Python data structure. Which it sounds like you want to be able to do.
A badly written plugin, if it was in a thread, could crash your whole program. Whereas if it was in a separate process this wouldn't happen. Maybe that's a consideration?

Controlling scheduling priority of python threads?

I've written a script that uses two thread pools of ten threads each to pull in data from an API. The thread pool implements this code on ActiveState. Each thread pool is monitoring a Redis database via PubSub for new entries. When a new entry is published, python passes the data to a function that uses python's Subprocess.POpen to execute a PHP shell to do the actual work of calling the API.
This system of launching PHP shells is necessary for functionality with my PHP web app, so launching PHP shells with Python can't be avoided.
This script will only be running on Linux servers.
How do I control the niceness (scheduling priority) of the application's threads?
Edit:
It seems controlling scheduling priority for individual threads in Python isn't possible. Is there a python solution, or at the very least a UNIX command I can run along with my script, to control the priority?
Edit 2:
Well I didn't end up finding a python way to handle it. I'm just running my script with nice now like this:
nice -n 19 python MyScript.py
I believe that threading priority is not controllable in python due to how they are implemented using a global interpreter lock (GIL). Having said that, even if you could give one thread more CPU processing priority, the python implementation that hands around the GIL would not be aware of this as it handed around the GIL. If you were able to increase niceness in a single thread in your pool (say it is doing a more important job) you would need to use your own implementation of locks to give the higher priority thread access to the GIL more often.
A google search returns this article which I believe is similar to what you are asking
Explains why it doesnt work
http://www.velocityreviews.com/forums/t329441-threading-priority.html
Explains the workaround I was suggesting
http://bytes.com/topic/python/answers/645966-setting-thread-priorities
The python threading-docs mention explicitly that there is no support for setting thread-priorities:
The design of this module is loosely based on Java’s threading model. However, where Java makes locks and condition variables basic behavior of every object, they are separate objects in Python. Python’s Thread class supports a subset of the behavior of Java’s Thread class; currently, there are no priorities, no thread groups, and threads cannot be destroyed, stopped, suspended, resumed, or interrupted. The static methods of Java’s Thread class, when implemented, are mapped to module-level functions.
It doesn't work, but I tried:
getting the parent pid and priority
launching threads using concurrent.futures.ThreadPoolExecutor
using ctypes to get the (linux) thread id from within the thread(works)
using the tid with os.setpriority(os.PRIO_PROCESS,tid,parent_priority+1)
calling pool.shutdown() from the parent.
Even with liberal sprinkling of os.sched_yield(), the child threads never actually run past the setpriority().
Reading man pages, it seems threads don't have the capability to change (even their) scheduling priority; you have to do something with "capabilities" to give the thread the "CAP_SYS_NICE" capability. Running the process with root permissions didn't help either; child threads still don't run.
I know, a lot of time has passed, but I recently came across this question, and I thought it would be useful to add another option.
Have a look at threading2, which is a drop-in replacement and extension for the default threading module, with support – sort of – for priority and affinity.
I was wondering if this answer at another related question might be useful in this scenario? (link)
As you are already using Subprocess.POpen to launch your PHP script, it strikes me that you can use "preexec_fn" and either a predefined function, or a lambda function (as demonstrated in the above linked answer) to set the nice level of each launched PHP thread?

Tasks queue process in python

Task is:
I have task queue stored in db. It grows. I need to solve tasks by python script when I have resources for it. I see two ways:
python script working all the time. But i don't like it (reason posible memory leak).
python script called by cron and do a little part of task. But i need to solve the problem of one working active script in memory (To prevent active scripts count grow). What is the best solution to implement it in python?
Any ideas to solve this problem at all?
You can use a lockfile to prevent multiple scripts from running out of cron. See the answers to an earlier question, "Python: module for creating PID-based lockfile". This is really just good practice in general for anything that you need to make sure won't have multiple instances running, actually, so you should look into it even if you do have the script running constantly, which I do suggest.
For most things, it shouldn't be too hard to avoid memory leaks, but if you're having a lot of trouble with it (I sometimes do with complex third-party web frameworks, for example), I would suggest instead writing the script with a small, carefully-designed main loop that monitors the database for new jobs, and then uses the multiprocessing module to fork off new processes to complete each task.
When a task is complete, the child process can exit, immediately freeing any memory that isn't properly garbage collected, and the main loop should be simple enough that you can avoid any memory leaks.
This also offers the advantage that you can run multiple tasks in parallel if your system has more than one CPU core, or if your tasks spend a lot of time waiting for I/O.
This is a bit of a vague question. One thing you should remember is that it is very difficult to leak memory in Python, because of the automatic garbage collection. croning a Python script to handle the queue isn't very nice, although it would work fine.
I would use method 1; if you need more power you could make a small Python process that monitors the DB queue and starts new processes to handle the tasks.
I'd suggest using Celery, an asynchronous task queuing system which I use myself.
It may seem a bit heavy for your use case, but it makes it easy to expand later by adding more worker resources if/when needed.

Python Job Service Daemon?

What packages should I look at for writing a python daemon and processing jobs? Also, what do I need to do for a python daemon?
I'm pretty happy with beanstalkd, which has client libraries available in various languages:
Daemon:
http://kr.github.com/beanstalkd/
Python client library:
http://code.google.com/p/pybeanstalk/
Your question is a bit ambiguous, but I'm assuming you mean you would like to write a python daemon that will process jobs that get thrown in a queue. If not, please say as much. :-)
I've heard a lot of great things about redis. The folks at github built resque as a job processing daemon for Ruby. If you're language flexible, you could just use that, but if you're not, you could emulate it in as much or as little depth as you like making use of redis as your queue system. Depending on how pluggable and extensible you need it to be, this could be a really simple thing to implement.
Another option I ran across after some more googling is redqueue. It looks like it might already implement most of a job queue.
If you're using django, you may wish to consider the Celery project. It's a job queue system based on RabbitMQ which is yet another queuing server with excellent reviews.
As far as creating a daemon in python, there are a number of options. You can look at this page on activestate, which is a good start. Better yet, you can use python-daemon to do it all for you. But if you use one of the above options or beanstalkd as recommended by mczepiel, you probably won't have to make your process run as a daemon.
I have recently (this week) implemented a queue in RabbitMQ with a python daemon extracting the information and storing it on a database (using Django ORM). The daemon has a intermediate buffer so it will wait a little and write in the database in batches, instead of writing each time a little message arrives.
I've made the integration with the queue using this little flopsy module, which is easy to set up. The only problem I've got it to be able to set up a timeout for waiting a message, as the module has not a clear way of doing that. After a while playing with the interactive shell and making a few dir(), I manage to get to the socket object and set up the timeout.
I considered also Celery, but seems to be more focused on using internally a RabbitMQ to allow you to launch tasks (periodically or asynchronously), more that using a queue to communicating with other systems. In our case, the queue can be feed both by Python systems and Ruby ones.
Once I've completed the process, I've made some adjustments to allow running it as a daemon (mostly storing the standard output to a file to allow easy logging) and then create a bash script that launch a start-stop-daemon command. I've followed more or less this schema
I discovered python-daemon just about one day late, so after the work is done it makes no sense revisiting it, but maybe it makes more sense for a Python project.

Categories

Resources