Streaming text logfiles into RabbitMQ, then reconstructing at other end?

Streaming text logfiles into RabbitMQ, then reconstructing at other end? - python

Requirements
We have several servers (20-50) - Solaris 10 and Linux (SLES) - running a mix of different applications, each generating a bunch of log events into textfiles. We need to capture these to a separate monitoring box, where we can do analysis/reporting/alerts.
Current Approach
Currently, we use SSH with a remote "tail -f" to stream the logfiles from the servers onto the monitoring box. However, this is somewhat brittle.
New Approach
I'd like to replace this with RabbitMQ. The servers would publish their log events into this, and each monitoring script/app could then subscribe to the appropriate queue.
Ideally, we'd like the applications themselves to dump events directly into the RabbitMQ queue.
However, assuming that's not an option in the short term (we may not have source for all the apps), we need a way to basically "tail -f" the logfiles from disk. I'm most comfortable in Python, so I was looking at a Pythonic way of doing that - the consensus seems to be to just use a loop with readline() and sleep() to emulate "tail -f".
Questions
Is there an easier way of "tail -f" a whole bunch of textfiles directly onto a RabbitMQ stream? Something inbuilt, or an extension we could leverage on? Any other tips/advice here?
If we do write a Python wrapper to capture all the logfiles and publish them - I'd ideally like a single Python script to concurrently handle all the logfiles, rather than manually spinning up a separate instance for each logfile. How should we tackle this? Are there considerations in terms of performance, CPU usage, throughput, concurrency etc.?
We need to subscribe to the queues, and then possibly dump the events back to disk and reconstruct the original logfiles. Any tips/advice on this? And we'd also like a single Python script we could startup to handle reconstructing all of the logfiles - rather than 50 separate instances of the same script - is that easily achievable?
Cheers,
Victor
PS: We did have a look at Facebook's Scribe, as well as Flume, and both seem a little heavyweight for our needs.

You seem to be describing centralized syslog with rabbitmq as the transport.
If you could live with syslog, take a look at syslog-ng. Otherwise, you might
save some time by using parts of logstash ( http://logstash.net/ ).

If it would be possible you can make the Application publish the events Asynchronously to RabbitMQ instead of writing it to log files. I have done this currently in Java.
But some times it is not possible to make the app log the way you want.
1 ) You can write a file tailer in python which publishes to AMQP. I don't know of anything which plugs in a File as the input to RabbitMQ. Have a look at http://code.activestate.com/recipes/436477-filetailpy/ and http://www.perlmonks.org/?node_id=735039 for tailing files.
2) You can create a Python Daemon which can tail all the given files either as processes or in a round robin fashion.
3) A similar approach to 2 can help you solve this. You can probably have a single queue for each log file.

If you are talking about application logging (as opposed to e.g. access logs such as Apache webserver logs), you can use a handler for stdlib logging which writes to AMQP middleware.

Related

Python multi-processing one worker dynimc number of recievers of all worker data (1:n)

I am planing to setup a small proxy service for a remote sensor, that only accepts one connection. I have a temporary solution and I am now designing a more robust version, and therefore dived deeper into the python multiprocessing module.
I have written a couple of systems in python using a main process, which spawns subprocesses using the multiprocessing module and used multiprocessing.Queue to communicate between them. This works quite well and some of theses programs/scripts are doing their job in a production environment.
The new case is slightly different since it uses 2+n processes:
One data-collector, that reads data from the sensor (at 100Hz) and every once in a while receives short ASCII strings as command
One main-server, that binds to a socket and listens, for new connections and spawns...
n child-servers, that handle clients who want to have the sensor data
while communication from the child servers to the data collector seems pretty straight forward using a multiprocessing.Queue which manages a n:1 connection well enough, I have problems with the other way. I can't use a queue for that as well, because all child-servers need to get all the data the sensor produces, while they are active. At least I haven't found a way to configure a Queue to mimic that behaviour, as get takes the top most out of the Queue by design.
I looked into shared memory already, which massively increases the management overhead, since as far as I understand it while using it, I would basically need to implement a streaming buffer myself.
The only safe way I see right now, is using a redis server and messages queues, but I am a bit hesitant, since that would need more infrastructure than I like.
Is there a pure python internal way?

maybe You can use MQTT for that ?
You did not clearly specify, but sounds like observer pattern -
or do You want the clients to poll each time they need data ?
It depends which delays / data rate / jitter etc. You can accept.
after You provided the information :
The whole setup runs on one machine in one process space. What I would like to have, is a way without going through a third party process
I would suggest to check for observer pattern.
More informations can be found for example:
https://www.youtube.com/watch?v=_BpmfnqjgzQ&t=1882s
and
https://refactoring.guru/design-patterns/observer/python/example
and
https://www.protechtraining.com/blog/post/tutorial-the-observer-pattern-in-python-879
and
https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Observer.html
Your Server should fork for each new connection and register with the observer, and will be therefore informed about every change.

python best way to publish/receive data between programs

I'm trying to figure out the best way to publish and receive data between separate programs. My ideal setup is to have one program constantly receive market data from an external websocket api and to have multiple other programs use this data. Since this is market data from an exchange, the lower the overhead the better.
My first thoughts were to write out a file and have the others read it, but that seems like there would be file locking issues. Another approach I tried was to use UDP sockets, but it seems like the socket blocks the rest of the program when receiving. I'm pretty new at writing full fledged programs instead of little scripts so sorry if this a dumb question. Any suggestions would be appreciated. Thanks!

You can use SQS, It is easy to use and the Python
documentation for it is great. If you want a free one you can use Kafka

Try something like an message queue, e.g. https://github.com/kr/beanstalkd, and you essentially control it via the client ... one that collects and sends, and one that consumes and marks what it has read ... and so on.
Beanstalk is super-light-weight and simple compared to other message queues which are more like multi app. systems rather than queues necessarily.

Query Python3 script to get stats about the script

I have a script that continually runs and accepts data (For those that are familiar and if it helps, it is connected to EMDR - https://eve-market-data-relay.readthedocs.org).
Inside the script I have debugging built in so that I can see how much data is currently in the queue for the threads to process, however this is built to be used with just printing to the console. What I would like to do is be able to either run the same script with an additional option or a totally different script that would return the current queue count without having to enable debug.
Is there a way to do this could someone please point me in the direction of the documentation/libaries that I need to research?

There are many ways to solve this; two that come to mind:
You can write the queue count to a k/v store (like memcache or redis) and then have another script read that for you and do whatever other actions required.
You can create a specific logger for your informational output (like the queue length) and set it to log somewhere else other than the console. For example, you could use it to send you an email or log to an external service, etc. See the logging cookbook for examples.

Is there a good way to split a python program into independent modules?

I'm trying to do some machinery automation with python, but I've run into a problem.
I have code that does the actual control, code that logs, code the provides a GUI, and some other modules all being called from a single script.
The issue is that an error in one module halts all the others. So, for instance a bug in the GUI will kill the control systems.
I want to be able to have the modules run independently, so one can crash, be restarted, be patched, etc without halting the others.
The only way I can find to make that work is to store the variables in an SQL database, or files or something.
Is there a way for one python script to sort of ..debug another? so that one script can read or change the variables in the other? I can't find a way to do that that also allows to scripts to be started and stopped independently.
Does anyone have any ideas or advice?

A fairly effective way to do this is to use message passing. Each of your modules are independent, but they can send and receive messages to each other. A very good reference on the many ways to achieve this in Python is the Python wiki page for parallel processing.
A generic strategy
Split your program into pieces where there are servers and clients. You could then use middleware such as 0MQ, Apache ActiveMQ or RabbitMQ to send data between different parts of the system.
In this case, your GUI could send a message to the log parser server telling it to begin work. Once it's done, the log parser will send a broadcast message to anyone interested telling the world the a reference to the results. The GUI could be a subscriber to the channel that the log parser subscribes to. Once it receives the message, it will open up the results file and display whatever the user is interested in.
Serialization and deserialization speed is important also. You want to minimise the overhead for communicating. Google Protocol Buffers and Apache Thrift are effective tools here.
You will also need some form of supervision strategy to prevent a failure in one of the servers from blocking everything. supervisord will restart things for you and is quite easy to configure. Again, it is only one of many options in this space.
Overkill much?
It sounds like you have created a simple utility. The multiprocessing module is an excellent way to have different bits of the program running fairly independently. You still apply the same strategy (message passing, no shared shared state, supervision), but with different tactics.

You want multiply independent processes, and you want them to talk to each other. Hence: read what methods of inter-process communication are available on your OS. I recommend sockets (generic, will work over a n/w and with diff OSs). You can easily invent a simple (maybe http-like) protocol on top of TCP, maybe with json for messages. There is a bunch of classes coming with Python distribution to make it easy (SocketServer.ThreadingMixIn, SocketServer.TCPServer, etc.).

Best practice for Python process control

This is my first hack at doing any system-level programming (mostly a LAMPhp, specifically Drupal, web dev up to this point).
Because of availability of a library with a very specific feature, I am using Python for an upcoming project. I need to run, restart as needed, monitor and respond to the output of multiple Python script processes, controlled ideally via a HTTP API from another master program which keeps a database of processes that need to be running, and some metadata about those processes (parameters, pid, etc). I'm planning on building this master program in PHP as I have far more experience in it, hence the want for a nice HTTP API.
Is there some best practice for this type of system? Some initial research lead me to supervisord (which has XML-RPC built in, apparently), but I thought I'd check the wisdom of the masses who've actually been down this road before moving forward with testing.

I can't say I have been down this road, but I am working to go down this road. I would look into the multiprocessing libraries for Python. There are network transparent libraries. A couple of routes you could take with those:
1. Create a process that controls all of the other processes. Make this process a server you can control with your PHP.
2. Determine how to get PHP to communicate to these networked Python processes. They may still need to be launched from a central Python process however.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.