How tell the python logging if its actually running?

How tell the python logging if its actually running? - python

I work at Linux
I have function which calculate some data and brings me results.
This function running in parallel in many copies on many workers in the local web.
I pass the function to the program which manages tasks for workers.
I need to have logging in that function.
I set logging_manager.py and have proper configuration in it - RotatingFileHandler (3 mb for every file, 99 log files max) - Rotating file handler makes the logs size stable - i cant get more than 300 mb for logs
But when I set logging in the beginning of the function - it sets new root logger for every parallel execution and in time every logger opens a file and it exceeds maximum process number which can be handled on machine and then crashing. Also huge amount of logs makes my log files messy because every logger tries to log to the same file but they don't know about each other.
This situation slows me down. I don't like it.
If i set logger to log to new file every time - i will quickly have space exceed on my machine and need to clean it by hand or set service to do it for me - i don't want such a thing
I tried to set logger as a process and manage logs with Queue, but then again - i need to code it on that function, so result will be the same - new process for every file. Loggers don't know about each other.
So my question is:
How tell the process if similar process already exists and force it to connect to previous logger process queue instead of creating new, separate process with separate queue?
This is something like centralized python logging service for the machine which will handle all my parallel functions and properly manage log messages in the files.
Is it even possible? Or do I need to find different solution?

Related

should i be using threads multiprocessing or asycio for my project?

I am trying to build a temperature control module that can be controlled over a network or with manual controls. the individual parts of my program all work but I'm having trouble figuring out how to make them all work together.also my temperature control module is python and the client is C#.
so far as physical components go i have a keypad that sets a temperature and turns the heater on and off and an lcd screen that displays temperature data and of course a temperature sensor.
for my network stuff i need to:
constantly send temperature data to the client.
send a list of log files to the client.
await prompts from the client to either set the desired temperature or send a log file to the client.
so far all the hardware works fine and each individual part of the network functions work but not together. I have not tried to use both physical and network components.
I have been attempting to use threads for this but was wondering if i should be using something else?
EDIT:
here is the basic logic behind what i want to do:
Hardware:
keypad takes a number inputs until '*' it then sets a temp variable.
temp variable is compared to sensor data and the heater is turned on or off accordingly.
'#' turns of the heater and sets temp variable to 0.
sensor data is written to log files while temp variable is not 0
Network:
upon client connect the client is sent a list of log files
temperature sensor data is continuously sent to client.
prompt handler listens for prompts.
if client requests log file the temperature data is halted and the file sent after which the temperature data is resumed.
client can send a command to the prompt handler to set the temp variable to trigger the heater
client can send a command to the prompt handler to stop the heater and set temp variable to 0
commands from either the keypad or client should work at all times.

Multiprocessing is generally for when you want to take advantage of the computational power of multiple processing cores. Multiprocessing limits your options on how to handle shared state between components of your program, as memory is copied initially on process creation, but not shared or updated automatically. Threads execute from the same region of memory, and do not have this restriction, but cannot take advantage of multiple cores for computational performance. Your application does not sound like it would require large amounts of computation, and simply would benefit from concurrency to be able to handle user input, networking, and a small amount of processing at the same time. I would say you need threads not processes. I am not experienced enough with asyncio to give a good comparison of that to threads.
Edit: This looks like a fairly involved project, so don't expect it to go perfectly the first time you hit "run", but definitely very doable and interesting.
Here's how I would structure this project...
I see effectively four separate threads here (maybe small ancillary dameon threads for stupid little tasks)
I would have one thread acting as your temperature controller (PID control / whatever) that has sole control of the heater output. (other threads get to make requests to change setpoint / control mode (duty cycle / PID))
I would have one main thread (with a few dameon threads) to handle the data logging: Main thead listens for logging commands (pause, resume, get, etc.) dameon threads to poll thermometer, rotate log files, etc..
I am not as familiar with networking, and this will be specific to your client application, but I would probably get started with http.server just for prototyping, or maybe something like websockets and a little bit of asyncio. The main thing is that it would interact with the data logger and temperature controller threads with getters and setters rather than directly modifying values
Finally, for the keypad input, I would likely just make up a quick tkinter application to grab keypresses, because that's what I know. Again, form a request with the tkinter app, but don't modify values directly; use getters and setters when "talking" between threads. It just keeps things better organized and compartmentalized.

Query Python3 script to get stats about the script

I have a script that continually runs and accepts data (For those that are familiar and if it helps, it is connected to EMDR - https://eve-market-data-relay.readthedocs.org).
Inside the script I have debugging built in so that I can see how much data is currently in the queue for the threads to process, however this is built to be used with just printing to the console. What I would like to do is be able to either run the same script with an additional option or a totally different script that would return the current queue count without having to enable debug.
Is there a way to do this could someone please point me in the direction of the documentation/libaries that I need to research?

There are many ways to solve this; two that come to mind:
You can write the queue count to a k/v store (like memcache or redis) and then have another script read that for you and do whatever other actions required.
You can create a specific logger for your informational output (like the queue length) and set it to log somewhere else other than the console. For example, you could use it to send you an email or log to an external service, etc. See the logging cookbook for examples.

How to add contextual information to log lines from multiprocessing workers?

I have a pool of worker processes (using multiprocessing.Pool) and want to log from these to a single log file. I am aware of logging servers, syslog, etc. but they all seem to require some changes to how my app is installed, monitored, logs processed etc. which I would like to avoid.
I am using CPython 2.6 on Linux.
Finally I stumbled into a solution which almost works for me. The basic idea is that you start a log listener process, set up a queue between it and the worker processes, and the workers log into the queue (using QueueHandler), and the listener then formats and serializes the log lines to a file.
This is all working so far according to the solution linked above.
But then I wanted to have the workers log some contextual information, for example a job token, for every log line. In pool.apply_async() method I can pass in the contextual info I want to be logged. Note that I am only interested in the contextual information while the worker is doing the specific job; when it is idle there should not be any contextual information if the worker wants to log something. So basically the log listener has log format specified as something like:
"%(job_token)s %(process)d %(asctime)s %(msg)"
and the workers are supposed to provide job_token as contextual info in the log record (the other format specifiers are standard).
I have looked at custom log filters. With custom filter I can create a filter when the job starts and apply the filter to the root logger, but I am using 3rd party modules which create their own loggers (typically at module import time), and my custom filter is not applied to them.
Is there a way to make this work in the above setup? Or is there some alternative way to make this work (remember that I would still prefer a single log file, no separate log servers, job-specific contextual information for worker log lines)?

Filters can be applied to handlers as well as loggers - so you can just apply the filter to your QueueHandler. If this handler is attached to the root logger in your processes, then any logging by third party modules should also be handled by the handler, so you should get the context in those logged events, too.

How can I capture all of the python log records generated during the execution of a series of Celery tasks?

I want to convert my homegrown task queue system into a Celery-based task queue, but one feature I currently have is causing me some distress.
Right now, my task queue operates very coarsely; I run the job (which generates data and uploads it to another server), collect the logging using a variant on Nose's log capture library, and then I store the logging for the task as a detailed result record in the application database.
I would like to break this down as three tasks:
collect data
upload data
report results (including all logging from the preceding two tasks)
The real kicker here is the logging collection. Right now, using the log capture, I have a series of log records for each log call made during the data generation and upload process. These are required for diagnostic purposes. Given that the tasks are not even guaranteed to run in the same process, it's not clear how I would accomplish this in a Celery task queue.
My ideal solution to this problem will be a trivial and ideally minimally invasive method of capturing all logging during the predecessor tasks (1, 2) and making it available to the reporter task (3)
Am I best off remaining fairly coarse-grained with my task definition, and putting all of this work in one task? or is there a way to pass the existing captured logging around in order to collect it at the end?

I assume you are using logging module. You can use separate named logger per task set to do the job. They will inherit all configuration from upper level.
in task.py:
import logging
#task
step1(*args, **kwargs):
# `key` is some unique identifier common for a piece of data in all steps of processing
logger = logging.getLogger("myapp.tasks.processing.%s"%key)
# ...
logger.info(...) # log something
#task
step2(*args, **kwargs):
logger = logging.getLogger("myapp.tasks.processing.%s"%key)
# ...
logger.info(...) # log something
Here, all records were sent to the same named logger. Now, you can use 2 approaches to fetch those records:
Configure file listener with name that depends on logger name. After last step, just read all info from that file. Make sure output buffering is disabled for this listener or you risk loosing records.
Create custom listener that would accumulate records in memory then return them all when told so. I'd use memcached for storage here, it's simpler than creating your own cross-process storage.

It sounds like some kind of 'watcher' would be ideal. If you can watch and consume the logs as a stream you could slurp the results as they come in. Since the watcher would be running seperately and therefore have no dependencies with respect to what it is watching I believe this would satisfy your requirements for a non-invasive solution.

Django Sentry is a logging utility for Python (and Django), and has support for Celery.

Django and fcgi - logging question

I have a site running in Django. Frontend is lighttpd and is using fcgi to host django.
I start my fcgi processes as follows:
python2.6 /<snip>/manage.py runfcgi maxrequests=10 host=127.0.0.1 port=8000 pidfile=django.pid
For logging, I have a RotatingFileHandler defined as follows:
file_handler = RotatingFileHandler(filename, maxBytes=10*1024*1024, backupCount=5,encoding='utf-8')
The logging is working. However, it looks like the files are rotating when they do not even get up to 10Kb, let alone 10Mb. My guess is that each fcgi instance is only handling 10 requests, and then re-spawning. Each respawn of fcgi creates a new file. I confirm that fcgi is starting up under new process id every so often (hard to tell time exactly, but under a minute).
Is there any way to get around this issues? I would like all fcgi instances logging to one file until it reaches the size limit, at which point a log file rotation would take place.

As Alex stated, logging is thread-safe, but the standard handlers cannot be safely used to log from multiple processes into a single file.
ConcurrentLogHandler uses file locking to allow for logging from within multiple processes.

In your shoes I'd switch to a TimedRotatingFileHandler -- I'm surprised that the size-based rotating file handles is giving this problem (as it should be impervious to what processes are producing the log entries), but the timed version (though not controlled on exactly the parameter you prefer) should solve it. Or, write your own, more solid, rotating file handler (you can take a lot from the standard library sources) that ensures varying processes are not a problem (as they should never be).

As you appear to be using the default file opening mode of append ("a") rather than write ("w"), if a process re-spawns it should append to the existing file, then rollover when the size limit is reached. So I am not sure that what you are seeing is caused by re-spawning CGI processes. (This of course assumes that the filename remains the same when the process re-spawns).
Although the logging package is thread-safe, it does not handle concurrent access to the same file from multiple processes - because there is no standard way to do it in the stdlib. My normal advice is to set up a separate daemon process which implements a socket server and logs events received across it to file - the other processes then just implement a SocketHandler to communicate with the logging daemon. Then all events will get serialised to disk properly. The Python documentation contains a working socket server which could serve as a basis for this need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.