Python multi logging - multiprocessing.pool creates duplicate log entries

Python multi logging - multiprocessing.pool creates duplicate log entries - python

so I am doing an analysis where we process a lot of different files using multiprocessing.Pool in order to speed up the process:
with multiprocessing.Pool(processes=num_cores) as p:
output_mp = p.map(clean_file, doc_mp_list_xlsx)
Where read_file is the function to read a file and doc_mp_list is a list of the complete paths.
Logging is determined using the logging module, configured such that it tracks what CPU is used:
import logging # for logging
import multiprocessing # for getting cpu count
from utils import read_config # for reading config
config = read_config.config
num_cores = multiprocessing.cpu_count()
# CONFIGURE LOGGING FOR EACH POOL
for process in range(1, num_cores+1):
handler = logging.FileHandler(config['logging_path'] +
'\\' +
str(process) + '.log')
handler.setFormatter(logging.Formatter(
'%(asctime)s*|*%(levelname)s*|*%(message)s*|*%(module)s*|*%(funcName)s*|*%(lineno)d*|*%(name)s'))
logger = logging.getLogger("SpawnPoolWorker-" + str(process))
logger.setLevel(logging.INFO)
logger.addHandler(handler)
def getLogger(name):
"""
Function to return a logger for the current process
Arguments:
name: ignored argument for backwards compatibility
Returns:
logger: logger for current process
"""
return logging.getLogger(str(multiprocessing.current_process().name))
And this works correctly. However, logging creates duplicate values:
date and time
type
message
module
function
line
worker
2022-12-05 16:42:31,199*
INFO
*Beginning to clean file x.pdf *
clean_pdf
clean_pdf
22
*SpawnPoolWorker-3
2022-12-05 16:42:30,400*
INFO
*Beginning to clean file x.pdf *
clean_pdf
clean_pdf
22
*SpawnPoolWorker-4
I do know understand why this happens. Also, it does not create multiple outputs. It just seems that it is a duplicate message (but with a different worker attached to it).
Does anybody have a clue why this happens? Is the configuration of mp.logging incorrect? Or is it the result of something else?
Thanks in advance.

I think the problem comes from you misunderstanding how logging works in the case of multiprocessing, and in general.
Each process has its own Python, which includes its own logging. In the main process, you configured it. But not in the others. Their own logging never got configured. What you did instead was configuring the main process' logging many times. Each time you provided a different name to a new handler, so that when a message was received in the main process it was handled many times, writing different names. But a log message emitted in the other processes did not get handled at all.
And you sometimes use logging.getLogger("SpawnPoolWorker-" + str(process)), other times logging.getLogger(str(multiprocessing.current_process().name)), so that you may not even use the same logger' objects.
I am not surprised it does not work. Multi-process code is harder than usually expected.
What you need to do is to include the setup of the logging in each process (for example at the start of clean_file) instead, in a multiprocess-compatible way.
The code I used to reproduce (does not include the code for the solution) :
import logging
import multiprocessing
num_cores = multiprocessing.cpu_count()
for process in range(1, num_cores+1):
handler = logging.FileHandler('log.log') # modified the file path
handler.setFormatter(logging.Formatter(
'%(asctime)s*|*%(levelname)s*|*%(message)s*|*%(module)s*|*%(funcName)s*|*%(lineno)d*|*%(name)s'))
logger = logging.getLogger("SpawnPoolWorker-" + str(process))
logger.setLevel(logging.INFO)
logger.addHandler(handler)
def getLogger():
"""
Function to return a logger for the current process
Arguments:
name: ignored argument for backwards compatibility
Returns:
logger: logger for current process
"""
return logging.getLogger(str(multiprocessing.current_process().name))
def clean_file(something): # FAKE
getLogger().debug(f"calling clean_file with {something=!r}")
doc_mp_list_xlsx = [1, 2, 3, 4, 5] # FAKE
with multiprocessing.Pool(processes=num_cores) as p:
output_mp = p.map(clean_file, doc_mp_list_xlsx)

Related

Prevent one of the logging handlers for specific messages

I monitor my script with the logging module of the Python Standard Library and I send the loggings to both the console with StreamHandler, and to a file with FileHandler.
I would like to have the option to disable a handler for a LogRecord independantly of its severity. For example, for a specific LogRecord I would like to have the option not to send it to the file destination or to the console (with passing a parameter).
I have found that the library has the Filter class for that reason (which is described as a finer grained way to filter blocks), but haven't figured out how to do it.
Any ideas how to do this in a cosistent way?

Finally, it is quite easy. I used a function as a Handler.filer as suggested in the comments.
This is a working example:
from pathlib import Path
import logging
from logging import LogRecord
def build_handler_filters(handler: str):
def handler_filter(record: LogRecord):
if hasattr(record, 'block'):
if record.block == handler:
return False
return True
return handler_filter
ch = logging.StreamHandler()
ch.addFilter(build_handler_filters('console'))
fh = logging.FileHandler(Path('/tmp/test.log'))
fh.addFilter(build_handler_filters('file'))
mylogger = logging.getLogger(__name__)
mylogger.setLevel(logging.DEBUG)
mylogger.addHandler(ch)
mylogger.addHandler(fh)
When the logger is called, the message is sent to both console and output, i.e.
mylogger.info('msg').
To block for example the file the logger should be called with the extra argument like this
mylogger.info('msg only to console', extra={'block': 'file'})
Disabling console is analogous.

Create log file named after filename of caller script

I have a logger.py file which initialises logging.
import logging
logger = logging.getLogger(__name__)
def logger_init():
import os
import inspect
global logger
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)
fh = logging.FileHandler(os.getcwd() + os.path.basename(__file__) + ".log")
fh.setLevel(level=logging.DEBUG)
logger.addHandler(fh)
return None
logger_init()
I have another script caller.py that calls the logger.
from logger import *
logger.info("test log")
What happens is a log file called logger.log will be created containing the logged messages.
What I want is the name of this log file to be named after the caller script filename. So, in this case, the created log file should have the name caller.log instead.
I am using python 3.7

It is immensely helpful to consolidate logging to one location. I learned this the hard way. It is easier to debug when events are sorted by time and it is thread-safe to log to the same file. There are solutions for multiprocessing logging.
The log format can, then, contain the module name, function name and even line number from where the log call was made. This is invaluable. You can find a list of attributes you can include automatically in a log message here.
Example format:
format='[%(asctime)s] [%(module)s.%(funcName)s] [%(levelname)s] %(message)s
Example log message
[2019-04-03 12:29:48,351] [caller.work_func] [INFO] Completed task 1.

You can get the filename of the main script from the first item in sys.argv, but if you want to get the caller module not the main script, check the answers on this question.

why python logging RotatingFileHandler lost records when used in multiple processes?

Recently I've realize that my application generate less log records than I expected. After some experiments I've found that problem is in RotatingFileHandler and multiprocessing.
import logging
from logging import handlers
from multiprocessing import Pool
import os
log_file_name = 'log.txt'
def make_logger():
logger = logging.getLogger('my_logger')
logger.setLevel(logging.INFO)
current_handler_names = {handler.name for handler in logger.handlers}
handler_name = 'my_handler'
if handler_name in current_handler_names:
return logger
handler = handlers.RotatingFileHandler(
log_file_name, maxBytes=10 * 2 ** 10, backupCount=0)
handler.setLevel(logging.INFO)
handler.set_name(handler_name)
logger.addHandler(handler)
return logger
def f(x):
logger = make_logger()
logger.info('hey %s' % x)
if os.path.exists(log_file_name):
os.unlink(log_file_name)
p = Pool(processes=30)
N = 1000
p.map(f, range(N))
with open(log_file_name, 'r') as f:
print 'expected: %s, real: %s' % (N, f.read().count('hey'))
Output:
$ python main.py
expected: 1000, real: 943
What I did wrong?

As it is well explained,
Although logging is thread-safe, and logging to a single file from multiple threads in a single process is supported, logging to a single file from multiple processes is not supported
In a few words, RotatingFileHandler simply closes and deletes the file from one process, then opens a new file. But other processes don't know about a new file descriptor and see that previous have been closed. Only the process who managed to rotate the file first continues logging.
In my answer to the similar question I've proposed to use logrotate daemon to rotate files aside of these processes. It does not close the file descriptor, but just truncates the file. Thus file remains the same and other processes can continue logging.

Python: flush logging only at end of script run

Currently I use for logging a custom logging system that works as follow:
I have a Log class that ressemble the following:
class Log:
def __init__(self):
self.script = ""
self.datetime = datetime.datetime.now().replace(second=0, microsecond=0)
self.mssg = ""
self.mssg_detail = ""
self.err = ""
self.err_detail = ""
I created a function decorator that perform a try/except on the function call, and add a message either to .mssg or .err on the Log object accordingly.
def logging(fun):
#functools.wraps(fun)
def inner(self, *args):
try:
f = fun(self, *args)
self.logger.mssg += fun.__name__ +" :ok, "
return f
except Exception as e:
self.logger.err += fun.__name__ +": error: "+str(e.args)
return inner
So usually a script is a class that is composed of multiple methods that are run sequentially.
I hence run those methods (decorated such as mentionned above) , and lastly I upload the Log object into a mysql db.
This works quite fine and alright. But now I want to modify those items so that they integrate with the "official" logging module of python.
What I dont like about that module is that it is not possible to "save" the messages onto 1 log object in order to upload/save to log only at the end of the run. Rather each logging call will write/send the message to a file etc. - which create lots of performances issues sometimes. I could usehandlers.MemoryHandler , but it still doesn't seems to perform as my original system: it is said to collect messages and flush them to another handler periodically - which is not what i want: I want to collect the messages in memory and to flush them on request with an explicit function.
Anyone has any suggestions?

Here is my idea. Use a handler to capture the log in a StringIO. Then you can grab the StringIO whenever you want. Since there was perhaps some confusion in the discussion thread - StringIO is a "file-like" interface for strings, there isn't ever an actual file involved.
import logging
import io
def initialize_logging(log_level, log_name='default_logname'):
logger = logging.getLogger(log_name)
logger.setLevel(log_level)
log_stream = io.StringIO()
if not logger.handlers:
ch = logging.StreamHandler(log_stream)
ch.setLevel(log_level)
ch.setFormatter(logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
))
logger.addHandler(ch)
logger.propagate = 0
return logger, log_stream
And then something like:
>>> logger, log_stream = initialize_logging(logging.INFO, "logname")
>>> logger.warning("Hello World!")
And when you want the log information:
>>> log_stream.getvalue()
'2017-05-16 16:35:03,501 - logname - WARNING - Hello World!\n'

At program start (in the main), you can:
instanciate your custom logger => global variable/singleton.
register a function at program end which will flush your logger.
Run your decorated functions.
To register a function you can use atexit.register function. See the page Exit handlers in the doc.
EDIT
The idea above can be simplified.
To delay the logging, you can use the standard MemoryHandler handler, described in the page logging.handlers — Logging handlers
Take a look at this GitHub project: https://github.com/tantale/python-ini-cfg-demo
And replace the INI file by this:
[formatters]
keys=default
[formatter_default]
format=%(asctime)s:%(levelname)s:%(message)s
class=logging.Formatter
[handlers]
keys=console, alternate
[handler_console]
class=logging.handlers.MemoryHandler
formatter=default
args=(1024, INFO)
target=alternate
[handler_alternate]
class=logging.StreamHandler
formatter=default
args=()
[loggers]
keys=root
[logger_root]
level=DEBUG
formatter=default
handlers=console
To log to a database table, just replace the alternate handler by your own database handler.
There is some blog/SO questions about that:
You can look at Logging Exceptions To Your SQLAlchemy Database to create a SQLAlchemyHandler
See Store Django log to database if you are using DJango.
EDIT2
Note: ORM generally support "Eager loading", for instance with SqlAlchemy

Python BasicConfig Logging does not change logfile

I wrote a small function to log events to a file. This python script is imported in the main script. The mainscript runs as a daemon (actually it is polling a database).
MainScript.py:
import logger
logger.logmessage(module = module, message = "SomeMessage")
logger.py:
def logmessage(message, module, level = 'INFO'):
today = str(datetime.date.today())
logFile = '/path/to/log/myapplog.'+today+'.log'
logging.basicConfig(format='%(asctime)s - %(levelname)s - '+ module + ' - %(message)s',level=logging.INFO,filename=logFile)
if level is "INFO":
logging.info(message)
elif level is "WARNING":
logging.warning(message)
elif level is "CRITICAL":
logging.critical(message)
My intention: get logfiles like myapplog.2014-01-23.log, 2014-01-24.log, ...
My proplem: the logfile stays the same. It constantly logs to myapplog.2014-01-23.log and only after a restart of the daemon, the proper log with correct date is created and used.

It sounds like you need to use TimedRotatingFileHandler as documented here.
Also, you shouldn't call basicConfig() more than once (I presume you're calling logmessage more than once). As documented, basicConfig() won't do anything except set up a basic configuration if there is none (so only the first call does anything - subsequent calls find there is a configuration, so don't do anything).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multi logging - multiprocessing.pool creates duplicate log entries - python

Related

Prevent one of the logging handlers for specific messages

Create log file named after filename of caller script

why python logging RotatingFileHandler lost records when used in multiple processes?

Python: flush logging only at end of script run

Python BasicConfig Logging does not change logfile

Categories

Resources