Python TimedRotatingFileHandler - PID in Log File name - Best Approach

Python TimedRotatingFileHandler - PID in Log File name - Best Approach - python

I wish to start multiple instances (Processes) of the Python Program and I want each one of them to write to their own log file.
The processes will be restarted atleast once daily.
So i arrived at the following code.
logHandler = TimedRotatingFileHandler(os.path.join(os.path.dirname(sys.argv[0]),'logs/LogFile_'+str(os.getpid())+'.log'),when="midnight", backupCount=7)
Will this code maintain 7 Backups for each PID?
Is there a better way to split this so that my disk does not fill up with useless
files? Give that the PID might be unique for the processes over months.
Is there a better approach to doing this?
What I would ideally like to have is that the logs pertaining to only 1 Week are maintained. Can this be done using TimeRotatingFileHander without having to write a separate Purge/Delete script?

Yes, this will maintain 7 backups, or a weeks worth of logs, for each unique log path.
Rotating file handlers are the correct way to put a limit on logs.
As I said, rotating file handlers are the correct approach. I suppose you could use a RotatingFileHandler, but that rotates when the log hits a size, rather than at a particular time, so it doesn't allow you to specify a weeks worth of logs.
I'm a bit confused by how you're keeping the pid for a given process constant giving that the 'processes will be restarted at least once daily'. A stronger guarantee that each process has a unique log path is to provide it explicitly as an argument, e.g. python script --log-file="$(pwd)/logs/LogFileProcX.log"

Related

Concurrent access to one file by several unrelated processes on macOS

I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!

While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc

python WatchedFileHandler still writing to old file after rotation

I've been using WatchedFileHandler as my python logging file handler, so that I can rotate my logs with logrotate (on ubuntu 14.04), which you know is what the docs say its for. My logrotate config files looks like
/path_to_logs/*.log {
daily
rotate 365
size 10M
compress
delaycompress
missingok
notifempty
su root root
}
Everything seemed to be working just fine. I'm using logstash to ship my logs to my elasticsearch cluster and everything is great. I added a second log file for my debug logs which gets rotated but is not watched by logstash. I noticed that when that file is rotated, python just keeps writing to /path_to_debug_logs/*.log.1 and never starts writting to the new file. If I manually tail /path_to_debug_logs/*.log.1, it switches over instantly and starts writing to /path_to_debug_logs/*.log.
This seems REALLY weird to me.
I believe what is happening is that logstash is always tailing my non-debug logs, which some how triggers the switch over to the new file after logrotate is called. If logrotate is called twice without a switch over, the log.1 file gets moved and compressed to log.2.gz, which python can no longer log to and logs are lost.
Clearly there are a bunch of hacky solutions to this (such as a cronjob that tails all my logs every now and then), but I feel like I must be doing something wrong.
I'm using WatchedFileHandler and logrotate instead of RotatingFileHandler for a number of reasons, but mainly because it will nicely compress my logs for me after rotation.
UPDATE:
I tried the horrible hack of adding a manual tail to the end of my log rotation config script.
sharedscripts
postrotate
/usr/bin/tail -n 1 path_to_logs/*.log.1
endscript
Sure enough this works most of the time, but randomly fails sometimes for no clear reason, so isn't a solution. I've also tried a number of less hacky solutions where I've modified the way WatchFileHandler checks if the file has changed, but no luck.
I'm fairly sure the root of my problem is that the logs are stored on a network drive, which is somehow confusing the file system.
I'm moving my rotation to python with RotatingFileHandler, but if anyone knows the proper way to handle this I'd love to know.

Use copytruncate option of logrotate. From docs
copytruncate
Truncate the original log file in place after creating a copy, instead of moving the old log file and optionally creating a new one, It can be used when some program can not be told to close its logfile and thus might continue writing (appending) to the previous log file forever. Note that there is a very small time slice between copying the file and truncating it, so some logging data might be lost. When this option is used, the create option will have no effect, as the old log file stays in place.

WatchedFileHandler does a rollover when a device and/or inode change is detected in the log file just before writing to it. Perhaps the file which isn't being watched by logstash doesn't see a change in its device/inode? That would explain why the handler keeps on writing to it.

Knowing how to time a task when building a progress bar

In my program a user uploads a csv file.
While the file is uploading & being processed by my app, I'd like to show a progress bar.
The problem is that this process isn't entirely under my control (I can't really tell how long it'll take for the file to finish loading & be processed, as this depends on the file content and the size).
What would be the correct approach for doing this? It's not like I have many steps and I could increment the progress bar every time a step happens.... It's basically waiting for a file to be loaded, I cannot determine the time for that!
Is this even possible?
Thanks in advance

You don't give much detail, so I'll explain what I think is happening and give some suggestions from my thought process.
You have some kind of app that has some kind of function/process that
is a black-box (i.e you can't see inside it or change it), this
black-box uploads a csv file to some server and returns control back to
your app when it's done. Since you can't see inside the black-box you
can't determine how much it has uploaded and thus can't create an
accurate progress bar.
Named Pipes:
If you're passing only the filename of the csv to the black-box, you might be able to create a named pipe (depending on your situation.) Since named pipes block after the buffer is full - until the receiver reads it, you could keep track of how much has been read and thus create an accurate progress bar.
So you would create a named pipe, pass the black-box its filename, and then read in from the csv - and write to the named pipe. How far you've read in - is your progress.
More Pythonic:
Since you tagged Python, if you're passing the csv as a file-like object, this activestate recipe could help.
Same kind of idea just for Python.
Conclusion: These are two possible solutions. I'm getting tired, and there may be many more - but I can't help more since you haven't given us much to work with.
To answer your question at an abstract level: you can't make accurate progress bars for black-box functions, after all they could have a sleep(random()) call in them for all you know.
There are ways around this that are implementation specific, the two ideas above are examples: the idea being you can make the black-box take a stream instead, and count the bytes as you pass them through.
Alternatively you can guess/approximate, a rough calculation of how many bytes are going in and a (previously calculated) average speed per byte would give you some kind of indication of when it would complete. You could even save how long each run took in your code and do the previous idea automatically getting better each time.

Python Logging - Circular Log (last n entries)

I know the Python logging library allows you to do 'circular' logging over multiple log files. What I'm trying to do is simply have one file, foo.log, that is always <= B bytes in size; if the next append is going to put it over B, then things are deleted off the top. I'd be just as happy to specify the max in terms of events, as well.
So, if this were the file rotation scheme, and item #4 exceeded B, you'd have:
foo.log.1 foo.log.2
--------- ---------
African Swallow
or
European
I'd like to simply wind up with:
foo.log
-------
or
European
Swallow
EDIT: Based on the comments below, people have legitimately noted this is a less-than-optimal format. The motivation comes from debugging. I have scripts using psycopg2 to execute queries on a remote server that's stuck in 2002, roughly, with no internet connection. Having it log everything it's sending to the db and then checking that log is the fastest way to see where something went wrong, and I have to point someone else to do it I don't want to introduce the complication of having them figure out which is the current log file. The current solution is just to write the log and delete it if it gets too big.

As Martijn notes, this type of log would be complicated to manage, and maybe inefficient (though this may or may not concern you).
A simple way to solve some part of the inefficiency is to use fixed record lengths. I.e. make each log entry the same (max) length.
Another way is to make your log database-based, and just make a record (variable or not) for each log entry. Let the db manager handle the adjustments. There are simple (RAM-based) databases to real, disk-based ones, all of which you can access with Python.
Yet another solution, if you're happy with a memory based log, is look into FIFO files.

You could log to a file, with a couter/time at start of each line.
When you get to a certain point, just update from the top of the file again.
thefile = open('somebinfile', 'r+b')
thefile.seek(0)
Things to consider, when you seek to the top, and write to the file, you might only half overwrite next line to account for that you would need a unique line ending Char/String.

Monitoring Rsync Progress

I'm trying to write a Python script which will monitor an rsync transfer, and provide a (rough) estimate of percentage progress. For my first attempt, I looked at an rsync --progress command and saw that it prints messages such as:
1614 100% 1.54MB/s 0:00:00 (xfer#5, to-check=4/10)
I wrote a parser for such messages, and used the to-check part to produce a percentage progress, here, this would be 60% complete.
However, there are two flaws in this:
In large transfers, the "numerator" of the to-check fraction doesn't seem to monotonically decrease, so the percentage completeness can jump backwards.
Such a message is not printed for all files, meaning that the progress can jump forwards.
I've had a look at other alternatives of messages to use, but haven't managed to find anything. Does anyone have any ideas?
Thanks in advance!

The current version of rsync (at the time of editing 3.1.2) has an option --info=progress2 which will show you progress of the entire transfer instead of individual files.
From the man page:
There is also a --info=progress2 option that outputs statistics based on the whole transfer, rather than individual files. Use this flag without outputting a filename (e.g. avoid -v or specify --info=name0 if you want to see how the transfer is doing without scrolling the screen with a lot of names. (You don't need to specify the --progress option in order to use --info=progress2.)
So, if possible on your system you could upgrade rsync to a current version which contains that option.

You can disable the incremental recursion with the argument --no-inc-recursive. rsync will do a pre-scan of the entire directory structure, so it knows the total number of files it has to check.
This is actually the old way it recursed. Incremental recursion, the current default, was added for speed.

Note the caveat here that even --info=progress2 is not entirely reliable since this is percentage based on the number of files rsync knows about at the time when the progress is being displayed. This is not necessarily the total number of files that needed to be sync'd (for instance, if it discovers a large number of large files in a deeply nested directory).
One way to ensure that --info=progress2 doesn't jump back in the progress indication would be to force rsync to scan all the directories recursively before starting the sync (instead of its default behavior of doing an incrementally recursive scan), by also providing the --no-inc-recursive option. Note however that this option will also increase rsync memory usage and run-time.

For full control over the transfer you should use a more low-level diff tool and manage directory listing and data transfer yourself.
Based on librsync there is either the command line rdiff or the python module pysync

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.