Python Logging - Circular Log (last n entries)

Python Logging - Circular Log (last n entries) - python

I know the Python logging library allows you to do 'circular' logging over multiple log files. What I'm trying to do is simply have one file, foo.log, that is always <= B bytes in size; if the next append is going to put it over B, then things are deleted off the top. I'd be just as happy to specify the max in terms of events, as well.
So, if this were the file rotation scheme, and item #4 exceeded B, you'd have:
foo.log.1 foo.log.2
--------- ---------
African Swallow
or
European
I'd like to simply wind up with:
foo.log
-------
or
European
Swallow
EDIT: Based on the comments below, people have legitimately noted this is a less-than-optimal format. The motivation comes from debugging. I have scripts using psycopg2 to execute queries on a remote server that's stuck in 2002, roughly, with no internet connection. Having it log everything it's sending to the db and then checking that log is the fastest way to see where something went wrong, and I have to point someone else to do it I don't want to introduce the complication of having them figure out which is the current log file. The current solution is just to write the log and delete it if it gets too big.

As Martijn notes, this type of log would be complicated to manage, and maybe inefficient (though this may or may not concern you).
A simple way to solve some part of the inefficiency is to use fixed record lengths. I.e. make each log entry the same (max) length.
Another way is to make your log database-based, and just make a record (variable or not) for each log entry. Let the db manager handle the adjustments. There are simple (RAM-based) databases to real, disk-based ones, all of which you can access with Python.
Yet another solution, if you're happy with a memory based log, is look into FIFO files.

You could log to a file, with a couter/time at start of each line.
When you get to a certain point, just update from the top of the file again.
thefile = open('somebinfile', 'r+b')
thefile.seek(0)
Things to consider, when you seek to the top, and write to the file, you might only half overwrite next line to account for that you would need a unique line ending Char/String.

Related

Knowing how to time a task when building a progress bar

In my program a user uploads a csv file.
While the file is uploading & being processed by my app, I'd like to show a progress bar.
The problem is that this process isn't entirely under my control (I can't really tell how long it'll take for the file to finish loading & be processed, as this depends on the file content and the size).
What would be the correct approach for doing this? It's not like I have many steps and I could increment the progress bar every time a step happens.... It's basically waiting for a file to be loaded, I cannot determine the time for that!
Is this even possible?
Thanks in advance

You don't give much detail, so I'll explain what I think is happening and give some suggestions from my thought process.
You have some kind of app that has some kind of function/process that
is a black-box (i.e you can't see inside it or change it), this
black-box uploads a csv file to some server and returns control back to
your app when it's done. Since you can't see inside the black-box you
can't determine how much it has uploaded and thus can't create an
accurate progress bar.
Named Pipes:
If you're passing only the filename of the csv to the black-box, you might be able to create a named pipe (depending on your situation.) Since named pipes block after the buffer is full - until the receiver reads it, you could keep track of how much has been read and thus create an accurate progress bar.
So you would create a named pipe, pass the black-box its filename, and then read in from the csv - and write to the named pipe. How far you've read in - is your progress.
More Pythonic:
Since you tagged Python, if you're passing the csv as a file-like object, this activestate recipe could help.
Same kind of idea just for Python.
Conclusion: These are two possible solutions. I'm getting tired, and there may be many more - but I can't help more since you haven't given us much to work with.
To answer your question at an abstract level: you can't make accurate progress bars for black-box functions, after all they could have a sleep(random()) call in them for all you know.
There are ways around this that are implementation specific, the two ideas above are examples: the idea being you can make the black-box take a stream instead, and count the bytes as you pass them through.
Alternatively you can guess/approximate, a rough calculation of how many bytes are going in and a (previously calculated) average speed per byte would give you some kind of indication of when it would complete. You could even save how long each run took in your code and do the previous idea automatically getting better each time.

Python TimedRotatingFileHandler - PID in Log File name - Best Approach

I wish to start multiple instances (Processes) of the Python Program and I want each one of them to write to their own log file.
The processes will be restarted atleast once daily.
So i arrived at the following code.
logHandler = TimedRotatingFileHandler(os.path.join(os.path.dirname(sys.argv[0]),'logs/LogFile_'+str(os.getpid())+'.log'),when="midnight", backupCount=7)
Will this code maintain 7 Backups for each PID?
Is there a better way to split this so that my disk does not fill up with useless
files? Give that the PID might be unique for the processes over months.
Is there a better approach to doing this?
What I would ideally like to have is that the logs pertaining to only 1 Week are maintained. Can this be done using TimeRotatingFileHander without having to write a separate Purge/Delete script?

Yes, this will maintain 7 backups, or a weeks worth of logs, for each unique log path.
Rotating file handlers are the correct way to put a limit on logs.
As I said, rotating file handlers are the correct approach. I suppose you could use a RotatingFileHandler, but that rotates when the log hits a size, rather than at a particular time, so it doesn't allow you to specify a weeks worth of logs.
I'm a bit confused by how you're keeping the pid for a given process constant giving that the 'processes will be restarted at least once daily'. A stronger guarantee that each process has a unique log path is to provide it explicitly as an argument, e.g. python script --log-file="$(pwd)/logs/LogFileProcX.log"

What would happen if there's a power failure while the OS is in the middle of doing file I/O operations?

OK, this question is actually a follow-up question from my previous one: What would happen if I abruptly close my script while it's still doing file I/O operations?
So it's not possible to see an incomplete line written into a file whenever you force your script/program to quit, as the OS will do their job. But what if there's a power failure, and the OS is just in the middle of appending one line such as "This is a test"(or even bigger strings) into a file, do I get an incomplete line appended or nothing appended or even worse, previous content lost? I'm really curious to know, and this kind of situation would definitely happen on the server side. Anybody can help me out?

Rule 1. There's no magic. No guarantee. No assurance. Power failure means the circuitry passes through states that are outside their design tolerances. Anything could happen. No guarantees.
what if there's a power failure, and the OS is just in the middle of appending ... into a file, do I get an incomplete line appended
Possibly. There's no magic. The I/O could include two physical blocks. One written, one unwritten.
or nothing appended
Possibly. There's no magic. The I/O buffer may not have been synced to the device.
or even worse, previous content lost?
Possibly. There's no magic. A block write to the device could -- during a power failure -- fatally corrupt bits on the device.
I'm really curious to know, and this kind of situation would definitely happen on the server side.
"Definitely"? Nothing's definite during an uncontrollable event like a power failure. Anything could happen.
There's a really small possibility that the random scrambled bits could be the text of Lincoln's Gettysburg Address and that's what appears on the device.

It is dependent on FileSystem (and its options), hardware (caches/buffers, media, etc.), application behavior and lots of other tidbits.
You can lose data, even data you had safely written before. You can corrupt whole partitions. You can get garbage on files. You can get a line half-written, half-laden with garbage or whatever. Given the right combination of factors, you can pretty much get any result you imagine, files with mixed contents, old bits of deleted files resurfacing, dogs and cats living together... mass hysteria!
With a proper (journaled? versioned?) FS and sane hardware, you do lower the amount of chaos possible.

Any write functions In python that have the same safety as ACID does in databases

The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.

Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.

The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.

It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".

Need some ideas on how to code my log parser

I have a VPS that's hosting multiple virtual hosts. Each host has it's own access.log and error.log. Currently, there's no log rotation setup, though, this may change.
Basically, I want to parse these logs to monitor bandwidth and collect stats.
My idea was to write a parser and save the information to a small sqlite database. The script will run every 5 minutes and use Python's seek and tell methods to open the log files from the last parsed locations. This prevents me from parsing a 10GB log file every 5 minutes when all I need is the new information sitting at the end of it (no log rotation, remember?).
After some thought, I realised that all I'm doing is taking the information from the log files and putting them into a database... Moving the data from one location to another :/
So how else can I do this?
I want to be able to do something like:
python logparse.py --show=bandwidth --between-dates=25,05|30,05 --vhost=test.com
This would open the log file for test.com and show me the bandwidth used for the specified 5 days.
Now, my question is, how do I prevent myself from parsing 10GB worth of data when I only want 5 days worth of data?
If I were to use my idea of saving the log data to a database every 5 minutes, I could just save a unix timestamp of the dates and pull out the data between them. Easy. But I'd prefer to parse the log file directly.

Unless you create different log files for each day, you have no way other than to parse on request the whole log.
I would still use a database to hold the log data, but with your desired time-unit resolution (eg. hold the bandwidth at a day / hour interval). Another advantage in using a database is that you can make range queries, like the one you give in your example, very easily and fast. Whenever you have old data that you don't need any more you can delete it from the database to save up space.
Also, you don't need to parse the whole file each time. You could monitor the writes to the file with the help of pyinotify whenever a line is written you could update the counters in the database. Or you can store the last position in the file whenever you read from it and read from that position the next time. Be careful when the file is truncated.
To sum it up:
hold your data in the database at day resolution (eg. the bandwith for each day)
use pyinotify to monitor the writes to the log file so that you don't read the whole file over and over again
If you don't want to code your own solution, take a look at Webalizer, AWStats or pick a tool from this list.
EDIT:
WebLog Expert also looks promising. Take a look of one of the reports.

Pulling just the required 5 days of data from a large logfile comes down to finding the right starting offset to seek() the file to before you begin parsing.
You could find that position each time using a binary search through the file: seek() to os.stat(filename).st_size / 2, call readline() once (discarding the result) to skip to the end of the current line, then do two more readline()s. If the first of those lines is before your desired starting time, and the second is after it, then your starting offset is tell() - len(second_line). Otherwise, do the standard binary search algorithm. (I'm ignoring the corner cases where the line you're looking for is the first or last or not in the file at all, but those are easy to add)
Once you have your starting offset, you just keep parsing lines from there until you reach one that's newer than the range you're interested in.
This will be much faster than parsing the whole logfile each time, of course, but if you're going to be doing a lot of these queries, then a database probably is worth the extra complexity. If the size of the database is a concern, you could go for a hybrid approach where the database is an index to the log file. For example, you could store the just the byte-offset of the start of each day in the database. If you don't want to update the database every 5 minutes, you could have logparse.py update it with new data each time it runs.
After all that, though, as Pierre and the_void have said, do make sure you're not reinventing the wheel -- you're not the first person ever to need bandwidth statistics :-)

Save the last position
When you have finished with the parsing of a log file, save the position in a table of your database that reference both the full file path and the position. When you run the parser 5 minutes after, you query the database for the log your are going to parse, retrieve the position and start from there.
Save the first line of data
When you have log rotation, add an additionnal key in the database that will contain the first line of the log file. So when you start with a file, first read the first line. When you query the database, you have then to check on the first line and not on the file name.
First line should be unique, always, since you have the timestamp. But don't forget that W3C compliant log file usually write headers at the beginning of the file. So the first line should be the first line of data.
Save the data you need only
When parsing W3C, it's very easy to read the bytes sent. Parsing will be very fast if you keep that information only. The store it in your database, either by updating an existing row in your database, or adding a new row with a timestamp that you can aggregate with others later in a query.
Don't reinvent the wheel
Unless what you are doing is very specific, I recommand you to grab an open source parser on the web. http://awstats.sourceforge.net/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.