Need some ideas on how to code my log parser - python

I have a VPS that's hosting multiple virtual hosts. Each host has it's own access.log and error.log. Currently, there's no log rotation setup, though, this may change.
Basically, I want to parse these logs to monitor bandwidth and collect stats.
My idea was to write a parser and save the information to a small sqlite database. The script will run every 5 minutes and use Python's seek and tell methods to open the log files from the last parsed locations. This prevents me from parsing a 10GB log file every 5 minutes when all I need is the new information sitting at the end of it (no log rotation, remember?).
After some thought, I realised that all I'm doing is taking the information from the log files and putting them into a database... Moving the data from one location to another :/
So how else can I do this?
I want to be able to do something like:
python logparse.py --show=bandwidth --between-dates=25,05|30,05 --vhost=test.com
This would open the log file for test.com and show me the bandwidth used for the specified 5 days.
Now, my question is, how do I prevent myself from parsing 10GB worth of data when I only want 5 days worth of data?
If I were to use my idea of saving the log data to a database every 5 minutes, I could just save a unix timestamp of the dates and pull out the data between them. Easy. But I'd prefer to parse the log file directly.

Unless you create different log files for each day, you have no way other than to parse on request the whole log.
I would still use a database to hold the log data, but with your desired time-unit resolution (eg. hold the bandwidth at a day / hour interval). Another advantage in using a database is that you can make range queries, like the one you give in your example, very easily and fast. Whenever you have old data that you don't need any more you can delete it from the database to save up space.
Also, you don't need to parse the whole file each time. You could monitor the writes to the file with the help of pyinotify whenever a line is written you could update the counters in the database. Or you can store the last position in the file whenever you read from it and read from that position the next time. Be careful when the file is truncated.
To sum it up:
hold your data in the database at day resolution (eg. the bandwith for each day)
use pyinotify to monitor the writes to the log file so that you don't read the whole file over and over again
If you don't want to code your own solution, take a look at Webalizer, AWStats or pick a tool from this list.
EDIT:
WebLog Expert also looks promising. Take a look of one of the reports.

Pulling just the required 5 days of data from a large logfile comes down to finding the right starting offset to seek() the file to before you begin parsing.
You could find that position each time using a binary search through the file: seek() to os.stat(filename).st_size / 2, call readline() once (discarding the result) to skip to the end of the current line, then do two more readline()s. If the first of those lines is before your desired starting time, and the second is after it, then your starting offset is tell() - len(second_line). Otherwise, do the standard binary search algorithm. (I'm ignoring the corner cases where the line you're looking for is the first or last or not in the file at all, but those are easy to add)
Once you have your starting offset, you just keep parsing lines from there until you reach one that's newer than the range you're interested in.
This will be much faster than parsing the whole logfile each time, of course, but if you're going to be doing a lot of these queries, then a database probably is worth the extra complexity. If the size of the database is a concern, you could go for a hybrid approach where the database is an index to the log file. For example, you could store the just the byte-offset of the start of each day in the database. If you don't want to update the database every 5 minutes, you could have logparse.py update it with new data each time it runs.
After all that, though, as Pierre and the_void have said, do make sure you're not reinventing the wheel -- you're not the first person ever to need bandwidth statistics :-)

Save the last position
When you have finished with the parsing of a log file, save the position in a table of your database that reference both the full file path and the position. When you run the parser 5 minutes after, you query the database for the log your are going to parse, retrieve the position and start from there.
Save the first line of data
When you have log rotation, add an additionnal key in the database that will contain the first line of the log file. So when you start with a file, first read the first line. When you query the database, you have then to check on the first line and not on the file name.
First line should be unique, always, since you have the timestamp. But don't forget that W3C compliant log file usually write headers at the beginning of the file. So the first line should be the first line of data.
Save the data you need only
When parsing W3C, it's very easy to read the bytes sent. Parsing will be very fast if you keep that information only. The store it in your database, either by updating an existing row in your database, or adding a new row with a timestamp that you can aggregate with others later in a query.
Don't reinvent the wheel
Unless what you are doing is very specific, I recommand you to grab an open source parser on the web. http://awstats.sourceforge.net/

Related

Getting realtime weather data charted using python and google apps script

Apologies if too stupid question:
I run the weather data logging program Cumulus on my pc. It writes each data sample to the realtime.txt file every second (the previous sample is always overwritten i.e. file size stays at 1 data row).
I want to publish a pseudo-realtime chart of the last hour of wind data (strength and direction) to the web. My plan is to have a Python program reading the realtime.txt data rows each second and append them to a buffer which is saved to a file. A Google App script is then time-triggered every minute, reads the buffer file and appends the last minute's worth of data to a circular one-hour data table in Google Sheets. The table's data is used in a line chart, which is then published and can be read by anyone.
So basically I have a short term (1-minute) FIFO which feeds a longer term (1-hour) FIFO using a buffer file for transfer.
Now my problem is how to sync these FIFOs so that every one-second data row is counted, and no rows are counted twice. As the two apps are unsynchronized, this is not trivial at least not for me.
I'm thinking of giving each data row a unique row number, enabling the App script to figure out where to put the next chunk of data in the longer term buffer. Also to expand the short term FIFO to 2 minutes, which would ensure that there is always at least 1-minutes worth of fresh data available regardless of any jitter in the trigger time of the App script.
But converting these thoughts into code is proving a bit overwhelming for me ;( Anyone had the same problem or can otherwise advise, even slightly?

ACID Transactions at the File System

Background:
I am getting a temperature float from an arduino via a serial connection. I need to be able to cache this temperature data every 30 seconds for other applications (e.g. web, thermostat controller) to access and not overload the serial connection.
Currently I cache this data to RAM as a file in /run (I'm trying to follow Linux convention). Then, other applications can poll the file for the temperature as they want it all day long with i/o now the only bottle neck (using an rpi, so not a lot of enterprise-level need here).
Problem:
I think when an app reads these files, it risks reading corrupt data. Should a writer update the file, and a reader try to read the file at the same time, can corrupt data be read, causing the thermostat to behave erratically?
Should I just use sqlite3 as an overkill solution, or use file locks (and does that risk something else not working perfectly)?
This is all taking place in multiple python processes. Is Linux able to handle this situation natively or do I need to apply somehow the principles mentioned here?
Calls to write(2) ought to be atomic under Linux.
Which means as long you are writing a single buffer, you can be certain that readers won't read an incomplete record. You might want to use os.write to make sure that no buffering/chunking happens you are not aware of.
if a read is happening and a file is updated, will the read use the new data while in the middle of a file, or does it somehow know how to get data from the old file (how)?
If there is exactly one read(2) and one write(2), you are guaranteed to see a consistent result. If you split your write into two, it might happen that you write the first part, read and then write the second part which would be an atomicity violation. In case you need to write multiple buffers, either combine them yourself or use writev(2).

Python Logging - Circular Log (last n entries)

I know the Python logging library allows you to do 'circular' logging over multiple log files. What I'm trying to do is simply have one file, foo.log, that is always <= B bytes in size; if the next append is going to put it over B, then things are deleted off the top. I'd be just as happy to specify the max in terms of events, as well.
So, if this were the file rotation scheme, and item #4 exceeded B, you'd have:
foo.log.1 foo.log.2
--------- ---------
African Swallow
or
European
I'd like to simply wind up with:
foo.log
-------
or
European
Swallow
EDIT: Based on the comments below, people have legitimately noted this is a less-than-optimal format. The motivation comes from debugging. I have scripts using psycopg2 to execute queries on a remote server that's stuck in 2002, roughly, with no internet connection. Having it log everything it's sending to the db and then checking that log is the fastest way to see where something went wrong, and I have to point someone else to do it I don't want to introduce the complication of having them figure out which is the current log file. The current solution is just to write the log and delete it if it gets too big.
As Martijn notes, this type of log would be complicated to manage, and maybe inefficient (though this may or may not concern you).
A simple way to solve some part of the inefficiency is to use fixed record lengths. I.e. make each log entry the same (max) length.
Another way is to make your log database-based, and just make a record (variable or not) for each log entry. Let the db manager handle the adjustments. There are simple (RAM-based) databases to real, disk-based ones, all of which you can access with Python.
Yet another solution, if you're happy with a memory based log, is look into FIFO files.
You could log to a file, with a couter/time at start of each line.
When you get to a certain point, just update from the top of the file again.
thefile = open('somebinfile', 'r+b')
thefile.seek(0)
Things to consider, when you seek to the top, and write to the file, you might only half overwrite next line to account for that you would need a unique line ending Char/String.

Replacing files in a file structure similar to a file system?

I'm writing a program that extracts and adds files to the xbox 360's STFS files. The STFS structure is a mini file system, it has hashtables, a file table, etc.
Extracting the files is simple enough. I have the starting block of the file and the amount of blocks in the file, so I just need to find the block offsets, read the block lengths, and send that out as the file. What happens, though, when I need to replace or remove a file? I've read that on Windows and computers in general, files aren't actually deleted, they're just removed from the file table and are overwritten when something else needs the space. When I'm writing a file, then, how do I find an unused sequence of blocks large enough to hold it? The blocks are 0x1000 bytes in length and fill remaining space with empty bytes, so everything evens out nicely, but I can't think of an efficient way to find an unused range of blocks that will fit the file I want to add.
My current plan is to rewrite everything on removing or adding a file so that I don't have large amounts of unused space that I'm unable to figure out how to overwrite. Is there a good introduction to file systems like NTFS or FAT32 that I could read that won't take days to understand and will contain the necessary information to write a basic file manager?
reference to structure: http://free60.org/STFS
edit: on Second though, I would create a list of ranges for each file in the table. That is, the start offset and end offset based on size. When looking for an open range to insert a file, I would start at 0 and check if the end start or end of each range is inside the range needed by the file to be inserted. If either the start or end is inside the range, I would move on to the end of the other file's end offset. This is better than my initial idea, but still seems inefficient. I would have to make multiple comparisons for every file in the file table.

adding space in an output file with out having to read the entire thing first

Question: How do you write data to an already existing file at the beginning of the file with out writing over what's already there and with out reading the entire file into memory? (e.g. prepend)
Info:
I'm working on a project right now where the program frequently dumps data into a file. this file will very quickly balloon up to 3-4gb. I'm running this simulation on a computer with only 768mb of ram. pulling all that data to the ram over and over will be a great pain and a huge waste of time. The simulation already takes long enough to run as it is.
The file is structured such that the number of dumps it makes is listed at the beginning with just a simple value, like 6. each time the program makes a new dump I want that to be incremented, so now it's 7. the problem lies with the 10th, 100th, 1000th, and so dump. the program will enter the 10 just fine, but remove the first letter of the next line:
"9\n580,2995,2083,028\n..."
"10\n80,2995,2083,028\n..."
obviously, the difference between 580 and 80 in this case is significant. I can't lose these values. so i need a way to add a little space in there so that I can add in this new data without losing my data or having to pull the entire file up and then rewrite it.
Basically what I'm looking for is a kind of prepend function. something to add data to the beginning of a file instead of the end.
Programmed in Python
~n
See the answers to this question:
How do I modify a text file in Python?
Summary: you can't do it without reading the file in (this is due to how the operating system works, rather than a Python limitation)
It's not addressing your original question, but here are some possible workarounds:
Use SQLite (it's bundled with your Python)
Use a fancier database, either RDBMS or NoSQL
Just track the number of dumps in a different text file
The first couple of options are a little more work up front, but provide more flexibility. The last option is the easiest solution to your current problem.
You could quite easily create an new file, output the data you wish to prepend to that file and then copy the content of the existing file and append it to the new one, then rename.
This would prevent having to read the whole file if that is the primary issue.

Categories

Resources