Is it possible for `tail` to emit incomplete lines? - python

I am using tail -F log | python parse.py to monitor and parse a growing log file, but some parsing errors occur that may be caused by reading incomplete lines from the log file.
Is it possible that tail emits incomplete lines?
In the parser, I am reading rows with code like the following:
import csv
import sys
reader = csv.reader(sys.stdin)
for row in reader
# process

It is possible that tail can emit 'unparsable lines' - but only if invalid lines are written to the file. Kind of a circular bit of argument, but here's an example of how it could happen:
You tail -f on /var/log/syslog
syslog-ng dies in the middle of a block-spanning write (sectors are 512 bytes, your filesystem block size is most likely larger, altho probably not much larger than 4096.. so, syslog has 9k of data buffered up to write out, it gets through 4k byte page and before it can write the next 4k+1k syslog dies. On ext2 at least, that'll end up as a partial write even after fsck. ext3? .. heh. I've been doing embedded for so long I can't remember... But, I'd HOPE not.. But who's to say that the data you're writing is always going to be correct? You might get a non-fatal string formatting error that doesn't include a newline that you're expecting.
You'll then have a partial line that isn't terminated with a newline (or even a \0), and the next time syslog starts up and starts appending it'll just be appending on the end of of the file with no notion of 'valid' records. So the first new record will be garbage, but the next one will be ok.
This is easy to exercise..
In one window
tail -f SOMEFILE
In another window
echo FOO >>SOMEFILE
echo BAR >>SOMEFILE
printf NO_NEWLINE >>SOMEFILE
echo I_WILL_HAVE_THE_LAST_LINE_PREFIXED_TO_ME_CAUSING_NERD_RAGE >>SOMEFILE
Since Linux's tail uses inotify by default, whatever is reading will get the last line without a newline and wait until the next newline comes along appending NO_NEWLINE to the beginning of what it considers 'the latest line'.
If you want to do this 'the pythonic' way, if you're using Linux - use inotify, if you're using OSX or BSD use 'knotty' and negate the use of 'tail' as an input pipe and just watch the file yourself.
Tail might do weird things if you use the 'resync on truncate' too - i.e. if the file gets zeroed and restarted in the middle of a read, you might get some weird amount of data on the read since 'tail' will close the previously opened file handle in exchange for the new one.

Since you changed up the question on me.. new answer! :p
reader = csv.reader(sys.stdin)
for row in reader:
try:
validate_row_data_somehow(row)
do_things_with_valid_row(row)
except:
print "failed to process row", `row`

Related

"_csv.Error: line contains NULL byte" after truncating a csv log file that is being piped to by another process

How do I truncate a csv log file that is being used as std out pipe destination from another process without generating a _csv.Error: line contains NULL byte error?
I have one process running rtlamr > log/readings.txt that is piping radio signal data to readings.txt. I don't think it matters what is piping to the file--any long-running pipe process will do.
I have a file watcher using watchdog (Python file watcher) on that file, which triggers a function when the file is changed. The function read the files and updates a database.
Then I try to truncate readings.txt so that it doesn't grow infinitely (or back it up).
file = open(dir_path+'/log/readings.txt', "w")
file.truncate()
file.close()
This corrupts readings.txt and generates the error (the start of the file contains garbage characters).
I tried moving the file instead of truncating it, in the hopes that rtlamr will recreate a fresh file, but that only has the effect of stopping the pipe.
EDIT
I noticed that the charset changes from us-ascii to binary but attempting to truncate the file with file = open(dir_path+'/log/readings.log', "w",encoding="us-ascii") does not do anything.
If you truncate a file1 while another process has it open in w mode, that process will continue to write to the same offsets, making the file sparse. Low offsets will thus be read as 0s.
As per x11 - Concurrent writing to a log file from many processes - Unix & Linux Stack Exchange and Can two Unix processes simultaneous write to different positions in a single file?, each process that has a file open has its own offset in it, and a ftruncate() doesn't change that.
If you want the other process to react to truncation, it needs to have it open in a mode.
Your approach has principal bugs, too. E.g. it's not atomic: you may (=will, eventually) truncate the file after the producer has added data but before you have read it so it would get lost.
Consider using dedicated data buffering utilities instead like buffer or pv as per Add a big buffer to a pipe between two commands.
1Which is superfluous because open(mode='w') already does that. Either truncate or reopen, no need to do both.

What's behind sys.stdin.readlines()

Question 1:
I have a piece of code like this (Python2.7):
for line in sys.stdin.readlines():
print line
When I run this code, input a string in the terminal and press Enter key, nothing happens. 'print line' doesn't work.
So I imagine there is buffer for sys.stdin.readlines(), but I wonder how does it work? Can I flush it so every time a line is given, 'print line' can be executed imeediatly?
Question2: What's the difference between these two lines:
for line in sys.stdin:
for line in sys.stdin.readline():
I found their behavior is a little different. If I use ctrl+D to terminate the input, I have to press ctrl+D twice in the first case before it's really terminated. While in the second case, only one ctrl+D is enough.
CTRL-D sends the EOF (end of file) control character to stdin in an interactive shell. Usually, you feed a file to the stdin of a process via redirection (e.g. myprogram < myfile), but if you are interactively typing characters into stdin of a process, you need to tell it when to stop reading the "file" you are actively creating.
sys.stdin.readlines waits for stdin to complete (via an EOF control character), then conveniently splits the entire stdin contents (flushed) before the EOF into a list of tokens delimited by newline characters. When you hit ENTER, you send a \n character, which is rendered for you as a new line, but does NOT tell stdin to stop reading.
Regarding the other two lines, I think this might help:
Think of the sys.stdin object as a file. When you EOF, you save that file and then you are not allowed to edit it anymore because it leaves your hands and belongs to stdin. You can perform functions on that file, like readlines, which is a convenient way to say "I want a list, and each element is a line in that file". Or, you can just read one line from it with readline, in which case the for loop would be only iterating over the characters in that line.
What's going on behind the scenes?
Internally, the reference to sys.stdin blocks execution until EOF is received in sys.stdin. Then it becomes a file-like object stored in memory with a read pointer pointing to the beginning.
When you call just readline, the pointer reads until it hits a \n character, returns to you what it just traversed over, and stays put, waiting for you to move it again. Calling readline again will cause the pointer to move until the next \n, if it exists, else EOF.
readlines is really telling the pointer to traverse all the way (\n is functionally meaningless) from its current position (not necessarily beginning of file) until it sees EOF.
Try it out!
Trying it out is the best way to learn.
To see this behavior in action, try making a file with 10 lines, then redirect it to the stdin of a python script that prints sys.stdin.readline 3 times, then print sys.stdin.readlines. You'll see 3 lines printed out then a list containing 7 elements :)

Using textfile as stdin in python under windows 7

I'm a win7-user.
I accidentally read about redirections (like command1 < infile > outfile) in *nix systems, and then I discovered that something similar can be done in Windows (link). And python is also can do something like this with pipes(?) or stdin/stdout(?).
I do not understand how this happens in Windows, so I have a question.
I use some kind of proprietary windows-program (.exe). This program is able to append data to a file.
For simplicity, let's assume that it is the equivalent of something like
while True:
f = open('textfile.txt','a')
f.write(repr(ctime()) + '\n')
f.close()
sleep(100)
The question:
Can I use this file (textfile.txt) as stdin?
I mean that the script (while it runs) should always (not once) handle all new data, ie
In the "never-ending cycle":
The program (.exe) writes something.
Python script captures the data and processes.
Could you please write how to do this in python, or maybe in win cmd/.bat or somehow else.
This is insanely cool thing. I want to learn how to do it! :D
If I am reading your question correctly then you want to pipe output from one command to another.
This is normally done as such:
cmd1 | cmd2
However, you say that your program only writes to files. I would double check the documentation to see if their isn't a way to get the command to write to stdout instead of a file.
If this is not possible then you can create what is known as a named pipe. It appears as a file on your filesystem, but is really just a buffer of data that can be written to and read from (the data is a stream and can only be read once). Meaning your program reading it will not finish until the program writing to the pipe stops writing and closes the "file". I don't have experience with named pipes on windows so you'll need to ask a new question for that. One down side of pipes is that they have a limited buffer size. So if there isn't a program reading data from the pipe then once the buffer is full the writing program won't be able to continue and just wait indefinitely until a program starts reading from the pipe.
An alternative is that on Unix there is a program called tail which can be set up to continuously monitor a file for changes and output any data as it is appended to the file (with a short delay.
tail --follow=textfile.txt --retry | mycmd
# wait for data to be appended to the file and output new data to mycmd
cmd1 >> textfile.txt # append output to file
One thing to note about this is that tail won't stop just because the first command has stopped writing to the file. tail will continue to listen to changes on that file forever or until mycmd stops listening to tail, or until tail is killed (or "sigint-ed").
This question has various answers on how to get a version of tail onto a windows machine.
import sys
sys.stdin = open('textfile.txt', 'r')
for line in sys.stdin:
process(line)
If the program writes to textfile.txt, you can't change that to redirect to stdin of your Python script unless you recompile the program to do so.
If you were to edit the program, you'd need to make it write to stdout, rather than a file on the filesystem. That way you can use the redirection operators to feed it into your Python script (in your case the | operator).
Assuming you can't do that, you could write a program that polls for changes on the text file, and consumes only the newly written data, by keeping track of how much it read the last time it was updated.
When you use < to direct the output of a file to a python script, that script receives that data on it's stdin stream.
Simply read from sys.stdin to get that data:
import sys
for line in sys.stdin:
# do something with line

What would happen if a same file being read and appended at the same time(python programming)?

I'm writing a script using two separate thread one doing file reading operation and the other doing appending, both threads run fairly frequently.
My question is, if one thread happens to read the file while the other is just in the middle of appending strings such as "This is a test" into this file, what would happen?
I know if you are appending a smaller-than-buffer string, no matter how frequently you read the file in other threads, there would never be incomplete line such as "This i" appearing in your read file, I mean the os would either do: append "This is a test" -> read info from the file; or: read info from the file -> append "This is a test" to the file; and such would never happen: append "This i" -> read info from the file -> append "s a test".
But if "This is a test" is big enough(assuming it's a bigger-than-buffer string), the os can't do appending job in one operation, so the appending job would be divided into two: first append "This i" to the file, then append "s a test", so in this kind of situation if I happen to read the file in the middle of the whole appending operation, would I get such result: append "This i" -> read info from the file -> append "s a test", which means I might read a file that includes an incomplete string?
If you're worried about this, just have your consumer look for a special character (endline would work) so that it knows that there wasn't an incomplete write. So your producer (the one writing data to the file) can output partial data but the consumer (one reading from the file) will know it only got a partial write.
Is there a reason you're not using a PIPE instead of a file? And is there a reason you're using threading? You don't really gain anything except for maybe simplicity in coding but IMO you may as well have separate processes, then you can get a gain from this model.
Added: Unfortunately this I/O stuff is not just how Python handles things but how the OS handles things. Everything you've said about worrying about the buffer is true.
http://docs.python.org/library/functions.html#open
I would try to figure out what your buffer size is and for that I don't know even how to check. I'm using OSX anyways.

How to append EOF to file using Perl or Python?

I’m trying to bulk insert data to SQL server express database. When doing bcp from Windows XP command prompt, I get the following error:
C:\temp>bcp in -T -f -S
Starting copy...
SQLState = S1000, NativeError = 0
Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file
0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 4391
So, there is a problem with EOF. How to append a correct EOF character to this file using Perl or Python?
EOF is End Of File. What probably occurred is that the file is not complete; the software expects data, but there is none to be had anymore.
These kinds of things happen when:
the export is interrupted (quit dump software while dumping)
while copying the dumpfile aborting the copy
disk full during dump
these kinds of things.
By the way, though EOF is usually just an end of file, there does exist an EOF character. This is used because terminal (command line) input doesn't really end like a file does, but it sometimes is necessary to pass an EOF to such a utility. I don't think it's used in real files, at least not to indicate an end of file. The file system knows perfectly well when the file has ended, it doesn't need an indicator to find that out.
EDIT shamelessly copied from a comment provided by John Machin
It can happen (uninentionally) in real files. All it needs is (1) a data-entry user to type Ctrl-Z by mistake, see nothing on the screen, type the intended Shift-Z, and keep going and (2) validation software (written by e.g. the company president's nephew) which happily accepts Ctrl-anykey in text fields and your database has a little bomb in it, just waiting for someone to produce a query to a flat file.
Unexpected EOF means that the bcp reader found an EOF when it was expecting more data. This EOF can be:
(1) the actual physical end-of-file (no more bytes to be read). This means that you have mis-formatted data. Check near the end of your file for an incomplete record.
OR
(2) on Windows, where you are, programs reading a file in text mode honour the ancient convention inherited via MS-DOS from CP/M of regarding Ctrl-Z (aka ^Z aka \'x1A' aka SUB aka SUBSTITUTE) as an end-of-file marker when reading from ANY file, not just a terminal. This includes Python -- the behaviour is determined by the C stdlib. Check for '\x1A' in your data.
Update responding to comments in a legible fashion:
In Notepad++, you can make it display unusual characters by doing View / Show Symbol / Show All Characters. You can search by doing Ctrl-F, typing \x1a in the Find What box, and selecting the Extended radio button in the Search panel.
Or you can with a little bit of Python get the line number of the first Ctrl-Z:
bytes = open('bcp.dat', 'rb').read()
zpos = bytes.find('\x1a')
# if zpos is -1, no Ctrl-Z in file
print 1 + bytes[:zpos].count('\r\n')
Where your .dat was created doesn't matter. An unintentional Ctrl-Z can happen anywhere in a file created on any operating system. It is where it is being read as a text file that matters -- Windows? Bang!
This is not a problem with missing EOF, but with EOF that is there and is not expected by bcp.
I am not a bcp tool expert, but it looks like there is some problem with format of your data files.

Categories

Resources