How to append EOF to file using Perl or Python? - python

I’m trying to bulk insert data to SQL server express database. When doing bcp from Windows XP command prompt, I get the following error:
C:\temp>bcp in -T -f -S
Starting copy...
SQLState = S1000, NativeError = 0
Error = [Microsoft][SQL Native Client]Unexpected EOF encountered in BCP data-file
0 rows copied.
Network packet size (bytes): 4096
Clock Time (ms.) Total : 4391
So, there is a problem with EOF. How to append a correct EOF character to this file using Perl or Python?

EOF is End Of File. What probably occurred is that the file is not complete; the software expects data, but there is none to be had anymore.
These kinds of things happen when:
the export is interrupted (quit dump software while dumping)
while copying the dumpfile aborting the copy
disk full during dump
these kinds of things.
By the way, though EOF is usually just an end of file, there does exist an EOF character. This is used because terminal (command line) input doesn't really end like a file does, but it sometimes is necessary to pass an EOF to such a utility. I don't think it's used in real files, at least not to indicate an end of file. The file system knows perfectly well when the file has ended, it doesn't need an indicator to find that out.
EDIT shamelessly copied from a comment provided by John Machin
It can happen (uninentionally) in real files. All it needs is (1) a data-entry user to type Ctrl-Z by mistake, see nothing on the screen, type the intended Shift-Z, and keep going and (2) validation software (written by e.g. the company president's nephew) which happily accepts Ctrl-anykey in text fields and your database has a little bomb in it, just waiting for someone to produce a query to a flat file.

Unexpected EOF means that the bcp reader found an EOF when it was expecting more data. This EOF can be:
(1) the actual physical end-of-file (no more bytes to be read). This means that you have mis-formatted data. Check near the end of your file for an incomplete record.
OR
(2) on Windows, where you are, programs reading a file in text mode honour the ancient convention inherited via MS-DOS from CP/M of regarding Ctrl-Z (aka ^Z aka \'x1A' aka SUB aka SUBSTITUTE) as an end-of-file marker when reading from ANY file, not just a terminal. This includes Python -- the behaviour is determined by the C stdlib. Check for '\x1A' in your data.
Update responding to comments in a legible fashion:
In Notepad++, you can make it display unusual characters by doing View / Show Symbol / Show All Characters. You can search by doing Ctrl-F, typing \x1a in the Find What box, and selecting the Extended radio button in the Search panel.
Or you can with a little bit of Python get the line number of the first Ctrl-Z:
bytes = open('bcp.dat', 'rb').read()
zpos = bytes.find('\x1a')
# if zpos is -1, no Ctrl-Z in file
print 1 + bytes[:zpos].count('\r\n')
Where your .dat was created doesn't matter. An unintentional Ctrl-Z can happen anywhere in a file created on any operating system. It is where it is being read as a text file that matters -- Windows? Bang!

This is not a problem with missing EOF, but with EOF that is there and is not expected by bcp.
I am not a bcp tool expert, but it looks like there is some problem with format of your data files.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks
You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

"_csv.Error: line contains NULL byte" after truncating a csv log file that is being piped to by another process

How do I truncate a csv log file that is being used as std out pipe destination from another process without generating a _csv.Error: line contains NULL byte error?
I have one process running rtlamr > log/readings.txt that is piping radio signal data to readings.txt. I don't think it matters what is piping to the file--any long-running pipe process will do.
I have a file watcher using watchdog (Python file watcher) on that file, which triggers a function when the file is changed. The function read the files and updates a database.
Then I try to truncate readings.txt so that it doesn't grow infinitely (or back it up).
file = open(dir_path+'/log/readings.txt', "w")
file.truncate()
file.close()
This corrupts readings.txt and generates the error (the start of the file contains garbage characters).
I tried moving the file instead of truncating it, in the hopes that rtlamr will recreate a fresh file, but that only has the effect of stopping the pipe.
EDIT
I noticed that the charset changes from us-ascii to binary but attempting to truncate the file with file = open(dir_path+'/log/readings.log', "w",encoding="us-ascii") does not do anything.
If you truncate a file1 while another process has it open in w mode, that process will continue to write to the same offsets, making the file sparse. Low offsets will thus be read as 0s.
As per x11 - Concurrent writing to a log file from many processes - Unix & Linux Stack Exchange and Can two Unix processes simultaneous write to different positions in a single file?, each process that has a file open has its own offset in it, and a ftruncate() doesn't change that.
If you want the other process to react to truncation, it needs to have it open in a mode.
Your approach has principal bugs, too. E.g. it's not atomic: you may (=will, eventually) truncate the file after the producer has added data but before you have read it so it would get lost.
Consider using dedicated data buffering utilities instead like buffer or pv as per Add a big buffer to a pipe between two commands.
1Which is superfluous because open(mode='w') already does that. Either truncate or reopen, no need to do both.

Creating a remote a file over serial using python

I have a host connected to a Linux target over serial. The target is using the serial port for shell I/O. I need to save a text file on the target with contents from the host.
I thought I could get away with doing:
ser.write("cat > file.txt\n")
ser.write([contents I need to add to the file])
ser.write(chr(4))
ser.write(chr(4))
But the 's I'm sending aren't closing the file. I've tried a few variations of chr(4)... \x04, print, str =, and a few others, but they all fail the same way.
If I simulate this with minicom, and follow up sending the [contents...] with uploading a 2 byte file that holds 0x04 0x04, the file closes as expected.
I haven't tried opening the "EOF" file in python and sending it yet. I'll do it, I'm about out of options. But I'm new to python, so I must be doing something wrong.
Any obvious newb-fixing answer to this one?
Thanks.
As a workaround, could you use a heredoc ?
ser.write("cat > file.txt << END_OF_FILE\n")
ser.write([contents I need to add to the file])
ser.write("\nEND_OF_FILE\n");
For a more robust solution, you should probably have to look at some file transfer protocol over serial line, like Kermit.

Is it possible for `tail` to emit incomplete lines?

I am using tail -F log | python parse.py to monitor and parse a growing log file, but some parsing errors occur that may be caused by reading incomplete lines from the log file.
Is it possible that tail emits incomplete lines?
In the parser, I am reading rows with code like the following:
import csv
import sys
reader = csv.reader(sys.stdin)
for row in reader
# process
It is possible that tail can emit 'unparsable lines' - but only if invalid lines are written to the file. Kind of a circular bit of argument, but here's an example of how it could happen:
You tail -f on /var/log/syslog
syslog-ng dies in the middle of a block-spanning write (sectors are 512 bytes, your filesystem block size is most likely larger, altho probably not much larger than 4096.. so, syslog has 9k of data buffered up to write out, it gets through 4k byte page and before it can write the next 4k+1k syslog dies. On ext2 at least, that'll end up as a partial write even after fsck. ext3? .. heh. I've been doing embedded for so long I can't remember... But, I'd HOPE not.. But who's to say that the data you're writing is always going to be correct? You might get a non-fatal string formatting error that doesn't include a newline that you're expecting.
You'll then have a partial line that isn't terminated with a newline (or even a \0), and the next time syslog starts up and starts appending it'll just be appending on the end of of the file with no notion of 'valid' records. So the first new record will be garbage, but the next one will be ok.
This is easy to exercise..
In one window
tail -f SOMEFILE
In another window
echo FOO >>SOMEFILE
echo BAR >>SOMEFILE
printf NO_NEWLINE >>SOMEFILE
echo I_WILL_HAVE_THE_LAST_LINE_PREFIXED_TO_ME_CAUSING_NERD_RAGE >>SOMEFILE
Since Linux's tail uses inotify by default, whatever is reading will get the last line without a newline and wait until the next newline comes along appending NO_NEWLINE to the beginning of what it considers 'the latest line'.
If you want to do this 'the pythonic' way, if you're using Linux - use inotify, if you're using OSX or BSD use 'knotty' and negate the use of 'tail' as an input pipe and just watch the file yourself.
Tail might do weird things if you use the 'resync on truncate' too - i.e. if the file gets zeroed and restarted in the middle of a read, you might get some weird amount of data on the read since 'tail' will close the previously opened file handle in exchange for the new one.
Since you changed up the question on me.. new answer! :p
reader = csv.reader(sys.stdin)
for row in reader:
try:
validate_row_data_somehow(row)
do_things_with_valid_row(row)
except:
print "failed to process row", `row`

Emacs collaborative buffers open in the wrong mode

I am using Emacs and Rudel to collaborate with a remote programmer. Rudel has a concept of published buffers. When my partner publishes a buffer, I can subscribe to it and the we can both edit it simultaneously.
My problem is that when he publishes a Python file with a *.py extension and I subscribe to it, my buffer is not set to python-mode automatically (it is in fundamental mode). How can I get it so that the buffer opens with the correct language mode?
I don't know Rudel well enough to give a 100% solution, but what you want to do is something like this:
(add-hook 'rudel-document-attach-hook 'my-rudel-set-mode-appropriately)
(defun my-rudel-set-mode-appropriately (document buffer)
"try to set the mode appropriately"
(set-buffer buffer)
(let ((buffer-file-name ...get-name-from-document...))
(set-auto-mode)))
Only, you need to replace the ...get-name-from-document... portion of the code with something that evaluates to the file name that you want, for example, if the buffer is named myfile.py, then you can change that to (buffer-name). But, if the buffers get odd names, perhaps you need to extract the name from the document object (Rudel internally uses a document object to represent the thing you are sharing). So, if (buffer-name) doesn't work, you can try (rudel-suggested-buffer-name document).
i.e. try the above code but using one of these lines:
(let ((buffer-file-name (buffer-name)))
and
(let ((buffer-file-name (rudel-suggested-buffer-name document)))
The set-auto-mode will use value of buffer-file-name to determine the major mode using the general Emacs mechanisms.
I know absolutely nothing about how rudel works. However, have you tried explicitly setting the mode in the text file? Try adding something like this to the first line of the file:
# -*- mode: python; fill-column: 75; comment-column: 50; -*-
Putting a line like this first in the file will cause emacs to ignore the file's extension and open in the given mode.

Categories

Resources