File seek operation is missing out bytes randomly while writing to file - python

I am in the process of porting code from python2 to python3. The code below in python2 is working fine for python3, except on the occasions where it is over writing a couple of bytes at the end from the previous line while writing a new packet to the file. This is causing an error rate of 10% while reading out the packets from the file (The error rate was around 2% in python2).
logfile = open(filepath, 'w+')
# Gets the offset to write to the file (EOF)
offset = self.enddict[fname]
# The output message
outmsg = "%ld\n%d\n%s\n" % (now, msg_len, msg)
#Seeks to the given offset and writes the message out
logfile.seek(offset)
logfile.write(outmsg)
I've tried out a couple of solutions to resolve this issue, but haven't got the right one so far:
Add extra new lines to the beginning and end of the output message. This seems to mitigate the issue (reduces the error rate to 2%), but it doesn't seem like a viable solution as we'd need to change various readers that are reading off the file downstream.
outmsg = "\n\n\n\n\n\n\n\n\n\n%ld\n%d\n%s\n\n\n\n\n\n\n\n\n\n" % (now, msg_len, msg)
Use io.SEEK_END. This seems to write the packets correctly to the file and the error rate drops close to 0 %. But, it's messing up with the offsets written to the DB. While reading the chunk from the file by using the offsets in the DB, we're getting corrupted chunk.
logfile.seek(0, io.SEEK_END)
I did some research into using os.lseek and found it to be slower than seek.

The below solution seems to be getting rid of the missing bytes issue during seek operation:
self.logfile.seek(offset)
off_bytes_count = len(self.logfile.read())
if off_bytes_count:
offset += off_bytes_count
self.logfile.seek(offset)
Here, the off_bytes_count is the count of the bytes that’s remaining after the seek operation.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks
You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

How to follow a large file in python without loading it all in the memory?

I tried the solution in this answer, since the file I'm tailing can grow large than 50GB, the server is choking. Any suggestions on how to follow it without storing all in memory?
You don't need an external app.
import time
fsl = open('/var/syslog')
# Seek to end
fsl.seek(0,2)
last = fsl.tell()
# Back up 200 bytes
fsl.seek( last-200 )
while True:
print( fsl.read() )
time.sleep(5)
Now, this will print an extra linefeed every time it tries to get more data, but you can fix that.

Reading a file line-by-line with a timeout for lines that are taking too long?

I have a 1.2TB file that I am running some code against, but constantly running into OutOfMemoryError exceptions. I ran the following two pieces of code against the file to see what was wrong:
import sys
with open(sys.argv[1]) as f:
count = 1
for line in f:
if count > 173646280:
print line
else:
print count
count += 1
And this code:
#!/usr/bin/env perl
use strict;
use warnings;
my $count = 1;
while (<>) {
print "$count\n";
$count++;
}
Both of them zoom until they hit line 173,646,264, and then they just completely stop. Let me just give a quick background on the file.
I created a file called groupBy.json. I then processed that file with some Java code to transform the JSON objects and created a file called groupBy_new.json. I put groupBy_new.json on s3, pulled it down on another server and was doing some processing on it when I started getting OOM errors. I figured that maybe the file got corrupted when transferring to s3. I ran the above Python/Perl code on groupBy_new.json on both serverA (the server where it was originally at), and serverB (the server from which I pulled the file off s3), both halted at the same line. I ran then ran the above Python/Perl code on groupBy.json, the original file, and it also halted. I tried to recreate groupBy_new.json with the same code that I had used to originally create it, and ran into an OOM error.
So this is a really odd problem that is perplexing me. In short, I'd like to get rid of this line that is causing me problems. What I'm trying to do is read a file with a timeout on the line being read. If it cannot read the input line in 2 seconds or so, move on to the next line.
What you can do is count the number of lines until the problem line and output it - make sure you flush the output - see https://perl.plover.com/FAQs/Buffering.html . Then write another program that will copy the first of this number of lines to a different file, and then read the file's input stream character by character (see http://perldoc.perl.org/functions/read.html ) until it hits a "\n" and then copy the rest of the file - either line by line or in chunks.

opening and writing a large binary file python

I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.
In linux this is a non-issue (the local dev box is a windows system).
I have the following code
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
temp_file.write(dec_data)
data = file.read(settings.READ_SIZE)
# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).
My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.
if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...
Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
response.write(dec_data)
data = file.read(settings.READ_SIZE)
# no dumps to be taken when finishing up the decrypt process...
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
Clarification
The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).
When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.
When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.
I'm going to take a guess here:
You have some other program that's continually replacing the files you're trying to read.
On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.
On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.
Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.
In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…
Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*
* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.

Python EOF for multi byte requests of file.read()

The Python docs on file.read() state that An empty string is returned when EOF is encountered immediately. The documentation further states:
Note that this method may call the
underlying C function fread() more
than once in an effort to acquire as
close to size bytes as possible. Also
note that when in non-blocking mode,
less data than was requested may be
returned, even if no size parameter
was given.
I believe Guido has made his view on not adding f.eof() PERFECTLY CLEAR so need to use the Python way!
What is not clear to ME, however, is if it is a definitive test that you have reached EOF if you receive less than the requested bytes from a read, but you did receive some.
ie:
with open(filename,'rb') as f:
while True:
s=f.read(size)
l=len(s)
if l==0:
break # it is clear that this is EOF...
if l<size:
break # ? Is receiving less than the request EOF???
Is it a potential error to break if you have received less than the bytes requested in a call to file.read(size)?
You are not thinking with your snake skin on... Python is not C.
First, a review:
st=f.read() reads to EOF, or if opened as a binary, to the last byte;
st=f.read(n) attempts to reads n bytes and in no case more than n bytes;
st=f.readline() reads a line at a time, the line ends with '\n' or EOF;
st=f.readlines() uses readline() to read all the lines in a file and returns a list of the lines.
If a file read method is at EOF, it returns ''. The same type of EOF test is used in the other 'file like" methods like StringIO, socket.makefile, etc. A return of less than n bytes from f.read(n) is most assuredly NOT a dispositive test for EOF! While that code may work 99.99% of the time, it is the times it does not work that would be very frustrating to find. Plus, it is bad Python form. The only use for n in this case is to put an upper limit on the size of the return.
What are some of the reasons the Python file-like methods returns less than n bytes?
EOF is certainly a common reason;
A network socket may timeout on read yet remain open;
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in text mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
The file is in non-blocking mode and another process begins to access the file;
Temporary non-access to the file;
An underlying error condition, potentially temporary, on the file, disc, network, etc.
The program received a signal, but the signal handler ignored it.
I would rewrite your code in this manner:
with open(filename,'rb') as f:
while True:
s=f.read(max_size)
if not s: break
# process the data in s...
Or, write a generator:
def blocks(infile, bufsize=1024):
while True:
try:
data=infile.read(bufsize)
if data:
yield data
else:
break
except IOError as (errno, strerror):
print "I/O error({0}): {1}".format(errno, strerror)
break
f=open('somefile','rb')
for block in blocks(f,2**16):
# process a block that COULD be up to 65,536 bytes long
Here's what my C compiler's documentation says for the fread() function:
size_t fread(
void *buffer,
size_t size,
size_t count,
FILE *stream
);
fread returns the number of full items
actually read, which may be less than
count if an error occurs or if the end
of the file is encountered before
reaching count.
So it looks like getting less than size means either an error has occurred or EOF has been reached -- so breaking out of the loop would be the correct thing to do.

Categories

Resources