Garbage in file after truncate(0) in Python - python

Assume there is a file test.txt containing a string 'test'.
Now, consider the following Python code:
f = open('test', 'r+')
f.read()
f.truncate(0)
f.write('passed')
f.flush();
Now I expect test.txt to contain 'passed' now, however there are additionally some strange symbols!
Update: flush after truncate does not help.

Yeah, that's true that truncate() doesn't move the position, but said that, is simple as death:
f.read()
f.seek(0)
f.truncate(0)
f.close()
this is perfectly working ;)

This is because truncate doesn't change the stream position.
When you read() the file, you move the position to the end. So successive writes will write to file from that position. However, when you call flush(), it seems not only it tries to write the buffer to the file, but also does some error checking and fixes the current file position. When Flush() is called after the truncate(0), writes nothing (buffer is empty), then checks the file size and places the position at the first applicable place (which is 0).
UPDATE
Python's file function are NOT just wrappers around the C standard library equivalents, but knowing the C functions helps knowing what is happening more precisely.
From the ftruncate man page:
The value of the seek pointer is not modified by a call to ftruncate().
From the fflush man page:
If stream points to an input stream or an update stream into which the most recent operation was input, that stream is flushed if it is seekable and is not already at end-of-file. Flushing an input stream discards any buffered input and adjusts the file pointer such that the next input operation accesses the byte after the last one read.
This means if you put flush before truncate it has no effect. I checked and it was so.
But for putting flush after truncate:
If stream points to an output stream or an update stream in which the most recent operation was not input, fflush() causes any unwritten data for that stream to be written to the file, and the st_ctime and st_mtime fields of the underlying file are marked for update.
The man page doesn't mention the seek pointer when explaining output streams with last operation not being input. (Here our last operation is truncate)
UPDATE 2
I found something in python source code: Python-3.2.2\Modules\_io\fileio.c:837
#ifdef HAVE_FTRUNCATE
static PyObject *
fileio_truncate(fileio *self, PyObject *args)
{
PyObject *posobj = NULL; /* the new size wanted by the user */
#ifndef MS_WINDOWS
Py_off_t pos;
#endif
...
#ifdef MS_WINDOWS
/* MS _chsize doesn't work if newsize doesn't fit in 32 bits,
so don't even try using it. */
{
PyObject *oldposobj, *tempposobj;
HANDLE hFile;
////// THIS LINE //////////////////////////////////////////////////////////////
/* we save the file pointer position */
oldposobj = portable_lseek(fd, NULL, 1);
if (oldposobj == NULL) {
Py_DECREF(posobj);
return NULL;
}
/* we then move to the truncation position */
...
/* Truncate. Note that this may grow the file! */
...
////// AND THIS LINE //////////////////////////////////////////////////////////
/* we restore the file pointer position in any case */
tempposobj = portable_lseek(fd, oldposobj, 0);
Py_DECREF(oldposobj);
if (tempposobj == NULL) {
Py_DECREF(posobj);
return NULL;
}
Py_DECREF(tempposobj);
}
#else
...
#endif /* HAVE_FTRUNCATE */
Look at the two lines I indicated (///// This Line /////). If your platform is Windows, then it's saving the position and returning it back after the truncate.
To my surprise, most of the flush functions inside the Python 3.2.2 functions either did nothing or did not call fflush C function at all. The 3.2.2 truncate part was also very undocumented. However, I did find something interesting in Python 2.7.2 sources. First, I found this in Python-2.7.2\Objects\fileobject.c:812 in truncate implementation:
/* Get current file position. If the file happens to be open for
* update and the last operation was an input operation, C doesn't
* define what the later fflush() will do, but we promise truncate()
* won't change the current position (and fflush() *does* change it
* then at least on Windows). The easiest thing is to capture
* current pos now and seek back to it at the end.
*/
So to summarize all, I think this is a fully platform dependent thing. I checked on default Python 3.2.2 for Windows x64 and got the same results as you. Don't know what happens on *nixes.

If anyone is in the same boat as mine, here is my problem with solution:
I have a program that is always ON i.e. it doesn't stop, keeps on polling the data and writes to a log file
The problem is, i want to split the main file as soon as it reaches the 10 MB mark, therefore, i wrote the below program.
I found the solution as well to the problem, where truncate was writing null values to the file causing further problem.
Below is an illustration on how i solved this issue.
f1 = open('client.log','w')
nowTime = datetime.datetime.now().time()
f1.write(os.urandom(1024*1024*15)) #Adding random values worth 15 MB
if (int(os.path.getsize('client.log') / 1048576) > 10): #checking if file size is 10 MB and above
print 'File size limit Exceeded, needs trimming'
dst = 'client_'+ str(randint(0, 999999)) + '.log'
copyfile('client.log', dst) #Copying file to another one
print 'Copied content to ' + str(dst)
print 'Erasing current file'
f1.truncate(0) #Truncating data, this works fine but puts the counter at the last
f1.seek(0) #very important to use after truncate so that new data begins from 0
print 'File truncated successfully'
f1.write('This is fresh content') #Dummy content
f1.close()
print 'All Job Processed'

Truncate doesn't change the file position.
Note also that even if the file is opened in read+write you cannot just switch between the two types of operation (a seek operation is required to be able to switch from read to write or vice versa).

I expect the following is the code you meant to write:
open('test.txt').read()
open('test.txt', 'w').write('passed')

It depends. If you want to keep the file open and access it without closing it then flush will force writing to the file. If you're closing the file right after flush then no you don't need it because close will flush for you. That's my understanding from the docs

Related

python os.read(fd, n) requires parameter n, why?

I need to read a text file with the os module as such:
t = os.open('te.txt', os.O_RDONLY)
r = os.read(t, 20)
rs = r.decode('utf-8')
print(rs)
What if I don't know the byte size of the file. I could put a very large number instead of 20 as a value seems to be required, but perhaps there is a more pythonic way.
The second argument isn't supposed to hold the size of the file in bytes; it's only supposed to hold the maximum amount of content you're prepared to read at a time (which should typically be divisible by both your operating system's block size and page size; 64kb is not a bad default).
The "why" of this is because memory has to be allocated in userspace before the kernel can be instructed to write content into that memory. This isn't the kind of detail that Python developers need to think about often, but you're using a low-level interface built for use from C; it accordingly has implementation details leaking out of that underlying layer.
The operating system is free to give you less than the number of bytes you indicate as a maximum (for example, if it gets interrupted, or the filesystem driver isn't written to provide that much data at a time), so no matter what, you need to be prepared to call it repeatedly; only when it returns an empty string (as opposed to throwing an exception or returning a shorter-than-requested string) are you certain to have reached the end of the file.
os.read() isn't a Pythonic interface, and it isn't supposed to be. It's a thin wrapper around the syscall provided by the operating system kernel. If you want a Pythonic interface, don't use os.read(), but instead use Python's native file objects.
If you wanted to load the whole file and you have to use os, you could use os.stat(filename).st_size or os.path.getsize(filename) to get the size of the file in bytes.
filename = 'te.txt'
t = os.open(filename, os.O_RDONLY)
b = os.stat(filename).st_size
r = os.read(t, b)
rs = r.decode('utf-8')
print(rs)

Why can't node handle this regex but python can?

I have a large text file that I am extracting URLs from. If I run:
import re
with open ('file.in', 'r') as fh:
for match in re.findall(r'http://matchthis\.com', fh.read()):
print match
it runs in a second or so user time and gets the URLs I was wanting, but if I run either of these:
var regex = /http:\/\/matchthis\.com/g;
fs.readFile('file.in', 'ascii', function(err, data) {
while(match = regex.exec(data))
console.log(match);
});
OR
fs.readFile('file.in', 'ascii', function(err, data) {
var matches = data.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
I get:
FATAL ERROR: CALL_AND_RETRY_0 Allocation failed - process out of memory
What is happening with the node.js regex engine? Is there any way I can modify things such that they work in node?
EDIT: The error appears to be fs centric as this also produces the error:
fs.readFile('file.in', 'ascii', function(err, data) {
});
file.in is around 800MB.
You should process the file line by line using the streaming file interface. Something like this:
var fs = require('fs');
var byline = require('byline');
var input = fs.createReadStream('tmp.txt');
var lines = input.pipe(byline.createStream());
lines.on('readable', function(){
var line = lines.read().toString('ascii');
var matches = line.match(/http:\/\/matchthis\.com/g);
for (var i = 0; i < matches.length; ++i) {
console.log(matches[i]);
}
});
In this example, I'm using the byline module to split the stream into lines so that you won't miss matches by getting partial chunks of lines per .read() call.
To elaborate more, what you were doing is allocating ~800MB of RAM as a Buffer (outside of V8's heap) and then converting that to an ASCII string (and thus transferring it into V8's heap), which will take at least 800MB and likely more depending on V8's internal optimizations. I believe V8 stores strings as UCS2 or UTF16, which means each character will be 2 bytes (given ASCII input) so your string would really be about 1600MB.
Node's max allocated heap space is 1.4GB, so by trying to create such a large string, you cause V8 to throw an exception.
Python does not have this problem because it does not have a maximum heap size and will chew through all of your RAM. As others have pointed out, you should also avoid fh.read() in Python since that will copy all the file data into RAM as a string instead of streaming it line by line with an iterator.
Given that both programs are trying to read the entire 1400000 file into memory, I'd suggest it would be a difference between how Node and Python handle large strings. Try doing a line by line search and the problem should disappear.
For example, in Python you can do this:
import re
with open ('file.in', 'r') as file:
for line in file:
for match in re.findall(r'http://matchthis\.com', line):
print match

Mixing read() and write() on Python files in Windows

It appears that a write() immediately following a read() on a file opened with r+ (or r+b) permissions in Windows doesn't update the file.
Assume there is a file testfile.txt in the current directory with the following contents:
This is a test file.
I execute the following code:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.write("----")
I would expect the code to print This and update the file contents to this:
This----a test file.
This works fine on at least Linux. However, when I run it on Windows then the message is displayed correctly, but the file isn't altered - it's like the write() is being ignored. If I call tell() on the filehandle it shows that the position has been updated (it's 4 before the write() and 8 afterwards), but no change to the file.
However, if I put an explicit fd.seek(4) just before the write() line then everything works as I'd expect.
Does anybody know the reason for this behaviour under Windows?
For reference I'm using Python 2.7.3 on Windows 7 with an NTFS partition.
EDIT
In response to comments, I tried both r+b and rb+ - the official Python docs seem to imply the former is canonical.
I put calls to fd.flush() in various places, and placing one between the read() and the write() like this:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.flush()
fd.write("----")
... yields the following interesting error:
IOError: [Errno 0] Error
EDIT 2
Indirectly that addition of a flush() helped because it lead me to this post describing a similar problem. If one of the commenters on it is correct, it's a bug in the underlying Windows C library.
Python's file operation should follow the libc convention as internally its implemented using C file IO functions.
Quoting from fopen man page or fopen page in cplusplus
For files open for appending (those which include a "+" sign), on
which both input and output operations are allowed, the stream should
be flushed (fflush) or repositioned (fseek, fsetpos, rewind) between
either a writing operation followed by a reading operation or a
reading operation which did not reach the end-of-file followed by a
writing operation.
SO to summarize, if you need to read a file after writing, you need to fflush the buffer and a write operation after read should be preceded by a fseek, as fd.seek(0, os.SEEK_CUR)
So just change your code snippet to
with open("test1.txt", "r+b") as fd:
print fd.read(4)
fd.seek(0, os.SEEK_CUR)
fd.write("----")
The behavior is consistent with how a similar C program would behave
#include <cstdio>
int main()
{
char buffer[5] = {0};
FILE *fp = fopen("D:\\Temp\\test1.txt","rb+");
fread(buffer, sizeof(char), 4, fp);
printf("%s\n", buffer);
/*without fseek, file would not be updated*/
fseek(fp, 0, SEEK_CUR);
fwrite("----",sizeof(char), 4, fp);
fclose(fp);
return 0;
}
It appears that this due to the behaviour of the underlying Windows libraries (which personally I regard to be in error) and nothing wrong with Python. On adding a flush() call between reading and writing (which is apparently good practice) I got an IOError with a zero errno, which is the same issue as discussed in this blog post.
From that post I found this Python issue which mentions the problem and says that the seek() call is actually the best workaround, along with a flush() every time you change from reading to writing.
All that taken into account, it seems the best way to write the code above such that it successfully runs on Windows is:
with open("testfile.txt", "r+b") as fd:
print fd.read(4)
fd.flush()
fd.seek(4)
fd.write("----")
Might be something to bear in mind for anybody attempting to write portable code.
have you tried flushing ?
fd.flush()
it is OS-dependant, as write uses the filesystem caching mechanism
Is it possible that the implementation missinterpretest "r+b"? Afaik "rb+" is for reading and writing in binary.

Read pcap header length field with python

I have captured some packets using pcap library in c. Now i am using python program to read that saved packet file. but i have a problem here. I have a file which first have pkthdr(provided by lybrary) and then actual packet.
format of pkthdr is-
struct pcap_pkthdr {
struct timeval ts; /* time stamp 32bit */ 32bit
bpf_u_int32 caplen; /* length of portion present */
bpf_u_int32 len; /* length this packet (off wire) */
};
now i want to read len field, so i have skipped timeval and cap len, and printed len field using python in binary form.. the binary code which i got is-
01001010 00000000 00000000 00000000
Now how to read it in u_int32, i dont think it is correct value(too large), actual len field value should be 74 byte(check in wireshark).. so please tell me what i am doing wrong..
thanks in advance
Or have a look at the pylibpcap module, the pypcap module, or the pcapy module, which let you just call pcap APIs with relative ease. That way you don't have to care about the details of pcap files, and your code will, with libpcap 1.1 or later, also be able to read at least some of the pcap-ng files that Wireshark can produce and that it will produce by default in the 1.8 release.
Writing your own code to read pcap files, rather than relying on libpcap/WinPcap to do so, is rarely worth doing. (Wireshark does so, as part of its library that reads a number of capture file formats and supports pcap-ng format in ways that the current pcap API can't, but the library in question also supports pcap-ng....)
Have a look at the struct module, which lets you unpack such binary data with relative ease, for example:
struct.unpack('LLL', yourbuffer)
This will give you a tuple of the three (L = unsigned long) values. If the len value doesn't seem right, the byte order of the file is different from your native one. In that case prefix the format string with either > (big-endian) or < (little-endian):
struct.unpack('>LLL', yourbuffer)

Python EOF for multi byte requests of file.read()

The Python docs on file.read() state that An empty string is returned when EOF is encountered immediately. The documentation further states:
Note that this method may call the
underlying C function fread() more
than once in an effort to acquire as
close to size bytes as possible. Also
note that when in non-blocking mode,
less data than was requested may be
returned, even if no size parameter
was given.
I believe Guido has made his view on not adding f.eof() PERFECTLY CLEAR so need to use the Python way!
What is not clear to ME, however, is if it is a definitive test that you have reached EOF if you receive less than the requested bytes from a read, but you did receive some.
ie:
with open(filename,'rb') as f:
while True:
s=f.read(size)
l=len(s)
if l==0:
break # it is clear that this is EOF...
if l<size:
break # ? Is receiving less than the request EOF???
Is it a potential error to break if you have received less than the bytes requested in a call to file.read(size)?
You are not thinking with your snake skin on... Python is not C.
First, a review:
st=f.read() reads to EOF, or if opened as a binary, to the last byte;
st=f.read(n) attempts to reads n bytes and in no case more than n bytes;
st=f.readline() reads a line at a time, the line ends with '\n' or EOF;
st=f.readlines() uses readline() to read all the lines in a file and returns a list of the lines.
If a file read method is at EOF, it returns ''. The same type of EOF test is used in the other 'file like" methods like StringIO, socket.makefile, etc. A return of less than n bytes from f.read(n) is most assuredly NOT a dispositive test for EOF! While that code may work 99.99% of the time, it is the times it does not work that would be very frustrating to find. Plus, it is bad Python form. The only use for n in this case is to put an upper limit on the size of the return.
What are some of the reasons the Python file-like methods returns less than n bytes?
EOF is certainly a common reason;
A network socket may timeout on read yet remain open;
Exactly n bytes may cause a break between logical multi-byte characters (such as \r\n in text mode and, I think, a multi-byte character in Unicode) or some underlying data structure not known to you;
The file is in non-blocking mode and another process begins to access the file;
Temporary non-access to the file;
An underlying error condition, potentially temporary, on the file, disc, network, etc.
The program received a signal, but the signal handler ignored it.
I would rewrite your code in this manner:
with open(filename,'rb') as f:
while True:
s=f.read(max_size)
if not s: break
# process the data in s...
Or, write a generator:
def blocks(infile, bufsize=1024):
while True:
try:
data=infile.read(bufsize)
if data:
yield data
else:
break
except IOError as (errno, strerror):
print "I/O error({0}): {1}".format(errno, strerror)
break
f=open('somefile','rb')
for block in blocks(f,2**16):
# process a block that COULD be up to 65,536 bytes long
Here's what my C compiler's documentation says for the fread() function:
size_t fread(
void *buffer,
size_t size,
size_t count,
FILE *stream
);
fread returns the number of full items
actually read, which may be less than
count if an error occurs or if the end
of the file is encountered before
reaching count.
So it looks like getting less than size means either an error has occurred or EOF has been reached -- so breaking out of the loop would be the correct thing to do.

Categories

Resources