fp = open("a.txt")
#do many things with fp
c = fp.read()
if c is None:
print 'fp is at the eof'
Besides the above method, any other way to find out whether is fp is already at the eof?
fp.read() reads up to the end of the file, so after it's successfully finished you know the file is at EOF; there's no need to check. If it cannot reach EOF it will raise an exception.
When reading a file in chunks rather than with read(), you know you've hit EOF when read returns less than the number of bytes you requested. In that case, the following read call will return the empty string (not None). The following loop reads a file in chunks; it will call read at most once too many.
assert n > 0
while True:
chunk = fp.read(n)
if chunk == '':
break
process(chunk)
Or, shorter:
for chunk in iter(lambda: fp.read(n), ''):
process(chunk)
The "for-else" design is often overlooked. See: Python Docs "Control Flow in Loop":
Example
with open('foobar.file', 'rb') as f:
for line in f:
foo()
else:
# No more lines to be read from file
bar()
I'd argue that reading from the file is the most reliable way to establish whether it contains more data. It could be a pipe, or another process might be appending data to the file etc.
If you know that's not an issue, you could use something like:
f.tell() == os.fstat(f.fileno()).st_size
As python returns empty string on EOF, and not "EOF" itself, you can just check the code for it, written here
f1 = open("sample.txt")
while True:
line = f1.readline()
print line
if ("" == line):
print "file finished"
break;
When doing binary I/O the following method is useful:
while f.read(1):
f.seek(-1,1)
# whatever
The advantage is that sometimes you are processing a binary stream and do not know in advance how much you will need to read.
You can compare the returned value of fp.tell() before and after calling the read method. If they return the same value, fp is at eof.
Furthermore, I don't think your example code actually works. The read method to my knowledge never returns None, but it does return an empty string on eof.
Here is a way to do this with the Walrus Operator (new in Python 3.8)
f = open("a.txt", "r")
while (c := f.read(n)):
process(c)
f.close()
Useful Python Docs (3.8):
Walrus operator: https://docs.python.org/3/whatsnew/3.8.html#assignment-expressions
Methods of file objects: https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
read returns an empty string when EOF is encountered. Docs are here.
f=open(file_name)
for line in f:
print line
I really don't understand why python still doesn't have such a function. I also don't agree to use the following
f.tell() == os.fstat(f.fileno()).st_size
The main reason is f.tell() doesn't likely to work for some special conditions.
The method works for me is like the following. If you have some pseudocode like the following
while not EOF(f):
line = f.readline()
" do something with line"
You can replace it with:
lines = iter(f.readlines())
while True:
try:
line = next(lines)
" do something with line"
except StopIteration:
break
This method is simple and you don't need to change most of you code.
If file is opened in non-block mode, returning less bytes than expected does not mean it's at eof, I'd say #NPE's answer is the most reliable way:
f.tell() == os.fstat(f.fileno()).st_size
The Python read functions will return an empty string if they reach EOF
f = open(filename,'r')
f.seek(-1,2) # go to the file end.
eof = f.tell() # get the end of file location
f.seek(0,0) # go back to file beginning
while(f.tell() != eof):
<body>
You can use the file methods seek() and tell() to determine the position of the end of file. Once the position is found, seek back to the file beginning
Python doesn't have built-in eof detection function but that functionality is available in two ways: f.read(1) will return b'' if there are no more bytes to read. This works for text as well as binary files. The second way is to use f.tell() to see if current seek position is at the end. If you want EOF testing not to change the current file position then you need bit of extra code.
Below are both implementations.
Using tell() method
import os
def is_eof(f):
cur = f.tell() # save current position
f.seek(0, os.SEEK_END)
end = f.tell() # find the size of file
f.seek(cur, os.SEEK_SET)
return cur == end
Using read() method
def is_eof(f):
s = f.read(1)
if s != b'': # restore position
f.seek(-1, os.SEEK_CUR)
return s == b''
How to use this
while not is_eof(my_file):
val = my_file.read(10)
Play with this code.
You can use tell() method after reaching EOF by calling readlines()
method, like this:
fp=open('file_name','r')
lines=fp.readlines()
eof=fp.tell() # here we store the pointer
# indicating the end of the file in eof
fp.seek(0) # we bring the cursor at the begining of the file
if eof != fp.tell(): # we check if the cursor
do_something() # reaches the end of the file
Reading a file in batches of BATCH_SIZE lines (the last batch can be shorter):
BATCH_SIZE = 1000 # lines
with open('/path/to/a/file') as fin:
eof = False
while eof is False:
# We use an iterator to check later if it was fully realized. This
# is a way to know if we reached the EOF.
# NOTE: file.tell() can't be used with iterators.
batch_range = iter(range(BATCH_SIZE))
acc = [line for (_, line) in zip(batch_range, fin)]
# DO SOMETHING WITH "acc"
# If we still have something to iterate, we have read the whole
# file.
if any(batch_range):
eof = True
Get the EOF position of the file:
def get_eof_position(file_handle):
original_position = file_handle.tell()
eof_position = file_handle.seek(0, 2)
file_handle.seek(original_position)
return eof_position
and compare it with the current position: get_eof_position == file_handle.tell().
Although I would personally use a with statement to handle opening and closing a file, in the case where you have to read from stdin and need to track an EOF exception, do something like this:
Use a try-catch with EOFError as the exception:
try:
input_lines = ''
for line in sys.stdin.readlines():
input_lines += line
except EOFError as e:
print e
I use this function:
# Returns True if End-Of-File is reached
def EOF(f):
current_pos = f.tell()
file_size = os.fstat(f.fileno()).st_size
return current_pos >= file_size
This code will work for python 3 and above
file=open("filename.txt")
f=file.readlines() #reads all lines from the file
EOF=-1 #represents end of file
temp=0
for k in range(len(f)-1,-1,-1):
if temp==0:
if f[k]=="\n":
EOF=k
else:
temp+=1
print("Given file has",EOF,"lines")
file.close()
You can try this code:
import sys
sys.stdin = open('input.txt', 'r') # set std input to 'input.txt'
count_lines = 0
while True:
try:
v = input() # if EOF, it will raise an error
count_lines += 1
except EOFError:
print('EOF', count_lines) # print numbers of lines in file
break
You can use below code snippet to read line by line, till end of file:
line = obj.readline()
while(line != ''):
# Do Something
line = obj.readline()
Related
I want to read through a binary file.
Googling "python binary eof" led me here.
Now, the questions:
Why does the container (x in the SO answer) contain not a single (current) byte but a whole bunch of them? What am I doing wrong?
If it should be so and I am doing nothing wrong, HOW do read a single byte? I mean, is there any way to detect EOF while reading the file with read(1) method?
To quote the documentation:
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
That means (for a regular file):
f.read(1) will return a byte object containing either 1 byte or 0 byte is EOF was reached
f.read(2) will return a byte object containing either 2 bytes, or 1 byte if EOF is reached after the first byte, or 0 byte if EOF in encountered immediately.
...
If you want to read your file one byte at a time, you will have to read(1) in a loop and test for "emptiness" of the result:
# From answer by #Daniel
with open(filename, 'rb') as f:
while True:
b = f.read(1)
if not b:
# eof
break
do_something(b)
If you want to read your file by "chunk" of say 50 bytes at a time, you will have to read(50) in a loop:
with open(filename, 'rb') as f:
while True:
b = f.read(50)
if not b:
# eof
break
do_something(b) # <- be prepared to handle a last chunk of length < 50
# if the file length *is not* a multiple of 50
In fact, you may even break one iteration sooner:
with open(filename, 'rb') as f:
while True:
b = f.read(50)
do_something(b) # <- be prepared to handle a last chunk of size 0
# if the file length *is* a multiple of 50
# (incl. 0 byte-length file!)
# and be prepared to handle a last chunk of length < 50
# if the file length *is not* a multiple of 50
if len(b) < 50:
break
Concerning the other part of your question:
Why does the container [..] contain [..] a whole bunch of them [bytes]?
Referring to that code:
for x in file:
i=i+1
print(x)
To quote again the doc:
A file object is its own iterator, [..]. When a file is used as an iterator, typically in a for loop (for example, for line in f: print line.strip()), the next() method is called repeatedly. This method returns the next input line, or raises StopIteration when EOF is hit when the file is open for reading (behavior is undefined when the file is open for writing).
The the code above read a binary file line-by-line. That is stopping at each occurrence of the EOL char (\n). Usually, that leads to chunks of various length as most binary files contains occurrences of that char randomly distributed.
I wouldn't encourage you to read a binary file that way. Please prefer one a solution based on read(size).
"" will signify the end of the file
with open(filename, 'rb') as f:
for ch in iter(lambda: f.read(1),""): # keep calling f.read(1) until end of the data
print ch
Reading byte-by-byte:
with open(filename, 'rb') as f:
while True:
b = f.read(1)
if not b:
# eof
break
do_something(b)
Here is what I did. The call to read returns a falsy value when it encounters the end of the file, and this terminates the loop. Using while ch != "": copied the image but it gave me a hung loop.
from sys import argv
donor = argv[1]
recipient = argv[2]
# read from donor and write into recipient
# with statement ends, file gets closed
with open(donor, "rb") as fp_in:
with open(recipient, "wb") as fp_out:
ch = fp_in.read(1)
while ch:
fp_out.write(ch)
ch = fp_in.read(1)
I'm writing a Python script to read a file, and when I arrive at a section of the file, the final way to read those lines in the section depends on information that's given also in that section. So I found here that I could use something like
fp = open('myfile')
last_pos = fp.tell()
line = fp.readline()
while line != '':
if line == 'SPECIAL':
fp.seek(last_pos)
other_function(fp)
break
last_pos = fp.tell()
line = fp.readline()
Yet, the structure of my current code is something like the following:
fh = open(filename)
# get generator function and attach None at the end to stop iteration
items = itertools.chain(((lino,line) for lino, line in enumerate(fh, start=1)), (None,))
item = True
lino, line = next(items)
# handle special section
if line.startswith['SPECIAL']:
start = fh.tell()
for i in range(specialLines):
lino, eline = next(items)
# etc. get the special data I need here
# try to set the pointer to start to reread the special section
fh.seek(start)
# then reread the special section
But this approach gives the following error:
telling position disabled by next() call
Is there a way to prevent this?
Using the file as an iterator (such as calling next() on it or using it in a for loop) uses an internal buffer; the actual file read position is further along the file and using .tell() will not give you the position of the next line to yield.
If you need to seek back and forth, the solution is not to use next() directly on the file object but use file.readline() only. You can still use an iterator for that, use the two-argument version of iter():
fileobj = open(filename)
fh = iter(fileobj.readline, '')
Calling next() on fileiterator() will invoke fileobj.readline() until that function returns an empty string. In effect, this creates a file iterator that doesn't use the internal buffer.
Demo:
>>> fh = open('example.txt')
>>> fhiter = iter(fh.readline, '')
>>> next(fhiter)
'foo spam eggs\n'
>>> fh.tell()
14
>>> fh.seek(0)
0
>>> next(fhiter)
'foo spam eggs\n'
Note that your enumerate chain can be simplified to:
items = itertools.chain(enumerate(fh, start=1), (None,))
although I am in the dark why you think a (None,) sentinel is needed here; StopIteration will still be raised, albeit one more next() call later.
To read specialLines count lines, use itertools.islice():
for lino, eline in islice(items, specialLines):
# etc. get the special data I need here
You can just loop directly over fh instead of using an infinite loop and next() calls here too:
with open(filename) as fh:
enumerated = enumerate(iter(fileobj.readline, ''), start=1):
for lino, line in enumerated:
# handle special section
if line.startswith['SPECIAL']:
start = fh.tell()
for lino, eline in islice(items, specialLines):
# etc. get the special data I need here
fh.seek(start)
but do note that your line numbers will still increment even when you seek back!
You probably want to refactor your code to not need to re-read sections of your file, however.
I'm not an expert with version 3 of Python, but it seems like you're reading using generator that yields lines that are read from file. Thus you can have only one-side direction.
You'll have to use another approach.
I'm having trouble reading an entire specific line of a text file using Python. I currently have this:
load_profile = open('users/file.txt', "r")
read_it = load_profile.readline(1)
print read_it
Of course this will just read one byte of the first line, which is not what I want. I also tried Google but didn't find anything.
What are the conditions of this line? Is it at a certain index? Does it contain a certain string? Does it match a regex?
This code will match a single line from the file based on a string:
load_profile = open('users/file.txt', "r")
read_it = load_profile.read()
myLine = ""
for line in read_it.splitlines():
if line == "This is the line I am looking for":
myLine = line
break
print myLine
And this will give you the first line of the file (there are several other ways to do this as well):
load_profile = open('users/file.txt', "r")
read_it = load_profile.read().splitlines()[0]
print read_it
Or:
load_profile = open('users/file.txt', "r")
read_it = load_profile.readline()
print read_it
Check out Python File Objects Docs
file.readline([size])
Read one entire line from the file. A trailing
newline character is kept in the string (but may be absent when a file
ends with an incomplete line). [6] If the size argument is present and
non-negative, it is a maximum byte count (including the trailing
newline) and an incomplete line may be returned. When size is not 0,
an empty string is returned only when EOF is encountered immediately.
Note Unlike stdio‘s fgets(), the returned string contains null
characters ('\0') if they occurred in the input.
file.readlines([sizehint])
Read until EOF using readline() and return
a list containing the lines thus read. If the optional sizehint
argument is present, instead of reading up to EOF, whole lines
totalling approximately sizehint bytes (possibly after rounding up to
an internal buffer size) are read. Objects implementing a file-like
interface may choose to ignore sizehint if it cannot be implemented,
or cannot be implemented efficiently.
Edit:
Answer to your comment Noah:
load_profile = open('users/file.txt', "r")
read_it = load_profile.read()
myLines = []
for line in read_it.splitlines():
# if line.startswith("Start of line..."):
# if line.endswith("...line End."):
# if line.find("SUBSTRING") > -1:
if line == "This is the line I am looking for":
myLines.append(line)
print myLines
You can use Python's inbuilt module linecache
import linecache
line = linecache.getline(filepath,linenumber)
load_profile.readline(1)
specifically says to cap at 1 byte. it doesn't mean 1 line. Try
read_it = load_profile.readline()
def readline_number_x(file,x):
for index,line in enumerate(iter(file)):
if index+1 == x: return line
return None
f = open('filename')
x = 3
line_number_x = readline_number_x(f,x) #This will return the third line
I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file
I am trying to split up a large xml file into smaller chunks. I write to the output file and then check its size to see if its passed a threshold, but I dont think the getsize() method is working as expected.
What would be a good way to get the filesize of a file that is changing in size.
Ive done something like this...
import string
import os
f1 = open('VSERVICE.xml', 'r')
f2 = open('split.xml', 'w')
for line in f1:
if str(line) == '</Service>\n':
break
else:
f2.write(line)
size = os.path.getsize('split.xml')
print('size = ' + str(size))
running this prints 0 as the filesize for about 80 iterations and then 4176. Does Python store the output in a buffer before actually outputting it?
File size is different from file position. For example,
os.path.getsize('sample.txt')
It exactly returns file size in bytes.
But
f = open('sample.txt')
print f.readline()
f.tell()
Here f.tell() returns the current position of the file handler - i.e. where the next write will put its data. Since it is aware of the buffering, it should be accurate as long as you are simply appending to the output file.
Yes, Python is buffering your output. You'd be better off tracking the size yourself, something like this:
size = 0
for line in f1:
if str(line) == '</Service>\n':
break
else:
f2.write(line)
size += len(line)
print('size = ' + str(size))
(That might not be 100% accurate, eg. on Windows each line will gain a byte because of the \r\n line separator, but it should be good enough for simple chunking.)
Have you tried to replace os.path.getsize with os.tell, like this:
f2.write(line)
size = f2.tell()
Tracking the size yourself will be fine for your case. A different way would be to flush the file buffers just before you check the size:
f2.write(line)
f2.flush() # <-- buffers are written to disk
size = os.path.getsize('split.xml')
Doing that too often will slow down file I/O, of course.
To find the offset to the end of a file:
file.seek(0,2)
print file.tell()
Real world example - read updates to a file and print them as they happen:
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
while True:
#set the file size agian
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
if neweof > eof:
#go back to last position...
file.seek(eof)
# print from last postion to current one
print file.read(neweof-eof),
eof = neweof