I have an input file which looks like this
some data...
some data...
some data...
...
some data...
<binary size="2358" width="32" height="24">
data of size 2358 bytes
</binary>
some data...
some data...
The value 2358 in the binary size can change for different files.
Now I want to extract the 2358 bytes of data for this file (which is a variable)
and write to another file.
I wrote the following code for the same. But it gives me an error. The problem is, I am not able to extract this 2358 bytes of binary data and write to another file.
c = responseFile.read(1)
ValueError: Mixing iteration and read methods would lose data
Code Is -
import re
outputFile = open('output', 'w')
inputFile = open('input.txt', 'r')
fileSize=0
width=0
height=0
for line in inputFile:
if "<binary size" in line:
x = re.findall('\w+', line)
fileSize = int(x[2])
width = int(x[4])
height = int(x[6])
break
print x
# Here the file will point to the start location of 2358 bytes.
for i in range(0,fileSize,1):
c = inputFile.read(1)
outputFile.write(c)
outputFile.close()
inputFile.close()
Final Answer to my Question -
#!/usr/local/bin/python
import os
inputFile = open('input', 'r')
outputFile = open('output', 'w')
flag = False
for line in inputFile:
if line.startswith("<binary size"):
print 'Start of Data'
flag = True
elif line.startswith("</binary>"):
flag = False
print 'End of Data'
elif flag:
outputFile.write(line) # remove newline
inputFile.close()
outputFile.close()
# I have to delete the last extra new line character from the output.
size = os.path.getsize('output')
outputFile = open('output', 'ab')
outputFile.truncate(size-1)
outputFile.close()
How about a different approach? In pseudo-code:
for each line in input file:
if line starts with binary tag: set output flag to True
if line starts with binary-termination tag: set output flag to False
if output flag is True: copy line to the output file
And in real code:
outputFile = open('./output', 'w')
inputFile = open('./input.txt', 'r')
flag = False
for line in inputFile:
if line.startswith("<binary size"):
flag = True
elif line.startswith("</binary>"):
flag = False
elif flag:
outputFile.write(line[:-1]) # remove newline
outputFile.close()
inputFile.close()
Try changing your first loop to something like this:
while True:
line = inputFile.readline()
# continue the loop as it was
This gets rid of iteration and only leaves read methods, so the problem should disappear.
Consider this method:
import re
line = '<binary size="2358" width="32" height="24">'
m = re.search('size="(\d*)"', line)
print m.group(1) # 2358
It varies from your code, so its not a drop-in replacement, but the regular expressions functionality is different.
This uses Python's regex group capturing features and is much better than your string splitting method.
For example, consider what would happen if the attributes were re-ordered. For example:
<binary width="32" size="2358" height="24">'
instead of
<binary size="2358" width="32" height="24">'
Would your code still work? Mine would. :-)
Edit: To answer your question:
If you want to read n bytes of data from the beginning of a file, you could do something like
bytes = ifile.read(n)
Note that you may get less than n bytes if the input file is not long enough.
If you don't want to start from the "0th" byte, but some other byte, use seek() first, as in:
ifile.seek(9)
bytes = ifile.read(5)
Which would give you bytes 9:13 or the 10th through 14th bytes.
Related
I have a huge text file that I need to split based on matching a 'EKYC' only value. However, when other values with similar pattern show up my script fails.
I am new in Python and it is wearing me out.
import sys;
import os;
MASTER_TEXT_FILE=sys.argv[1];
OUTPUT_FILE=sys.argv[2];
L = file(MASTER_TEXT_FILE, "r").read().strip().split("EKYC")
i = 0
for l in L:
i = i + 1
f = file(OUTPUT_FILE+"-%d.ekyc" % i , "w")
print >>f, "EKYC" + l
The script breaks when there is EKYCSMRT or EKYCVDA or EKYCTIGO then how can I make the guard to prevent the splitting to occur before the point.
This is the content of all of the messages
EKYC
WIK 12
EKYC
WIK 12
EKYCTIGO
EKYC
WIK 13
TTL
EKYCVD
EKYC
WIK 14
TTL D
Thanks for the assistance.
If possible, you should avoid reading large files into memory all at once. Instead, stream chunks of them at a time.
The sensible chunks of text files are usually lines. This can be done with .readline(), but simply iterating over the file yields its lines too.
After reading a line (which includes the newline), you can .write() it directly to the current output file.
import sys
master_filename = sys.argv[1]
output_filebase = sys.argv[2]
output = None
output_number = 0
for line in open(master_filename):
if line.strip() == 'EKYC':
if output is not None:
output.close()
output = None
else:
if output is None:
output_number += 1
output_filename = '%s-%d.ekyc' % (output_filebase, output_number)
output = open(output_filename, 'w')
output.write(line)
if output is not None:
output.close()
The output file is closed and reset upon encountering 'EKYC' on its own line.
Here, you'll notice that the output file isn't (re)opened until right before there is a line to write to it: this avoids creating an empty output file in case there are no further lines to write to it. You'll have to re-order this slightly if you want the 'EKYC' line to appear in the output file also.
Based on your sample input file, you need to: split('\nEKYC\n')
#!/usr/bin/env python
import sys
MASTER_TEXT_FILE = sys.argv[1]
OUTPUT_FILE = sys.argv[2]
with open(MASTER_TEXT_FILE) as f:
fdata = f.read()
i = 0
for subset in fdata.split('\nEKYC\n'):
i += 1
with open(OUTPUT_FILE+"-%d.ekyc" % i, 'w') as output:
output.write(subset)
Other comments:
Python doesn't use ;.
Your original code wasn't using os.
It's recommended to use with open(<filename>, <mode>) as f: ... since it handles possible errors and closes the file afterward.
I'm having trouble reading an entire specific line of a text file using Python. I currently have this:
load_profile = open('users/file.txt', "r")
read_it = load_profile.readline(1)
print read_it
Of course this will just read one byte of the first line, which is not what I want. I also tried Google but didn't find anything.
What are the conditions of this line? Is it at a certain index? Does it contain a certain string? Does it match a regex?
This code will match a single line from the file based on a string:
load_profile = open('users/file.txt', "r")
read_it = load_profile.read()
myLine = ""
for line in read_it.splitlines():
if line == "This is the line I am looking for":
myLine = line
break
print myLine
And this will give you the first line of the file (there are several other ways to do this as well):
load_profile = open('users/file.txt', "r")
read_it = load_profile.read().splitlines()[0]
print read_it
Or:
load_profile = open('users/file.txt', "r")
read_it = load_profile.readline()
print read_it
Check out Python File Objects Docs
file.readline([size])
Read one entire line from the file. A trailing
newline character is kept in the string (but may be absent when a file
ends with an incomplete line). [6] If the size argument is present and
non-negative, it is a maximum byte count (including the trailing
newline) and an incomplete line may be returned. When size is not 0,
an empty string is returned only when EOF is encountered immediately.
Note Unlike stdio‘s fgets(), the returned string contains null
characters ('\0') if they occurred in the input.
file.readlines([sizehint])
Read until EOF using readline() and return
a list containing the lines thus read. If the optional sizehint
argument is present, instead of reading up to EOF, whole lines
totalling approximately sizehint bytes (possibly after rounding up to
an internal buffer size) are read. Objects implementing a file-like
interface may choose to ignore sizehint if it cannot be implemented,
or cannot be implemented efficiently.
Edit:
Answer to your comment Noah:
load_profile = open('users/file.txt', "r")
read_it = load_profile.read()
myLines = []
for line in read_it.splitlines():
# if line.startswith("Start of line..."):
# if line.endswith("...line End."):
# if line.find("SUBSTRING") > -1:
if line == "This is the line I am looking for":
myLines.append(line)
print myLines
You can use Python's inbuilt module linecache
import linecache
line = linecache.getline(filepath,linenumber)
load_profile.readline(1)
specifically says to cap at 1 byte. it doesn't mean 1 line. Try
read_it = load_profile.readline()
def readline_number_x(file,x):
for index,line in enumerate(iter(file)):
if index+1 == x: return line
return None
f = open('filename')
x = 3
line_number_x = readline_number_x(f,x) #This will return the third line
I have a text file structure as:
date
downland
user
date data1 date2
201102 foo bar 200 50
201101 foo bar 300 35
So first six lines of file are not needed. filename:dnw.txt
f = open('dwn.txt', 'rb')
How do I "split" this file starting at line 7 to EOF?
with open('dwn.txt') as f:
for i in xrange(6):
f, next()
for line in f:
process(line)
Update: use next(f) for python 3.x.
Itertools answer!
from itertools import islice
with open('foo') as f:
for line in islice(f, 6, None):
print line
Python 3:
with open("file.txt","r") as f:
for i in range(6):
f.readline()
for line in f:
# process lines 7-end
with open('test.txt', 'r') as fo:
for i in xrange(6):
fo.next()
for line in fo:
print "%s" % line.strip()
In fact, to answer precisely at the question as it was written
How do I "split" this file starting at line 7 to EOF?
you can do
:
in case the file is not big:
with open('dwn.txt','rb+') as f:
for i in xrange(6):
print f.readline()
content = f.read()
f.seek(0,0)
f.write(content)
f.truncate()
in case the file is very big
with open('dwn.txt','rb+') as ahead, open('dwn.txt','rb+') as back:
for i in xrange(6):
print ahead.readline()
x = 100000
chunk = ahead.read(x)
while chunk:
print repr(chunk)
back.write(chunk)
chunk = ahead.read(x)
back.truncate()
The truncate() function is essential to put the EOF you asked for. Without executing truncate() , the tail of the file, corresponding to the offset of 6 lines, would remain.
.
The file must be opened in binary mode to prevent any problem to happen.
When Python reads '\r\n' , it transforms them in '\n' (that's the Universal Newline Support, enabled by default) , that is to say there are only '\n' in the chains chunk even if there were '\r\n' in the file.
If the file is from Macintosh origin , it contains only CR = '\r' newlines before the treatment but they will be changed to '\n' or '\r\n' (according to the platform) during the rewriting on a non-Macintosh machine.
If it is a file from Linux origin, it contains only LF = '\n' newlines which, on a Windows OS, will be changed to '\r\n' (I don't know for a Linux file processed on a Macintosh ).
The reason is that the OS Windows writes '\r\n' whatever it is ordered to write , '\n' or '\r' or '\r\n'. Consequently, there would be more characters rewritten than having been read, and then the offset between the file's pointers ahead and back would diminish and cause a messy rewriting.
In HTML sources , there are also various newlines.
That's why it's always preferable to open files in binary mode when they are so processed.
Alternative version
You can direct use the command read() if you know the character position pos of the separating (header part from the part of interest) linebreak, e.g. an \n, in the text at which you want to break your input text:
with open('input.txt', 'r') as txt_in:
txt_in.seek(pos)
second_half = txt_in.read()
If you are interested in both halfs, you could also investigate the following method:
with open('input.txt', 'r') as txt_in:
all_contents = txt_in.read()
first_half = all_contents[:pos]
second_half = all_contents[pos:]
You can read the entire file into an array/list and then just start at the index appropriate to the line you wish to start reading at.
f = open('dwn.txt', 'rb')
fileAsList = f.readlines()
fileAsList[0] #first line
fileAsList[1] #second line
#!/usr/bin/python
with open('dnw.txt', 'r') as f:
lines_7_through_end = f.readlines()[6:]
print "Lines 7+:"
i = 7;
for line in lines_7_through_end:
print " Line %s: %s" % (i, line)
i+=1
Prints:
Lines 7+:
Line 7: 201102 foo bar 200 50
Line 8: 201101 foo bar 300 35
Edit:
To rebuild dwn.txt without the first six lines, do this after the above code:
with open('dnw.txt', 'w') as f:
for line in lines_7_through_end:
f.write(line)
I have created a script used to cut an Apache access.log file several times a day.
It's not original topic of question, but I think it can be useful, if you have store the file cursor position after the 6 first lines reading.
So I needed the set a position cursor on last line parsed during last execution.
To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.
My code :
ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))
# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")
# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")
# Set in from_line
from_position = 0
try:
with open(cursor_position, "r", encoding=ENCODING) as f:
from_position = int(f.read())
except Exception as e:
pass
# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
with open(cut_file, "w", encoding=ENCODING) as fw:
# We set cursor to the last position used (during last run of script)
f.seek(from_position)
for line in f:
fw.write("%s" % (line))
# We save the last position of cursor for next usage
with open(cursor_position, "w", encoding=ENCODING) as fw:
fw.write(str(f.tell()))
Just do f.readline() six times. Ignore the returned value.
Solutions with readlines() are not satisfactory in my opinion because readlines() reads the entire file. The user will have to read again the lines (in file or in the produced list) to process what he wants, while it could have been done without having read the intersting lines already a first time. Moreover if the file is big, the memory is weighed by the file's content while a for line in file instruction would have been lighter.
Doing repetition of readline() can be done like that
nb = 6
exec( nb * 'f.readline()\n')
It's short piece of code and nb is programmatically adjustable
I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file
I am trying to split up a large xml file into smaller chunks. I write to the output file and then check its size to see if its passed a threshold, but I dont think the getsize() method is working as expected.
What would be a good way to get the filesize of a file that is changing in size.
Ive done something like this...
import string
import os
f1 = open('VSERVICE.xml', 'r')
f2 = open('split.xml', 'w')
for line in f1:
if str(line) == '</Service>\n':
break
else:
f2.write(line)
size = os.path.getsize('split.xml')
print('size = ' + str(size))
running this prints 0 as the filesize for about 80 iterations and then 4176. Does Python store the output in a buffer before actually outputting it?
File size is different from file position. For example,
os.path.getsize('sample.txt')
It exactly returns file size in bytes.
But
f = open('sample.txt')
print f.readline()
f.tell()
Here f.tell() returns the current position of the file handler - i.e. where the next write will put its data. Since it is aware of the buffering, it should be accurate as long as you are simply appending to the output file.
Yes, Python is buffering your output. You'd be better off tracking the size yourself, something like this:
size = 0
for line in f1:
if str(line) == '</Service>\n':
break
else:
f2.write(line)
size += len(line)
print('size = ' + str(size))
(That might not be 100% accurate, eg. on Windows each line will gain a byte because of the \r\n line separator, but it should be good enough for simple chunking.)
Have you tried to replace os.path.getsize with os.tell, like this:
f2.write(line)
size = f2.tell()
Tracking the size yourself will be fine for your case. A different way would be to flush the file buffers just before you check the size:
f2.write(line)
f2.flush() # <-- buffers are written to disk
size = os.path.getsize('split.xml')
Doing that too often will slow down file I/O, of course.
To find the offset to the end of a file:
file.seek(0,2)
print file.tell()
Real world example - read updates to a file and print them as they happen:
file = open('log.txt', 'r')
#find inital End Of File offset
file.seek(0,2)
eof = file.tell()
while True:
#set the file size agian
file.seek(0,2)
neweof = file.tell()
#if the file is larger...
if neweof > eof:
#go back to last position...
file.seek(eof)
# print from last postion to current one
print file.read(neweof-eof),
eof = neweof