Limiting amount read using readline - python

I'm trying to read the first 100 lines of large text files. Simple code for doing this is shown below. The challenge, though, is that I have to guard against the case of corrupt or otherwise screwy files that don't have any line breaks (yes, people somehow figure out ways to generate these). In those cases I'd still like to read in data (because I need to see what's going on in there) but limit it to, say, n bytes.
The only way I can think of to do this is to read the file char by char. Other than being slow (probably not an issue for only 100 lines) I am worried that I'll run into trouble when I encounter a file using non-ASCII encoding.
Is it possible to limit the bytes read using readline()? Or is there a more elegant way to handle this?
line_count = 0
with open(filepath, 'r') as f:
for line in f:
line_count += 1
print('{0}: {1}'.format(line_count, line))
if line_count == 100:
break
EDIT:
As #Fredrik correctly pointed out, readline() accepts an arg that limits the number of chars read (I'd thought it was a buffer size param). So, for my purposes, the following works quite well:
max_bytes = 1024*1024
bytes_read = 0
fo = open(filepath, "r")
line = fo.readline(max_bytes)
bytes_read += len(line)
line_count = 0
while line != '':
line_count += 1
print('{0}: {1}'.format(line_count, line))
if (line_count == 100) or (bytes-read >= max_bytes):
break
else:
line = fo.readline(max_bytes - bytes_read)
bytes_read += len(line)

If you have a file:
f = open("a.txt", "r")
f.readline(size)
The size parameter tells the maximum number of bytes to read

This checks for data with no line breaks:
f=open('abc.txt','r')
dodgy=False
if '\n' not in f.read(1024):
print "Dodgy file - No linefeeds in the first Kb"
dodgy=True
f.seek(0)
if dodgy==False: #read the first 100 lines
for x in range(1,101):
try: line = next(f)
except Exception as e: break
print('{0}: {1}'.format(x, line))
else: #read the first n bytes
line = f.read(1024)
print('bytes: '+line)
f.close()

Related

Fastest way to read and delete N lines in python

Fastest way to read and delete N lines in python.
First I read the file something like this: (I think this is the best way to read large files: Source)
N = 50
with open("ahref.txt", "r+") as f:
link_list = [(next(f)).removesuffix("\n") for x in range(N)]
after that I run my code:
# My code here
After that I want to delete the first N line (I read it: Source).
# Source: https://stackoverflow.com/questions/4710067/how-to-delete-a-specific-line-in-a-file/28057753#28057753
with open("target.txt", "r+") as f:
d = f.readlines()
f.seek(0)
for i in d:
if i != "line you want to remove...":
f.write(i)
f.truncate()
This code doesn't work for me. Because I read only N numbers of lines.
I can remove lines:
with open("xml\\ahref.txt", "r+") as f:
N = 5
all_lines = f.readlines()
f.seek(0)
f.truncate()
f.writelines(all_lines[N:])
But there is a problem with that:
I have to read all the lines and after that I have to write all the lines.
which is not a fast way (There are many ways, but it needs to read all line)
What is the fastest way in terms of performance? because the file is huge.
fastest way is not to read the entire file in memory and use a temporary output file that you can then move over the original file if required
try:
N = 50
mode = "r+"
if not os.path.isfile('output'): mode = "w+"
with open('input', 'r') as fin, open('output', mode) as fout:
for index, line in enumerate(fout): N += 1
for index, line in enumerate(fin):
if index > N: fout.write(line)
# i haven't tested this you may need index > N or index >= N

Unable to remove line breaks in a text file in python

At the risk of losing reputation I did not know what else to do. My file is not showing any hidden characters and I have tried every .replace and .strip I can think of. My file is UTF-8 encoded and I am using python/3.6.1
I have a file with the format:
>header1
AAAAAAAA
TTTTTTTT
CCCCCCCC
GGGGGGGG
>header2
CCCCCC
TTTTTT
GGGGGG
AAAAAA
I am trying to remove line breaks from the end of the file to make each line a continuous string. (This file is actually thousands of lines long).
My code is redundant in the sense that I typed in everything I could think of to remove line breaks:
fref = open(ref)
for line in fref:
sequence = 0
header = 0
if line.startswith('>'):
header = ''.join(line.splitlines())
print(header)
else:
sequence = line.strip("\n").strip("\r")
sequence = line.replace('\n', ' ').replace('\r', '').replace(' ', '').replace('\t', '')
print(len(sequence))
output is:
>header1
8
8
8
8
>header2
6
6
6
6
But if I manually go in and delete the end of line to make it a continuous string it shows it as a congruent string.
Expected output:
>header1
32
>header2
24
Thanks in advance for any help,
Dennis
There are several approaches to parsing this kind of input. In all cases, I would recommend isolating the open and print side-effects outside of a function that you can unit test to convince yourself of the proper behavior.
You could iterate over each line and handle the case of empty lines and end-of-file separately. Here, I use yield statements to return the values:
def parse(infile):
for line in infile:
if line.startswith(">"):
total = 0
yield line.strip()
elif not line.strip():
yield total
else:
total += len(line.strip())
if line.strip():
yield total
def test_parse(func):
with open("input.txt") as infile:
assert list(parse(infile)) == [
">header1",
32,
">header2",
24,
]
Or, you could handle both empty lines and end-of-file at the same time. Here, I use an output array to which I append headers and totals:
def parse(infile):
output = []
while True:
line = infile.readline()
if line.startswith(">"):
total = 0
header = line.strip()
elif line and line.strip():
total += len(line.strip())
else:
output.append(header)
output.append(total)
if not line:
break
return output
def test_parse(func):
with open("input.txt") as infile:
assert parse(infile) == [
">header1",
32,
">header2",
24,
]
Or, you could also split the whole input file into empty-line-separated blocks and parse them independently. Here, I use an output stream to which I write the output; in production, you could pass the sys.stdout stream for example:
import re
def parse(infile, outfile):
content = infile.read()
for block in re.split(r"\r?\n\r?\n", content):
header, *lines = re.split(r"\s+", block)
total = sum(len(line) for line in lines)
outfile.write("{header}\n{total}\n".format(
header=header,
total=total,
))
from io import StringIO
def test_parse(func):
with open("/tmp/a.txt") as infile:
outfile = StringIO()
parse(infile, outfile)
outfile.seek(0)
assert outfile.readlines() == [
">header1\n",
"32\n",
">header2\n",
"24\n",
]
Note that my tests use open("input.txt") for brevity but I would actually recommend passing a StringIO(...) instance instead to see the input being tested more easily, to avoid hitting the filesystem and to make the tests faster.
From my understanding of your question you would like something like this:
Note how the sequence is build over multiple iteration steps of the loop, as you wish to combine multiple lines.
with open(ref) as f:
sequence = "" # reset sequence
header = None
for line in f:
if line.startswith('>'):
if header:
print(header) # print last header
print(len(sequence)) # print last sequence
sequence = "" # reset sequence
header = line[1:] # store header
else:
sequence += line.rstrip() # append line to sequence

How to make log parsing faster for large text files

I have large (500,000 line) log files that I parse through for specified sections. When found the sections are printed to a Text widget. Even if I cut the readlines down to the last 50,000 lines it takes upwards of a minute or longer to finish.
with open(i, "r") as f:
r = f.readlines()
r = r[-50000:]
start = 0
for line in r:
if 'Start section' in line:
if start == 1:
cpfotxt.insert('end', line + "\n", 'hidden')
start = 1
if 'End section' in line:
start = 0
cpfotxt.insert('end', line + "\n")
if start == 1:
cpfotxt.insert('end', line + "\n")
f.close()
Any way to do this faster?
You should try to read it in chunks.
with open(...) as f:
for line in f:
<do something with line>
A more clear approach that can be applied for you:
def readInChunks(fileObj, chunkSize=2048):
"""
Lazy function to read a file piece by piece.
Default chunk size: 2kB.
"""
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
f = open('bigFile')
for chuck in readInChunks(f):
do_something(chunk)
Another possibility is to use seek to skip over a lot of lines. However, this requires that you have some idea of how large the last 50K lines might be. Instead of reading through all the early lines, jump close to the end:
with ... as f:
f.seek(-50000 * 80)
# insert your processing here

Sorting/Deleting File Lines - Python

I was wanting to get rid of lines in a file that were less than 6 characters, and delete the whole line that had a string less than 6 characters. I tried running this code, but it ended up deleting the whole text file. How would I go about this?
Code:
import linecache
i = 1
while i < 5:
line = linecache.getline('file.txt', i)
if len(line) < 6:
str.replace(line, line, '')
i += 1
Thanks in advance!
You'll want to use the open method instead of the linecache:
def deleteShortLines():
text = 'file.txt'
f = open(text)
output = []
for line in f:
if len(line) >= 6:
output.append(line)
f.close()
f = open(text, 'w')
f.writelines(output)
f.close()
Done with iterators instead of lists to support very long files:
with open('file.txt', 'r') as input_file:
# iterating over a file object yields its lines one at a time
# keep only lines with at least 6 characters
filtered_lines = (line for line in input_file if len(line) >= 6)
# write the kept lines to a new file
with open('output_file.txt', 'w') as output_file:
output_file.writelines(filtered_lines)

What is the most efficient way to get first and last line of a text file?

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?
Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.
To read both the first and final line of a file you could...
open the file, ...
... read the first line using built-in readline(), ...
... seek (move the cursor) to the end of the file, ...
... step backwards until you encounter EOL (line break) and ...
... read the last line from there.
def readlastline(f):
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found ...
f.seek(-2, 1) # ... jump back, over the read byte plus one more.
return f.read() # Read all data from this point on.
with open(file, "rb") as f:
first = f.readline()
last = readlastline(f)
Jump to the second last byte directly to prevent trailing newline characters to cause empty lines to be returned*.
The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.
The whence parameter passed to fseek(offset, whence=0) indicates that fseek should seek to a position offset bytes relative to...
0 or os.SEEK_SET = The beginning of the file.
1 or os.SEEK_CUR = The current position.
2 or os.SEEK_END = The end of the file.
* As would be expected as the default behavior of most applications, including print and echo, is to append one to every line written and has no effect on lines missing trailing newline character.
Efficiency
1-2 million lines each and I have to do this for several hundred files.
I timed this method and compared it against against the top answer.
10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.
Millions of lines would increase the difference a lot more.
Exakt code used for timing:
with open(file, "rb") as f:
first = f.readline() # Read and store the first line.
for last in f: pass # Read all lines, keep final value.
Amendment
A more complex, and harder to read, variation to address comments and issues raised since.
Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, noted by comment.
Also adds support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).
Please note that this variation is really slow for large files because of the non-relative offsets needed in text mode. Modify to your need, or do not use it at all as you're probably better off using f.readlines()[-1] with files opened in text mode.
#!/bin/python3
from os import SEEK_END
def readlast(f, sep, fixed=True):
r"""Read the last segment from a file-like object.
:param f: File to read last line from.
:type f: file-like object
:param sep: Segment separator (delimiter).
:type sep: bytes, str
:param fixed: Treat data in ``f`` as a chain of fixed size blocks.
:type fixed: bool
:returns: Last line of file.
:rtype: bytes, str
"""
bs = len(sep)
step = bs if fixed else 1
if not bs:
raise ValueError("Zero-length separator.")
try:
o = f.seek(0, SEEK_END)
o = f.seek(o-bs-step) # - Ignore trailing delimiter 'sep'.
while f.read(bs) != sep: # - Until reaching 'sep': Read sep-sized block
o = f.seek(o-step) # and then seek to the block to read next.
except (OSError,ValueError): # - Beginning of file reached.
f.seek(0)
return f.read()
def test_readlast():
from io import BytesIO, StringIO
# Text mode.
f = StringIO("first\nlast\n")
assert readlast(f, "\n") == "last\n"
# Bytes.
f = BytesIO(b'first|last')
assert readlast(f, b'|') == b'last'
# Bytes, UTF-8.
f = BytesIO("X\nY\n".encode("utf-8"))
assert readlast(f, b'\n').decode() == "Y\n"
# Bytes, UTF-16.
f = BytesIO("X\nY\n".encode("utf-16"))
assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"
# Bytes, UTF-32.
f = BytesIO("X\nY\n".encode("utf-32"))
assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"
# Multichar delimiter.
f = StringIO("X<br>Y")
assert readlast(f, "<br>", fixed=False) == "Y"
# Make sure you use the correct delimiters.
seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
assert "\n".encode('utf8' ) == seps['utf8']
assert "\n".encode('utf16')[2:] == seps['utf16']
assert "\n".encode('utf32')[4:] == seps['utf32']
# Edge cases.
edges = (
# Text , Match
("" , "" ), # Empty file, empty string.
("X" , "X" ), # No delimiter, full content.
("\n" , "\n"),
("\n\n", "\n"),
# UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
(b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
)
for txt, match in edges:
for enc,sep in seps.items():
assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match
if __name__ == "__main__":
import sys
for path in sys.argv[1:]:
with open(path) as f:
print(f.readline() , end="")
print(readlast(f,"\n"), end="")
docs for io module
with open(fname, 'rb') as fh:
first = next(fh).decode()
fh.seek(-1024, 2)
last = fh.readlines()[-1].decode()
The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.
Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:
for line in fh:
pass
last = line
You don't need to bother with the binary flag you could just use open(fname).
ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.
Here's a modified version of SilentGhost's answer that will do what you want.
with open(fname, 'rb') as fh:
first = next(fh)
offs = -100
while True:
fh.seek(offs, 2)
lines = fh.readlines()
if len(lines)>1:
last = lines[-1]
break
offs *= 2
print first
print last
No need for an upper bound for line length here.
Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.
This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:
def tail(filepath):
"""
#author Marco Sulla (marcosullaroma#gmail.com)
#date May 31, 2016
"""
try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath
with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1
if start_pos != 0:
f.seek(start_pos)
char = f.read(1)
if char == b"\n":
start_pos -= 1
f.seek(start_pos)
if start_pos == 0:
f.seek(start_pos)
else:
char = ""
for pos in range(start_pos, -1, -1):
f.seek(pos)
char = f.read(1)
if char == b"\n":
break
return f.readline()
It's ispired by Trasp's answer and AnotherParker's comment.
First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.
a=open('file.txt','rb')
lines = a.readlines()
if lines:
first_line = lines[:1]
last_line = lines[-1]
w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:
x= line
print ('last line is : ',x)
w.close()
The for loop runs through the lines and x gets the last line on the final iteration.
with open("myfile.txt") as f:
lines = f.readlines()
first_row = lines[0]
print first_row
last_row = lines[-1]
print last_row
Here is an extension of #Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.
def tail(filepath):
with open(filepath, "rb") as f:
first = f.readline() # Read the first line.
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
try:
f.seek(-2, 1) # ...jump back the read byte plus one more.
except IOError:
f.seek(-1, 1)
if f.tell() == 0:
break
last = f.readline() # Read last line.
return last
Nobody mentioned using reversed:
f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()
Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.
with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
first = f.readline()
if f.read(1) == '':
return first
f.seek(-2, 2) # Jump to the second last byte.
while f.read(1) != b"\n": # Until EOL is found...
f.seek(-2, 1) # ...jump back the read byte plus one more.
last = f.readline() # Read last line.
return last
The above answer is a modified version of the above answers which handles the case that there is only one line in the file

Categories

Resources