parse blocks of text from text file using Python

parse blocks of text from text file using Python - python

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.
Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.
The code that I have so far is:
tmp = open(files[0],"r")
lines = tmp.readlines()
tmp.close()
num = 0
a=0
for line in lines:
num += 1
if "1:" in line:
a = num
break
a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.

Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.
fh = open('yourfile.txt', 'r')
all_lines = fh.readlines()
fh.close()
for count, line in enumerate(all_lines):
if "1:" in line:
return all_lines[count+1:count+20]

Could be done in a one-liner...
open(files[0]).read().split('1:', 1)[1].split('\n')[:19]
or more readable
txt = open(files[0]).read() # read the file into a big string
before, after = txt.split('1:', 1) # split the file on the first "1:"
after_lines = after.split('\n') # create lines from the after text
lines_to_save = after_lines[:19] # grab the first 19 lines after "1:"
then join the lines with a newline (and add a newline to the end) before writing it to a new file:
out_text = "1:" # add back "1:"
out_text += "\n".join(lines_to_save) # add all 19 lines with newlines between them
out_text += "\n" # add a newline at the end
open("outputfile.txt", "w").write(out_text)
to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:
def read_file(fname):
"Returns contents of file with name `fname`."
with open(fname) as fp:
return fp.read()
def write_file(fname, txt):
"Writes `txt` to a file named `fname`."
with open(fname, 'w') as fp:
fp.write(txt)
then you can replace the first line above with:
txt = read_file(files[0])
and the last line with:
write_file("outputfile.txt", out_text)

I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:
def process_file(fname):
with open(fname) as fp:
for line in fp:
if line.startswith('1:'):
break
else:
return # no '1:' in file
yield line # yield line containing '1:'
for i, line in enumerate(fp):
if i >= 19:
break
yield line
if __name__ == "__main__":
with open('ouput.txt', 'w') as fp:
for line in process_file('intxt.txt'):
fp.write(line)
It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

Related

How can I handle multiple lines at once while reading from a file?

The standard Python approach to working with files using the open() function to create a 'file object' f allows you to either load the entire file into memory at once using f.read() or to read lines one-by-one using a for loop:
with open('filename') as f:
# 1) Read all lines at once into memory:
all_data = f.read()
# 2) Read lines one-by-one:
for line in f:
# Work with each line
I'm searching through several large files looking for a pattern that might span multiple lines. The most intuitive way to do this is to read line-by-line looking for the beginning of the pattern, and then to load in the next few lines to see where it ends:
with open('large_file') as f:
# Read lines one-by-one:
for line in f:
if line.startswith("beginning"):
# Load in the next line, i.e.
nextline = f.getline(line+1) # ??? #
# or something
The line I've marked with # ??? # is my own pseudocode for what I imagine this should look like.
My question is, does this exist in Python? Is there any method for me to access other lines as needed while keeping the cursor at line and without loading the entire file into memory?
Edit Inferring from the responses here and other reading, the answer is "No."

Like this:
gather = []
for line in f:
if gather:
gather.append(line)
if "ending" in line:
process( ''.join(gather) )
gather = []
elif line.startswith("beginning"):
gather = [line]
Although in many cases it's easier just to load the whole file into a string and search it.
You may want to rstrip the newline before appending the line.

Just store the interesting lines into a list while going line-wise through the file:
with open("file.txt","w") as f:
f.write("""
a
b
------
c
d
e
####
g
f""")
interesting_data = []
inside = False
with open ("file.txt") as f:
for line in f:
line = line.strip()
# start of interesting stuff
if line.startswith("---"):
inside = True
# end of interesting stuff
elif line.startswith("###"):
inside = False
# adding interesting bits
elif inside:
interesting_data.append(line)
print(interesting_data)
to get
['c', 'd', 'e']

I think you're looking for .readline(), which does exactly that. Here is a sketch to proceed to the line where a pattern starts.
with open('large_file') as f:
line = f.readline()
while not line.startswith("beginning"):
line = f.readline()
# end of file
if not line:
print("EOF")
break
# do_something with line, get additional lines by
# calling .readline() again, etc.

Parsing Logs with Regular Expressions Python

Coding and Python lightweight :)
I've gotta iterate through some logfiles and pick out the ones that say ERROR. Boom done got that. What I've gotta do is figure out how to grab the following 10 lines containing the details of the error. Its gotta be some combo of an if statement and a for/while loop I presume. Any help would be appreciated.
import os
import re
# Regex used to match
line_regex = re.compile(r"ERROR")
# Output file, where the matched loglines will be copied to
output_filename = os.path.normpath("NodeOut.log")
# Overwrites the file, ensure we're starting out with a blank file
#TODO Append this later
with open(output_filename, "w") as out_file:
out_file.write("")
# Open output file in 'append' mode
with open(output_filename, "a") as out_file:
# Open input file in 'read' mode
with open("MXNode1.stdout", "r") as in_file:
# Loop over each log line
for line in in_file:
# If log line matches our regex, print remove later, and write > file
if (line_regex.search(line)):
# for i in range():
print(line)
out_file.write(line)

There is no need for regex to do this, you can just use the in operator ("ERROR" in line).
Also, to clear the content of the file without opening it in w mode, you can simply place the cursor at the beginning of the file and truncate.
import os
output_filename = os.path.normpath("NodeOut.log")
with open(output_filename, 'a') as out_file:
out_file.seek(0, 0)
out_file.truncate(0)
with open("MXNode1.stdout", 'r') as in_file:
line = in_file.readline()
while line:
if "ERROR" in line:
out_file.write(line)
for i in range(10):
out_file.write(in_file.readline())
line = in_file.readline()
We use a while loop to read lines one by one using in_file.readline(). The advantage is that you can easily read the next line using a for loop.
See the doc:
f.readline() reads a single line from the file; a newline character (\n) is left at the end of the string, and is only omitted on the last line of the file if the file doesn’t end in a newline. This makes the return value unambiguous; if f.readline() returns an empty string, the end of the file has been reached, while a blank line is represented by '\n', a string containing only a single newline.

Assuming you would only want to always grab the next 10 lines, then you could do something similar to:
with open("MXNode1.stdout", "r") as in_file:
# Loop over each log line
lineCount = 11
for line in in_file:
# If log line matches our regex, print remove later, and write > file
if (line_regex.search(line)):
# for i in range():
print(line)
lineCount = 0
if (lineCount < 11):
lineCount += 1
out_file.write(line)
The second if statement will help you always grab the line. The magic number of 11 is so that you grab the next 10 lines after the initial line that the ERROR was found on.

Read file and find if all lines are the same length

Using python I need to read a file and determine if all lines are the same length or not. If they are I move the file into a "good" folder and if they aren't all the same length I move them into a "bad" folder and write a word doc that says which line was not the same as the rest. Any help or ways to start?

You should use all():
with open(filename) as read_file:
length = len(read_file.readline())
if all(len(line) == length for line in read_file):
# Move to good folder
else:
# Move to bad folder
Since all() is short-circuiting, it will stop reading the file at the first non-match.

First off, you can read the file, here example.txt and put all lines in a list, content:
with open(filename) as f:
content = f.readlines()
Next you need to trim all the newline characters from the end of a line and put it in another list result:
for line in content:
line = line.strip()
result.append(line)
Now it's not that hard to get the length of every sentence, and since you want lines that are bad, you loop through the list:
for line in result:
lengths.append(len(line))
So the i-th element of result has length [i-th element of lengths]. We can make a counter for what line length occurs the most in the list, it is as simple as one line!
most_occuring = max(set(lengths), key=lengths.count)
Now we can make another for-loop to check which lengths don't correspond with the most-occuring and add those to bad-lines:
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
bad_lines.append([i, result[i]])
The next step is check where the file needs to go, the good folder, or the bad folder:
if len(bad_lines == 0):
#Good file, move it to the good folder, use the os or shutil module
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
else:
#Bad file, one or more lines are bad, thus move it to the bad folder
os.rename("path/to/current/file.foo", "path/to/new/desination/for/file.foo")
The last step is writing the bad lines to another file, which is do-able, since we have the bad lines already in a list bad_lines:
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))
It's not a doc file, but I think this is a nice start. You can take a look at the docx module if you really want to write to a doc file.
EDIT: Here is an example python script.
with open("example.txt") as f:
content = f.readlines()
result = []
lengths = []
#Strip the file of \n
for line in content:
line = line.strip()
result.append(line)
lengths.append(len(line))
most_occuring = max(set(lengths), key=lengths.count)
bad_lines = []
for i in range(len(lengths)):
if (lengths[i] != most_occuring):
#Append the bad_line to bad_lines
bad_lines.append([i, result[i]])
#Check if it's a good, or a bad file
#if len(bad_lines == 0):
#Good File
#Move file to the good folder...
#else:
#Bad File
with open("bad_lines.txt", "wb") as f:
for bad_line in bad_lines:
f.write("[%3i] %s\n" % (bad_line[0], bad_line[1]))

Ways to read/edit multiple lines in python

What i'm trying to do is to take 4 lines from a file that look like this:
#blablabla
blablabla #this string needs to match the amount of characters in line 4
!blablabla
blablabla #there is a string here
This goes on for a few hundred times.
I read the entire thing line by line, make a change to the fourth line, then want to match the second line's character count to the amount in the fourth line.
I can't figure out how to "backtrack" and change the second line after making changes to the fourth.
with fileC as inputA:
for line1 in inputA:
line2 = next(inputA)
line3 = next(inputA)
line4 = next(inputA)
is what i'm currently using, because it lets me handle 4 lines at the same time, but there has to be a better way as causes all sorts of problems when writing away the file. What could I use as an alternative?

you could do:
with open(filec , 'r') as f:
lines = f.readlines() # readlines creates a list of the lines
to access line 4 and do something with it you would access:
lines[3] # as lines is a list
and for line 2
lines[1] # etc.
You could then write your lines back into a file if you wish
EDIT:
Regarding your comment, perhaps something like this:
def change_lines(fileC):
with open(fileC , 'r') as f:
while True:
lines = []
for i in range(4):
try:
lines.append(f.next()) # f.next() returns next line in file
except StopIteration: # this will happen if you reach end of file before finding 4 more lines.
#decide what you want to do here
return
# otherwise this will happen
lines[2] = lines[4] # or whatever you want to do here
# maybe write them to a new file
# remember you're still within the for loop here
EDIT:
Since your file divides into fours evenly, this works:
def change_lines(fileC):
with open(fileC , 'r') as f:
while True:
lines = []
for i in range(4):
try:
lines.append(f.next())
except StopIteration:
return
code code # do something with lines here
# and write to new file etc.

Another way to do it:
import sys
from itertools import islice
def read_in_chunks(file_path, n):
with open(file_path) as fh:
while True:
lines = list(islice(fh, n))
if lines: yield lines
else: break
for lines in read_in_chunks(sys.argv[1], 4):
print lines
Also relevant is the grouper() recipe in the itertools module. In that case, you would need to filter out the None values before yielding them to the caller.

You could read the file with .readlines and then index which ever line you want to change and write that back to the file:
rf = open('/path/to/file')
file_lines = rf.readlines()
rf.close()
line[1] = line[3] # trim/edit however you'd like
wf = open('/path/to/write', 'w')
wf.writelines(file_lines)
wf.close()

to read line from file without getting "\n" appended at the end [duplicate]

This question already has answers here:
How to read a file without newlines?
(12 answers)
Closed 6 years ago.
My file is "xml.txt" with following contents:
books.xml
news.xml
mix.xml
if I use readline() function it appends "\n" at the name of all the files which is an error because I want to open the files contained within the xml.txt. I wrote this:
fo = open("xml.tx","r")
for i in range(count.__len__()): #here count is one of may arrays that i'm using
file = fo.readline()
find_root(file) # here find_root is my own created function not displayed here
error encountered on running this code:
IOError: [Errno 2] No such file or directory: 'books.xml\n'

To remove just the newline at the end:
line = line.rstrip('\n')
The reason readline keeps the newline character is so you can distinguish between an empty line (has the newline) and the end of the file (empty string).

From Best method for reading newline delimited files in Python and discarding the newlines?
lines = open(filename).read().splitlines()

You could use the .rstrip() method of string objects to get a version with trailing whitespace (including newlines) removed.
E.g.:
find_root(file.rstrip())

I timed it just for curiosity. Below are the results for a vary large file.
tldr;
File read then split seems to be the fastest approach on a large file.
with open(FILENAME, "r") as file:
lines = file.read().split("\n")
However, if you need to loop through the lines anyway then you probably want:
with open(FILENAME, "r") as file:
for line in file:
line = line.rstrip("\n")
Python 3.4.2
import timeit
FILENAME = "mylargefile.csv"
DELIMITER = "\n"
def splitlines_read():
"""Read the file then split the lines from the splitlines builtin method.
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = file.read().splitlines()
return lines
# end splitlines_read
def split_read():
"""Read the file then split the lines.
This method will return empty strings for blank lines (Same as the other methods).
This method may also have an extra additional element as an empty string (compared to
splitlines_read).
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = file.read().split(DELIMITER)
return lines
# end split_read
def strip_read():
"""Loop through the file and create a new list of lines and removes any "\n" by rstrip
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = [line.rstrip(DELIMITER) for line in file]
return lines
# end strip_readline
def strip_readlines():
"""Loop through the file's read lines and create a new list of lines and removes any "\n" by
rstrip. ... will probably be slower than the strip_read, but might as well test everything.
Returns:
lines (list): List of file lines.
"""
with open(FILENAME, "r") as file:
lines = [line.rstrip(DELIMITER) for line in file.readlines()]
return lines
# end strip_readline
def compare_times():
run = 100
splitlines_t = timeit.timeit(splitlines_read, number=run)
print("Splitlines Read:", splitlines_t)
split_t = timeit.timeit(split_read, number=run)
print("Split Read:", split_t)
strip_t = timeit.timeit(strip_read, number=run)
print("Strip Read:", strip_t)
striplines_t = timeit.timeit(strip_readlines, number=run)
print("Strip Readlines:", striplines_t)
# end compare_times
def compare_values():
"""Compare the values of the file.
Note: split_read fails, because has an extra empty string in the list of lines. That's the only
reason why it fails.
"""
splr = splitlines_read()
sprl = split_read()
strr = strip_read()
strl = strip_readlines()
print("splitlines_read")
print(repr(splr[:10]))
print("split_read", splr == sprl)
print(repr(sprl[:10]))
print("strip_read", splr == strr)
print(repr(strr[:10]))
print("strip_readline", splr == strl)
print(repr(strl[:10]))
# end compare_values
if __name__ == "__main__":
compare_values()
compare_times()
Results:
run = 1000
Splitlines Read: 201.02846901328783
Split Read: 137.51448011841822
Strip Read: 156.18040391519133
Strip Readline: 172.12281272950372
run = 100
Splitlines Read: 19.956802833188124
Split Read: 13.657361738959867
Strip Read: 15.731161020969516
Strip Readlines: 17.434831199281092
run = 100
Splitlines Read: 20.01516321280158
Split Read: 13.786344555543899
Strip Read: 16.02410587620824
Strip Readlines: 17.09326775703279
File read then split seems to be the fastest approach on a large file.
Note: read then split("\n") will have an extra empty string at the end of the list.
Note: read then splitlines() checks for more then just "\n" possibly "\r\n".

It's better style to use a context manager for the file, and len() instead of calling .__len__()
with open("xml.tx","r") as fo:
for i in range(len(count)): #here count is one of may arrays that i'm using
file = next(fo).rstrip("\n")
find_root(file) # here find_root is my own created function not displayed here

To remove the newline character fro the end you could also use something like this:
for line in file:
print line[:-1]

A use case with #Lars Wirzenius's answer:
with open("list.txt", "r") as myfile:
for lines in myfile:
lines = lines.rstrip('\n') # the trick
try:
with open(lines) as myFile:
print "ok"
except IOError as e:
print "files does not exist"

# mode : 'r', 'w', 'a'
f = open("ur_filename", "mode")
for t in f:
if(t):
fn.write(t.rstrip("\n"))
"If" condition will check whether the line has string or not, if yes next line will strip the "\n" at the end and write to a file.
Code Tested. ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parse blocks of text from text file using Python - python

Related

How can I handle multiple lines at once while reading from a file?

Parsing Logs with Regular Expressions Python

Read file and find if all lines are the same length

Ways to read/edit multiple lines in python

to read line from file without getting "\n" appended at the end [duplicate]

Categories

Resources