Most pythonic way to break up highly branched parser - python

I'm working on a parser for a specific type of file that is broken up into sections by some header keyword followed a bunch of heterogeneous data. Headers are always separated by blank lines. Something along the lines of the following:
Header_A
1 1.02345
2 2.97959
...
Header_B
1 5.1700 10.2500
2 5.0660 10.5000
...
Every header contains very different types of data and depending on certain keywords within a block, the data must be stored in different locations. The general approach I took is to have some regex that catches all of the keywords that can define a header and then iterate through the lines in the file. Once I find a match, I pop lines until I reach a blank line, storing all of the data from lines in the appropriate locations.
This is the basic structure of the code where "do stuff with current_line" will involve a bunch of branches depending on what the line contains:
headers = re.compile(r"""
((?P<header_a>Header_A)
|
(?P<header_b>Header_B))
""", re.VERBOSE)
i = 0
while i < len(data_lines):
match = header.match(data_lines[i])
if match:
if match.group('header_a'):
data_lines.pop(i)
data_lines.pop(i)
# not end of file not blank line
while i < len(data_lines) and data_lines[i].strip():
current_line = data_lines.pop(i)
# do stuff with current_line
elif match.group('header_b'):
data_lines.pop(i)
data_lines.pop(i)
while i < len(data_lines) and data_lines[i].strip():
current_line = data_lines.pop(i)
# do stuff with current_line
else:
i += 1
else:
i += 1
Everything works correctly but it amounts to a highly branched structure that I find to be highly illegible and likely hard to follow for anyone unfamiliar with the code. It also makes it more difficult to keep lines at <79 characters and more generally doesn't feel very pythonic.
One thing I'm working on is separating the branch for each header into separate functions. This will hopefully improve readability quite a bit but...
...is there a cleaner way to perform the outer looping/matching structure? Maybe using itertools?
Also for various reasons this code must be able to run in 2.7.

You could use itertools.groupby to group the lines according to which processing function you wish to perform:
import itertools as IT
def process_a(lines):
for line in lines:
line = line.strip()
if not line: continue
print('processing A: {}'.format(line))
def process_b(lines):
for line in lines:
line = line.strip()
if not line: continue
print('processing B: {}'.format(line))
def header_func(line):
if line.startswith('Header_A'):
return process_a
elif line.startswith('Header_B'):
return process_b
else: return None # you could omit this, but it might be nice to be explicit
with open('data', 'r') as f:
for key, lines in IT.groupby(f, key=header_func):
if key is None:
if func is not None:
func(lines)
else:
func = key
Applied to the data you posted, the above code prints
processing A: 1 1.02345
processing A: 2 2.97959
processing A: ...
processing B: 1 5.1700 10.2500
processing B: 2 5.0660 10.5000
processing B: ...
The one complicated line in the code above is
for key, lines in IT.groupby(f, key=header_func):
Let's try to break it down into its component parts:
In [31]: f = open('data')
In [32]: list(IT.groupby(f, key=header_func))
Out[32]:
[(<function __main__.process_a>, <itertools._grouper at 0xa0efecc>),
(None, <itertools._grouper at 0xa0ef7cc>),
(<function __main__.process_b>, <itertools._grouper at 0xa0eff0c>),
(None, <itertools._grouper at 0xa0ef84c>)]
IT.groupby(f, key=header_func) returns an iterator. The items yielded by the iterator are 2-tuples, such as
(<function __main__.process_a>, <itertools._grouper at 0xa0efecc>)
The first item in the 2-tuple is the value returned by header_func. The second item in the 2-tuple is an iterator. This iterator yields lines from f for which header_func(line) all return the same value.
Thus, IT.groupby is grouping the lines in f according to the return value of header_func. When the line in f is a header line -- either Header_A or Header_B -- then header_func returns process_a or process_b, the function we wish to use to process subsequent lines.
When the line in f is a header line, the group of lines returned by IT.groupby (the second item in the 2-tuple) is short and uninteresting -- it is just the header line.
We need to look in the next group for the interesting lines. For these lines, header_func returns None.
So we need to look at two 2-tuples: the first 2-tuple yielded by IT.groupby gives us the function to use, and the second 2-tuple gives the lines to which the header function should be applied.
Once you have both the function and the iterator with the interesting lines, you just call func(lines) and you're done!
Notice that it would be very easy to expand this to process other kinds of headers. You would only need to write another process_* function, and modify header_func to return process_* when the line indicates to do so.
Edit: I removed the use of izip(*[iterator]*2) since
it assumes the first line is a header line. The first line could be blank or a non-header line, which would throw everything off. I replaced it with some if-statements. It's not quite as succinct, but the result is a bit more robust.

How about splitting out the logic for parsing the different header's types of data into separate functions, then using a dictionary to map from the given header to the right one:
def parse_data_a(iterator):
next(iterator) # throw away the blank line after the header
for line in iterator:
if not line.strip():
break # bale out if we find a blank line, another header is about to start
# do stuff with each line here
# define similar functions to parse other blocks of data, e.g. parse_data_b()
# define a mapping from header strings to the functions that parse the following data
parser_for_header = {"Header_A": parse_data_a} # put other parsers in here too!
def parse(lines):
iterator = iter(lines)
for line in iterator:
header = line.strip()
if header in parser_for_header:
parser_for_header[header](iterator)
This code uses iteration, rather than indexing to handle the lines. An advantage of this is that you can run it directly on a file in addition to on a list of lines, since files are iterable. It also makes the bounds checking very easy, since a for loop will end automatically when there's nothing left in the iterable, as well as when a break statement is hit.
Depending on what you're doing with the data you're parsing, you may need to have the individual parsers return something, rather than just going off and doing their own thing. In that case, you'll need some logic in the top-level parse function to get the results and assemble it into some useful format. Perhaps a dictionary would make the most sense, with the last line becoming:
results_dict[header] = parser_for_header[header](iterator)

You can do it with the send function of generators as well :)
data_lines = [
'Header_A ',
'',
'',
'1 1.02345',
'2 2.97959',
'',
]
def process_header_a(line):
while True:
line = yield line
# process line
print 'A', line
header_processors = {
'Header_A': process_header_a(None),
}
current_processer = None
for line in data_lines:
line = line.strip()
if line in header_processors:
current_processor = header_processors[line]
current_processor.send(None)
elif line:
current_processor.send(line)
for processor in header_processors.values():
processor.close()
You can remove all if conditions from the main loop if you replace
current_processer = None
for line in data_lines:
line = line.strip()
if line in header_processors:
current_processor = header_processors[line]
current_processor.send(None)
elif line:
current_processor.send(line)
with
map(next, header_processors.values())
current_processor = header_processors['Header_A']
for line in data_lines:
line = line.strip()
current_processor = header_processors.get(line, current_processor)
line and line not in header_processors and current_processor.send(line)

Related

Read text line by line in python

I would like to make a script that read a text line by line and based on lines if it finds a certain parameter populates an array. The idea is this
Read line
if Condition 1
#True
nested if Condition 2
...
else Condition 1 is not true
read next line
I can't get it to work though. I'm using readline () to read the text line by line, but the main problem is that the command never works to make it read the next line. Can you help me? Below an extract of my actual code:
col = 13 # colonne
rig = 300 # righe
a = [ [ None for x in range(col) ] for y in range(rig) ]
counter = 1
file = open('temp.txt', 'r')
files = file.readline()
for line in files:
if 'bandEUTRA: 32' in line:
if 'ca-BandwidthClassDL-EUTRA: a' in line:
a[counter][5] = 'DLa'
counter = counter + 1
else:
next(files)
else:
next(files)
print('\n'.join(map(str, a)))
Fixes for the code you asked about inline, and some other associated cleanup, with comments:
col = 13 # colonne
rig = 300 # righe
a = [[None] * col for y in range(rig)] # Innermost repeated list of immutable
# can use multiplication, just don't do it for
# outer list(s), see: https://stackoverflow.com/q/240178/364696
counter = 1
with open('temp.txt') as file: # Use with statement to get guaranteed file closure; 'r' is implicit mode and can be omitted
# Removed: files = file.readline() # This makes no sense; files would be a single line from the file, but your original code treats it as the lines of the file
# Replaced: for line in files: # Since files was a single str, this iterated characters of the file
for line in file: # File objects are iterators of their own lines, so you can get the lines one by one this way
if 'bandEUTRA: 32' in line and 'ca-BandwidthClassDL-EUTRA: a' in line: # Perform both tests in single if to minimize arrow pattern
a[counter][5] = 'DLa'
counter += 1 # May as well not say "counter" twice and use +=
# All next() code removed; next() advances an iterator and returns the next value,
# but files was not an iterator, so it was nonsensical, and the new code uses a for loop that advances it for you, so it was unnecessary.
# If the goal is to intentionally skip the next line under some conditions, you *could*
# use next(files, None) to advance the iterator so the for loop will skip it, but
# it's rare that a line *failing* a test means you don't want to look at the next line
# so you probably don't want it
# This works:
print('\n'.join(map(str, a)))
# But it's even simpler to spell it as:
print(*a, sep="\n")
# which lets print do the work of stringifying and inserting the separator, avoiding
# the need to make a potentially huge string in memory; it *might* still do so (no documented
# guarantees), but if you want to avoid that possibility, you could do:
sys.stdout.writelines(map('{}\n'.format, a))
# which technically doesn't guarantee it, but definitely actually operates lazily, or
for x in a:
print(x)
# which is 100% guaranteed not to make any huge strings
You can do:
with open("filename.txt", "r") as f:
for line in f:
clean_line = line.rstrip('\r\n')
process_line(clean_line)
Edit:
for your application of populating an array, you could do something like this:
with open("filename.txt", "r") as f:
contains = ["text" in l for l in f]
This will give you a list of length number of lines in filename.txt, the contents of the array will be False for each line that doesn't contain text, and True for each line that does.
Edit 2: To reflect #ShadowRanger's comments, I've changed my code to not do iterate over each line in the file without reading the whole thing at once.

Using Python3 to search a file for a string, add the results on the next lines to an array before stopping at the next string

I am using Python 3 to process a results file. The structure of the file is a combination of string identifiers followed by lists of integer values in this format:
ENERGY_BOUNDS
1.964033E+07 1.733253E+07 1.491825E+07 1.384031E+07 1.161834E+07 1.000000E+07 8.187308E+06 6.703200E+06
6.065307E+06 5.488116E+06 4.493290E+06 3.678794E+06 3.011942E+06 2.465970E+06 2.231302E+06 2.018965E+06
EIGENVALUE
1.219034E+00
There are maybe 50 different sets of data with unique identifiers in this file. What I want to do is write a code that will search for a specific identifier (e.g. ENERGY_BOUNDS), then read the values that follow into a list, stopping at the next identifier (in this case EIGENVALUE). I then need to be able to manipulate the list (finding its length, printing its values, etc.).
I am writing this as a function so I can call it multiple times in my code when I want to search for different identifiers. So far what I have is:
def read_data_from_file(file_name, identifier):
list_of_results = [] # Create list_of_results to put results in for future manipulation
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
# If yes, read the next line
nextValue = next(line)
list_of_results.append(nextValue.rstrip())
return list_of_results
It works fine up until it comes to reading the next line after the identifier, and I am stuck on how to continue reading the results after that line and how to make it stop at the next identifier.
Following is simple and tested answer.
You are making two mistakes
line is a string and not iterator so doing next(line) is causing error.
You are just reading one line after identifier has been found while you need to keep on reading until another identifier appears.
Following is the code after doing little modification of your code. It's also tested on your data
def read_data_from_file(file_name, identifier):
with open(file_name, 'r') as read_obj:
list_of_results = []
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
# If yes, read the next line
nextValue = next(read_obj)
while(not nextValue.strip().isalpha()): #keep on reading untill next identifier appears
list_of_results.extend(nextValue.split())
nextValue = next(read_obj)
print(list_of_results)
I would suggest adding a variable that indicates whether you have found a line containing an identifier.
Afterwards, simply add the values into the array until the next identifier has been reached.
def read_data_from_file(file_name, identifier):
list_of_results = [] # Create list_of_results to put results in for future manipulation
identifier_found = False
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
identifier_found = True
elif identifier_found:
if line.strip().isalpha(): # Next identifier reached, exit loop
break
list_of_results += line.split() # Add values to result
return list_of_results
Use booleans, continue, and break!
Try to implement logic as follows:
Set a boolean (I'll use in_range) to False
Look through the lines and see if they match the identifier.
If it does, set the boolean to True and continue
If it does not, continue
If the boolean is False AND the line begins with a space: continue
If the boolean is True AND the line begins with a space: Add the line to the list.
If the boolean is True AND the line doesn't begin with a space: break.
This ends the searching process once a new identifier has been started.
The other 2 answers are already helpful. Here is my method incase that you need something else. With comments to explain.
If you dont want to use the end_identifier you can use .isAlpha() which checks if the string only contains letters.
def read_data_from_file(file_name, start_identifier, end_identifier):
list_of_results = []
with open(file_name, 'r') as read_obj:
start_identifier_reached = False # variable to check if we reached the needed identifier_reached
for line in read_obj:
if start_identifier in line:
start_identifier_reached = True # now we reached the identifier
continue # We go back to the start so we dont write the identifier into the list
if start_identifier_reached and (end_identifier not in line): # Put the values into the list until we reach the end_identifier
list_of_results.append(line.rstrip())
else:
return list_of_results

Find first non-None returned value from predicate over a sequence in Python

On the surface, this might seem to be a duplicate of
find first element in a sequence that matches a predicate
but it is not.
I have a predicate function (function of one argument) that does some
processing on the argument and returns a non-None value when the
processing is said to "succeed". I want to use that function
efficiently on a list or even some iterable but I do not want to
iterate over all elements of the list or iterable, but just return the
return value of the predicate function when that value is not None,
and then stop executing the predicate on subsequent elements.
I was hoping there was something in
itertools that
would do this, but they all seem hardwired to return the element of
the original items passed to the predicate, and instead I want the
returned value.
I have a solution shown below, but is overly heavy code-wise. I'm
wanting something more elegant and that does not require the firstof
utility function coded there.
Note: Reading the entire file into a list of lines is actually
necessary here, since I need the full contents in memory for other
processing.
I'm using Python 2 here; I do not want to switch to Python 3 at this
time but will want to avoid using syntax that is deprecated or missing
in Python 3.
import re
def match_timestamp(line):
timestamp_re = r'\d+-\d+-\d+ \d+:\d+:\d+'
m = re.search(r'^TIMESTAMP (' + timestamp_re + ')', line)
if m:
return m.group(1)
return None
def firstof(pred, items):
"""Find result from the first call to pred of items.
Do not continue to evaluate items (short-circuiting)."""
for item in items:
tmp = pred(item)
if tmp:
return tmp
return None
log_file = "/tmp/myfile"
with open(log_file, "r") as f:
lines = f.readlines()
for line in lines:
print "line", line.rstrip()
timestamp = firstof(match_timestamp, lines)
print "** FOUND TIMESTAMP **", timestamp
Suppose I have /tmp/myfile contain:
some number of lines here
some number of lines here
some number of lines here
TIMESTAMP 2017-05-09 21:24:52
some number of lines here
some number of lines here
some number of lines here
Running the above program on it yeilds:
line some number of lines here
line some number of lines here
line some number of lines here
line TIMESTAMP 2017-05-09 21:24:52
line some number of lines here
line some number of lines here
line some number of lines here
** FOUND TIMESTAMP ** 2017-05-09 21:24:52
from itertools import imap, ifilter
timestamp = next(line for line in imap(match_timestamp, lines) if line)
# or
timestamp = next(ifilter(None, imap(match_timestamp, lines)))
(I believe that's the way to do it in Python 2, in Python 3 you'd simply use map.)
map the function over your lines so you get a lazy iterator of your transformed values, then lazily get the next truthy value from it using next and a generator expression or ifilter. You can choose whether to let next raise a StopIteration error if no value is found, or give it a second argument for the default return value.
Edited: You can create a generator and use it with next until a timestamp is found.
with open(log_file, "r") as f:
lines = f.readlines()
for line in lines:
print "line", line.rstrip()
timestamp = None
generator = (match_timestamp(line) for line in lines)
while timestamp is None:
timestamp = next(generator)
print "** FOUND TIMESTAMP **", timestamp

Python reading file problems

highest_score = 0
g = open("grades_single.txt","r")
arrayList = []
for line in highest_score:
if float(highest_score) > highest_score:
arrayList.extend(line.split())
g.close()
print(highest_score)
Hello, wondered if anyone could help me , I'm having problems here. I have to read in a file of which contains 3 lines. First line is no use and nor is the 3rd. The second contains a list of letters, to which I have to pull them out (for instance all the As all the Bs all the Cs all the way upto G) there are multiple letters of each. I have to be able to count how many off each through this program. I'm very new to this so please bear with me if the coding created is wrong. Just wondered if anyone could point me in the right direction of how to pull out these letters on the second line and count them. I then have to do a mathamatical function with these letters but I hope to work that out for myself.
Sample of the data:
GTSDF60000
ADCBCBBCADEBCCBADGAACDCCBEDCBACCFEABBCBBBCCEAABCBB
*
You do not read the contents of the file. To do so use the .read() or .readlines() method on your opened file. .readlines() reads each line in a file seperately like so:
g = open("grades_single.txt","r")
filecontent = g.readlines()
since it is good practice to directly close your file after opening it and reading its contents, directly follow with:
g.close()
another option would be:
with open("grades_single.txt","r") as g:
content = g.readlines()
the with-statement closes the file for you (so you don't need to use the .close()-method this way.
Since you need the contents of the second line only you can choose that one directly:
content = g.readlines()[1]
.readlines() doesn't strip a line of is newline(which usually is: \n), so you still have to do so:
content = g.readlines()[1].strip('\n')
The .count()-method lets you count items in a list or in a string. So you could do:
dct = {}
for item in content:
dct[item] = content.count(item)
this can be made more efficient by using a dictionary-comprehension:
dct = {item:content.count(item) for item in content}
at last you can get the highest score and print it:
highest_score = max(dct.values())
print(highest_score)
.values() returns the values of a dictionary and max, well, returns the maximum value in a list.
Thus the code that does what you're looking for could be:
with open("grades_single.txt","r") as g:
content = g.readlines()[1].strip('\n')
dct = {item:content.count(item) for item in content}
highest_score = max(dct.values())
print(highest_score)
highest_score = 0
arrayList = []
with open("grades_single.txt") as f:
arraylist.extend(f[1])
print (arrayList)
This will show you the second line of that file. It will extend arrayList then you can do whatever you want with that list.
import re
# opens the file in read mode (and closes it automatically when done)
with open('my_file.txt', 'r') as opened_file:
# Temporarily stores all lines of the file here.
all_lines_list = []
for line in opened_file.readlines():
all_lines_list.append(line)
# This is the selected pattern.
# It basically means "match a single character from a to g"
# and ignores upper or lower case
pattern = re.compile(r'[a-g]', re.IGNORECASE)
# Which line i want to choose (assuming you only need one line chosen)
line_num_i_need = 2
# (1 is deducted since the first element in python has index 0)
matches = re.findall(pattern, all_lines_list[line_num_i_need-1])
print('\nMatches found:')
print(matches)
print('\nTotal matches:')
print(len(matches))
You might want to check regular expressions in case you need some more complex pattern.
To count the occurrences of each letter I used a dictionary instead of a list. With a dictionary, you can access each letter count later on.
d = {}
g = open("grades_single.txt", "r")
for i,line in enumerate(g):
if i == 1:
holder = list(line.strip())
g.close()
for letter in holder:
d[letter] = holder.count(letter)
for key,value in d.iteritems():
print("{},{}").format(key,value)
Outputs
A,9
C,15
B,15
E,4
D,5
G,1
F,1
One can treat the first line specially (and in this case ignore it) with next inside try: except StopIteration:. In this case, where you only want the second line, follow with another next instead of a for loop.
with open("grades_single.txt") as f:
try:
next(f) # discard 1st line
line = next(f)
except StopIteration:
raise ValueError('file does not even have two lines')
# now use line

Is there a way to go back when reading a file using seek and calls to next()?

I'm writing a Python script to read a file, and when I arrive at a section of the file, the final way to read those lines in the section depends on information that's given also in that section. So I found here that I could use something like
fp = open('myfile')
last_pos = fp.tell()
line = fp.readline()
while line != '':
if line == 'SPECIAL':
fp.seek(last_pos)
other_function(fp)
break
last_pos = fp.tell()
line = fp.readline()
Yet, the structure of my current code is something like the following:
fh = open(filename)
# get generator function and attach None at the end to stop iteration
items = itertools.chain(((lino,line) for lino, line in enumerate(fh, start=1)), (None,))
item = True
lino, line = next(items)
# handle special section
if line.startswith['SPECIAL']:
start = fh.tell()
for i in range(specialLines):
lino, eline = next(items)
# etc. get the special data I need here
# try to set the pointer to start to reread the special section
fh.seek(start)
# then reread the special section
But this approach gives the following error:
telling position disabled by next() call
Is there a way to prevent this?
Using the file as an iterator (such as calling next() on it or using it in a for loop) uses an internal buffer; the actual file read position is further along the file and using .tell() will not give you the position of the next line to yield.
If you need to seek back and forth, the solution is not to use next() directly on the file object but use file.readline() only. You can still use an iterator for that, use the two-argument version of iter():
fileobj = open(filename)
fh = iter(fileobj.readline, '')
Calling next() on fileiterator() will invoke fileobj.readline() until that function returns an empty string. In effect, this creates a file iterator that doesn't use the internal buffer.
Demo:
>>> fh = open('example.txt')
>>> fhiter = iter(fh.readline, '')
>>> next(fhiter)
'foo spam eggs\n'
>>> fh.tell()
14
>>> fh.seek(0)
0
>>> next(fhiter)
'foo spam eggs\n'
Note that your enumerate chain can be simplified to:
items = itertools.chain(enumerate(fh, start=1), (None,))
although I am in the dark why you think a (None,) sentinel is needed here; StopIteration will still be raised, albeit one more next() call later.
To read specialLines count lines, use itertools.islice():
for lino, eline in islice(items, specialLines):
# etc. get the special data I need here
You can just loop directly over fh instead of using an infinite loop and next() calls here too:
with open(filename) as fh:
enumerated = enumerate(iter(fileobj.readline, ''), start=1):
for lino, line in enumerated:
# handle special section
if line.startswith['SPECIAL']:
start = fh.tell()
for lino, eline in islice(items, specialLines):
# etc. get the special data I need here
fh.seek(start)
but do note that your line numbers will still increment even when you seek back!
You probably want to refactor your code to not need to re-read sections of your file, however.
I'm not an expert with version 3 of Python, but it seems like you're reading using generator that yields lines that are read from file. Thus you can have only one-side direction.
You'll have to use another approach.

Categories

Resources