Parsing a text file to store into class objects and attributes

Parsing a text file to store into class objects and attributes - python

2 part question. How to parse text and save as class object/attributes and best way to rewrite text from the classes in a specific format.
I'm wanting to parse through a text file and extract sections of text and create a class object and attributes. There will be several classes (Polygons, space, zone, system, schedule) involved. In the original file each "Object" and it's "attributes" are separated by '..'. An example of one is below.
"Office PSZ" = SYSTEM
TYPE = PSZ
HEAT-SOURCE = FURNACE
FAN-SCHEDULE = "HVAC Yr Schedule"
COOLING-EIR = 0.233207
..
I'd like to read this text and store into class objects. So "Office PSZ" would be of the HVACsystem or SYSTEM class, haven't decided. 'SYSTEM' would be a class variable. For this instance ("Office PSZ"), self.TYPE would be PSZ. self.HEAT-SOURCE would equal FURNACE,etc.
I want to manipulate these objects based on their attributes. The end result though would be to write all the data that was manipulated back into a text file with the original format. End result for this instance may be.
"Office PSZ" = SYSTEM
TYPE = PSZ
HEAT-SOURCE = ELECTRIC
FAN-SCHEDULE = "Other Schedule"
COOLING-EIR = 0.200
..
Is there a way to print the attribute name/title (idk what to call it)? Because the attribute name (i.e. TYPE,HEAT-SOURCE) comes from the original file and it would be easier to not have to manually anticipate all of the attributes associated with every class.
I suppose I could create an array of all of the values on the left side of "=" and another array for the values on the right and loop through those as I'm writing/formatting a new text file. But I'm not sure if that's a good way to go.
I'm still quite the amateur so I might be overreaching but any suggestions on how I should proceed?

Pyparsing makes it easy to write custom parsers for data like this, and gives back
parsed data in a pyparsing data structure call ParseResults. ParseResults give you
access to your parsed values by position (like a list), by key (like a dict), or for
names that work as Python identifiers, by attribute (like an object).
I've simplfied my parsing of your data to pretty much just take every key = value line
and build up a structure using the key strings as keys. The '..' lines work great
as terminators for each object.
A simple BNF for this might look like:
object ::= attribute+ end
attribute ::= key '=' value
key ::= word composed of letters 'A'..'Z' and '-', starting with 'A'..'Z',
or a quoted string
value ::= value_string | value_number | value_word
value_word ::= a string of non-whitespace characters
value_string ::= a string of any characters in '"' quotes
value_number ::= an integer or float numeric value
end ::= '..'
To implement a pyparsing parser, we work bottom up to define pyparsing sub-expressions.
Then we use Python '+' and '|' operators to assemble lower-level expressions to higher-level
ones:
import pyparsing as pp
END = pp.Suppress("..")
EQ = pp.Suppress('=')
pyparsing includes some predefined expressions for quoted strings and numerics;
the numerics will be automatically converted to ints or floats.
value_number = pp.pyparsing_common.number
value_string = pp.quotedString
value_word = pp.Word(pp.printables)
value = value_string | value_number | value_word
For our attribute key, we will use the two-argument form for Word. The first
argument is a string of allowable leading characters, and the second argument is a
string of allowable body characters. If we just wrote `Word(alphas + '-'), then
our parser would accept '---' as a legal key.
key = pp.Word(pp.alphas, pp.alphas + '-') | pp.quotedString
An attribute definition is just a key, an '=' sign, and a value
attribute = key + EQ + value
Lastly we will use some of the more complex features of pyparsing. The simplest form
would just be "pp.OneOrMore(attribute) + END", but this would just give us back a
pile of parsed tokens with no structure. The Group class structures the enclosed expressions
so that their results will be returned as a sub-list. We will catch every attribute as
its own sub-list using Group. Dict will apply some naming to the results, using
the text from each key expression as the key for that group. Finally, the whole collection
of attributes will be Group'ed again, this time representing all the attributes for a
single object:
object_defn = pp.Group(pp.Dict(pp.OneOrMore(pp.Group(attribute)))) + END
To use this expression, we'll define our parser as:
parser = pp.OneOrMore(object_defn)
and parse the sample string using:
objs = parser.parseString(sample)
The objs variable we get back will be a pyparsing ParseResults, which will work like
a list of the grouped object attributes. We can view just the parsed attributes as a list
of lists using asList():
for obj in objs:
print(obj.asList())
[['"Office PSZ"', 'SYSTEM'], ['TYPE', 'PSZ'], ['HEAT-SOURCE', 'FURNACE'],
['FAN-SCHEDULE', '"HVAC Yr Schedule"'], ['COOLING-EIR', 0.233207]]
If we had not used the Dict class, this would have all we would get, but since we
did use Dict, we can also see the attributes as a Python dict:
for obj in objs:
print(obj.asDict())
{'COOLING-EIR': 0.233207, '"Office PSZ"': 'SYSTEM', 'TYPE': 'PSZ',
'FAN-SCHEDULE': '"HVAC Yr Schedule"', 'HEAT-SOURCE': 'FURNACE'}
We can even access named fields by name, if they work as Python identifiers. In your
sample, "TYPE" is the only legal identifier, so you can see how to print it here. There
is also a dump() method that will give the results in list form, followed by an
indented list of defined key pairs. (I've also shown how you can use list and dict
type access directly on the ParseResults object, without having to convert to list
or dict types):
for obj in objs:
print(obj[0])
print(obj['FAN-SCHEDULE'])
print(obj.TYPE)
print(obj.dump())
['"Office PSZ"', 'SYSTEM']
"HVAC Yr Schedule"
PSZ
[['"Office PSZ"', 'SYSTEM'], ['TYPE', 'PSZ'], ['HEAT-SOURCE', 'FURNACE'],
['FAN-SCHEDULE', '"HVAC Yr Schedule"'], ['COOLING-EIR', 0.233207]]
- "Office PSZ": 'SYSTEM'
- COOLING-EIR: 0.233207
- FAN-SCHEDULE: '"HVAC Yr Schedule"'
- HEAT-SOURCE: 'FURNACE'
- TYPE: 'PSZ'
Here is the full parser code for you to work from:
import pyparsing as pp
END = pp.Suppress("..")
EQ = pp.Suppress('=')
value_number = pp.pyparsing_common.number
value_string = pp.quotedString
value_word = pp.Word(pp.printables)
value = value_string | value_number | value_word
key = pp.Word(pp.alphas, pp.alphas+"-") | pp.quotedString
attribute = key + EQ + value
object_defn = pp.Group(pp.Dict(pp.OneOrMore(pp.Group(attribute)))) + END
parser = pp.OneOrMore(object_defn)
objs = parser.parseString(sample)
for obj in objs:
print(obj.asList())
for obj in objs:
print(obj.asDict())
for obj in objs:
print(obj[0])
print(obj['FAN-SCHEDULE'])
print(obj.TYPE)
print(obj.dump())

Related

pythonic method for extracting numeric digits from string

I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2

You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.

pyparsing nestedExpr and double closing characters

I am trying to parse nested column type definitions such as
1 string
2 struct<col_1:string,col_2:int>
3 row(col_1 string,array(col_2 string),col_3 boolean)
4 array<struct<col_1:string,col_2:int>,col_3:boolean>
5 array<struct<col_1:string,col2:int>>
Using nestedExpr works as expected for cases 1-4, but throws a parse error on case 5. Adding a space between double closing brackets like "> >" seems work, and might be explained by this quote from the author.
By default, nestedExpr will look for space-delimited words of printables
https://sourceforge.net/p/pyparsing/bugs/107/
I'm mostly looking for alternatives to pre and post processing the input string
type_str = type_str.replace(">", "> ")
# parse string here
type_str = type_str.replace("> ", ">")
I've tried using the infix_notation but I haven't been able to figure out how to use it in this situation. I'm probably just using this the wrong way...
Code snippet
array_keyword = pp.Keyword('array')
row_keyword = pp.Keyword('row')
struct_keyword = pp.Keyword('struct')
nest_open = pp.Word('<([')
nest_close = pp.Word('>)]')
col_name = pp.Word(pp.alphanums + '_')
col_type = pp.Forward()
col_type_delimiter = pp.Word(':') | pp.White(' ')
column = col_name('name') + col_type_delimiter + col_type('type')
col_list = pp.delimitedList(pp.Group(column))
struct_type = pp.nestedExpr(
opener=struct_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
)
row_type = pp.locatedExpr(pp.nestedExpr(
opener=row_keyword + nest_open, closer=nest_close, content=col_list | col_type, ignoreExpr=None
))
array_type = pp.nestedExpr(
opener=array_keyword + nest_open, closer=nest_close, content=col_type, ignoreExpr=None
)
col_type <<= struct_type('children') | array_type('children') | row_type('children') | scalar_type('type')

nestedExpr and infixNotation are not really appropriate for this project. nestedExpr is generally a short-cut expression for stuff you don't really want to go into details parsing, you just want to detect and step over some chunk of text that happens to have some nesting in opening and closing punctuation. infixNotation is intended for parsing expressions with unary and binary operators, usually some kind of arithmetic. You might be able to treat the punctuation in your grammar as operators, but it is a stretch, and definitely doing things the hard way.
For your project, you will really need to define the different elements, and it will be a recursive grammar (since the array and struct types will themselves be defined in terms of other types, which could also be arrays or structs).
I took a stab at a BNF, for a subset of your grammar using scalar types int, float, boolean, and string, and compound types array and struct, with just the '<' and '>' nesting punctuation. An array will take a single type argument, to define the type of the elements in the array. A struct will take one or more struct fields, where each field is an identifier:type pair.
scalar_type ::= 'int' | 'float' | 'string' | 'boolean'
array_type ::= 'array' '<' type_defn '>'
struct_type ::= 'struct' '<' struct_element (',' struct_element)... '>'
struct_element ::= identifier ':' type_defn
type_defn ::= scalar_type | array_type | struct_type
(If you later want to add a row definition also, think about what the row is supposed to look like, and how its elements would be defined, and then add it to this BNF.)
You look pretty comfortable with the basics of pyparsing, so I'll just start you off with some intro pieces, and then let you fill in the rest.
# define punctuation
LT, GT, COLON = map(pp.Suppress, "<>:")
ARRAY = pp.Keyword('array')
STRUCT = pp.Keyword('struct')
# create a Forward that will be used in other type expressions
type_defn = pp.Forward()
# here is the array type, you can fill in the other types following this model
# and the definitions in the BNF
array_type = pp.Group(ARRAY + LT + type_defn + GT)
...
# then finally define type_defn in terms of the other type expressions
type_defn <<= scalar_type | array_type | struct_type
Once you have that finished, try it out with some tests:
type_defn.runTests("""\
string
struct<col_1:string,col_2:int>
array<struct<col_1:string,col2:int>>
""", fullDump=False)
And you should get something like:
string
['string']
struct<col_1:string,col_2:int>
['struct', [['col_1', 'string'], ['col_2', 'int']]]
array<struct<col_1:string,col2:int>>
['array', ['struct', [['col_1', 'string'], ['col2', 'int']]]]>
Once you have that, you can play around with extending it to other types, such as your row type, maybe unions, or arrays that take multiple types (if that was your intention in your posted example). Always start by updating the BNF - then the changes you'll need to make in the code will generally follow.

Dynamically Read the Format of a String, Python

I have a lookup table of Scientific Names for plants. I want to use this lookup table to validate other tables where I have a data entry person entering the data. Sometimes they get the formatting of these scientific names wrong, so I am writing a script to try to flag the errors.
There's a very specific way to format each name. For example 'Sonchus arvensis L.' specifically needs to have the S in Sonchus capitalized as well as the L at the end. I have about 1000 different plants and each one is formatted differently. Here's a few more examples:
Linaria dalmatica (L.) Mill.
Knautia arvensis (L.) Coult.
Alliaria petiolata (M. Bieb.) Cavara & Grande
Berteroa incana (L.) DC.
Aegilops cylindrica Host
As you can see, all of these strings are formatted very differently (i.e some letters are capitalized, some aren't, there are brackets sometimes, ampersands, periods, etc)
My question is, is there any way to dynamically read the formatting of each string in the lookup table so that I can compare that to the value the data entry person entered to make sure it is formatted properly? In the script below, I test (first elif) to see if the value is in the lookup table by capitalizing all values in order to make the match work, regardless of formatting. In the next test (second elif) I can sort of test formatting by comparing against the lookup table value for value. This will return unmatched records based on formatting, but it doesn't specifically tell you why the unmatched record returned.
What I perceive to do is, read in the string values in the look up table and somehow dynamically read the formatting of each string, so that I can specifically identify the error (i.e. a letter should be capitalized, where it wasn't)
So far my code snippet looks like this:
# Determine if the field heaidng is in a list I built earlier
if "SCIENTIFIC_NAME" in fieldnames:
# First, Test to see if record is empty
if not row.SCIENTIFIC_NAME:
weedPLineErrors.append("SCIENTIFIC_NAME record is empty")
# Second, Test to see if value is in the lookup table, regardless of formatting.
elif row.SCIENTIFIC_NAME.upper() not in [x.upper() for x in weedScientificTableList]:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not in the domain table")
# Third, if the second test is satisfied, we know the value is in the lookup table. We can then test the lookup table again, without capitalizing everything to see if there is an exact match to account for formatting.
elif row.SCIENTIFIC_NAME not in weedScientificTableList:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not formatted properly")
else:
pass
I hope my question is clear enough. I looked at string templates, but I don't think it does what I want to do...at least not dynamically. If anyone can point me in a better direction, I am all eyes...but maybe I am way out to lunch on this one.
Thanks,
Mike

To get around the punctuation problem, you can use regular expressions.
>>> import re
>>> def tokenize(s):
... return re.split('[^A-Za-z]+', s) # Split by anything that isn't a letter
...
>>> tokens = tokenize('Alliaria petiolata (M. Bieb.) Cavara & Grande')
>>> tokens
['Alliaria', 'petiolata', 'M', 'Bieb', 'Cavara', 'Grande']
To get around the capitalization problem, you can use
>>> tokens = [s.lower() for s in tokens]
From there, you could rewrite the entry in a standardized format, such as
>>> import string
>>> ## I'm not sure exactly what format you're looking for
>>> first, second, third = [string.capitalize(s) for s in tokens[:3]]
>>> "%s %s (%s)" % (first, second, third)
'Alliaria Petiolata (M)'
This probably isn't the exact formatting that you want, but maybe that will get you headed in the right direction.

You can build a dictionary of the names from the lookup table. Assuming that you have the names stored in a list (call it correctList), you can write a function which removes all formatting and maybe lowers or uppers the case and store the result in a dictionary. For example following is a sample code to build the dictionary
def removeFormatting(name):
name = name.replace("(", "").replace(")", "")
name = name.replace(".", "")
...
return name.lower()
formattingDict = dict([(removeFormatting(i), i) for i in correctList])
Now you can compare the strings input by the data entry person. Lets say it is in a list called inputList.
for name in inputList:
unformattedName = removeFormatting(name)
lookedUpName = formattingDict.get(unformattedName, "")
if not lookedUpName:
print "Spelling mistake:", name
elif lookedUpName != name:
print "Formatting error"
print differences(name, lookedUpName)
The differences function could be stuffed with some rules like brackets, "."s etc
def differences(inputName, lookedUpName):
mismatches = []
# Check for brackets
if "(" in lookedUpName:
if "(" not in inputName:
mismatches.append("Bracket missing")
...
# Add more rules
return mismatches
Does that answer your question a bit?

Get list in dictionary with string interpolation

I'm trying to get a value from a dictionary with string interpolation.
The dictionary has some digits, and a list of digits:
d = { "year" : 2010, \
"scenario" : 2, \
"region" : 5, \
"break_points" : [1,2,3,4,5] }
Is it possible to reference the list in a string interpolation, or do I need to identify unique keys for each?
Here's what I've tried:
str = "Year = %(year)d, \
Scenario = %(scenario)d, \
Region = %(region)d, \
Break One = %(break_points.0)d..." % d
I've also tried %(break_points[0])d, and %(break_points{'0'})d
Is this possible to do, or do I need to give them keys and save them as integers in the dictionary?

This is possible with new-style formatting:
print "{0[break_points][0]:d}".format(d)
or
print "{break_points[0]:d}".format(**d)
 
str.format documentation
The string on which this method is called can contain literal text or replacement fields delimited by braces {}. Each replacement field contains either the numeric index of a positional argument, or the name of a keyword argument.
Format string syntax
The field_name itself begins with an arg_name that is either a number or a keyword. If it’s a number, it refers to a positional argument, and if it’s a keyword, it refers to a named keyword argument.
...
The arg_name can be followed by any number of index or attribute expressions. An expression of the form '.name' selects the named attribute using getattr(), while an expression of the form '[index]' does an index lookup using __getitem__().

pyparsing question

This code works:
from pyparsing import *
zipRE = "\d{5}(?:[-\s]\d{4})?"
fooRE = "^\!\s+.*"
zipcode = Regex( zipRE )
foo = Regex( fooRE )
query = ( zipcode | foo )
tests = [ "80517", "C6H5OH", "90001-3234", "! sfs" ]
for t in tests:
try:
results = query.parseString( t )
print t,"->", results
except ParseException, pe:
print pe
I'm stuck on two issues:
1 - How to use a custom function to parse a token. For instance, if I wanted to use some custom logic instead of a regex to determine if a number is a zipcode.
Instead of:
zipcode = Regex( zipRE )
perhaps:
zipcode = MyFunc()
2 - How do I determine what a string parses TO. "80001" parses to "zipcode" but how do I determine this using pyparsing? I'm not parsing a string for its contents but simply to determine what kind of query it is.

You could use zipcode and foo separately, so that you know which one the string matches.
zipresults = zipcode.parseString( t )
fooresults = foo.parseString( t )

Your second question is easy, so I'll answer that first. Change query to assign results names to the different expressions:
query = ( zipcode("zip") | foo("foo") )
Now you can call getName() on the returned result:
print t,"->", results, results.getName()
Giving:
80517 -> ['80517'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['90001-3234'] zip
! sfs -> ['! sfs'] foo
If you are going to use the result's fooness or zipness to call another function, then you could do this at parse time by attaching a parse action to your foo and zipcode expressions:
# enclose zipcodes in '*'s, foos in '#'s
zipcode.setParseAction(lambda t: '*' + t[0] + '*')
foo.setParseAction(lambda t: '#' + t[0] + '#')
query = ( zipcode("zip") | foo("foo") )
Now gives:
80517 -> ['*80517*'] zip
Expected Re:('\\d{5}(?:[-\\s]\\d{4})?') (at char 0), (line:1, col:1)
90001-3234 -> ['*90001-3234*'] zip
! sfs -> ['#! sfs#'] foo
For your first question, I don't exactly know what kind of function you mean. Pyparsing provides many more parsing classes than just Regex (such as Word, Keyword, Literal, CaselessLiteral), and you compose your parser by combining them with '+', '|', '^', '~', '#' and '*' operators. For instance, if you wanted to parse for a US social security number, but not use a Regex, you could use:
ssn = Combine(Word(nums,exact=3) + '-' +
Word(nums,exact=2) + '-' + Word(nums,exact=4))
Word matches for contiguous "words" made up of the given characters in its constructor, Combine concatenates the matched tokens into a single token.
If you wanted to parse for a potential list of such numbers, delimited by '/'s, use:
delimitedList(ssn, '/')
or if there were between 1 and 3 such numbers, with no delimters, use:
ssn * (1,3)
And any expression can have results names or parse actions attached to them, to further enrich the parsed results, or the functionality during parsing. You can even build recursive parsers, such as nested lists of parentheses, arithmetic expressions, etc. using the Forward class.
My intent when I wrote pyparsing was that this composition of parsers from basic building blocks would be the primary form for creating a parser. It was only in a later release that I added Regex as (what I though was) the ultimate escape valve - if people couldn't build up their parser, they could fall back on regex's format, which has definitely proven its power over time.
Or, as one other poster suggests, you can open up the pyparsing source, and subclass one of the existing classes, or write your own, following their structure. Here is a class that would match for paired characters:
class PairOf(Token):
"""Token for matching words composed of a pair
of characters in a given set.
"""
def __init__( self, chars ):
super(PairOf,self).__init__()
self.pair_chars = set(chars)
def parseImpl( self, instring, loc, doActions=True ):
if (loc < len(instring)-1 and
instring[loc] in self.pair_chars and
instring[loc+1] == instring[loc]):
return loc+2, instring[loc:loc+2]
else:
raise ParseException(instring, loc, "Not at a pair of characters")
So that:
punc = r"~!##$%^&*_-+=|\?/"
parser = OneOrMore(Word(alphas) | PairOf(punc))
print parser.parseString("Does ** this match #### %% the parser?")
Gives:
['Does', '**', 'this', 'match', '##', '##', '%%', 'the', 'parser']
(Note the omission of the trailing single '?')

I do not have the pyparsing module, but Regex must be a class, not a function.
What you can do is subclass from it and override methods as required to customize behaviour, then use your subclasses instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.