Parsing text file in python using pyparsing - python

I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.

Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.

Related

pyparsing - Parse numbers with thousand separators

So I am making a parser, and I noticed a problem. Indeed, to parse numbers, I have:
from pyparsing import Word, nums
n = Word(nums)
This works well with numbers without thousands separators. For example, n.parseString("1000", parseAll=True) returns (['1000'], {}) and therefore works.
However, it doesn't work when I add the thousand separator. Indeed, n.parseString("1,000", parseAll=True) raises pyparsing.ParseException: Expected end of text, found ',' (at char 1), (line:1, col:2).
How can I parse numbers with thousand separators? I don't just want to ignore commas (for example, n.parseString("1,00", parseAll=True) should return an error as it is not a number).
A pure pyparsing approach would use Combine to wrap a series of pyparsing expressions representing the different fields that you are seeing in the regex:
import pyparsing as pp
int_with_thousands_separators = pp.Combine(pp.Optional("-")
+ pp.Word(pp.nums, max=3)
+ ("," + pp.Word(pp.nums, exact=3))[...])
I've found that building up numeric expressions like this results in much slower parse time, because all those separate parts are parsed independently, with multiple internal function and method calls (which are real performance killers in Python). So you can replace this with an expression using Regex:
# more efficient parsing with a Regex
int_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*")
You could also use the code as posted by Jan, and pass that compiled regex to the Regex constructor.
To do parse-time conversion to int, add a parse action that strips out the commas.
# add parse action to convert to int, after stripping ','s
int_with_thousands_separators.addParseAction(
lambda t: int(t[0].replace(",", "")))
I like using runTests to check out little expressions like this - it's easy to write a series of test strings, and the output shows either the parsed result or an annotated input string with the parse failure location. ("1,00" is included as an intentional error to demonstrate error output by runTests.)
int_with_thousands_separators.runTests("""\
1
# invalid value
1,00
1,000
-3,000,100
""")
If you want to parse real numbers, add pieces to represent the trailing decimal point and following digits.
real_with_thousands_separators = pp.Combine(pp.Optional("-")
+ pp.Word(pp.nums, max=3)
+ ("," + pp.Word(pp.nums, exact=3))[...]
+ "." + pp.Word(pp.nums))
# more efficient parsing with a Regex
real_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*\.\d+")
# add parse action to convert to float, after stripping ','s
real_with_thousands_separators.addParseAction(
lambda t: float(t[0].replace(",", "")))
real_with_thousands_separators.runTests("""\
# invalid values
1
1,00
1,000
-3,000,100
1.
# valid values
1.732
-273.15
""")
As you are dealing with strings in the first place, you could very well use a regular expression on it to ensure that it is indeed a number (thousand sep including). If it is, replace every comma and feed it to the parser:
import re
from pyparsing import Word, nums
n = Word(nums)
def is_number(number):
rx = re.compile(r'^-?\d+(?:,\d{3})*$')
if rx.match(number):
return number.replace(",", "")
raise ValueError
try:
number = is_number("10,000,000")
print(n.parseString(number, parseAll=True))
except ValueError:
print("Not a number")
With this, e.g. 1,00 will result in Not a number, see a demo for the expression on regex101.com.
I don't understand well what you mean with "numbers with thousands of separators".
In any case, with pyparsing you should define the pattern of what you want to parse.
In the first example pyparse works well just because you defined n as just a number, so:
n = Word(nums)
print(n.parseString("1000", parseAll=True))
['1000']
So, if you want to parse "1,000" or "1,00", you should define n as:
n = Word(nums) + ',' + Word(nums)
print(n.parseString("1,000", parseAll=True))
['1', ',', '000']
print(n.parseString("1,00", parseAll=True))
['1', ',', '00']
I also came up with a regex solution, kind of late:
from pyparsing import Word, nums
import re
n = Word(nums)
def parseNumber(x):
parseable = re.sub('[,][0-9]{3}', lambda y: y.group()[1:], x)
return n.parseString(parseable, parseAll=True)
print(parseNumber("1,000,123"))

Python str.format with string contatenation and continuation

I'd like to specify a string with both line continuation and catenation characters. this is really useful if I'm echoing a bunch of related values. Here is a simple example with only two parameters:
temp = "here is\n"\
+"\t{}\n"\
+"\t{}".format("foo","bar")
print(temp)
here's what I get:
here is
{}
foo
And here is what I expect:
here is
foo
bar
What gives?
You can try something like this :
temp = ("here is\n"
"\t{}\n"
"\t{}".format("foo","bar"))
print(temp)
Or like :
# the \t have been replaced with
# 4 spaces just as an example
temp = '''here is
{}
{}'''.format
print(temp('foo', 'bar'))
vs. what you have:
a = "here is\n"
b = "\t{}\n"
c = "\t{}".format("foo","bar")
print( a + b + c)
str.format is called before your strings are concatenated. Think of it like 1 + 2 * 3, where the multiplication is evaluated before the addition.
Just wrap the whole string in parentheses to indicate that you want the strings concatenated before calling str.format:
temp = ("here is\n"
+ "\t{}\n"
+ "\t{}").format("foo","bar")
Python in effect sees this:
Concatenate the result of
"here is\n"
with the resuslt of
"\t{}\n"
with the result of
"\t{}".format("foo","bar")
You have 3 separate string literals, and only the last one has the str.format() method applied.
Note that the Python interpreter is concatenating the strings at runtime.
You should instead use implicit string literal concatenation. Whenever you place two string literals side by side in an expression with no other operators in between, you get a single string:
"This is a single" " long string, even though there are separate literals"
This is stored with the bytecode as a single constant:
>>> compile('"This is a single" " long string, even though there are separate literals"', '', 'single').co_consts
('This is a single long string, even though there are separate literals', None)
>>> compile('"This is two separate" + " strings added together later"', '', 'single').co_consts
('This is two separate', ' strings added together later', None)
From the String literal concatenation documentation:
Multiple adjacent string or bytes literals (delimited by whitespace), possibly using different quoting conventions, are allowed, and their meaning is the same as their concatenation. Thus, "hello" 'world' is equivalent to "helloworld".
When you use implicit string literal concatenation, any .format() call at the end is applied to that whole, single string.
Next, you don't want to use \ backslash line continuation. Use parentheses instead, it is cleaner:
temp = (
"here is\n"
"\t{}\n"
"\t{}".format("foo","bar"))
This is called implicit line joining.
You might also want to learn about multiline string literals, where you use three quotes at the start and end. Newlines are allowed in such strings and remain part of the value:
temp = """\
here is
\t{}
\t{}""".format("foo","bar")
I used a \ backslash after the opening """ to escape the first newline.
The format function is only being applied to the last string.
temp = "here is\n"\
+"\t{}\n"\
+"\t{}".format("foo","bar")
Is doing this:
temp = "here is\n" + "\t{}\n"\ + "\t{}".format("foo","bar")
The key is that the .format() function is only happening to the last string:
"\t{}".format("foo","bar")
You can obtain the desired result using parentheses:
temp = ("here is\n"\
+"\t{}\n"\
+"\t{}").format("foo","bar")
print(temp)
#here is
# foo
# bar

Pyparsing: dblQuotedString parsing differently in nestedExpr

I'm working on a grammar to parse search queries (not evaluate them, just break them into component pieces). Right now I'm working with nestedExpr, just to grab the different 'levels' of each term, but I seem to have an issue if the first part of a term is in double quotes.
A simple version of the grammar:
QUOTED = QuotedString(quoteChar = '“', endQuoteChar = '”', unquoteResults = False).setParseAction(remove_curlies)
WWORD = Word(alphas8bit + printables.replace("(", "").replace(")", ""))
WORDS = Combine(OneOrMore(dblQuotedString | QUOTED | WWORD), joinString = ' ', adjacent = False)
TERM = OneOrMore(WORDS)
NESTED = OneOrMore(nestedExpr(content = TERM))
query = '(dog* OR boy girl w/3 ("girls n dolls" OR friends OR "best friend" OR (friends w/10 enemies)))'
Calling NESTED.parseString(query) returns:
[['dog* OR boy girl w/3', ['"girls n dolls"', 'OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
The first dblQuotedString instance is separate from the rest of the term at the same nesting, which doesn't occur to the second dblQuotedString instance, and also doesn't occur if the quoted bit is a QUOTED instance (with curly quotes) rather than dblQuotedString with straight ones.
Is there something special about dblQuotedString that I'm missing?
NOTE: I know that operatorPrecedence can break up search terms like this, but I have some limits on what can be broken apart, so I'm testing if I can use nestedExpr to work within those limits.
nestedExpr takes an optional keyword argument ignoreExpr, to take an expression that nestedExpr should use to ignore characters that would otherwise be interpreted as nesting openers or closers, and the default is pyparsing's quotedString, which is defined as sglQuotedString | dblQuotedString. This is to handle strings like:
(this has a tricky string "string with )" )
Since the default ignoreExpr is quotedString, the ')' in quotes is not misinterpreted as the closing parenthesis.
However, your content argument also matches on dblQuotedString. The leading quoted string is matched internally by nestedExpr by way of skipping over quoted strings that may contain "()"s, then your content is matched, which also matches quoted strings. You can suppress nestedExpr's ignore expression using a NoMatch:
NESTED = OneOrMore(nestedExpr(content = TERM, ignoreExpr=NoMatch()))
which should now give you:
[['dog* OR boy girl w/3',
['"girls n dolls" OR friends OR "best friend" OR', ['friends w/10 enemies']]]]
You'll find more details and examples at https://pythonhosted.org/pyparsing/pyparsing-module.html#nestedExpr

Parsing arithmetic expressions with function calls

I am working with pyparsing and found it to be excellent for developing a simple DSL that allows me to extract data fields out of MongoDB and do simple arithmetic operations on them. I am now trying to extend my tools such that I can apply functions of the form Rank[Person:Height] to the fields and potentially include simple expressions as arguments to the function calls. I am struggling hard with getting the parsing syntax to work. Here is what I have so far:
# Define parser
expr = Forward()
integer = Word(nums).setParseAction(EvalConstant)
real = Combine(Word(nums) + "." + Word(nums)).setParseAction(EvalConstant)
# Handle database field references that are coming out of Mongo,
# accounting for the fact that some fields contain whitespace
dbRef = Combine(Word(alphas) + ":" + Word(printables) + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
dbRef.setParseAction(EvalDBref)
# Handle function calls
functionCall = (Keyword("Rank") | Keyword("ZS") | Keyword("Ntile")) + "[" + expr + "]"
functionCall.setParseAction(EvalFunction)
operand = functionCall | dbRef | (real | integer)
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
# Use parse actions to attach Eval constructors to sub-expressions
expr << operatorPrecedence(operand,
[
(signop, 1, opAssoc.RIGHT, EvalSignOp),
(multop, 2, opAssoc.LEFT, EvalMultOp),
(plusop, 2, opAssoc.LEFT, EvalAddOp),
])
My issue is that when I test a simple expression like Rank[Person:Height] I am getting a parse exception:
ParseException: Expected "]" (at char 19), (line:1, col:20)
If I use a float or arithmetic expression as the argument like Rank[3 + 1.1] the parsing works ok, and if I simplify the dbRef grammar so its just Word(alphas) it also works. Cannot for the life of me figure out whats wrong with my full grammar. I have tried rearranging the order of operands as well as simplifying the functionCall grammar to no avail. Can anyone see what I am doing wrong?
Once I get this working I would want to take a last step and introduce support for variable assignment in expressions ..
EDIT: Upon further testing, if I remove the printables from dbRef grammar, things work ok:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional("_" + Word(alphas)))
HOWEVER, if I add the character "-" to dbRef (which I need for DB fields like "Class:S-N"), the parser fails again. I think the "-" is being consumed by the signop in my operatorPrecedence?
What appears to happen is that the ] character at the end of your test string (Rank[Person:Height]) gets consumed as part of the dbRef token, because the portion of this token past the initial : is declared as being made of Word(printables) (and this character set, unfortunately includes the square brackets characters)
Then the parser tries to produce a functionCall but is missing the closing ] hence the error message.
A tentative fix is to use a character set that doesn't include the square brackets, maybe something more explicit like:
dbRef = Combine(Word(alphas) + ":" + Word(alphas, alphas+"-_./") + \
Optional(" " + Word(alphas) + " " + Word(alphas)))
Edit:
Upon closer look, the above is loosely correct, but the token hierarchy is wrong (e.g. the parser attempts to produce a functionCall as one operand of an an expr etc.)
Also, my suggested fix will not work because of the ambiguity with the - sign which should be understood as a plain character when within a dbRef and as a plusOp when within an expr. This type of issue is common with parsers and there are ways to deal with this, though I'm not sure exactly how with pyparsing.
Found solution - the issue was that my grammar for dbRef was consuming some of the characters that were part of the function specification. New grammar that works correctly:
dbRef = Combine(Word(alphas) + OneOrMore(":") + Word(alphanums) + \
Optional(oneOf("_ -") + Word(alphas)))

PyParsing: What does Combine() do?

What is the difference between:
foo = TOKEN1 + TOKEN2
and
foo = Combine(TOKEN1 + TOKEN2)
Thanks.
UPDATE: Based on my experimentation, it seems like Combine() is for terminals, where you're trying to build an expression to match on, whereas plain + is for non-terminals. But I'm not sure.
Combine has 2 effects:
it concatenates all the tokens into a single string
it requires the matching tokens to all be adjacent with no intervening whitespace
If you create an expression like
realnum = Word(nums) + "." + Word(nums)
Then realnum.parseString("3.14") will return a list of 3 tokens: the leading '3', the '.', and the trailing '14'. But if you wrap this in Combine, as in:
realnum = Combine(Word(nums) + "." + Word(nums))
then realnum.parseString("3.14") will return '3.14' (which you could then convert to a float using a parse action). And since Combine suppresses pyparsing's default whitespace skipping between tokens, you won't accidentally find "3.14" in "The answer is 3. 14 is the next answer."

Categories

Resources