Related
I try to encapsulate regex metacharaters to a list
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\']
Enter and get errors
SyntaxError: EOL while scanning string literal
How to resolve the problem?
The problem is the backslash, which is an escape character. The correct representation of a single backslash would be '\\' or "\\".
While all the answers above seem to work, for readability it might be better to write
mc = list("^$[]{}-?*+()|\\")
This makes it much easier to see which characters are being used, reducing visual clutter at very little cost.
It should be:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
You need to escape the final backslash \ with another one, as in the list above \\.
You need to escape the final backslash:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
In your example, the backslash is escaping the last quote, so it's not valid python.
The backslash next to a " ' " is an escape sequence
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
Here is the text I'm parsing:
x ~ normal(mu, 1)
y ~ normal(mu2, 1)
The parser matches those lines using:
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression).setResultsName('model_definition')
// end of line: .setResultsName('model_definition')
The problem is that when there are two model definitions, they aren't named separately in the ParseResults object:
It looks like the first one gets overridden by the second. The reason I'm naming them is to make executing the lines easier - this way I (hopefully) don't have to figure out what is going on at evaluation time - the parser has already labelled everything. How can I get both model_definitions labelled? It would be nice if model_definition held a list of every model definition found.
Just in case, here is some more of my code:
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression).setResultsName('model_definition')
expression << Or([function_application, number, identifier, list_literal, probability_expression])
statement = Optional(newline) + Or([model_definition, assignment, function_application]) + Optional(newline)
line = OneOrMore('\n').suppress()
comment = Group('#' + SkipTo(newline)).suppress()
program = OneOrMore(Or([line, statement, comment]))
ast = program.parseString(input_string)
return ast
Not documented that I know of, but I found something in pyparsing.py:
I changed .setResultsName('model_definition') to .setResultsName('model_definition*') and they listed correctly!
Edit: it is documented, but it is a flag you pass to setResultsName:
setResultsName( string, listAllMatches=False ) - name to be given to tokens matching the element; if multiple tokens within a repetition group (such as ZeroOrMore or delimitedList) the default is to return only the last matching token - if listAllMatches is set to True, then a list of matching tokens is returned.
Here is enough of your code to get things to work:
from pyparsing import *
# fake in the bare minimum to parse the given test strings
identifier = Word(alphas, alphanums)
integer = Word(nums)
function_call = identifier + '(' + Optional(delimitedList(identifier | integer)) + ')'
expression = function_call
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression)
sample = """
x ~ normal(mu, 1)
y ~ normal(mu2, 1)
"""
The trailing '*' is there in setResultsName for those cases where you use the short form of setResultsName: expr("name*") vs expr.setResultsName("name", listAllMatches=True). If you prefer calling setResultsName, then I would not use the '*' notation, but would pass the listAllMatches argument.
If you are getting names that step on each other, you may need to add a level of Grouping. Here is your solution using listAllMatches=True, by virtue of the trailing '*' notation:
model_definition1 = model_definition('model_definition*')
print OneOrMore(model_definition1).parseString(sample).dump()
It returns this parse result:
[['x', '~', 'normal', '(', 'mu', '1', ')'], ['y', '~', 'normal', '(', 'mu2', '1', ')']]
- model_definition: [['x', '~', 'normal', '(', 'mu', '1', ')'], ['y', '~', 'normal', '(', 'mu2', '1', ')']]
[0]:
['x', '~', 'normal', '(', 'mu', '1', ')']
- random_variable_name: x
[1]:
['y', '~', 'normal', '(', 'mu2', '1', ')']
Here is a variation that does not use listAllMatches, but adds another level of Group:
model_definition2 = model_definition('model_definition')
print OneOrMore(Group(model_definition2)).parseString(sample).dump()
gives:
[[['x', '~', 'normal', '(', 'mu', '1', ')']], [['y', '~', 'normal', '(', 'mu2', '1', ')']]]
[0]:
[['x', '~', 'normal', '(', 'mu', '1', ')']]
- model_definition: ['x', '~', 'normal', '(', 'mu', '1', ')']
- random_variable_name: x
[1]:
[['y', '~', 'normal', '(', 'mu2', '1', ')']]
- model_definition: ['y', '~', 'normal', '(', 'mu2', '1', ')']
- random_variable_name: y
In both cases, I see the full content being returned, so I don't quit understand what you mean by "if you return multiple, it fails to split out each child."
I need to split a mathematical expression based on the delimiters. The delimiters are (, ), +, -, *, /, ^ and space. I came up with the following regular expression
"([\\s\\(\\)\\-\\+\\*/\\^])"
which also keeps the delimiters in the resulting list (which is what I want), but it also produces empty strings "" elements, which I don't want. I hardly ever use regular expression (unfortunately), so I am not sure if it is possible to avoid this.
Here's an example of the problem:
>>> import re
>>> e = "((12*x^3+4 * 3)*3)"
>>> re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e)
['', '(', '', '(', '12', '*', 'x', '^', '3', '+', '4',
' ', '', ' ', '', ' ', '', '*', '', ' ', '3', ')', '', '*', '3', ')', '']
Is there a way to not produce those empty strings, maybe by modifying my regular expression? Of course I can remove them using for example filter, but the idea would be not to produce them at all.
Edit
I would also need to not include spaces. If you can help also in that matter, it would be great.
You could add \w+, remove the \s and do a findall:
import re
e = "((12*x^3+44 * 3)*3)"
print re.findall("(\w+|[()\-+*/^])", e)
Output:
['(', '(', '12', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Depending on what you want you can change the regex:
e = "((12a*x^3+44 * 3)*3)"
print re.findall("(\d+|[a-z()\-+*/^])", e)
print re.findall("(\w+|[()\-+*/^])", e)
The first considers 12a to be two strings the latter one:
['(', '(', '12', 'a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
['(', '(', '12a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Just strip/filter them out in a comprehension.
result = [item for item in re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e) if item.strip()]
I am trying to separate the operators (including parentheses) and the operands in an expression. For example given an expression
expr = "(32+54)*342-(4*(3-9))"
I am trying to get
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')']
Here is the code that I wrote. Is there a better way of doing it in python.
l = list(expr)
n = ''
expr = []
try:
for c in l:
if c in string.digits:
n += c
else:
if n != '':
expr.append(n)
n = ''
expr.append(c)
finally:
if n != '':
expr.append(n)
We can do this with re.split():
>>> import re
>>> expr = "(32+54)*342-(4*(3-9))"
>>> re.split("([-()+*/])", expr)
['', '(', '32', '+', '54', ')', '', '*', '342', '-', '', '(', '4', '*', '', '(', '3', '-', '9', ')', '', ')', '']
This does insert some empty strings, but these can probably be handled or stripped out trivially enough. E.g with a list comprehension:
>>> [part for part in re.split("([-()+*/])", expr) if part]
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')']
If you are only trying to tokenize the stream, your approach is fine, but somewhat old-fashioned. You can use a regular expression, to split the tokens more easily.
However, if you also want to do something with the tokens (such as evaluate them) then I suggest you look at a parsing module that can handle recursion (regular expressions cannot handle recursion), such as pyparsing.
Python: Batteries Included.
>>> [x[1] for x in tokenize.generate_tokens(StringIO.StringIO('(32+54)*342-(4*(3-9))').readline)]
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')', '']
>>> if True:
exp=[]
expr = "(32+54)*342-(4*(3-9))"
flag=False
for i in expr:
if i.isdigit() and flag:
exp.append(str(exp.pop(len(exp)-1))+i)
elif i.isdigit():
flag=True
exp.append(i)
else:
flag=False
exp.append(i)
print(exp)
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')']
>>>
I have a expression and I want to extract it in python 2.6. Here is the example:
[a]+[c]*0.6/[b]-([a]-[f]*0.9)
this going to:
(
'[a]',
'+',
'[c]',
'*',
'0.6',
'/',
'[b]',
'-',
'(',
'[a]',
'-',
'[f]',
'*',
'0.9',
')',
)
I need it a list. Please give me a hand. Thanks.
>>> import re
>>> expr = '[a]+[c]*0.6/[b]-([a]-[f]*0.9)'
>>> re.findall('(?:\[.*?\])|(?:\d+\.*\d*)|.', expr)
['[a]', '+', '[c]', '*', '0.6', '/', '[b]', '-', '(', '[a]', '-', '[f]', '*', '0.9', ')']
One approach would be to create a list of regular expressions to match each token, something like:
import re
tokens = [r'\[.?\]', r'\(', r'\)', r'\+', r'\*', r'\-', r'/', r'\d+?.\d+', r'\d+']
regex = re.compile('|'.join(tokens))
Then you could use findall on your expression to return a list of matches:
>>> regex.findall('[a]+[c]*0.6/[b]-([a]-[f]*0.9)')
<<<
['[a]',
'+',
'[c]',
'*',
'0.6',
'/',
'[b]',
'-',
'(',
'[a]',
'-',
'[f]',
'*',
'0.9',
')']