How to split duplicated separator in Python - python

I have a string with the format
exp = '(( 200 + (4 * 3.14)) / ( 2 ** 3 ))'
I would like to separate the string into tokens by using re.split() and include the separators as well. However, I am not able to split ** together and eventually being split by * instead.
This is my code: tokens = re.split(r'([+|-|**?|/|(|)])',exp)
My Output (wrong):
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '*', '*', '3', ')', ')']
I would like to ask is there a way for me to split the separators between * and **? Thank you so much!
Desired Output:
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '**', '3', ')', ')']

Using the [...] notation only allows you to specify individual characters. To get variable sized alternate patterns you need to use the | operator outside of these brackets. This also means that you need to escape the regular expression operators and that you need to place the longer patterns before the shorter ones (i.e. ** before *)
tokens = re.split(r'(\*\*|\*|\+|\-|/|\(|\))',exp)
or even shorter:
tokens = re.split(r'(\*\*|[*+-/()])',exp)

Related

Write metacharacters to a list

I try to encapsulate regex metacharaters to a list
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\']
Enter and get errors
SyntaxError: EOL while scanning string literal
How to resolve the problem?
The problem is the backslash, which is an escape character. The correct representation of a single backslash would be '\\' or "\\".
While all the answers above seem to work, for readability it might be better to write
mc = list("^$[]{}-?*+()|\\")
This makes it much easier to see which characters are being used, reducing visual clutter at very little cost.
It should be:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
You need to escape the final backslash \ with another one, as in the list above \\.
You need to escape the final backslash:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
In your example, the backslash is escaping the last quote, so it's not valid python.
The backslash next to a " ' " is an escape sequence
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']

pyparsing: setResultsName for multiple elements get combined

Here is the text I'm parsing:
x ~ normal(mu, 1)
y ~ normal(mu2, 1)
The parser matches those lines using:
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression).setResultsName('model_definition')
// end of line: .setResultsName('model_definition')
The problem is that when there are two model definitions, they aren't named separately in the ParseResults object:
It looks like the first one gets overridden by the second. The reason I'm naming them is to make executing the lines easier - this way I (hopefully) don't have to figure out what is going on at evaluation time - the parser has already labelled everything. How can I get both model_definitions labelled? It would be nice if model_definition held a list of every model definition found.
Just in case, here is some more of my code:
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression).setResultsName('model_definition')
expression << Or([function_application, number, identifier, list_literal, probability_expression])
statement = Optional(newline) + Or([model_definition, assignment, function_application]) + Optional(newline)
line = OneOrMore('\n').suppress()
comment = Group('#' + SkipTo(newline)).suppress()
program = OneOrMore(Or([line, statement, comment]))
ast = program.parseString(input_string)
return ast
Not documented that I know of, but I found something in pyparsing.py:
I changed .setResultsName('model_definition') to .setResultsName('model_definition*') and they listed correctly!
Edit: it is documented, but it is a flag you pass to setResultsName:
setResultsName( string, listAllMatches=False ) - name to be given to tokens matching the element; if multiple tokens within a repetition group (such as ZeroOrMore or delimitedList) the default is to return only the last matching token - if listAllMatches is set to True, then a list of matching tokens is returned.
Here is enough of your code to get things to work:
from pyparsing import *
# fake in the bare minimum to parse the given test strings
identifier = Word(alphas, alphanums)
integer = Word(nums)
function_call = identifier + '(' + Optional(delimitedList(identifier | integer)) + ')'
expression = function_call
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression)
sample = """
x ~ normal(mu, 1)
y ~ normal(mu2, 1)
"""
The trailing '*' is there in setResultsName for those cases where you use the short form of setResultsName: expr("name*") vs expr.setResultsName("name", listAllMatches=True). If you prefer calling setResultsName, then I would not use the '*' notation, but would pass the listAllMatches argument.
If you are getting names that step on each other, you may need to add a level of Grouping. Here is your solution using listAllMatches=True, by virtue of the trailing '*' notation:
model_definition1 = model_definition('model_definition*')
print OneOrMore(model_definition1).parseString(sample).dump()
It returns this parse result:
[['x', '~', 'normal', '(', 'mu', '1', ')'], ['y', '~', 'normal', '(', 'mu2', '1', ')']]
- model_definition: [['x', '~', 'normal', '(', 'mu', '1', ')'], ['y', '~', 'normal', '(', 'mu2', '1', ')']]
[0]:
['x', '~', 'normal', '(', 'mu', '1', ')']
- random_variable_name: x
[1]:
['y', '~', 'normal', '(', 'mu2', '1', ')']
Here is a variation that does not use listAllMatches, but adds another level of Group:
model_definition2 = model_definition('model_definition')
print OneOrMore(Group(model_definition2)).parseString(sample).dump()
gives:
[[['x', '~', 'normal', '(', 'mu', '1', ')']], [['y', '~', 'normal', '(', 'mu2', '1', ')']]]
[0]:
[['x', '~', 'normal', '(', 'mu', '1', ')']]
- model_definition: ['x', '~', 'normal', '(', 'mu', '1', ')']
- random_variable_name: x
[1]:
[['y', '~', 'normal', '(', 'mu2', '1', ')']]
- model_definition: ['y', '~', 'normal', '(', 'mu2', '1', ')']
- random_variable_name: y
In both cases, I see the full content being returned, so I don't quit understand what you mean by "if you return multiple, it fails to split out each child."

re.split on multiple characters (and maintaining the characters) produces a list containing also empty strings

I need to split a mathematical expression based on the delimiters. The delimiters are (, ), +, -, *, /, ^ and space. I came up with the following regular expression
"([\\s\\(\\)\\-\\+\\*/\\^])"
which also keeps the delimiters in the resulting list (which is what I want), but it also produces empty strings "" elements, which I don't want. I hardly ever use regular expression (unfortunately), so I am not sure if it is possible to avoid this.
Here's an example of the problem:
>>> import re
>>> e = "((12*x^3+4 * 3)*3)"
>>> re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e)
['', '(', '', '(', '12', '*', 'x', '^', '3', '+', '4',
' ', '', ' ', '', ' ', '', '*', '', ' ', '3', ')', '', '*', '3', ')', '']
Is there a way to not produce those empty strings, maybe by modifying my regular expression? Of course I can remove them using for example filter, but the idea would be not to produce them at all.
Edit
I would also need to not include spaces. If you can help also in that matter, it would be great.
You could add \w+, remove the \s and do a findall:
import re
e = "((12*x^3+44 * 3)*3)"
print re.findall("(\w+|[()\-+*/^])", e)
Output:
['(', '(', '12', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Depending on what you want you can change the regex:
e = "((12a*x^3+44 * 3)*3)"
print re.findall("(\d+|[a-z()\-+*/^])", e)
print re.findall("(\w+|[()\-+*/^])", e)
The first considers 12a to be two strings the latter one:
['(', '(', '12', 'a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
['(', '(', '12a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Just strip/filter them out in a comprehension.
result = [item for item in re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e) if item.strip()]

Pythonic way to separate operators and operands in an expression

I am trying to separate the operators (including parentheses) and the operands in an expression. For example given an expression
expr = "(32+54)*342-(4*(3-9))"
I am trying to get
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')']
Here is the code that I wrote. Is there a better way of doing it in python.
l = list(expr)
n = ''
expr = []
try:
for c in l:
if c in string.digits:
n += c
else:
if n != '':
expr.append(n)
n = ''
expr.append(c)
finally:
if n != '':
expr.append(n)
We can do this with re.split():
>>> import re
>>> expr = "(32+54)*342-(4*(3-9))"
>>> re.split("([-()+*/])", expr)
['', '(', '32', '+', '54', ')', '', '*', '342', '-', '', '(', '4', '*', '', '(', '3', '-', '9', ')', '', ')', '']
This does insert some empty strings, but these can probably be handled or stripped out trivially enough. E.g with a list comprehension:
>>> [part for part in re.split("([-()+*/])", expr) if part]
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')']
If you are only trying to tokenize the stream, your approach is fine, but somewhat old-fashioned. You can use a regular expression, to split the tokens more easily.
However, if you also want to do something with the tokens (such as evaluate them) then I suggest you look at a parsing module that can handle recursion (regular expressions cannot handle recursion), such as pyparsing.
Python: Batteries Included.
>>> [x[1] for x in tokenize.generate_tokens(StringIO.StringIO('(32+54)*342-(4*(3-9))').readline)]
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')', '']
>>> if True:
exp=[]
expr = "(32+54)*342-(4*(3-9))"
flag=False
for i in expr:
if i.isdigit() and flag:
exp.append(str(exp.pop(len(exp)-1))+i)
elif i.isdigit():
flag=True
exp.append(i)
else:
flag=False
exp.append(i)
print(exp)
['(', '32', '+', '54', ')', '*', '342', '-', '(', '4', '*', '(', '3', '-', '9', ')', ')']
>>>

Extracting expression

I have a expression and I want to extract it in python 2.6. Here is the example:
[a]+[c]*0.6/[b]-([a]-[f]*0.9)
this going to:
(
'[a]',
'+',
'[c]',
'*',
'0.6',
'/',
'[b]',
'-',
'(',
'[a]',
'-',
'[f]',
'*',
'0.9',
')',
)
I need it a list. Please give me a hand. Thanks.
>>> import re
>>> expr = '[a]+[c]*0.6/[b]-([a]-[f]*0.9)'
>>> re.findall('(?:\[.*?\])|(?:\d+\.*\d*)|.', expr)
['[a]', '+', '[c]', '*', '0.6', '/', '[b]', '-', '(', '[a]', '-', '[f]', '*', '0.9', ')']
One approach would be to create a list of regular expressions to match each token, something like:
import re
tokens = [r'\[.?\]', r'\(', r'\)', r'\+', r'\*', r'\-', r'/', r'\d+?.\d+', r'\d+']
regex = re.compile('|'.join(tokens))
Then you could use findall on your expression to return a list of matches:
>>> regex.findall('[a]+[c]*0.6/[b]-([a]-[f]*0.9)')
<<<
['[a]',
'+',
'[c]',
'*',
'0.6',
'/',
'[b]',
'-',
'(',
'[a]',
'-',
'[f]',
'*',
'0.9',
')']

Categories

Resources