Shortish version
I am using this regex:
(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?
To try and extract all the element coefficient and order numbers from equations like this:
y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1
I want the regex to ignore the erroneous 4x^ which is missing its power number (doesn't currently do this) and allow me to get to this final result:
((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0))
Where first coordinate is coefficient and second is order for each element. Currently the regex above 'nearly' works if I take groups 1&2 and 5&6 to give me the coefficient and order respectively.
It just falls over on the erroneous 4x^ plus feels extremely inelegant, but I am somewhat noob at regex and am not sure what improvements to make.
How can I improve this regex, and also fix so that 4x^ is considered 'wrong' but 4x2 and 4x^2 are both fine?
tl;dr version
I am trying parse polynomial equations entered by users in order to validate and then decompose the equation into a series of elements. The equations will be presented as strings.
Here is an example of how the users are asked to format their string:
y = 2.0x^2.5 - 3.1x + 5.2
Where x is the independent variable (not a times symbol) and y is the dependent variable.
In reality the users commonly make any of the following mistakes:
Forgetting to include y =
Adding a * to coefficients such as y = 2.0*x
Using integers instead of floats, e.g. y = 5x
Missing the ^ when setting the order e.g. y = x3
Adding or removing whitespace anywhere
However, for all of these I'd say it's still easily understandable what the user is trying to write. By that I mean it is obvious what the coefficient and order are meant to be for each element.
So what I want to do is write some regex that correctly splits the entered string into separate elements and can get me A (the coefficient) and B (the order) of each element where an element in general is of the form Ax^B and A and B can each be any real number.
I devised the following example:
y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1
Which I believe covers all of the potential issues I outlined above, in addition to one other straight up mistake 4x^+2x^2 is missing the order on the element 4x^.
For this example I'd like to get to: ((1.0, 1.0), (3.3, -50.0), (15.0, 25.5), (2.0, 2.0), (3.0, -3.5), (1.1, 0.0)) where 4x^ has been ignored.
I am somewhat new to regex but I have made an effort using regex101.com to create the following:
(^|[yY]\s{0,}\=|\+|\-)\s{0,}([0-9]{0,}\.?[0-9]{0,})\s{0,}(\*{0,1}[xX]{0,1})\s{0,}(\^{0,1})(-?)([0-9]{0,}\.?[0-9]{0,})(\s{0,}|$)?
This appears to nearly work, with the following issues:
Does not catch missing order as per example 4x^ given above - I am not sure how to make the optionality of the order number 'conditional' on the presence of ^ whilst also working when ^ is not present but the order number is such as y = 4x2
Feels extremely in-concise / inelegant, but being inexperienced I am struggling to see where improvements can be made
Also please note I am happily ignoring the issue of repeated elements with the same order not being summed, e.g. I am happy to ignore y = x^2 + x^2 not appearing as y = 2x^2.
Thank you for any help.
p.s. Program to be written in Go, but I am also somewhat noob at Go so I am first prototyping in Python. Not sure if this will make any difference to the regex (I really am that new to regex).
The following regex will mostly do:
(?P<c1>[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P<e1>-? *\d+(?:\.\d+)?)|(?P<e2>-? *\d+(?:\.\d+)?)?)|(?P<c2>[+-]? *\d+(?:\.\d+)?)
I say mostly because this solution takes the "4x^" case as having order 1, given the requirements are already pretty lenient and otherwise trying to ignore such term makes the RE much much more complicated or even impossible because it creates an ambiguity which can not be parsed with a RE.
Please note that absent coeficients/exponents will not be captured as '1.0' as you represent in your example result, that will have to be done after applying the regex and taking all empty capture groups as '1' (or '0' for the exponent depending on the captured groups).
Here you have the regex in regex101.com for checking/trying how it works.
And here a working program in golang which tests a couple of cases:
package main
import (
"fmt"
"regexp"
"strconv"
"strings"
)
const e = `(?P<c1>[+-]? *\d+(?:\.\d+)?)? *\*? *[xX] *(?:\^ *(?P<e1>-? *\d+(?:\.\d+)?)|(?P<e2>-? *\d+(?:\.\d+)?)?)|(?P<c2>[+-]? *\d+(?:\.\d+)?)`
var cases = []string{
"y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1",
"3.3X^-50",
}
func parse(d float64, ss ...string) float64 {
for _, s := range ss {
if s != "" {
c, _ := strconv.ParseFloat(strings.Replace(s, " ", "", -1), 64)
return c
}
}
return d
}
func main() {
re := regexp.MustCompile(e)
for i, c := range cases {
fmt.Printf("testing case %v: %q\n", i, c)
ms := re.FindAllStringSubmatch(c, -1)
if ms == nil {
fmt.Println("no match")
continue
}
for i, m := range ms {
fmt.Printf(" match %v: %q\n", i, m[0])
c := parse(1.0, m[1], m[4])
de := 1.0
if m[4] != "" {
de = 0.0
}
e := parse(de, m[2], m[3])
fmt.Printf(" c: %v\n", c)
fmt.Printf(" e: %v\n", e)
}
}
}
Which outputs:
testing case 0: "y=x+3.3X^-50+ 15x25.5 - 4x^+2x^2 +3*x-2.5+1.1"
match 0: "x"
c: 1
e: 1
match 1: "+3.3X^-50"
c: 3.3
e: -50
match 2: "+ 15x25.5"
c: 15
e: 25.5
match 3: "- 4x"
c: -4
e: 1
match 4: "+2x^2"
c: 2
e: 2
match 5: "+3*x-2.5"
c: 3
e: -2.5
match 6: "+1.1"
c: 1.1
e: 0
testing case 1: "3.3X^-50"
match 0: "3.3X^-50"
c: 3.3
e: -50
Here you have the program on golang playground to try.
Related
I'm stucking with my code to get all return match by given range. My data sample is:
comment
0 [intj74, you're, whipping, people, is, a, grea...
1 [home, near, kcil2, meniaga, who, intj47, a, l...
2 [thematic, budget, kasi, smooth, sweep]
3 [budget, 2, intj69, most, people, think, of, e...
I want to get the result as: (where the given range is intj1 to intj75)
comment
0 [intj74]
1 [intj47]
2 [nan]
3 [intj69]
My code is:
df.comment = df.comment.apply(lambda x: [t for t in x if t=='intj74'])
df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]
I'm not sure how to use regular expression to find the range for t=='range'. Or any other idea to do this?
Thanks in advance,
Pandas Python Newbie
you could replace [t for t in x if t=='intj74'] with, e.g.,
[t for t in x if re.match('intj[0-9]+$', t)]
or even
[t for t in x if re.match('intj[0-9]+$', t)] or [np.nan]
which would also handle the case if there are no matches (so that one wouldn't need to check for that explicitly using df.ix[df.comment.apply(len) == 0, 'comment'] = [[np.nan]]) The "trick" here is that an empty list evaluates to False so that the or in that case returns its right operand.
I am new to pandas as well. You might have initialized your DataFrame differently. Anyway, this is what I have:
import pandas as pd
data = {
'comment': [
"intj74, you're, whipping, people, is, a",
"home, near, kcil2, meniaga, who, intj47, a",
"thematic, budget, kasi, smooth, sweep",
"budget, 2, intj69, most, people, think, of"
]
}
print(df.comment.str.extract(r'(intj\d+)'))
In many of my python projects, I find myself having to go through a file, match lines against regexes, and then perform some computation on the basis of elements from the line extracted by regex.
In pseudo-C code, this is pretty-easy:
while (read(line))
{
if (m=matchregex(regex1,line))
{
/* munch on the components extracted in regex1 by accessing m */
}
else if (m=matchregex(regex2,line))
{
/* munch on the components extracted in regex2 by accessing m */
}
else if ...
...
else
{
error("Unrecognized line format");
}
}
However, because python does not allow an assignment in the conditional of an if, this can't be done elegantly. One could first parse against all the regexes and then do the if on the various match objects, but that is neither elegant nor efficient.
What I find myself doing instead is including code like this at the base level of every project:
im=None
img=None
def imps(p,s):
global im
global img
im=re.search(p,s)
if im:
img=im.groups()
return True
else:
img=None
return False
Then I can work like this:
for line in open(file,'r').read().splitlines():
if imps(regex1,line):
# munch on contents of img
elsif imps(regex2,line):
# munch on contents of img
else:
error('Unrecognised line: {}'.format(line))
That works, is reasonably compact, and easy to type. But it is hardly beautiful; it uses global variables and is not thread safe (which has not been an issue for me so far).
But I'm sure others have run across this problem before and come up with an equally compact, but more python-y and generally superior solution. What is it?
Depends on the needs of the code.
A common choice I use is something like this:
# note, order is important here. The first one to match will exit the processing
parse_regexps = [
(r"^foo", handle_foo),
(r"^bar", handle_bar),
]
for regexp, handler in parse_regexps:
m = regexp.match(line)
if m:
handler(line) # possibly other data too like m.groups
break
else:
error("Unrecognized format....")
This has the advantage of moving the handling code into clear and obvious functions which makes testing and change easy.
You can just use continue:
for line in file:
m = re.match(re1, line)
if m:
do stuff
continue
m = re.match(re2, line)
if m:
do stuff
continue
raise BadLine
Another, less obvious, option is to have a function like this:
def match_any(subject, *regexes):
for n, regex in enumerate(regexes):
m = re.match(regex, subject)
if m:
return n, m
return -1, None
and then:
for line in file:
n, m = match_any(line, re1, re2)
if n == 0:
....
elif n == 1:
....
else:
raise BadLine
I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)
Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)
Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.
This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])
Edit: I did a first version, which Eike helped me to advance quite a bit on it. I'm now stuck to a more specific problem, which I will describe bellow. You can have a look at the original question in the history
I'm using pyparsing to parse a small language used to request specific data from a database. It features numerous keyword, operators and datatypes as well as boolean logic.
I'm trying to improve the error message sent to the user when he does a syntax error, since the current one is not very useful. I designed a small example, similar to what I'm doing with the language aforementioned but much smaller:
#!/usr/bin/env python
from pyparsing import *
def validate_number(s, loc, tokens):
if int(tokens[0]) != 0:
raise ParseFatalException(s, loc, "number musth be 0")
def fail(s, loc, tokens):
raise ParseFatalException(s, loc, "Unknown token %s" % tokens[0])
def fail_value(s, loc, expr, err):
raise ParseFatalException(s, loc, "Wrong value")
number = Word(nums).setParseAction(validate_number).setFailAction(fail_value)
operator = Literal("=")
error = Word(alphas).setParseAction(fail)
rules = MatchFirst([
Literal('x') + operator + number,
])
rules = operatorPrecedence(rules | error , [
(Literal("and"), 2, opAssoc.RIGHT),
])
def try_parse(expression):
try:
rules.parseString(expression, parseAll=True)
except Exception as e:
msg = str(e)
print("%s: %s" % (msg, expression))
print(" " * (len("%s: " % msg) + (e.loc)) + "^^^")
So basically, the only things which we can do with this language, is writing series of x = 0, joined together with and and parenthesis.
Now, there are cases, when and and parenthesis are used, where the error reporting is not very good. Consider the following examples:
>>> try_parse("x = a and x = 0") # This one is actually good!
Wrong value (at char 4), (line:1, col:5): x = a and x = 0
^^^
>>> try_parse("x = 0 and x = a")
Expected end of text (at char 6), (line:1, col:1): x = 0 and x = a
^^^
>>> try_parse("x = 0 and (x = 0 and (x = 0 and (x = a)))")
Expected end of text (at char 6), (line:1, col:1): x = 0 and (x = 0 and (x = 0 and (x = a)))
^^^
>>> try_parse("x = 0 and (x = 0 and (x = 0 and (x = 0)))")
Expected end of text (at char 6), (line:1, col:1): x = 0 and (x = 0 and (x = 0 and (xxxxxxxx = 0)))
^^^
Actually, it seems that if the parser can't parse (and parse here is important) something after a and, it doesn't produce good error messages anymore :(
And I mean parse, since if it can parse 5 but the "validation" fails in the parse action, it still produces a good error message. But, if it can't parse a valid number (like a) or a valid keyword (like xxxxxx), it stops producing the right error messages.
Any idea?
Pyparsing will always have somewhat bad error messages, because it backtracks. The error message is generated in the last rule that the parser tries. The parser can't know where the error really is, it only knows that there is no matching rule.
For good error messages you need a parser that gives up early. These parsers are less flexible than Pyparsing, but most conventional programming languages can be parsed with such parsers. (C++ and Scala IMHO can't.)
To improve error messages in Pyparsing use the - operator, it works like the + operator, but it does not backtrack. You would use it like this:
assignment = Literal("let") - varname - "=" - expression
Here is a small article on improving error reporting, by Pyparsing's author.
Edit
You could also generate good error messages for the invalid numbers in the parse actions that do the validation. If the number is invalid you raise an exception that is not caught by Pyparsing. This exception can contain a good error message.
Parse actions can have three arguments [1]:
s = the original string being parsed (see note below)
loc = the location of the matching substring
toks = a list of the matched tokens, packaged as a ParseResults object
There are also three useful helper methods for creating good error messages [2]:
lineno(loc, string) - function to give the line number of the location within the string; the first line is line 1, newlines start new rows.
col(loc, string) - function to give the column number of the location within the string; the first column is column 1, newlines reset the column number to 1.
line(loc, string) - function to retrieve the line of text representing lineno(loc, string). Useful when printing out diagnostic messages for exceptions.
Your validating parse action would then be like this:
def validate_odd_number(s, loc, toks):
value = toks[0]
value = int(value)
if value % 2 == 0:
raise MyFatalParseException(
"not an odd number. Line {l}, column {c}.".format(l=lineno(loc, s),
c=col(loc, s)))
[1] http://pythonhosted.org/pyparsing/pyparsing.pyparsing.ParserElement-class.html#setParseAction
[2] HowToUsePyparsing
Edit
Here [3] is an improved version of the question's current (2013-4-10) script. It gets the example errors right, but other error are indicated at the wrong position. I believe there are bugs in my version of Pyparsing ('1.5.7'), but maybe I just don't understand how Pyparsing works. The issues are:
ParseFatalException seems not to be always fatal. The script works as expected when I use my own exception.
The - operator seems not to work.
[3] http://pastebin.com/7E4kSnkm
I'm currently transitioning from Java to Python and have taken on the task of trying to create a calculator that can carry out symbolic operations on infix-notated mathematical expressions (without using custom modules like Sympy). Currently, it's built to accept strings that are space delimited and can only carry out the (, ), +, -, *, and / operators. Unfortunately, I can't figure out the basic algorithm for simplifying symbolic expressions.
For example, given the string '2 * ( ( 9 / 6 ) + 6 * x )', my program should carry out the following steps:
2 * ( 1.5 + 6 * x )
3 + 12 * x
But I can't get the program to ignore the x when distributing the 2. In addition, how can I handle 'x * 6 / x' so it returns '6' after simplification?
EDIT: To clarify, by "symbolic" I meant that it will leave letters like "A" and "f" in the output while carrying out the remaining calculations.
EDIT 2: I (mostly) finished the code. I'm posting it here if anyone stumbles on this post in the future, or if any of you were curious.
def reduceExpr(useArray):
# Use Python's native eval() to compute if no letters are detected.
if (not hasLetters(useArray)):
return [calculate(useArray)] # Different from eval() because it returns string version of result
# Base case. Returns useArray if the list size is 1 (i.e., it contains one string).
if (len(useArray) == 1):
return useArray
# Base case. Returns the space-joined elements of useArray as a list with one string.
if (len(useArray) == 3):
return [' '.join(useArray)]
# Checks to see if parentheses are present in the expression & sets.
# Counts number of parentheses & keeps track of first ( found.
parentheses = 0
leftIdx = -1
# This try/except block is essentially an if/else block. Since useArray.index('(') triggers a KeyError
# if it can't find '(' in useArray, the next line is not carried out, and parentheses is not incremented.
try:
leftIdx = useArray.index('(')
parentheses += 1
except Exception:
pass
# If a KeyError was returned, leftIdx = -1 and rightIdx = parentheses = 0.
rightIdx = leftIdx + 1
while (parentheses > 0):
if (useArray[rightIdx] == '('):
parentheses += 1
elif (useArray[rightIdx] == ')'):
parentheses -= 1
rightIdx += 1
# Provided parentheses pair isn't empty, runs contents through again; else, removes the parentheses
if (leftIdx > -1 and rightIdx - leftIdx > 2):
return reduceExpr(useArray[:leftIdx] + [' '.join(['(',reduceExpr(useArray[leftIdx+1:rightIdx-1])[0],')'])] + useArray[rightIdx:])
elif (leftIdx > -1):
return reduceExpr(useArray[:leftIdx] + useArray[rightIdx:])
# If operator is + or -, hold the first two elements and process the rest of the list first
if isAddSub(useArray[1]):
return reduceExpr(useArray[:2] + reduceExpr(useArray[2:]))
# Else, if operator is * or /, process the first 3 elements first, then the rest of the list
elif isMultDiv(useArray[1]):
return reduceExpr(reduceExpr(useArray[:3]) + useArray[3:])
# Just placed this so the compiler wouldn't complain that the function had no return (since this was called by yet another function).
return None
You need much more processing before you go into operations on symbols. The form you want to get to is a tree of operations with values in the leaf nodes. First you need to do a lexer run on the string to get elements - although if you always have space-separated elements it might be enough to just split the string. Then you need to parse that array of tokens using some grammar you require.
If you need theoretical information about grammars and parsing text, start here: http://en.wikipedia.org/wiki/Parsing If you need something more practical, go to https://github.com/pyparsing/pyparsing (you don't have to use the pyparsing module itself, but their documentation has a lot of interesting info) or http://www.nltk.org/book
From 2 * ( ( 9 / 6 ) + 6 * x ), you need to get to a tree like this:
*
2 +
/ *
9 6 6 x
Then you can visit each node and decide if you want to simplify it. Constant operations will be the simplest ones to eliminate - just compute the result and exchange the "/" node with 1.5 because all children are constants.
There are many strategies to continue, but essentially you need to find a way to go through the tree and modify it until there's nothing left to change.
If you want to print the result then, just walk the tree again and produce an expression which describes it.
If you are parsing expressions in Python, you might consider Python syntax for the expressions and parse them using the ast module (AST = abstract syntax tree).
The advantages of using Python syntax: you don't have to make a separate language for the purpose, the parser is built in, and so is the evaluator. Disadvantages: there's quite a lot of extra complexity in the parse tree that you don't need (you can avoid some of it by using the built-in NodeVisitor and NodeTransformer classes to do your work).
>>> import ast
>>> a = ast.parse('x**2 + x', mode='eval')
>>> ast.dump(a)
"Expression(body=BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(),
right=Num(n=2)), op=Add(), right=Name(id='x', ctx=Load())))"
Here's an example class that walks a Python parse tree and does recursive constant folding (for binary operations), to show you the kind of thing you can do fairly easily.
from ast import *
class FoldConstants(NodeTransformer):
def visit_BinOp(self, node):
self.generic_visit(node)
if isinstance(node.left, Num) and isinstance(node.right, Num):
expr = copy_location(Expression(node), node)
value = eval(compile(expr, '<string>', 'eval'))
return copy_location(Num(value), node)
else:
return node
>>> ast.dump(FoldConstants().visit(ast.parse('3**2 - 5 + x', mode='eval')))
"Expression(body=BinOp(left=Num(n=4), op=Add(), right=Name(id='x', ctx=Load())))"