What is the difference between:
foo = TOKEN1 + TOKEN2
and
foo = Combine(TOKEN1 + TOKEN2)
Thanks.
UPDATE: Based on my experimentation, it seems like Combine() is for terminals, where you're trying to build an expression to match on, whereas plain + is for non-terminals. But I'm not sure.
Combine has 2 effects:
it concatenates all the tokens into a single string
it requires the matching tokens to all be adjacent with no intervening whitespace
If you create an expression like
realnum = Word(nums) + "." + Word(nums)
Then realnum.parseString("3.14") will return a list of 3 tokens: the leading '3', the '.', and the trailing '14'. But if you wrap this in Combine, as in:
realnum = Combine(Word(nums) + "." + Word(nums))
then realnum.parseString("3.14") will return '3.14' (which you could then convert to a float using a parse action). And since Combine suppresses pyparsing's default whitespace skipping between tokens, you won't accidentally find "3.14" in "The answer is 3. 14 is the next answer."
Related
So I am making a parser, and I noticed a problem. Indeed, to parse numbers, I have:
from pyparsing import Word, nums
n = Word(nums)
This works well with numbers without thousands separators. For example, n.parseString("1000", parseAll=True) returns (['1000'], {}) and therefore works.
However, it doesn't work when I add the thousand separator. Indeed, n.parseString("1,000", parseAll=True) raises pyparsing.ParseException: Expected end of text, found ',' (at char 1), (line:1, col:2).
How can I parse numbers with thousand separators? I don't just want to ignore commas (for example, n.parseString("1,00", parseAll=True) should return an error as it is not a number).
A pure pyparsing approach would use Combine to wrap a series of pyparsing expressions representing the different fields that you are seeing in the regex:
import pyparsing as pp
int_with_thousands_separators = pp.Combine(pp.Optional("-")
+ pp.Word(pp.nums, max=3)
+ ("," + pp.Word(pp.nums, exact=3))[...])
I've found that building up numeric expressions like this results in much slower parse time, because all those separate parts are parsed independently, with multiple internal function and method calls (which are real performance killers in Python). So you can replace this with an expression using Regex:
# more efficient parsing with a Regex
int_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*")
You could also use the code as posted by Jan, and pass that compiled regex to the Regex constructor.
To do parse-time conversion to int, add a parse action that strips out the commas.
# add parse action to convert to int, after stripping ','s
int_with_thousands_separators.addParseAction(
lambda t: int(t[0].replace(",", "")))
I like using runTests to check out little expressions like this - it's easy to write a series of test strings, and the output shows either the parsed result or an annotated input string with the parse failure location. ("1,00" is included as an intentional error to demonstrate error output by runTests.)
int_with_thousands_separators.runTests("""\
1
# invalid value
1,00
1,000
-3,000,100
""")
If you want to parse real numbers, add pieces to represent the trailing decimal point and following digits.
real_with_thousands_separators = pp.Combine(pp.Optional("-")
+ pp.Word(pp.nums, max=3)
+ ("," + pp.Word(pp.nums, exact=3))[...]
+ "." + pp.Word(pp.nums))
# more efficient parsing with a Regex
real_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*\.\d+")
# add parse action to convert to float, after stripping ','s
real_with_thousands_separators.addParseAction(
lambda t: float(t[0].replace(",", "")))
real_with_thousands_separators.runTests("""\
# invalid values
1
1,00
1,000
-3,000,100
1.
# valid values
1.732
-273.15
""")
As you are dealing with strings in the first place, you could very well use a regular expression on it to ensure that it is indeed a number (thousand sep including). If it is, replace every comma and feed it to the parser:
import re
from pyparsing import Word, nums
n = Word(nums)
def is_number(number):
rx = re.compile(r'^-?\d+(?:,\d{3})*$')
if rx.match(number):
return number.replace(",", "")
raise ValueError
try:
number = is_number("10,000,000")
print(n.parseString(number, parseAll=True))
except ValueError:
print("Not a number")
With this, e.g. 1,00 will result in Not a number, see a demo for the expression on regex101.com.
I don't understand well what you mean with "numbers with thousands of separators".
In any case, with pyparsing you should define the pattern of what you want to parse.
In the first example pyparse works well just because you defined n as just a number, so:
n = Word(nums)
print(n.parseString("1000", parseAll=True))
['1000']
So, if you want to parse "1,000" or "1,00", you should define n as:
n = Word(nums) + ',' + Word(nums)
print(n.parseString("1,000", parseAll=True))
['1', ',', '000']
print(n.parseString("1,00", parseAll=True))
['1', ',', '00']
I also came up with a regex solution, kind of late:
from pyparsing import Word, nums
import re
n = Word(nums)
def parseNumber(x):
parseable = re.sub('[,][0-9]{3}', lambda y: y.group()[1:], x)
return n.parseString(parseable, parseAll=True)
print(parseNumber("1,000,123"))
The POS tagger that I use processes the following string
3+2
as shown below.
3/num++/sign+2/num
I'd like to split this result as follows using python.
['3/num', '+/sign', '2/num']
How can I do that?
Use re.split -
>>> import re
>>> re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
['3/num', '+/sign', '2/num']
The regex pattern will split on a + sign as long as no other + precedes it.
(?<! # negative lookbehind
\+ # plus sign
)
\+ # plus sign
Note that lookbehinds (in general) do not support varying length patterns.
The tricky part I believe is the double + sign. You can replace the signs with special characters and get it done.
This should work,
st = '3/num++/sign+2/num'
st = st.replace('++', '#$')
st = st.replace('+', '#')
st = st.replace('$', '+')
print (st.split('#'))
One issue with this is that, your original string cannot contain those special characters # & $. So you will need to carefully choose them for your use case.
Edit: This answer is naive. The one with regex is better
That is, as pointed out by COLDSPEED, you should use the following regex approach with lookbehind,
import re
print re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
Although the ask was to use regex, here is an example on how to do this with standard .split():
my_string = '3/num++/sign+2/num'
my_list = []
result = []
# enumerate over the split string
for e in my_string.split('/'):
if '+' in e:
if '++' in e:
#split element on double + and add in + as well
my_list.append(e.split('++')[0])
my_list.append('+')
else:
#split element on single +
my_list.extend(e.split('+'))
else:
#add element
my_list.append(e)
# at this point my_list contains
# ['3', 'num', '+', 'sign', '2', 'num']
# enumerate on the list, steps of 2
for i in range(0, len(my_list), 2):
#add result
result.append(my_list[i] + '/' + my_list[i+1])
print('result', result)
# result returns
# ['3/num', '+/sign', '2/num']
Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?
Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.
This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'
Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True
Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']
I am trying to parse the following text using pyparsing.
acp (SOLO1,
"solo-100",
"hi here is the gift"
"Maximum amount of money, goes",
430, 90)
jhk (SOLO2,
"solo-101",
"hi here goes the wind."
"and, they go beyond",
1000, 320)
I have tried the following code but it doesn't work.
flag = Word(alphas+nums+'_'+'-')
enclosed = Forward()
nestedBrackets = nestedExpr('(', ')', content=enclosed)
enclosed << (flag | nestedBrackets)
print list(enclosed.searchString (str1))
The comma(,) within the quotation is producing undesired results.
Well, I might have oversimplified slightly in my comments - here is a more complete
answer.
If you don't really have to deal with nested data items, then a single-level parenthesized
data group in each section will look like this:
LPAR,RPAR = map(Suppress, "()")
ident = Word(alphas, alphanums + "-_")
integer = Word(nums)
# treat consecutive quoted strings as one combined string
quoted_string = OneOrMore(quotedString)
# add parse action to concatenate multiple adjacent quoted strings
quoted_string.setParseAction(lambda t: '"' +
''.join(map(lambda s:s.strip('"\''),t)) +
'"' if len(t)>1 else t[0])
data_item = ident | integer | quoted_string
# section defined with no nesting
section = ident + Group(LPAR + delimitedList(data_item) + RPAR)
I wasn't sure if it was intentional or not when you omitted the comma between
two consecutive quoted strings, so I chose to implement logic like Python's compiler,
in which two quoted strings are treated as just one longer string, that is "AB CD " "EF" is
the same as "AB CD EF". This was done with the definition of quoted_string, and adding
the parse action to quoted_string to concatenate the contents of the 2 or more component
quoted strings.
Finally, we create a parser for the overall group
results = OneOrMore(Group(section)).parseString(source)
results.pprint()
and get from your posted input sample:
[['acp',
['SOLO1',
'"solo-100"',
'"hi here is the giftMaximum amount of money, goes"',
'430',
'90']],
['jhk',
['SOLO2',
'"solo-101"',
'"hi here goes the wind.and, they go beyond"',
'1000',
'320']]]
If you do have nested parenthetical groups, then your section definition can be
as simple as this:
# section defined with nesting
section = ident + nestedExpr()
Although as you have already found, this will retain the separate commas as if they
were significant tokens instead of just data separators.
I've searched but didn't quite find something for my case. Basically, I'm trying to split the following line:
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-:INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -:RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+:RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
You can read this as CU is NOT DIVD or WEXP or DIV- or and so on. What I'd like to do is split this line if it's over 65 characters into something more manageable like this:
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-)
(CU!INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT-)
(CU!RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+)
(CU!RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
They're all less than 65 characters. This can be stored in a list and I can take care of the rest. I'm starting to work on this with RegEx but I'm having a bit of trouble.
Additionally, it can also have the following conditionals:
!
<
>
=
!=
!<
!>
As of now, I have this:
def FilterParser(iteratorIn, headerIn):
listOfStrings = []
for eachItem in iteratorIn:
if len(str(eachItem.text)) > 65:
exmlLogger.error('The length of filter' + eachItem.text + ' exceeds the limit and will be dropped')
pass
else:
listOfStrings.append(rightSpaceFill(headerIn + EXUTIL.intToString(eachItem),80))
return ''.join(stringArray)
Here is a solution using regex, edited to include the CU! prefix (or any other prefix) to the beginning of each new line:
import re
s = '(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-:INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -:RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+:RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)'
prefix = '(' + re.search(r'\w+(!?[=<>]|!)', s).group(0)
maxlen = 64 - len(prefix) # max line length of 65, prefix and ')' will be added
regex = re.compile(r'(.{1,%d})(?:$|:)' % maxlen)
lines = [prefix + line + ')' for line in regex.findall(s[len(prefix):-1])]
>>> print '\n'.join(lines)
(CU!DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-)
(CU!INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -)
(CU!RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+)
(CU!RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD)
First we need to grab the prefix, we do this using re.search().group(0), which returns the entire match. Each of the final lines should be at most 65 characters, the regex that we will use to get these lines will not include the prefix or the closing parentheses, which is why maxlen is 64 - len(prefix).
Now that we know the most characters we can match, the first part of the regex (.{1,<maxlen>) will match at most that many characters. The portion at the end, (?:$|:), is used to make sure that we only split the string on semi-colons or at the end of the string. Since there is only one capturing group regex.findall() will return only that match, leaving off the trailing semi-colon. Here is what it looks like for you sample string:
>>> pprint.pprint(regex.findall(s[len(prefix):-1]))
['DIVD:WEXP:DIVD-:DIVD+:RWEXP:RDIVD:RECL:RLOSS:MISCDI:WEXP-',
'INT:RGAIN:DIVOP:RRGAIN:DIVOP-:RDIVOP:RRECL:RBRECL:INT -',
'RRLOSS:INT +:RINT:RDIVD-:RECL-:RWXPOR:WEXPOR:MISCRE:WEXP+',
'RWEXP-:RBWEXP:RECL+:RRECL-:RBDIVD']
The list comprehension is used to construct a list of all of the lines by adding the prefix and the trailing ) to each result. The slicing of s is done so that the prefix and the trailing ) are stripped off of the original string before regex.findall(). Hope this helps!