How can I match one or more parenthetical expressions appearing at the end of string?
Input:
'hello (i) (m:foo)'
Desired output:
['i', 'm:foo']
Intended for a python script. Paren marks cannot appear inside of each other (no nesting), and the parenthetical expressions may be separated by whitespace.
It's harder than it might seem at first glance, at least so it seems to me.
paren_pattern = re.compile(r"\(([^()]*)\)(?=(?:\s*\([^()]*\))*\s*$)")
def getParens(s):
return paren_pattern.findall(s)
or even shorter:
getParens = re.compile(r"\(([^()]*)\)(?=(?:\s*\([^()]*\))*\s*$)").findall
explaination:
\( # opening paren
([^()]*) # content, captured into group 1
\) # closing paren
(?= # look ahead for...
(?:\s*\([^()]*\))* # a series of parens, separated by whitespace
\s* # possibly more whitespace after
$ # end of string
) # end of look ahead
You don't need to use regex:
def splitter(input):
return [ s.rstrip(" \t)") for s in input.split("(") ][1:]
print splitter('hello (i) (m:foo)')
Note: this solution only works if your input is already known to be valid. See MizardX's solution that will work on any input.
Related
I'm trying to fix the . at the end to only one in a string. For example,
line = "python...is...fun..."
I have the regex \.*$ in Ruby, which is to be replaced by a single ., as in this demo, which don't seem to work as expected. I've searched for similar posts, and the closest I'd got is this answer in Python, which suggests the following,
>>> text1 = 'python...is...fun...'
>>> new_text = re.sub(r"\.+$", ".", text1)
>>> 'python...is...fun.'
But, it fails if I've no . at the end. So, I've tried like \b\.*$, as seen here, but this fails on the 3rd test which has some ?'s at end.
My question is, why \.*$ not matches all the .'s (despite of being greedy) and how to do the problem correctly?
Expected output:
python...is...fun.
python...is...fun.
python...is...fun??.
You might use an alternation matching either 2 or more dots or assert that what is directly to the left is not one of for example ! ? or a dot itself.
In the replacement use a single dot.
(?:\.{2,}|(?<!\.))$
Explanation
(?: Non capture group for the alternation
\.{2,} Match 2 or more dots
| Or
(?<!\.) Get the position where directly to the left is not a . (which you can extend with other characters as desired)
) Close non capture group
$ End of string (Or use \Z if there can be no newline following)
Regex demo | Python demo
For example
import re
strings = [
"python...is...fun...",
"python...is...fun",
"python...is...fun??"
]
for s in strings:
new_text = re.sub(r"(?:\.{2,}|(?<!\.))$", ".", s)
print(new_text)
Output
python...is...fun.
python...is...fun.
python...is...fun??.
If an empty string should not be replaced by a dot, you can use a positive lookbehind.
(?:\.{2,}|(?<=[^\s.]))$
Regex demo
I'm doing something like "Syntax Analyzer" with Kivy, using re (regular expresions).
I only want to check a valid syntax for basic operations (like +|-|*|/|(|)).
The user tape the string (with keyboard) and I validate it with regex.
But I don't know how to use regex in an if statement. That I want is: If the string that user brings me isn't correct (or doesn't check with regex) print something like "inavlid string" and if is correct print "Valid string".
I've tried with:
if re.match(patron, string) is not None:
print ("\nTrue")
else:
print("False")
but, it doesn't matter what do string has, the app always show True.
Sorry my poor english. Any help would be greatly appreciated!
import re
patron= re.compile(r"""
(
-?\d+[.\d+]?
[+*-/]
-?\d+[.\d+]?
[+|-|*|/]?
)*
""", re.X)
obj1= self.ids['text'].text #TextInput
if re.match(patron, obj1) is not None:
print ("\nValid String")
else:
print("Inavlid string")
if obj1= "53.22+22.11+10*555+62+55.2-66" actually it's correct and app prints "Valid..." but if I put an a like this "a53.22+22.11+10*555+62+55.2-66" it's incorrect and the app must prints invalid.. but instead it still valid.
Your regex always matches because it allows the empty string to match (since the entire regex is enclosed in an optional group.
If you test this live on regex101.com, you can immediately see this and also that it doesn't match the entire string but only parts of it.
I've already corrected two errors in your character classes concerning the use of unnecessary/harmful alternation operators (|) and incorrect placement of the dash, making it into a range operator (-), but it's still incorrect.
I think you want something more like this:
^ # Make sure the match begins at the start of the string
(?: # Start a non-capturing group that matches...
-? # an optional minus sign,
\d+ # one or more digits
(?:\.\d+)? # an optional group that contains a dot and one or more digits.
(?: # Start of a non-capturing group that either matches...
[+*/-] # an operator
| # or
$ # the end of the string.
) # End of inner non-capturing group
)+ # End of outer non-capturing group, required to match at least once.
(?<![+*/-]) # Make sure that the final character isn't an operator.
$ # Make sure that the match ends at the end of the string.
Test it live on regex101.com.
This answers your question about how to use if with regex:
Caveat: the regex formula will not weed out all invalid inputs, e.g., two decimal points (".."), two operators ("++"), and such. So please adjust it to suit your exact needs)
import re
regex = re.compile(r"[\d.+\-*\/]+")
input_list = [
"53.22+22.11+10*555+62+55.2-66", "a53.22+22.11+10*555+62+55.2-66",
"53.22+22.pq11+10*555+62+55.2-66", "53.22+22.11+10*555+62+55.2-66zz",
]
for input_str in input_list:
mmm = regex.match(input_str)
if mmm and input_str == mmm.group():
print('Valid: ', input_str)
else:
print('Invalid: ', input_str)
Above as a function for use with a single string instead of a list:
import re
regex = re.compile(r"[\d.+\-*\/]+")
def check_for_valid_string(in_string=""):
mmm = regex.match(in_string)
if mmm and in_string == mmm.group():
return 'Valid: ', in_string
return 'Invalid: ', in_string
check_for_valid_string('53.22+22.11+10*555+62+55.2-66')
check_for_valid_string('a53.22+22.11+10*555+62+55.2-66')
check_for_valid_string('53.22+22.pq11+10*555+62+55.2-66')
check_for_valid_string('53.22+22.11+10*555+62+55.2-66zz')
Output:
## Valid: 53.22+22.11+10*555+62+55.2-66
## Invalid: a53.22+22.11+10*555+62+55.2-66
## Invalid: 53.22+22.pq11+10*555+62+55.2-66
## Invalid: 53.22+22.11+10*555+62+55.2-66zz
The Setup:
Let's say I have the following regex defined in my script. I want to keep the comments there for future me because I'm quite forgetful.
RE_TEST = re.compile(r"""[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
""",
re.VERBOSE)
print(magic_function(RE_TEST)) # returns: "[0-9][A-Z][a-y]z"
The Question:
Does Python (3.4+) have a way to convert that to the simple string "[0-9][A-Z][a-y]z"?
Possible Solutions:
This question ("strip a verbose python regex") seems to be pretty close to what I'm asking for and it was answered. But that was a few years ago, so I'm wondering if a new (preferably built-in) solution has been found.
In addition to the above, there are work-arounds such as using implicit string concatenation and then using the .pattern attribute:
RE_TEST = re.compile(r"[0-9]" # 1 Number
r"[A-Z]" # 1 Uppercase Letter
r"[a-y]" # 1 lowercase, but not z
r"z", # gotta have z...
re.VERBOSE)
print(RE_TEST.pattern) # returns: "[0-9][A-Z][a-y]z"
or just commenting the pattern separately and not compiling it:
# matches pattern "nXxz"
RE_TEST = "[0-9][A-Z][a-y]z"
print(RE_TEST)
But I'd really like to keep the compiled regex the way it is (1st example). Perhaps I'm pulling the regex string from some file, and that file is already using the verbose form.
Background
I'm asking because I want to suggest an edit to the unittest module.
Right now, if you run assertRegex(string, pattern) using a compiled pattern with comments and that assertion fails, then the printed output is somewhat ugly (the below is a dummy regex):
Traceback (most recent call last):
File "verify_yaml.py", line 113, in test_verify_mask_names
self.assertRegex(mask, RE_MASK)
AssertionError: Regex didn't match: '(X[1-9]X[0-9]{2}) # comment\n |(XXX[0-9]{2}) # comment\n |(XXXX[0-9E]) # comment\n |(XXXX[O1-9]) # c
omment\n |(XXX[0-9][0-9]) # comment\n |(XXXX[
1-9]) # comment\n ' not found in 'string'
I'm going to propse that the assertRegex and assertNotRegex methods clean the regex before printing it by either removing the comments and extra whitespace or by printing it differently.
The following tested script includes a function that does a pretty good job converting an xmode regex string to non-xmode:
pcre_detidy(retext)
# Function pcre_detidy to convert xmode regex string to non-xmode.
# Rev: 20160225_1800
import re
def detidy_cb(m):
if m.group(2): return m.group(2)
if m.group(3): return m.group(3)
return ""
def pcre_detidy(retext):
decomment = re.compile(r"""(?#!py/mx decomment Rev:20160225_1800)
# Discard whitespace, comments and the escapes of escaped spaces and hashes.
( (?: \s+ # Either g1of3 $1: Stuff to discard (3 types). Either ws,
| \#.* # or comments,
| \\(?=[\r\n]|$) # or lone escape at EOL/EOS.
)+ # End one or more from 3 discardables.
) # End $1: Stuff to discard.
| ( [^\[(\s#\\]+ # Or g2of3 $2: Stuff to keep. Either non-[(\s# \\.
| \\[^# Q\r\n] # Or escaped-anything-but: hash, space, Q or EOL.
| \( # Or an open parentheses, optionally
(?:\?\#[^)]*(?:\)|$))? # starting a (?# Comment group).
| \[\^?\]? [^\[\]\\]* # Or Character class. Allow unescaped ] if first char.
(?:\\[^Q][^\[\]\\]*)* # {normal*} Zero or more non-[], non-escaped-Q.
(?: # Begin unrolling loop {((special1|2) normal*)*}.
(?: \[(?::\^?\w+:\])? # Either special1: "[", optional [:POSIX:] char class.
| \\Q [^\\]* # Or special2: \Q..\E literal text. Begin with \Q.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) [^\[\]\\]* # End special: One of 2 alternatives {(special1|2)}.
(?:\\[^Q][^\[\]\\]*)* # More {normal*} Zero or more non-[], non-escaped-Q.
)* (?:\]|\\?$) # End character class with ']' or EOL (or \\EOL).
| \\Q [^\\]* # Or \Q..\E literal text start delimiter.
(?:\\(?!E)[^\\]*)* # \Q..\E contents - everything up to \E.
(?:\\E|$) # \Q..\E literal text ends with \E or EOL.
) # End $2: Stuff to keep.
| \\([# ]) # Or g3of3 $6: Escaped-[hash|space], discard the escape.
""", re.VERBOSE | re.MULTILINE)
return re.sub(decomment, detidy_cb, retext)
test_text = r"""
[0-9] # 1 Number
[A-Z] # 1 Uppercase Letter
[a-y] # 1 lowercase, but not z
z # gotta have z...
"""
print(pcre_detidy(test_text))
This function detidies regexes written in pcre-8/pcre2-10 xmode syntax.
It preserves whitespace inside [character classes], (?#comment groups) and \Q...\E literal text spans.
RegexTidy
The above decomment regex, is a variant of one I am using in my upcoming, yet to be released: RegexTidy application, which will not only detidy a regex as shown above (which is pretty easy to do), but it will also go the other way and Tidy a regex - i.e. convert it from non-xmode regex to xmode syntax, adding whitespace indentation to nested groups as well as adding comments (which is harder).
p.s. Before giving this answer a downvote on general principle because it uses a regex longer than a couple lines, please add a comment describing one example which is not handled correctly. Cheers!
Looking through the way sre_parse handles this, there really isn't any point where your verbose regex gets "converted" into a regular one and then parsed. Rather, your verbose regex is being fed directly to the parser, where the presence of the VERBOSE flag makes it ignore unescaped whitespace outside character classes, and from unescaped # to end-of-line if it is not inside a character class or a capture group (which is missing from the docs).
The outcome of parsing your verbose regex there is not "[0-9][A-Z][a-y]z". Rather it is:
[(IN, [(RANGE, (48, 57))]), (IN, [(RANGE, (65, 90))]), (IN, [(RANGE, (97, 121))]), (LITERAL, 122)]
In order to do a proper job of converting your verbose regex to "[0-9][A-Z][a-y]z" you could parse it yourself. You could do this with a library like pyparsing. The other answer linked in your question uses regex, which will generally not duplicate the behavior correctly (specifically, spaces inside character classes and # inside capture groups/character classes. And even just dealing with escaping is not as convenient as with a good parser.)
I am using Python 2.7 and I am fairly familiar with using regular expressions and how to use them in Python. I would like to use a regex to replace comma delimiters with a semicolon. The problem is that data wrapped in double qoutes should retain embedded commas. Here is an example:
Before:
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After:
"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
Is there a single regex that can do this?
This is an other way that avoids to test all the string until the end with a lookahead for each occurrence. It's a kind of (more or less) \G feature emulation for re module.
Instead of testing what comes after the comma, this pattern find the item before the comma (and the comma obviously) and is written in a way that makes each whole match consecutive to the precedent.
re.sub(r'(?:(?<=,)|^)(?=("(?:"")*(?:[^"]+(?:"")*)*"|[^",]*))\1,', r'\1;', s)
online demo
details:
(?: # ensures that results are contiguous
(?<=,) # preceded by a comma (so, the one of the last result)
| # OR
^ # at the start of the string
)
(?= # (?=(a+))\1 is a way to emulate an atomic group: (?>a+)
( # capture the precedent item in group 1
"(?:"")*(?:[^"]+(?:"")*)*" # an item between quotes
|
[^",]* # an item without quotes
)
) \1 # back-reference for the capture group 1
,
The advantage of this way is that it reduces the number of steps to obtain a match and provides a near from constant number of steps whatever the item before (see the regex101 debugger). The reason is that all characters are matched/tested only once. So even the pattern is more long, it is more efficient (and the gain grow up in particular with long lines)
The atomic group trick is only here to reduce the number of steps before failing for the last item (that is not followed by a comma).
Note that the pattern deals with items between quotes with escaped quotes (two consecutive quotes) inside: "abcd""efgh""ijkl","123""456""789",foo
# Python 2.7
import re
text = '''
"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
'''.strip()
print "Before: " + text
print "After: " + ";".join(re.findall(r'(?:"[^"]+"|[^,]+)', text))
This produces the following output:
Before: "3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"
After: "3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
You can tinker with this here if you need more customization.
You can use:
>>> s = 'foo bar,"3,14","1,000,000",hippo,"cat,dog,frog",plain text,"2,25"'
>>> print re.sub(r'(?=(([^"]*"){2})*[^"]*$),', ';', s)
foo bar;"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"
RegEx Demo
This will match comma only if it is outside quote by matching even number of quotes after ,.
This regex seems to do the job
,(?=(?:[^"]*"[^"]*")*[^"]*\Z)
Adapted from:
How to match something with regex that is not between two special characters?
And tested with http://pythex.org/
You can split with regex and then join it :
>>> ';'.join([i.strip(',') for i in re.split(r'(,?"[^"]*",?)?',s) if i])
'"3,14";"1,000,000";hippo;"cat,dog,frog";plain text;"2,25"'
Why do these two expressions return the same output?
phillip = '#awesome '
nltk.re_show('\w+|[^\w\s]+', phillip)
vs.
nltk.re_show('\w+|[^\w]+', phillip)
Both return:
{#}{awesome}
Why doesn't the second one return
{#}{awesome}{ }?
It appears this that nltk right-strips whitespace in strings before applying the regex.
See the source code (or you could import inspect and print inspect.get_source(nltk.re_show))
def re_show(regexp, string, left="{", right="}"):
"""docstring here -- I stripped it for brevity"""
print(re.compile(regexp, re.M).sub(left + r"\g<0>" + right, string.rstrip()))
In particular, see the string.rstrip(), which strips all trailing whitespace.
For example, if you make sure that your phillip string does not have a space to the right:
nltk.re_show('\w+|[^\w]+', phillip + '.')
# {#}{awesome}{ .}
Not sure why nltk would do this, it seems like a bug to me...
\w looks to match [A-Za-z0-9_]. And since you are looking for one OR the other (1+ "word" characters OR 1+ non-"word" characters), it matches the first character as a \w character and keeps going until it hits a non-match .
If you do a global match, you will see that there is another match containing the space (the first non-"word" character).