How to ignore comments inside string literals - python

I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.
Our string literals start and end with exclamation mark. e.g. !this is a string literal!
Our comments start and end with three periods. e.g. ...This is a comment...
Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/ and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.
However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.
What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?
Examples:
!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !
This is my current code that can't ignore exclamation marks inside comments:
def t_STRING_LITERAL(t):
r'![^!\\]*(?:\\.[^!\\]*)*!'
# remove the escape characters from the string
t.value = re.sub(r'\\!', "!", t.value)
# remove single line comments
t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
return t

Perhaps this might be another option.
Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.
Then when you do match a character that the first character class does not matches, use an alternation to match either:
repeat 0+ times matching either a dot that is not directly followed by 2 dots
or match from 3 dots to the next first match of 3 dots
or match only an escaped character
To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1 to match.
For example
(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!
Explanation
(?<!\\)! Match ! not directly preceded by \
[^!\\.]* Match 1+ times any char except ! \ or .
(?: Non capture group
(?:\.(?!\.\.) Match a dot not directly followed by 2 dots
| Or
(?=(\.{3}.*?\.{3}))\1 Assert and capture in group 1 from ... to the nearest ...
| Or
\\. Match an escaped char
) Close group
[^!\\.]* Match 1+ times any char except ! \ or .
)*! Close non capture group and repeat 0+ times, then match !
Regex demo

Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2.
(?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.
It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!].
Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.

Related

Regex that captures a group with a positive lookahead but doesn't match a pattern

Using regex (Python) I want to capture a group \d-.+? that is immediately followed by another pattern \sLEFT|\sRIGHT|\sUP.
Here is my test set (from http://nflsavant.com/about.php):
(9:03) (SHOTGUN) 30-J.RICHARD LEFT GUARD PUSHED OB AT MIA 9 FOR 18 YARDS (29-BR.JONES; 21-E.ROWE).
(1:06) 69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK LEFT GUARD TO NO 4 FOR -3 YARDS (56-D.DAVIS; 93-D.ONYEMATA).
(3:34) (SHOTGUN) 28-R.FREEMAN LEFT TACKLE TO LAC 37 FOR 6 YARDS (56-K.MURRAY JR.).
(1:19) 22-L.PERINE UP THE MIDDLE TO CLE 43 FOR 2 YARDS (54-O.VERNON; 51-M.WILSON).
My best attempt is (\d*-.+?)(?=\sLEFT|\sRIGHT|\sUP), which works unless other characters appear between a matching capture group and my positive lookahead. In the second line of my test set this expression captures "69-R.HILL REPORTED IN AS ELIGIBLE. 33-D.COOK." instead of the desired "33-D.COOK".
My inputs are also saved on regex101, here: https://regex101.com/r/tEyuiJ/1
How can I modify (or completely rewrite) my regex to only capture the group immediately followed by my exact positive lookahead with no extra characters between?
To prevent skipping over digits, use \D non-digit (upper is negated \d).
\b(\d+-\D+?)\s(?:LEFT|RIGHT|UP)
See this demo at regex101
Further added a word boundary and changed the lookahead to a group.
If you want a capture group without any lookarounds:
\b(\d+-\S*)\s(?:LEFT|RIGHT|UP)\b
Explanation
\b A word boundary to prevent a partial word match
(\d+-\S*) Capture group 1, match 1+ digits - and optional non whitespace characters
\s Match a single whitespace character
(?:LEFT|RIGHT|UP) Match any of the alternatives
\b A word boundary
See the capture group value on regex101.
This is why you should be careful about using . to match anything and everything unless it's absolutely necessary. From the example you provided, it appears that what you're actually wanting to capture contains no spaces, thus we could utilize a negative character class [^\s] or alternatively more precisely [\w.], with either case using a * quantifier.
Your end result would look like "(\d*-[\w.]*)(?=\sLEFT|\sRIGHT|\sUP)"gm. And of course, when . is within the character class it's treated as a literal string - so it's not required to be escaped.
See it live at regex101.com
Try this:
\b\d+-[^\r \n]+(?= +(?:LEFT|RIGHT|UP)\b)
\b\d+-[^\r \n]+
\b word boundary to ignore things like foo30-J.RICHARD
\d+ match one or more digit.
- match a literal -.
[^\r \n]+ match on or more character except \r, \n and a literal space . Excluding \r and \n helps us not to cross newlines, and that is why \s is not used(i.e., it matches \r and \n too)
(?= +(?:LEFT|RIGHT|UP)\b) Using positive lookahead.
+ Ensure there is one or more literal space .
(?:LEFT|RIGHT|UP)\b using non-caputring group, ensure our previous space followed by one of these words LEFT, RIGHT or UP. \b word boundary to ignore things like RIGHTfoo or LEFTbar.
See regex demo

Can I add non-capture groups with a list of optional characters?

I need to match strings which have a-z, \? or \*, for example:
abcd
abc\?d # mush have a \ in front of a ?
abc\*d
ab\?c\*d
and exclude strings which don't have \ in front of other punctuations, such as
abc?d
abc*d
ab?c*d
I tried [a-z(?:\\\?)(?:\\\*)]+ (https://regex101.com/r/5yYBDl/1), but it doesn't work, because [] only supports characters i guess.
Any help would be appreciated.
You may use this regex with an alternation and anchors:
^(?:[a-z]|\\[*?])+$
Updated RegEx Demo
RegEx Details:
^: Start
(?:[a-z]|\\[*?])+: Non capturing group to match either [a-z] or \? or \*. Match 1 or more of this non capturing group.
$: End
will match Unicode character work? depending on your application it may have Unicode support
^(?:[a-z]|\u005c[\u003f\u002a])+$
https://www.regular-expressions.info/unicode.html
snippet from this site
"Perl, PCRE, Boost, and std::regex do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times"

Convert a TRIM on regex

I have a large string and one of the lines is in the form of
Description: something here....
I want to get everything in the: something here... without any trailing or leading space on it. Currently I'm doing it with a mix of regex and a strip(). How could this be done entirely in regex? Currently:
re.search('Description:\s+(.+)', body).group(1).strip()
Other thoughts:
re.search('Description:\s+\w(.+)\w', body).group(1) # works
Also, why doesn't putting an anchor work in the above context?
re.search('Description:\s+\w(.+)$', body).group(1) # fails
You can use either of
Description:\s+(.*\S)
See the regex demo.
The point is that you need to match up to the last non-whitespace character. .* matches any zero or more characters other than line break chars, as many as possible, so the \S matches the last non-whitespace character in the string.
If you have a multiline string and you need to get to the last non-whitespace character, you may add re.S / re.DOTALL option when passing the pattern above to a regex method, or re-write it as
Description:\s+(\S+(?:\s+\S+)*)
where \S+ matches one or more non-whitespace chars and (?:\s+\S+)* matches zero or more occurrences of one or more whitespaces followed with one or more non-whitespace chars.
See this regex demo.

How to exclude regex matches containing a constant string

I need help understanding exclusions in regex.
I begin with this in my Jupyter notebook:
import re
file = open('names.txt', encoding='utf-8')
data = file.read()
file.close()
Then I can't get my exclusions to work. The read file has 12 email strings in it, 3 of which contain '.gov'.
I was told this would return only those that are not .gov:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*.[^gov]
''', data, re.X|re.I)
It doesn't. It returns all the emails and excludes any characters in 'gov' following the '#'; e.g.:
abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
I've tried using ?! in various forms I found online to no avail.
For example, I was told the following syntax would exclude the entire match rather than just those characters:
#re.findall(r'''
# ^/(?!**SPECIFIC STRING TO IGNORE**)(**DEFINITION OF STRING TO RETURN**)$
#''', data, re.X|re.I)
Yet the following simply returns an empty list:
#re.findall(r'''
# ^/(?!\b[-+.\w\d]*#[-+.\w\d]*.gov)([-+.\w\d]*#[-+.\w\d].[\w]*[^\t\n])$
#''', data, re.X|re.I)
I tried to use the advice from this question:
Regular expression to match a line that doesn't contain a word
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*./^((?!.gov).)*$/s # based on syntax /^((?!**SUBSTRING**).)*$/s
#^ this slash is where different code starts
''', data, re.X|re.I)
This is supposed to be the inline syntax, and I think by including the slashes I may be making a mistake:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*./(?s)^((?!.gov).)*$/ # based on syntax /(?s)^((?!**SUBTRING**).)*$/
''', data, re.X|re.I)
And this returns an empty list:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*.(?s)^((?!.gov).)*$ # based on syntax (?s)^((?!**SUBTRING**).)*$
''', data, re.X|re.I)
Please help me understand how to use ?! or ^ or another exclusion syntax to return a specified string not containing another specified string.
Thanks!!
First, your regex for recognizing an email address does not look close to being correct. For example, it would accept #13a as being valid. See How to check for valid email address? for some simplifications. I will use: [^#]+#[^#]+\.[^#]+ with the recommendation that we also exclude space characters and so, in your particular case:
^([^#\s]+#[^#\s]+\.[^#\s.]+)
I also added a . to the last character class [^#\s.]+ to ensure that this represents the top-level domain. But we do not want the email address to end in .gov. Our regex specifies toward the end for matching the top-level domain:
\. Match a period.
[^#\s.]+ Match one or more non-white space, non-period characters.
In Step 2 above we should first apply a negative lookahead, i.e. a condition to ensure that the next characters are not gov. But to ensure we are not doing a partial match (if the top-level domain were government, that would be OK), gov must be followed by either white space or the end of the line to be disqualifying. So we have:
^([^#\s]+#[^#\s]+\.(?!gov(?:\s|$))[^#\s.]+)
See Regex Demo
import re
text = """abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
test#test.gov
test.test#test.org.gov.test
"""
print(re.findall(r'^([^#\s]+#[^#\s]+\.(?!gov(?:\s|$))[^#\s.]+)', text, flags=re.M|re.I))
Prints:
['abc123#abc.c', '456#email.edu', 'test.test#test.org.gov.test']
So, in my interpretation of the problem test.test#test.org.gov.test is OK becuase gov is not the top-level domain. governmentemail#governmentaddress. is rejected simply because it is not a valid email address.
If you don't want gov in any level of the domain, then use this regex:
^([^#\s]+#(?!(?:\S*\.)?gov(?:\s|\.|$))[^#\s]+\.[^#\s]+)
See Regex Demo
After seeing the # symbol, this ensures that what follows is not an optional period followed by gov followed by either another period, white space character or end of line.
import re
text = """abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
test#test.gov
test.test#test.org.gov.test
"""
print(re.findall(r'^([^#\s]+#(?!(?:\S*\.)?gov(?:\s|\.|$))[^#\s]+\.[^#\s]+)', text, flags=re.M|re.I))
Prints:
['abc123#abc.c', '456#email.edu']
A few notes about the patterns you tried
This part of the pattern [-+.\w\d]*\b# can be shortened to [-+.\w]*\b# as \w also matches \d and note that it will also not match a dot
Using [-+.\w\d]*\b# will prevent a dash from matching before the # but it could match ---a#.a
The character class [-+.\w\d]* is repeated 0+ times but it can never match 0+ times as the word boundary \b will not work between a whitespace or start of line and an #
Note that not escaping the dot . will match any character except a newline
This part ^((?!.gov).)*$ is a tempered greedy token that will, from the start of the string, match any char except a newline asserting what is on the right is not any char except a newline followed by gov until the end of the string
One option could be to use the tempered greedy token to assert that after the # there is not .gov present.
[-+.\w]+\b#(?:(?!\.gov)\S)+(?!\S)
Explanation about the separate parts
[-+.\w]+ Match 1+ times any of the listed
\b# Word boundary and match #
(?: Non capturing group
(?! Negative lookahead, assert what is on the right is not
\.gov Match .gov
) Close lookahead
\S Match a non whitespace char
)+ Close non capturing group and repeat 1+ times
(?!\S) Negative lookahead, assert what is on the right is non a non whitespace char to prevent partial matches
Regex demo
You could make the pattern a bit broader by matching not an # or whitespace char, then match # and then match non whitespace chars where the string .gov is not present:
[^\s#]+#(?:(?!\.gov)\S)+(?!\S)
Regex demo

Regex to ignore pattern found in quotes (Python or R)

I am trying to create a regex that allows me to find instances of a string where I have an unspaced /
eg:
some characters/morecharacters
I have come up with the expression below which allows me to find word characters or closing parenthesis before my / and word characters or open parenthesis characters afterwards.
(\w|\))/(\(|\w)
This works great for most situations, however I am coming unstuck when I have a / enclosed in quotes. In this case I'd like it to be ignored. I have seen a few different posts here and here. However, I can't quite get them to work in my situation.
What I'd like is for first three cases identified below to match and the last cast to be ignored allowing me to extract item 1 and item 3.
some text/more text
(formula)/dividethis
divideme/(byme)
"dont match/me"
It ain't pretty, but this will do what you want:
(?<!")(?:\(|\b)[^"\n]+\/[^"\n]+(?:\)|\b)(?!")
Demo on Regex101
Let's break it down a bit:
(?<!")(?:\(|\b) will match either an open bracket or a word boundary, as long as it's not preceded by a quotation mark. It does this by employing a negative lookbehind.
[^"\n]+ will match one or more characters, as long as they're neither a quotation mark or a line break (\n).
\/ will match a literal slash character.
Finally, (?:\)|\b)(?!") will match either a closing bracket or a word boundary as long as it's not followed by a quotation mark. It does this by employing a negative lookahead. Note that the (?:\)|\b) will only work 100% correctly in this order - if you reverse them, it'll drop the match on the bracket, because it encounters a word boundary before it gets to the bracket.
This will only match word/word which is not inside quotation marks.
import re
text = """
some text/more text "dont match/me" divideme/(byme)
(formula)/dividethis
divideme/(byme) "dont match/me hel d/b lo a/b" divideme/(byme)
"dont match/me"
"""
groups=re.findall("(?:\".*?\")|(\S+/\S+)", text, flags=re.MULTILINE)
print filter(None,groups)
Output:
['text/more', 'divideme/(byme)', '(formula)/dividethis', 'divideme/(byme)', 'divideme/(byme)']
(?:\".*?\") This will match everything inside quotes but this group won't be captured.
(\S+/\S+) This will match word/word only outside the quotations and this group will be captured.
Demo on Regex101

Categories

Resources