How to exclude regex matches containing a constant string

How to exclude regex matches containing a constant string - python

I need help understanding exclusions in regex.
I begin with this in my Jupyter notebook:
import re
file = open('names.txt', encoding='utf-8')
data = file.read()
file.close()
Then I can't get my exclusions to work. The read file has 12 email strings in it, 3 of which contain '.gov'.
I was told this would return only those that are not .gov:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*.[^gov]
''', data, re.X|re.I)
It doesn't. It returns all the emails and excludes any characters in 'gov' following the '#'; e.g.:
abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
I've tried using ?! in various forms I found online to no avail.
For example, I was told the following syntax would exclude the entire match rather than just those characters:
#re.findall(r'''
# ^/(?!**SPECIFIC STRING TO IGNORE**)(**DEFINITION OF STRING TO RETURN**)$
#''', data, re.X|re.I)
Yet the following simply returns an empty list:
#re.findall(r'''
# ^/(?!\b[-+.\w\d]*#[-+.\w\d]*.gov)([-+.\w\d]*#[-+.\w\d].[\w]*[^\t\n])$
#''', data, re.X|re.I)
I tried to use the advice from this question:
Regular expression to match a line that doesn't contain a word
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*./^((?!.gov).)*$/s # based on syntax /^((?!**SUBSTRING**).)*$/s
#^ this slash is where different code starts
''', data, re.X|re.I)
This is supposed to be the inline syntax, and I think by including the slashes I may be making a mistake:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*./(?s)^((?!.gov).)*$/ # based on syntax /(?s)^((?!**SUBTRING**).)*$/
''', data, re.X|re.I)
And this returns an empty list:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*.(?s)^((?!.gov).)*$ # based on syntax (?s)^((?!**SUBTRING**).)*$
''', data, re.X|re.I)
Please help me understand how to use ?! or ^ or another exclusion syntax to return a specified string not containing another specified string.
Thanks!!

First, your regex for recognizing an email address does not look close to being correct. For example, it would accept #13a as being valid. See How to check for valid email address? for some simplifications. I will use: [^#]+#[^#]+\.[^#]+ with the recommendation that we also exclude space characters and so, in your particular case:
^([^#\s]+#[^#\s]+\.[^#\s.]+)
I also added a . to the last character class [^#\s.]+ to ensure that this represents the top-level domain. But we do not want the email address to end in .gov. Our regex specifies toward the end for matching the top-level domain:
\. Match a period.
[^#\s.]+ Match one or more non-white space, non-period characters.
In Step 2 above we should first apply a negative lookahead, i.e. a condition to ensure that the next characters are not gov. But to ensure we are not doing a partial match (if the top-level domain were government, that would be OK), gov must be followed by either white space or the end of the line to be disqualifying. So we have:
^([^#\s]+#[^#\s]+\.(?!gov(?:\s|$))[^#\s.]+)
See Regex Demo
import re
text = """abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
test#test.gov
test.test#test.org.gov.test
"""
print(re.findall(r'^([^#\s]+#[^#\s]+\.(?!gov(?:\s|$))[^#\s.]+)', text, flags=re.M|re.I))
Prints:
['abc123#abc.c', '456#email.edu', 'test.test#test.org.gov.test']
So, in my interpretation of the problem test.test#test.org.gov.test is OK becuase gov is not the top-level domain. governmentemail#governmentaddress. is rejected simply because it is not a valid email address.
If you don't want gov in any level of the domain, then use this regex:
^([^#\s]+#(?!(?:\S*\.)?gov(?:\s|\.|$))[^#\s]+\.[^#\s]+)
See Regex Demo
After seeing the # symbol, this ensures that what follows is not an optional period followed by gov followed by either another period, white space character or end of line.
import re
text = """abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
test#test.gov
test.test#test.org.gov.test
"""
print(re.findall(r'^([^#\s]+#(?!(?:\S*\.)?gov(?:\s|\.|$))[^#\s]+\.[^#\s]+)', text, flags=re.M|re.I))
Prints:
['abc123#abc.c', '456#email.edu']

A few notes about the patterns you tried
This part of the pattern [-+.\w\d]*\b# can be shortened to [-+.\w]*\b# as \w also matches \d and note that it will also not match a dot
Using [-+.\w\d]*\b# will prevent a dash from matching before the # but it could match ---a#.a
The character class [-+.\w\d]* is repeated 0+ times but it can never match 0+ times as the word boundary \b will not work between a whitespace or start of line and an #
Note that not escaping the dot . will match any character except a newline
This part ^((?!.gov).)*$ is a tempered greedy token that will, from the start of the string, match any char except a newline asserting what is on the right is not any char except a newline followed by gov until the end of the string
One option could be to use the tempered greedy token to assert that after the # there is not .gov present.
[-+.\w]+\b#(?:(?!\.gov)\S)+(?!\S)
Explanation about the separate parts
[-+.\w]+ Match 1+ times any of the listed
\b# Word boundary and match #
(?: Non capturing group
(?! Negative lookahead, assert what is on the right is not
\.gov Match .gov
) Close lookahead
\S Match a non whitespace char
)+ Close non capturing group and repeat 1+ times
(?!\S) Negative lookahead, assert what is on the right is non a non whitespace char to prevent partial matches
Regex demo
You could make the pattern a bit broader by matching not an # or whitespace char, then match # and then match non whitespace chars where the string .gov is not present:
[^\s#]+#(?:(?!\.gov)\S)+(?!\S)
Regex demo

Related

how to write regex to accept the string which end with string

I want to write a regex which accepts this:
Accept:
done
done1
done1,done2,done3
Do not accept:
done1,
done1,done2,
I tried to write this regex
([a-zA-Z]+)?(/d)?(,)([a-zA-Z]+)
but it is not working.
What's wrong? How can I fix it?

I would phrase the regex pattern as:
(?<!\S)\w+(?:,\w+)*(?!\S)
Sample script:
inp = "done done1 done1,done2,done3 done1, done1,done2,"
matches = re.findall(r'(?<!\S)\w+(?:,\w+)*(?!\S)', inp)
print(matches) # ['done', 'done1', 'done1,done2,done3']
Here is an explanation of the regex pattern:
(?<!\S) assert that what precedes is either whitespace or the start of the input
\w+ match a word
(?:,\w+)* followed by comma another word, both zero or more times
(?!\S) assert that what follows the final word is either whitespace
or the end of the input

It also depends on how you apply the regex. The regex alone (e.g. when used with re.search()) tells you whether the input contains any substring which matches your regex. In the trivial case, if you are examining one line at a time, add start and end of line anchors around your regex to force it to match the entire line.
Also, of course, notice that the regex to match a single digit is \d, not /d.
Your regex looks like you want both the alphabetics and the numbers to be optional, but the group of alphabetics and numbers to be non-empty; is that correct? One way to do that is to add a lookahead (?=[a-zA-Z\d]) before the phrase which matches both optionally.
import re
tests = """\
done
done1
done1,done2,done3
done1,
done1,done2,
"""
regex = re.compile(r'^(?=[a-zA-Z\d])[a-zA-Z]*\d?(?:,(?=[a-zA-Z\d])[a-zA-Z]*\d?)*$')
for line in tests.splitlines():
match = regex.search(line)
if match:
print(line)
The individual phrases here should be easy to understand. [a-zA-Z]* matches zero or more alphabetics, and \d? matches zero or one digits. We require one of those, followed by zero or more repetitions of a comma followed by a repeat of the first expression.
Perhaps also note that [a-zA-Z\d] is almost the same as \w (the latter also matches an underscore). If you don't care about this inexactness, the expression could be simplified. It would certainly be useful in the lookahead, where the regex after it will not match an underscore anyhow. But I've left in the more complex expression just to make the code easier to follow in relation to the original example.
Demo: https://ideone.com/4mVGDh

How to ignore comments inside string literals

I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.
Our string literals start and end with exclamation mark. e.g. !this is a string literal!
Our comments start and end with three periods. e.g. ...This is a comment...
Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/ and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.
However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.
What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?
Examples:
!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !
This is my current code that can't ignore exclamation marks inside comments:
def t_STRING_LITERAL(t):
r'![^!\\]*(?:\\.[^!\\]*)*!'
# remove the escape characters from the string
t.value = re.sub(r'\\!', "!", t.value)
# remove single line comments
t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
return t

Perhaps this might be another option.
Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.
Then when you do match a character that the first character class does not matches, use an alternation to match either:
repeat 0+ times matching either a dot that is not directly followed by 2 dots
or match from 3 dots to the next first match of 3 dots
or match only an escaped character
To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1 to match.
For example
(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!
Explanation
(?<!\\)! Match ! not directly preceded by \
[^!\\.]* Match 1+ times any char except ! \ or .
(?: Non capture group
(?:\.(?!\.\.) Match a dot not directly followed by 2 dots
| Or
(?=(\.{3}.*?\.{3}))\1 Assert and capture in group 1 from ... to the nearest ...
| Or
\\. Match an escaped char
) Close group
[^!\\.]* Match 1+ times any char except ! \ or .
)*! Close non capture group and repeat 0+ times, then match !
Regex demo

Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2.
(?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.
It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!].
Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.

How to remove a word if it has more than 2 occurrence of a given character in python?

I am parsing a log file which has lines like:
Pushing the logs into /var/log/my_log.txt
Pushing the logs into /opt/test/log_file.txt
There are multiple occurrences of these lines with auto-generated paths(/.../.../...)
I want to change this into a generic form like:
Pushing the logs into PATH
I tried using regex to select a word with multiple forward slashes and then replace it with the word 'PATH' as follows:
line = re.sub(r'\b([\/A-Z]*\/[A-Z]*){1,}\b',' PATH ',line)
Only the forward slashes are getting replaced but not the entire word.
Very new to this concept. Am I doing something wrong? All help is appreciated. Thanks.

You could use:
import re
line = 'Pushing the logs into /var/log/my_log.txt'
pat = r'(?<!\S)(/\S+){2,}'
line = re.sub(pat, 'PATH', line)
print(line)
This is not answering exactly as stated because it looks for "words" that must start with a / and also contain two or more / (with other non-whitespace characters following each /) -- so it would cover e.g. /tmp/my_log.txt. I think this better covers the sort of strings that you would find -- if they are absolute paths then / will always be the first character, and similarly if they are files rather than directories then the last / will not be at the end (although I haven't bothered to exclude a / at the end provided that there are also at least two before it). If you only want to look for e.g. 3 or more / (not at the end), then change the 2 to a 3, but you will miss /tmp/my_log.txt if you do that.
The first bit of the regexp (?<!\S) is a negative lookbehind assertion meaning "not preceded by a non-whitespace character", i.e. it will match at the start of a "word" or the start of the line. The next bit (/\S+) means a / followed by one or more non-whitespace characters (which could include / -- it doesn't matter so I haven't bothered to exclude these). And the {2,} means that there should be two or more of these.
(I am using "word" here as in the question, to refer to sequence of non-whitespace characters, not necessarily letters.)

Only the forward slashes are matched because the string is lower case, and the pattern matches zero or more times either a forward slash or uppercase char A-Z using [\/A-Z]*
You could make the pattern case insensitive using re.IGNORECASE but it will not match the underscore and the dot in the example data.
The first forward slash does not get matched as you start the pattern with a word boundary \b, but there is no word boundary between the space and the first forward slash.
A bit more specific match could be using \w to match a word character and specify the dot for the extension:
(?<!\S)(?:/\w+)+/\w+\.\w+(?!\S)
(?<!\S) Assert a whitespace boundary to the left
(?:/\w+)+ Match 1 or more times a / followed by 1+ word chars
/\w+\.\w+ Match the last / followed by a filename format using the dot and word chars
(?!\S) Assert a whitespace boundary to the right
See a regex demo | Python demo
import re
line = 'Pushing the logs into /var/log/my_log.txt'
line = re.sub(r'(?<!\S)(?:/\w+)+/\w+\.\w+(?!\S)', 'PATH', line)
print(line)
Output
Pushing the logs into PATH
A broader pattern could be matching 2 times the forward slash and use a negated character class to match any char except a forward slash or a newline
(?<!\S)(?:/[^/\r\n]+){2,}
See another regex demo

Python regex to identify two consecutive capitalized words at the beginning of the line

I have this piece of text from which I want to remove both occurrences of each of the names, "Remggrehte Sertrro" and "Perrhhfson Forrtdd". I tried applying this regex: ([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+) but it identifies "Remggrehte Sertrro We", "Perrhhfson Forrtdd If" and also "Mash Mush" which is inside the text.
Basically I want it to only identify first two capitalized words at the beginning of the line without touching the rest. I am no regex expert and I am not sure how to adapt it.
This is the text:
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
Thanks in advance.

You can use this pattern /^([A-Z]+.*? ){2}/m if you are always certain that you are getting only two terms with capitalised first letters and always in the first two terms inline. Example working on regex101.com

You don't need the positive lookahead to match the first 2 capitalized words.
In your pattern, this part (?=\s[A-Z]) can be omitted as your first assert it and then directly match it.
You could match the first 2 words without a capturing group and assert a whitespace boundary (?!\S) at the right
^[A-Z][a-z]+[^\S\r\n][A-Z][a-z]+(?!\S)
Explanation
^ Start of string
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
[^\S\r\n] Match a whitespace char except a newline as \s could also match a newline and you want to match two consecutive capitalized words at the beginning of the line
[A-Z][a-z]+ Match a char A-Z and 1+ lowercase chars a-z
(?!\S) Assert a whitespace boundary at the right
Regex demo
Note that [A-Z][a-z]+ matches only chars a-z. To match word characters you could use \w instead of [a-z] only.

You can remove the line which only contains the names using re.MULTILINE flag and the following regex: r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$". This regex will match each name only if it fits in the line without extra text.
Here is a demo:
import re
text = """\
Remggrehte Sertrro
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.
"""
print(re.sub(r"^(?:[A-Z]\w+\s+[A-Z]\w+\s+)$", "", text, flags=re.MULTILINE))
You get:
Remggrehte Sertrro We did want a 4-day work week for years.
Perrhhfson Forrtdd If drumph does n't get sufficient testing and PPE gear , the economy Mash Mush will continue to.

unexpected result for python re.sub() with non-capturing character

I cannot understand the following output :
import re
re.sub(r'(?:\s)ff','fast-forward',' ff')
'fast-forward'
According to the documentation :
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
So why is the whitespace included in the captured occurence, and then replaced, since I added a non-capturing tag before it?
I would like to have the following output :
' fast-forward'

The non-capturing group still matches and consumes the matched text. Note that consuming means adding the matched text to the match value (memory buffer alotted for the whole matched substring) and the corresponding advancing of the regex index. So, (?:\s) puts the whitespace into the match value, and it is replaced with the ff.
You want to use a look-behind to check for a pattern without consuming it:
re.sub(r'(?<=\s)ff','fast-forward',' ff')
See the regex demo.
An alternative to this approach is using a capturing group around the part of the pattern one needs to keep and a replacement backreference in the replacement pattern:
re.sub(r'(\s)ff',r'\1fast-forward',' ff')
^ ^ ^^
Here, (\s) saves the whitespace in Group 1 memory buffer and \1 in the replacement retrieves it and adds to the replacement string result.
See the Python demo:
import re
print('"{}"'.format(re.sub(r'(?<=\s)ff','fast-forward',' ff')))
# => " fast-forward"

A non-capturing group still matches the pattern it contains. What you wanted to express was a look-behind, which does not match its pattern but simply asserts it is present before your match.
Although, if you are to use a look-behind for whitespace, you might want to consider using a word boundary metacharacter \b instead. It matches the empty string between a \w and a \W character, asserting that your pattern is at the beginning of a word.
import re
re.sub(r'\bff\b', 'fast-forward', ' ff') # ' fast-forward'
Adding a trailing \b will also make sure that you only match 'ff' if it is surrounded by whitespaces, not at the beginning of a word such as in 'ffoo'.
See the demo.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to exclude regex matches containing a constant string - python

Related

how to write regex to accept the string which end with string

How to ignore comments inside string literals

How to remove a word if it has more than 2 occurrence of a given character in python?

Python regex to identify two consecutive capitalized words at the beginning of the line

unexpected result for python re.sub() with non-capturing character

Categories

Resources