What is the purpose of .* in a Python lookahead regex?

What is the purpose of .* in a Python lookahead regex? - python

I am learning about regular expressions, and I found an interesting and helpful page on using them for password input validation here. The question I have is about the .* in the following expression:
"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$"
I understand that .* is a wildcard character representing any amount of text (or no text) but I'm having trouble wrapping my head around its purpose in these lookahead expressions. Why are these necessary in order to make these lookaheads function as needed?

Lookahead means direct lookahead. So if you write:
(?=a)
it means that the first character should be a. Sometimes, for instance with password checking, you do not want that. You want to express that somewhere there should be an a. So:
(?=.*a)
means that the first character can for instance be a b, 8 or #. But that eventually there should be an a somewhere.
Your regex thus means:
^ # start a match at the beginning of the string
(?=.*[a-z]) # should contain at least one a-z character
(?=.*[A-Z]) # should contain at least one A-Z character
(?=.*\d) # should contain at least one digit
[a-zA-Z\d]{8,} # consists out of 8 or more characters and only A-Za-z0-9
$ # end the match at the end of the string
Without the .*, there could never be a match, since:
"^(?=[a-z])(?=[A-Z])(?=\d)[a-zA-Z\d]{8,}$"
means:
^ # start a match at the beginning of the string
(?=[a-z]) # first character should be an a-z character
(?=[A-Z]) # first character should be an A-Z character
(?=\d) # first character should be a digit
[a-zA-Z\d]{8,} # consists out of 8 or more characters and only A-Za-z0-9
$ # end the match at the end of the string
Since there is no character that is both an A-Z character and a digit at the same time. This would never be satisfied.
Side notes:
we do not capture in the lookahead so greedyness does not matter;
the dot . by default does not match the new line character;
even if it did the fact that you have a constraint ^[A-Za-z0-9]{8,}$ means that you only would validate input with no new line.

Related

How to exclude regex matches containing a constant string

I need help understanding exclusions in regex.
I begin with this in my Jupyter notebook:
import re
file = open('names.txt', encoding='utf-8')
data = file.read()
file.close()
Then I can't get my exclusions to work. The read file has 12 email strings in it, 3 of which contain '.gov'.
I was told this would return only those that are not .gov:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*.[^gov]
''', data, re.X|re.I)
It doesn't. It returns all the emails and excludes any characters in 'gov' following the '#'; e.g.:
abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
I've tried using ?! in various forms I found online to no avail.
For example, I was told the following syntax would exclude the entire match rather than just those characters:
#re.findall(r'''
# ^/(?!**SPECIFIC STRING TO IGNORE**)(**DEFINITION OF STRING TO RETURN**)$
#''', data, re.X|re.I)
Yet the following simply returns an empty list:
#re.findall(r'''
# ^/(?!\b[-+.\w\d]*#[-+.\w\d]*.gov)([-+.\w\d]*#[-+.\w\d].[\w]*[^\t\n])$
#''', data, re.X|re.I)
I tried to use the advice from this question:
Regular expression to match a line that doesn't contain a word
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*./^((?!.gov).)*$/s # based on syntax /^((?!**SUBSTRING**).)*$/s
#^ this slash is where different code starts
''', data, re.X|re.I)
This is supposed to be the inline syntax, and I think by including the slashes I may be making a mistake:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*./(?s)^((?!.gov).)*$/ # based on syntax /(?s)^((?!**SUBTRING**).)*$/
''', data, re.X|re.I)
And this returns an empty list:
re.findall(r'''
[-+.\w\d]*\b#[-+\w\d]*.(?s)^((?!.gov).)*$ # based on syntax (?s)^((?!**SUBTRING**).)*$
''', data, re.X|re.I)
Please help me understand how to use ?! or ^ or another exclusion syntax to return a specified string not containing another specified string.
Thanks!!

First, your regex for recognizing an email address does not look close to being correct. For example, it would accept #13a as being valid. See How to check for valid email address? for some simplifications. I will use: [^#]+#[^#]+\.[^#]+ with the recommendation that we also exclude space characters and so, in your particular case:
^([^#\s]+#[^#\s]+\.[^#\s.]+)
I also added a . to the last character class [^#\s.]+ to ensure that this represents the top-level domain. But we do not want the email address to end in .gov. Our regex specifies toward the end for matching the top-level domain:
\. Match a period.
[^#\s.]+ Match one or more non-white space, non-period characters.
In Step 2 above we should first apply a negative lookahead, i.e. a condition to ensure that the next characters are not gov. But to ensure we are not doing a partial match (if the top-level domain were government, that would be OK), gov must be followed by either white space or the end of the line to be disqualifying. So we have:
^([^#\s]+#[^#\s]+\.(?!gov(?:\s|$))[^#\s.]+)
See Regex Demo
import re
text = """abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
test#test.gov
test.test#test.org.gov.test
"""
print(re.findall(r'^([^#\s]+#[^#\s]+\.(?!gov(?:\s|$))[^#\s.]+)', text, flags=re.M|re.I))
Prints:
['abc123#abc.c', '456#email.edu', 'test.test#test.org.gov.test']
So, in my interpretation of the problem test.test#test.org.gov.test is OK becuase gov is not the top-level domain. governmentemail#governmentaddress. is rejected simply because it is not a valid email address.
If you don't want gov in any level of the domain, then use this regex:
^([^#\s]+#(?!(?:\S*\.)?gov(?:\s|\.|$))[^#\s]+\.[^#\s]+)
See Regex Demo
After seeing the # symbol, this ensures that what follows is not an optional period followed by gov followed by either another period, white space character or end of line.
import re
text = """abc123#abc.c # 'o' is in 'gov' so it ends the returned string there
456#email.edu
governmentemail#governmentaddress. #'.gov' omitted
test#test.gov
test.test#test.org.gov.test
"""
print(re.findall(r'^([^#\s]+#(?!(?:\S*\.)?gov(?:\s|\.|$))[^#\s]+\.[^#\s]+)', text, flags=re.M|re.I))
Prints:
['abc123#abc.c', '456#email.edu']

A few notes about the patterns you tried
This part of the pattern [-+.\w\d]*\b# can be shortened to [-+.\w]*\b# as \w also matches \d and note that it will also not match a dot
Using [-+.\w\d]*\b# will prevent a dash from matching before the # but it could match ---a#.a
The character class [-+.\w\d]* is repeated 0+ times but it can never match 0+ times as the word boundary \b will not work between a whitespace or start of line and an #
Note that not escaping the dot . will match any character except a newline
This part ^((?!.gov).)*$ is a tempered greedy token that will, from the start of the string, match any char except a newline asserting what is on the right is not any char except a newline followed by gov until the end of the string
One option could be to use the tempered greedy token to assert that after the # there is not .gov present.
[-+.\w]+\b#(?:(?!\.gov)\S)+(?!\S)
Explanation about the separate parts
[-+.\w]+ Match 1+ times any of the listed
\b# Word boundary and match #
(?: Non capturing group
(?! Negative lookahead, assert what is on the right is not
\.gov Match .gov
) Close lookahead
\S Match a non whitespace char
)+ Close non capturing group and repeat 1+ times
(?!\S) Negative lookahead, assert what is on the right is non a non whitespace char to prevent partial matches
Regex demo
You could make the pattern a bit broader by matching not an # or whitespace char, then match # and then match non whitespace chars where the string .gov is not present:
[^\s#]+#(?:(?!\.gov)\S)+(?!\S)
Regex demo

Why doesn't this regex pattern work as intended?

I needed a regex pattern to catch any 16 digit string of numbers (each four number group separated by a hyphen) without any number being repeated more than 3 times, with or without hyphens in between.
So the pattern I wrote is
a=re.compile(r'(?!(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)')
But the example "5133-3367-8912-3456" gets matched even when 3 is repeated 4 times. (What is the problem with the negative lookahead section?)

Lookaheads only do the check at the position they are at, so in your case at the start of the string. If you want a lookahead to basically check the whole string, if a certain pattern can or can't be matched, you can add .* in front to make go deeper into the string.
In your case, you could change it to r'(?!.*(\d)\-?\1\-?\1\-?\1)(^d{4}\-?\d{4}\-?\d{4}\-?\d{4}$)'.
There is also no need to escape the minus at the position they are at and I would move the lookahead right after the ^. I don't know how well python regexes are optimized, but that way the start of the string anchor is matched first (only 1 valid position) instead of checking the lookahead at any place just to fail the match at ^. This would give r'^(?!.*(\d)-?\1-?\1-?\1)(\d{4}-?\d{4}-?\d{4}-?\d{4}$)'

Python regular expressions acting strangely

url = "http://www.domain.com/7464535"
match = re.search(r'\d*',url)
match.group(0)
returns '' <----- empty string
but
url = "http://www.domain.com/7464535"
match = re.search(r'\d+',url)
match.group(0)
returns '7464535'
I thought '+' was supposed to be 1 or more and '*' was 0 or more correct? And RE is supposed to be greedy. So why don't they both return the same thing and more importantly why does the 1st one return nothing?

You are correct about the meanings of + and *. So \d* will match zero or more digits — and that's exactly what it's doing. Starting at the beginning of the string, it matches zero digits, and then it's done. It successfully matched zero or more digits.
* is greedy, but that only means that it will match as many digits as it can at the place where it matches. It won't give up a match to try to find a longer one later in the string.
Edit: A more detailed description of what the regex engine does:
Take the case where our string to search is "http://www.domain.com/7464535" and the pattern is \d+.
In the beginning, the regex engine is pointing at the beginning of our URL and the beginning of the regex pattern. \d+ needs to match one or more digits, so first the regex engine must find at least one digit to have a successful match.
The first place it looks it finds an 'h' character. That's not a digit, so it moves on to the 't', then the next 't', and so on until it finally reaches the '7'. Now we've matched one digit, so the "one or more" requirement is satisfied and we could have a successful match, except + is greedy so it will match as many digits as it can without changing the starting point of the match, the '7'. So it hits the end of the string and matches that whole number '7464535'.
Now consider if our pattern was \d*. The only difference now is that zero digits is a valid match. Since regex matches left-to-right, the first place \d* will match is the very start of the string. So we have a zero-length match at the beginning, but since * is greedy, it will extend the match as long as there are digits. Since the first thing we find is 'h', a non-digit, it just returns the zero-length match.
How is * even useful, then, if it will just give you a zero-length match? Consider if I was matching a config file like this:
foo: bar
baz: quux
blah:blah
I want to allow any amount of spaces (even zero) after the colon. I would use a regex like (\w+):\s*(\w+) where \s* matches zero or more spaces. Since it occurs after the colon in the pattern, it will match just after the colon in the string and then either match a zero-length string (as in the third line blah:blah because the 'b' after the colon ends the match) or all the spaces there are before the next non-space, because * is greedy.

Match the same character an exact number of times with regular expressions

I'm trying to use python re to find a set of the same letter or number repeated a specific number of times. (.) works just fine for identifying what will be repeated, but I cannot find how to keep it from just repeating different characters. here is what I have:
re.search(r'(.){n}', str)
so for example it would match 9999 from 99997 if n = 4, but not if n = 3.
thanks

How about
(?:^|(?<=(.)))(?!\1)(.)\2{n-1}(?!\2)
This will:
(?:^|(?<=(.))): Make sure that:
^: Either we are at the beginning of the string
(?<=(.)): Either we are not at the beginning of the string; then, capture the character before the match and save it into \1
(?!\1)(.): Match any character that is not \1 and save it into \2
\2{n-1}: Match \2 n-1 times
(?!\2): Make sure \2 cannot be matched looking forward
(The n-1 is only symbolic; obviously you want to replace this with the actual value of n-1, not with 8-1 or something).
Important edit: The previous version of the regex ((.)\1{n-1}(?!\1)) does not work because it fails to account for character matching \1 behind the match. The regex above fixes this problem.

Python, regex negative lookbehind behavior

I have a regular experssion that should find up to 10 words in a line. THat is, it should include the word just preceding the line feed but not the words after the line feed. I am using a negative lookbehind with "\n".
a = re.compile(r"((\w)+[\s /]){0,10}(?<!\n)")
r = a.search("THe car is parked in the garage\nBut the sun is shining hot.")
When I execute this regex and call the method r.group(), I am getting back the whole sentence but the last word that contains a period. I was expecting only the complete string preceding the new line. That is, "THe car is parked in the garage\n".
What is the mistake that I am making here with the negative look behind...?

I don't know why you would use negative lookahead. You are saying that you want a maximum of 10 words before a linefeed. The regex below should work. It uses a positive lookahead to ensure that there is a linefeed after the words. Also when searching for words use `b\w+\b` instead of what you were using.
/(\b\w+\b)*(?=.*\\n)/
Python :
result = re.findall(r"(\b\w+\b)*(?=.*\\n)", subject)
Explanation :
# (\b\w+\b)*(?=.*\\n)
#
# Match the regular expression below and capture its match into backreference number 1 «(\b\w+\b)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*\\n)»
# Match any single character that is not a line break character «.*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the character “\” literally «\\»
# Match the character “n” literally «n»
You may also wish to consider the fact that there could be no \n at your string.

If I read you right, you want to read up to 10 words, or the first newline, whichever comes first:
((?:(?<!\n)\w+\b[\s.]*){0,10})
This uses a negative lookbehind, but just before the word match, so it blocks getting any word after a newline.
This will need some tuning for imperfect input, but it's a start.

For this task there is the anchor $ to find the the end of the string and together with the modifier re.MULTILINE/re.M it will find the end of the line. So you would end up with something like this
(\b\w+\b[.\s /]{0,2}){0,10}$
See it here on Regexr
The \b is a word boundary. I included [.\s /]{0,2} to match a dot followed by a whitespace in my example. If you don't want the dots you need to make this part at least optional like this [\s /]? otherwise it will be missing at the last word and then the \s is matching the \n.
Update/Idea 2
OK, maybe I misunderstood your question with my first solution.
If you just want to not match a newline and continue in the second row, then just don't allow it. The problem is that the newline is matched by the \s in your character class. The \s is a class for whitespace and this includes also the newline characters \r and \n
You already have a space in the class then just replace the \s with \t in case you want to allow tab and then you should be fine without lookbehind. And of course, make the character class optional otherwise the last word will also not be matched.
((\w)+[\t /]?){0,10}
See it here on Regexr

I think you shouldn't be using a lookbehind at all. If you want to match up to ten words not including a newline, try this:
\S+(?:[ \t]+\S+){0,9}
A word is defined here as one or more non-whitespace characters, which includes periods, apostrophes, and other sentence punctuation as well as letters. If you know the text you're matching is regular prose, there's no point limiting yourself to \w+, which isn't really meant to match natural-language words anyway.
After the first word, it repeatedly matches one or more horizontal whitespace characters (space or TAB) followed by another word, for a maximum of ten words. If it encounters a newline before the tenth word, it simply stops matching at that point. There's no need to mention newlines in the regex at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.