Python regular expressions acting strangely - python

url = "http://www.domain.com/7464535"
match = re.search(r'\d*',url)
match.group(0)
returns '' <----- empty string
but
url = "http://www.domain.com/7464535"
match = re.search(r'\d+',url)
match.group(0)
returns '7464535'
I thought '+' was supposed to be 1 or more and '*' was 0 or more correct? And RE is supposed to be greedy. So why don't they both return the same thing and more importantly why does the 1st one return nothing?

You are correct about the meanings of + and *. So \d* will match zero or more digits — and that's exactly what it's doing. Starting at the beginning of the string, it matches zero digits, and then it's done. It successfully matched zero or more digits.
* is greedy, but that only means that it will match as many digits as it can at the place where it matches. It won't give up a match to try to find a longer one later in the string.
Edit: A more detailed description of what the regex engine does:
Take the case where our string to search is "http://www.domain.com/7464535" and the pattern is \d+.
In the beginning, the regex engine is pointing at the beginning of our URL and the beginning of the regex pattern. \d+ needs to match one or more digits, so first the regex engine must find at least one digit to have a successful match.
The first place it looks it finds an 'h' character. That's not a digit, so it moves on to the 't', then the next 't', and so on until it finally reaches the '7'. Now we've matched one digit, so the "one or more" requirement is satisfied and we could have a successful match, except + is greedy so it will match as many digits as it can without changing the starting point of the match, the '7'. So it hits the end of the string and matches that whole number '7464535'.
Now consider if our pattern was \d*. The only difference now is that zero digits is a valid match. Since regex matches left-to-right, the first place \d* will match is the very start of the string. So we have a zero-length match at the beginning, but since * is greedy, it will extend the match as long as there are digits. Since the first thing we find is 'h', a non-digit, it just returns the zero-length match.
How is * even useful, then, if it will just give you a zero-length match? Consider if I was matching a config file like this:
foo: bar
baz: quux
blah:blah
I want to allow any amount of spaces (even zero) after the colon. I would use a regex like (\w+):\s*(\w+) where \s* matches zero or more spaces. Since it occurs after the colon in the pattern, it will match just after the colon in the string and then either match a zero-length string (as in the third line blah:blah because the 'b' after the colon ends the match) or all the spaces there are before the next non-space, because * is greedy.

Related

How to not capture a group in regex if it is followed by an another group

If I have a string eg.: 'hcto,231' or 'hcto.12' I want to be able to capture 'o,231' or 'o.12' and process it as a number ('hct' is random and any other string can replace it).
But I don't want to capture if the 'o' character if followed by a decimal number eg: 'wordo.23.12' or 'wordo,23,12'.
I've tried using the following regex:
([oO][.,][0-9]+)(?!([.,][0-9]+))
but it always matches.
In the string 'hcto.22.23' it matches the bold part, but I don't want it to match anything. Is there a way to combine groups so it won't match if the negative lookahead is true.
The match occurs in hcto.22.23 because the lookahead triggers backtracking, and since [0-9]+ match match a single 2 (it does not have to match 22) the match succeeds and returns a smaller, unexpected match:
It seems the simplest way to fix the current issue is to make the dot or comma pattern in the lookahead optional, and remove unnecessary groups:
[oO][.,]\d+(?![.,]?\d)
See the regex demo.
Details
[oO] - o or O
[.,] - a dot or comma
\d+ - one or more digits
(?![.,]?\d) - not followed with ./, and a digit, or just with a digit.

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

What is the purpose of .* in a Python lookahead regex?

I am learning about regular expressions, and I found an interesting and helpful page on using them for password input validation here. The question I have is about the .* in the following expression:
"^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$"
I understand that .* is a wildcard character representing any amount of text (or no text) but I'm having trouble wrapping my head around its purpose in these lookahead expressions. Why are these necessary in order to make these lookaheads function as needed?
Lookahead means direct lookahead. So if you write:
(?=a)
it means that the first character should be a. Sometimes, for instance with password checking, you do not want that. You want to express that somewhere there should be an a. So:
(?=.*a)
means that the first character can for instance be a b, 8 or #. But that eventually there should be an a somewhere.
Your regex thus means:
^ # start a match at the beginning of the string
(?=.*[a-z]) # should contain at least one a-z character
(?=.*[A-Z]) # should contain at least one A-Z character
(?=.*\d) # should contain at least one digit
[a-zA-Z\d]{8,} # consists out of 8 or more characters and only A-Za-z0-9
$ # end the match at the end of the string
Without the .*, there could never be a match, since:
"^(?=[a-z])(?=[A-Z])(?=\d)[a-zA-Z\d]{8,}$"
means:
^ # start a match at the beginning of the string
(?=[a-z]) # first character should be an a-z character
(?=[A-Z]) # first character should be an A-Z character
(?=\d) # first character should be a digit
[a-zA-Z\d]{8,} # consists out of 8 or more characters and only A-Za-z0-9
$ # end the match at the end of the string
Since there is no character that is both an A-Z character and a digit at the same time. This would never be satisfied.
Side notes:
we do not capture in the lookahead so greedyness does not matter;
the dot . by default does not match the new line character;
even if it did the fact that you have a constraint ^[A-Za-z0-9]{8,}$ means that you only would validate input with no new line.

Python regex * matches occurrences only in the starting of the string

When I use regex p* on string blackpink it returns the empty string as a match even though p is inside the string.
When I use the same regex p* on string pinkpink then it matches and returns p, indicating its matching only on the start of the string even though i have not specified anything of the kind.
The peculiar behavior is that, when I use p+ on string pink and blackpink, in both cases it returns p , indicating it does not care if the match is in the beginning or inside a string.
Can anyone explain this?
There are two important things to understand here:
First, p* matches zero or more, while p+ matches one or more.
Second, you will get the first match, no matter if that match is an empty string or not.
Third, regex is greedy by default so once it found the first match it will include as many p as possible.
So, as a result of this,
p* on blackpink matches the zero p at the very beginning of the string, that is ''.
p* on pinkpink matches the first p (not the second).
p+ on blackpink matches the sixth letter, the p, since the empty string is no longer a match because of the +.
p+ on pinkpink matches the first p.
I think you're using re.match to find your pattern's matches. As you can see from the docs:
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
emphasis mine
Since, p* means 0 or more characters, greedily, the starting point of the string blackpink is just an empty string, '' which satisfies your pattern. In fact, the pattern p* will return successful match for every empty (0-length) string between any two characters.

Match the same character an exact number of times with regular expressions

I'm trying to use python re to find a set of the same letter or number repeated a specific number of times. (.) works just fine for identifying what will be repeated, but I cannot find how to keep it from just repeating different characters. here is what I have:
re.search(r'(.){n}', str)
so for example it would match 9999 from 99997 if n = 4, but not if n = 3.
thanks
How about
(?:^|(?<=(.)))(?!\1)(.)\2{n-1}(?!\2)
This will:
(?:^|(?<=(.))): Make sure that:
^: Either we are at the beginning of the string
(?<=(.)): Either we are not at the beginning of the string; then, capture the character before the match and save it into \1
(?!\1)(.): Match any character that is not \1 and save it into \2
\2{n-1}: Match \2 n-1 times
(?!\2): Make sure \2 cannot be matched looking forward
(The n-1 is only symbolic; obviously you want to replace this with the actual value of n-1, not with 8-1 or something).
Important edit: The previous version of the regex ((.)\1{n-1}(?!\1)) does not work because it fails to account for character matching \1 behind the match. The regex above fixes this problem.

Categories

Resources