Match Two Sets of Different Consecutive Numbers Regex Python - python

I am classifying a list of vanity phone numbers based on their patterns using regex.
I would like to capture this pattern 5ABXXXYYY
Sample 534666999
I wrote the below regex that captures XXXYYY.
(\d)\1{2}(\d)\2{2}
I want to add a condition to assert the B is not the same number as X.
Desired output will match the given pattern exactly and replace it with the word silver.
S_2 = 534666999
S_2_pattern = re.sub(r"(\d)\2{2}(\d)\3{2}", "Silver", str(S_2))
print(S_2_pattern)
Silver
Thanks

If you want to match 9 digits, and the 3rd digit should not be the same as the 4th, you can add another capture group for the 3rd digit and all the group numbers after are incremented by 1.
\b\d\d(\d)(?!\1)(\d)\2\2(\d)\3\3\b
\b A word boundary to prevent a partial word match
\d\d Match 2 digits
(\d)(?!\1) Capture a single digit in group 1, and assert that it is not followed by the same
(\d)\2\2 Capture a single digit in group 2 and match 2 times the same digits after it
(\d)\3\3 Capture a single digit in group 3 and match 2 times the same digits after it
\b A word boundary
Regex demo
If the first 3 digits in group 2 should also be different from the last 3 digits in group 3:
\b\d\d(\d)(?!\1)(\d)(?!\d\d\2)\2\2(\d)\3\3\b
Regex demo

Related

Regex to Match Pattern 5ABXYXYXY

I am working on mobile number of 9 digits.
I want to use regex to match numbers with pattern 5ABXYXYXY.
A sample I have is 529434343
What I have tried
I have the below pattern to match it.
r"^\d*(\d)(\d)(?:\1\2){2}\d*$"
However, this pattern matches another pattern I have which is 5XXXXXXAB
a sample for that is 555555532.
What I want I want to edit my regex to match the first pattern only 5ABXYXYXY and ignore this one 5XXXXXXAB
You can use
^\d*((\d)(?!\2)\d)\1{2}$
See the regex demo.
Details:
^ - start of string
\d* - zero or more digits
((\d)(?!\2)\d) - Group 1: a digit (captured into Group 2), then another digit (not the same as the preceding one)
\1{2} - two occurrences of Group 1 value
$ - end of string.
To match 5ABXYXYXY where AB should not be same as XY matching 3 times, you may use this regex:
^\d*(\d{2})(?!\1)((\d)(?!\3)\d)\2{2}$
RegEx Demo
RegEx Breakup:
^: Start
\d*: Match 0 or more digits
(\d{2}): Match 2 digits and capture in group #1
(?!\1): Make sure we don't have same 2 digits at next position
(: Start capture group #2
(\d): Match and capture a digit in capture group #3
(?!\3): Make sure we don't have same digit at next position as in 3rd capture group
\d: Match a digit
)`: End capture group #2
\2{2}: Match 2 pairs of same value as in capture group #2
$: End

How can I write a regex that finds everything but 4 digit numbers like 2000 or 1990 or 1234?

I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970
Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits
If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo

Match text with 4 to 5 CAPITAL ALPAHABETS along with minimum 1 or maximum 2 digit number

Requirement:
4 to 5 CAPITAL ALPAHABETS along with minimum 1 or maximum 2 digit number
I have created a REGEX which matches string with CAPITAL ALPHABETS which has more than 1 digit but I want to match Text which has only 1 or 2 digits.
\b(?=.*\d){1,2}(?=.*[A-Z])[A-Z\d]{4,5}\b
Match Cases:
Allow
8HB8
H8ER
D5KC2
Disallow
8HB88
HEER
D54C2
Edit 1:
I should be able to match WORDs of that format with in sentence also not alone as word.
Allow:
This is a valid 9CB8 code
This is another valid H1CS code
One option is to assert 4-5 chars [A-Z0-9].
Then match at least 1 digit 0-9 between optional chars [A-Z] and optionally match a second digit.
^(?=[A-Z0-9]{4,5}$)[A-Z]*[0-9][A-Z]*(?:[0-9][A-Z]*)?$
In parts
^ Start of string
(?=[A-Z0-9]{4,5}$) Assert 4-5 chars A-Z0-9
[A-Z]*[0-9][A-Z]* Match a digit between optional chars A-Z
(?: Non capture group
[0-9][A-Z]* match a digit 0-9
)? Close group and make it optional
$ End of string
Regex demo
So maybe you could use:
^(?=[A-Z0-9]{4,5}$)(?:\D*\d\D*){1,2}$
I based my answer on the same principle as I did here.
^ - Start of string ancor
(?=[A-Z0-9]{4,5}$) - A positive lookahead for a minimum of 4 and a maximum of 5 characters in the range of [A-Z0-9] before the end of string ancor, $.
(?:\D*\d\D*) - A non-capture group where we have a combination of: zero or more non-digits followed by a digit and again zero or more non-digits.
{1,2} - Allow the previous non-capture group to occur a minimum of 1 and a maximum of two times (to make sure there are only 1 or 2 digits.
$ - End of string ancor.
See the online demo here and below is a visualization of the pattern from left to right:

python regex look ahead positive + negative

This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
My question is why it CANNOT be 234 from 1-234-56?
It is not possible as (?=(\d{3})+(?!\d)) requires 3-digit sequences appear after a 1-3-digit sequence. 56 (the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}. To get 234 from 123456, you'd need a specifically tailored regex for it: \B\d{3}, or (?<=1)\d{3} or even \d{3}(?=\d{2}(?!\d)).
Does 56 match the (?!\d)) pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+ subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d) is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3} - 1 to 3 digit sequence, as many as possible (greedy quantifier is used)
(?=(\d{3})+(?!\d)) - a positive look-ahead ((?=...)) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+ - 1 or more (+) sequences of exactly 3 digits...
(?!\d) - not followed by a digit.
Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123 with \d{1,3} as then you have 3-digit sequence (456). But 456 is capured within a lookahead, and re.findall returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo

Explain the behavior of this re

I have the following:
>>> re.sub('(..)+?/story','\\g<1>','money/story')
'mey'
>>>
Why is capture group 1 the first letter and last two letters of money and not the first two letters?
The first capture group does not contain m at all. What is being matched by (..)+?/story is oney/story.
The (..)+? matches an even number of characters, so the following is matched (spaced out to make it clearer):
m o n e y / s t o r y
^-^ ^-^
Then the replacement is the first capture group. Something you might not know is that when you have a repeated capture group (in this case (..)+?), then only the last captured group is kept.
To summarise, oney/story is matched, and replaced with ey, so the result is mey.
Because the string money contains 5 letters (odd) not even, it won't even match the first letter m. (..)+? captures two characters and non-greedily repeats the pattern one or more times . Because the repetation quantifier + exists next to the capturing group, it would capture tha last two characters of the match . Now the captured group contains the last two characters of the match done by this (..)+? pattern. So you got ey as the captured string not the first on. So by replacing all the matched characters with the string inside the group index 1 ey will give you mey.
DEMO

Categories

Resources