This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
My question is why it CANNOT be 234 from 1-234-56?
It is not possible as (?=(\d{3})+(?!\d)) requires 3-digit sequences appear after a 1-3-digit sequence. 56 (the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}. To get 234 from 123456, you'd need a specifically tailored regex for it: \B\d{3}, or (?<=1)\d{3} or even \d{3}(?=\d{2}(?!\d)).
Does 56 match the (?!\d)) pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+ subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d) is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3} - 1 to 3 digit sequence, as many as possible (greedy quantifier is used)
(?=(\d{3})+(?!\d)) - a positive look-ahead ((?=...)) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+ - 1 or more (+) sequences of exactly 3 digits...
(?!\d) - not followed by a digit.
Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123 with \d{1,3} as then you have 3-digit sequence (456). But 456 is capured within a lookahead, and re.findall returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo
Related
I'm working to advance my regex skills in python, and I've come across an interesting problem. Let's say that I'm trying to match valid credit card numbers , and on of the requirments is that it cannon have 4 or more consecutive digits. 1234-5678-9101-1213 is fine, but 1233-3345-6789-1011 is not. I currently have a regex that works for when I don't have dashes, but I want it to work in both cases, or at least in a way i can use the | to have it match on either one. Here is what I have for consecutive digits so far:
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})')
I know I could do some sort of replace '-' with '', but in an effort to make my code more versatile, it would be easier as just a regex. Here is the function for more context:
def isValid(number):
validStart = re.compile(r'^[456]') # Starts with 4, 5, or 6
validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$') # is 16 digits long
validOnlyDigits = re.compile(r'^[0-9-]*$') # only digits or dashes
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})') # no consecutives over 3
validators = [validStart, validLength, validOnlyDigits, validNoConsecutive]
return all([val.search(number) for val in validators])
list(map(print, ['Valid' if isValid(num) else 'Invalid' for num in arr]))
I looked into excluding chars and lookahead/lookbehind methods, but I can't seem to figure it out. Is there some way to perhaps ignore a character for a given regex? Thanks for the help!
You can add the (?!.*(\d)(?:-*\1){3}) negative lookahead after ^ (start of string) to add the restriction.
The ^(?!.*(\d)(?:-*\1){3}) pattern matches
^ - start of string
(?!.*(\d)(?:-*\1){3}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is
.* - any zero or more chars other than line break chars as many as possible
(\d) - Group 1: one digit
(?:-*\1){3} - three occurrences of zero or more - chars followed with the same digit as captured in Group 1 (as \1 is an inline backreference to Group 1 value).
See the regex demo.
If you want to combine this pattern with others, just put the lookahead right after ^ (and in case you have other patterns before with capturing groups, you will need to adjust the \1 backreference). E.g. combining it with your second regex, validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$'), it will look like
validLength = re.compile(r'^(?!.*(\d)(?:-*\1){3})(?:[0-9]{16}|[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})$')
I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970
Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits
If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo
I have a string like this:
s = "Abc 3456789 cbd 0045678 def 12345663333"
print(re.findall(r"(?<!\d)\d{7}(?!\d)", s))
Ouput is : 3456789 and 0045678
but I only want to get 3456789. How can I do that?
As per title of finding 7 digit numbers that don't start with 0 you may use:
(?<!\d)[1-9]\d{6}(?!\d)
Note [1-9] at start of match before matching next 6 digits to make it total 7 digits.
RegEx Demo
To make it match any number that doesn't start with 0 use:
(?<!\d)[1-9]\d*(?!\d)
This will do it: ^\D+(\d+)\s
At the beginning of the string ^, there are any non digit characters \D+, followed by any number of digits \d+, which are captured (\d+), and need to be followed by a whitespace \s.
See: https://regex101.com/r/ZuGJ7l/1
If you are looking for number not beginning with 0, then use [1-9] for the first digit and \d for the remaining digits.
For example, to find the ones of length 7 (as per the title), this would give you:
re.findall(r'(?<!\d)[1-9]\d{6}(?!\d)', s)
in other words, a non-zero digit followed by 6 digits, the whole thing neither preceded nor followed by a digit (per the negative lookahead and negative lookbehind assertions),
which for your current example string would produce:
['3456789']
If you want ones that are not length 7, you could use:
re.findall(r'(?<!\d)[1-9](?:\d{,5}|\d{7,})(?!\d)', s)
in other words, a non-zero digit followed by either <= 5 or >= 7 digits (i.e. any number other than 6), the whole thing neither preceded nor followed by a digit,
which would give:
['12345663333']
Note in the second case the use of ?: to ensure that the bracketed group is a non-capturing one -- this ensures that re.findall will return everything that is matched, rather than the contents of the parentheses.
Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59
Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.
In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$
I have such list (it's only a part);
not match me
norme
16/02574/REMMAJ
20160721
17/00016/FULM
OUT/2017/1071
SMD/2017/0391
17/01090/FULM
2017/30597
17/03940/MAO
18/00076/FULM
CH/17/323
18/00840/OUTMEI
17/00902/EIAM
PL/2017/02671/MINFOT
I need to find general rule to match them all but not this first rows (simple words) or any of \d nor \w if not mixed with each other and slash. Numbers like \d{8} are allowed.
I don't know how to use something like MUST clause applied for each of these 3 groups together - neither can be miss.
These patterns either match not fully or match words. Need as simple regex as possible if possible.
\d{8}|(\w+|/+|\d+)
\d{8}|[\w/\d]+
EDIT
It's funny, but some not provided examples doesn't match for proposed expressions. For example:
7/2018/4127
NWB/18CM032
but I know why and this is outside the scope. However, adding functionality for mixed numbers and letters in one group, like NWB/18CM032 would be great and wouldn't break previous idea I think.
You could match either 1 or more times an uppercase char or 1-8 digits and repeat that zero or more times with a forward slash prepended:
^(?:[a-z0-9]+(?:/[a-z0-9]+)+|\d{8})$
That will match
^ Start of string
(?: Non capturing group
[a-z0-9]+ Match a char a-z or a digit 1+ times
(?:/[a-z0-9]+)+ Match a / followed by a char or digit 1+ times and repeat 1+ times.
| Or
\d{8} Match 8 digits
) Close group
$ End of string
See it on regex101