Capturing entire repeated string based on a repeated pattern - python

Following regex matches both 59-59-59 and 59-59-59-59 and outputs only 59
The intent is to match four and only numbers followed by - with the max number being 59. Numbers less than 10 are represented as 00-09.
print(re.match(r'(\b[0-5][0-9]-{1,4}\b)','59-59-59').groups())
--> output ('59-',)
I need a pattern match that matches exactly 59-59-59-59
and does not match 59--59-59or 59-59-59-59-59

Try using the following pattern, if using re.match:
[0-5][0-9](?:-[0-5][0-9]){3}$
This is phrased to match an initial number starting with 0 through 5, followed by any second digit. Then, this is followed by a dash and a number with the same rules, this quantity three times exactly. Note that re.match anchor at the beginning by default, so we only need an ending anchor $.
Code:
print(re.match(r'([0-5][0-9](?:-[0-5][0-9]){3})$', '59-59-59-59').groups())
('59-59-59-59',)
If you intend to actually match the same number four times in a row, then see the answer by #Thefourthbird.
If you want to find such a string in a larger text, then consider using re.search. In that case, use this pattern:
(?:^|(?<=\s))[0-5][0-9](?:-[0-5][0-9]){3}(?=\s|$)
Note that instead of using word boundaries \b I used lookarounds to enforce the end of the "word" here. This means that the above pattern will not match something like 59-59-59-59-59.

In your pattern, this part -{1,4} matches 1-4 times a hyphen so 59-- will match.
If all the matches should be the same as 59, you could use a backreference to the first capturing group and repeat that 3 times with a prepended hyphen.
\b([0-5][0-9])(?:-\1){3}\b
Your code might look like:
import re
res = re.match(r'\b([0-5][0-9])(?:-\1){3}\b', '59-59-59-59')
if res:
print(res.group())
If there should not be partial matches, you could use an anchors to assert the ^ start and the end $ of the string:
^([0-5][0-9])(?:-\1){3}$

Related

Matching consecutive digits in regex while ignoring dashes in python3 re

I'm working to advance my regex skills in python, and I've come across an interesting problem. Let's say that I'm trying to match valid credit card numbers , and on of the requirments is that it cannon have 4 or more consecutive digits. 1234-5678-9101-1213 is fine, but 1233-3345-6789-1011 is not. I currently have a regex that works for when I don't have dashes, but I want it to work in both cases, or at least in a way i can use the | to have it match on either one. Here is what I have for consecutive digits so far:
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})')
I know I could do some sort of replace '-' with '', but in an effort to make my code more versatile, it would be easier as just a regex. Here is the function for more context:
def isValid(number):
validStart = re.compile(r'^[456]') # Starts with 4, 5, or 6
validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$') # is 16 digits long
validOnlyDigits = re.compile(r'^[0-9-]*$') # only digits or dashes
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})') # no consecutives over 3
validators = [validStart, validLength, validOnlyDigits, validNoConsecutive]
return all([val.search(number) for val in validators])
list(map(print, ['Valid' if isValid(num) else 'Invalid' for num in arr]))
I looked into excluding chars and lookahead/lookbehind methods, but I can't seem to figure it out. Is there some way to perhaps ignore a character for a given regex? Thanks for the help!
You can add the (?!.*(\d)(?:-*\1){3}) negative lookahead after ^ (start of string) to add the restriction.
The ^(?!.*(\d)(?:-*\1){3}) pattern matches
^ - start of string
(?!.*(\d)(?:-*\1){3}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is
.* - any zero or more chars other than line break chars as many as possible
(\d) - Group 1: one digit
(?:-*\1){3} - three occurrences of zero or more - chars followed with the same digit as captured in Group 1 (as \1 is an inline backreference to Group 1 value).
See the regex demo.
If you want to combine this pattern with others, just put the lookahead right after ^ (and in case you have other patterns before with capturing groups, you will need to adjust the \1 backreference). E.g. combining it with your second regex, validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$'), it will look like
validLength = re.compile(r'^(?!.*(\d)(?:-*\1){3})(?:[0-9]{16}|[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})$')

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Match an occurrence starting with two or three digits but not containing a specific pattern somewhere

I have the following lines:
12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2
I digits to get a match if at the beginning of the line I find two or three numbers but not one, also if the field contains somewhere the expressions FO, SH, GDP or LDP I should not count it as an occurrence. It means, from the previous lines, only get 153/G6S.3-H;2-3;1-2 as a match because in the others either contain FO, SH, GDP, or there is just one digit at the beginning.
I tried using
^[1-9][1-9]((?!FO|SH|GDP).)*$
I am getting the correct result but I am not sure is correct, I am not quite expert in regular expressions.
You need to add any other characters that might be between your starting digits and the things you want to exclude:
Simplified regex: ^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
will only match 153/G6S.3-H;2-3;1-2 from your given data.
Explanation:
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
----------- 2 to 3 digits or more at start of line
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--------------------- any characters + not matching (FO|SH|GDP|LDP)
^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$
--- match till end of line
The (?:....) negative lookbehind must follow exactly, you have other characters between what you do not want to see and your match, hence it is not picking it up.
See https://regex101.com/r/j4SRoQ/1 for more explanations (uses {2,}).
Full code example:
import re
regex = r"^[1-9]{2,3}(?!.*(?:FO|SH|GDP|LDP)).*$"
test_str = r"""12(3)/FO.2-3;1-2
153/G6S.3-H;2-3;1-2
1/G13S.2-3
22/FO.2-3;1-2
12(3)2S/FO.2-3;1-2
153/SH/G6S.3-H;2-3;1-2
45/3/H/GDP6;2-3;1-2"""
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
print(match.group())
Output:
153/G6S.3-H;2-3;1-2

How to match words in which must be letter, number and slash using regex (Python)?

I have such list (it's only a part);
not match me
norme
16/02574/REMMAJ
20160721
17/00016/FULM
OUT/2017/1071
SMD/2017/0391
17/01090/FULM
2017/30597
17/03940/MAO
18/00076/FULM
CH/17/323
18/00840/OUTMEI
17/00902/EIAM
PL/2017/02671/MINFOT
I need to find general rule to match them all but not this first rows (simple words) or any of \d nor \w if not mixed with each other and slash. Numbers like \d{8} are allowed.
I don't know how to use something like MUST clause applied for each of these 3 groups together - neither can be miss.
These patterns either match not fully or match words. Need as simple regex as possible if possible.
\d{8}|(\w+|/+|\d+)
\d{8}|[\w/\d]+
EDIT
It's funny, but some not provided examples doesn't match for proposed expressions. For example:
7/2018/4127
NWB/18CM032
but I know why and this is outside the scope. However, adding functionality for mixed numbers and letters in one group, like NWB/18CM032 would be great and wouldn't break previous idea I think.
You could match either 1 or more times an uppercase char or 1-8 digits and repeat that zero or more times with a forward slash prepended:
^(?:[a-z0-9]+(?:/[a-z0-9]+)+|\d{8})$
That will match
^ Start of string
(?: Non capturing group
[a-z0-9]+ Match a char a-z or a digit 1+ times
(?:/[a-z0-9]+)+ Match a / followed by a char or digit 1+ times and repeat 1+ times.
| Or
\d{8} Match 8 digits
) Close group
$ End of string
See it on regex101

python regex look ahead positive + negative

This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
My question is why it CANNOT be 234 from 1-234-56?
It is not possible as (?=(\d{3})+(?!\d)) requires 3-digit sequences appear after a 1-3-digit sequence. 56 (the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}. To get 234 from 123456, you'd need a specifically tailored regex for it: \B\d{3}, or (?<=1)\d{3} or even \d{3}(?=\d{2}(?!\d)).
Does 56 match the (?!\d)) pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+ subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d) is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3} - 1 to 3 digit sequence, as many as possible (greedy quantifier is used)
(?=(\d{3})+(?!\d)) - a positive look-ahead ((?=...)) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+ - 1 or more (+) sequences of exactly 3 digits...
(?!\d) - not followed by a digit.
Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123 with \d{1,3} as then you have 3-digit sequence (456). But 456 is capured within a lookahead, and re.findall returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo

Categories

Resources