Python regex matching only if digit - python

Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-

I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.

Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.

This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)

Related

Matching consecutive digits in regex while ignoring dashes in python3 re

I'm working to advance my regex skills in python, and I've come across an interesting problem. Let's say that I'm trying to match valid credit card numbers , and on of the requirments is that it cannon have 4 or more consecutive digits. 1234-5678-9101-1213 is fine, but 1233-3345-6789-1011 is not. I currently have a regex that works for when I don't have dashes, but I want it to work in both cases, or at least in a way i can use the | to have it match on either one. Here is what I have for consecutive digits so far:
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})')
I know I could do some sort of replace '-' with '', but in an effort to make my code more versatile, it would be easier as just a regex. Here is the function for more context:
def isValid(number):
validStart = re.compile(r'^[456]') # Starts with 4, 5, or 6
validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$') # is 16 digits long
validOnlyDigits = re.compile(r'^[0-9-]*$') # only digits or dashes
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})') # no consecutives over 3
validators = [validStart, validLength, validOnlyDigits, validNoConsecutive]
return all([val.search(number) for val in validators])
list(map(print, ['Valid' if isValid(num) else 'Invalid' for num in arr]))
I looked into excluding chars and lookahead/lookbehind methods, but I can't seem to figure it out. Is there some way to perhaps ignore a character for a given regex? Thanks for the help!
You can add the (?!.*(\d)(?:-*\1){3}) negative lookahead after ^ (start of string) to add the restriction.
The ^(?!.*(\d)(?:-*\1){3}) pattern matches
^ - start of string
(?!.*(\d)(?:-*\1){3}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is
.* - any zero or more chars other than line break chars as many as possible
(\d) - Group 1: one digit
(?:-*\1){3} - three occurrences of zero or more - chars followed with the same digit as captured in Group 1 (as \1 is an inline backreference to Group 1 value).
See the regex demo.
If you want to combine this pattern with others, just put the lookahead right after ^ (and in case you have other patterns before with capturing groups, you will need to adjust the \1 backreference). E.g. combining it with your second regex, validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$'), it will look like
validLength = re.compile(r'^(?!.*(\d)(?:-*\1){3})(?:[0-9]{16}|[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})$')

Select element with special character with regex and Python

From a list of strings ('16','160','1,2','100,11','1','16:','16:00'), I want to keep only the elements that
either have a comma between two digits (e.g. 1,2 or 100,11)
or have two digits (without comma) that are NOT followed by ":" (i.e. followed by nothing: e.g 16, or followed by anything but ":": e.g. 160)
I tried the following code using regex in Python:
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:[\d],[\d]|[\d][\d][^:]*)')
rate = list(filter(pattern_rate.search,string))
print(rate)
Print:
['16', '160', '1,2','100,11' '16:', '16:00']
To be correct, the script should keep the first three items and reject the rest, but my script fails at rejecting the last two items. I guess I'm using the "[^:]" sign incorrectly.
To be correct, the script should keep the first three items and reject
the rest,
You can match either 2 or more digits, or match 2 digits with a comma in between.
As the list contains only numbers, you could use re.match to start the match at the beginning of the string instead of re.search.
(?:\d{2,}|\d,\d)\Z
Explanation
(?: Non capture group
\d{2,} Match 2 or more digits
| Or
\d,\d Match 2 digits with a comma in between
) Close non capture group
\Z End of string
Regex demo | Python demo
import re
string = ['16','160','1,2','100,11','1','16:','16:00']
pattern_rate = re.compile(r'(?:\d{2,}|\d,\d)\Z')
rate = list(filter(pattern_rate.match,string))
print(rate)
Output
['16', '160', '1,2']
I recommend looking a bit deeper into a regex guide.
100 is not a digit and will not match \d. Also having groups [..] with one element inside is not necessary if you don't intend to negate or otherwise transform them.
The first query can be represented by (?:\d+,\d+). It's a non-capturing group, that detects comma-separated numbers of length greater equal to one.
Your second query will show anything matching three consecutive digits following any (*) amount of not colons.
You'll want to use something similar to (?:\d{2,}(?!:)). It's a non-capturing group, matching digits with length greater equal to two, that are not followed by a colon. ?! designates a negative lookahead.
In your python code, you'll want to use pattern_rate.match instead of pattern_rate.find as the latter one will return partial matches while the first one only returns full matches.
pattern_rate = re.compile(r'(?:\d+,\d+)|(?:\d{2,}(?!:))')
rate = list(filter(pattern_rate.match, string))
Not sure you need regex for that:
string = ['16','160','1,2','100,11','1','16:','16:00']
keep = []
for elem in string:
if ("," in elem and len(elem) == 3) or ( ":" not in elem and "," not in elem and len(elem) >= 2):
keep.append(elem)
print (keep)
Output:
['16', '160', '1,2']
Although not that much elegant, tends to be faster than using regex.

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Python Regex Find match group of range of non digits after hyphen and if range is not present ignore rest of pattern

I'm newer to more advanced regex concepts and am starting to look into look behinds and lookaheads but I'm getting confused and need some guidance. I have a scenario in which I may have several different kind of release zips named something like:
v1.1.2-beta.2.zip
v1.1.2.zip
I want to write a one line regex that can find match groups in both types. For example if file type is the first zip, I would want three match groups that look like:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3. 2
or if the second zip one match group:
v1.1.2.zip
Group 1: v1.1.2
This is where things start getting confusing to me as I would assume that the regex would need to assert if the hyphen exists and if does not, only look for the one match group, if not find the other 3.
(v[0-9.]{0,}).([A-Za-z]{0,}).([0-9]).zip
This was the initial regex I wrote witch successfully matches the first type but does not have the conditional. I was thinking about doing something like match group range of non digits after hyphen but can't quite get it to work and don't not know to make it ignore the rest of the pattern and accept just the first group if it doesn't find the hyphen
([\D]{0,}(?=[-]) # Does not work
Can someone point me in the right right direction?
You can use re.findall:
import re
s = ['v1.1.2-beta.2.zip', 'v1.1.2.zip']
final_results = [re.findall('[a-zA-Z]{1}[\d\.]+|(?<=\-)[a-zA-Z]+|\d+(?=\.zip)', i) for i in s]
groupings = ["{}\n{}".format(a, '\n'.join(f'Group {i}: {c}' for i, c in enumerate(b, 1))) for a, b in zip(s, final_results)]
for i in groupings:
print(i)
print('-'*10)
Output:
v1.1.2-beta.2.zip
Group 1: v1.1.2
Group 2: beta
Group 3: 2
----------
v1.1.2.zip
Group 1: v1.1.2.
----------
Note that the result garnered from re.findall is:
[['v1.1.2', 'beta', '2'], ['v1.1.2.']]
Here is how I would approach this using re.search. Note that we don't need lookarounds here; just a fairly complex pattern will do the job.
import re
regex = r"(v\d+(?:\.\d+)*)(?:-(\w+)\.(\d+))?\.zip"
str1 = "v1.1.2-beta.2.zip"
str2 = "v1.1.2.zip"
match = re.search(regex, str1)
print(match.group(1))
print(match.group(2))
print(match.group(3))
print("\n")
match = re.search(regex, str2)
print(match.group(1))
v1.1.2
beta
2
v1.1.2
Demo
If you don't have a ton of experience with regex, providing an explanation of each step probably isn't going to bring you up to speed. I will comment, though, on the use of ?: which appears in some of the parentheses. In that context, ?: tells the regex engine not to capture what is inside. We do this because you only want to capture (up to) three specific things.
We can use the following regex:
(v\d+(?:\.\d+)*)(?:[-]([A-Za-z]+))?((?:\.\d+)*)\.zip
This thus produces three groups: the first one the version, the second is optional: a dash - followed by alphabetical characters, and then an optional sequence of dots followed by numbers, and finally .zip.
If we ignore the \.zip suffix (well I assume this is rather trivial), then there are still three groups:
(v\d+(?:\.\d+)*): a regex group that starts with a v followed by \d+ (one or more digits). Then we have a non-capture group (a group starting with (?:..) that captures \.\d+ a dot followed by a sequence of one or more digits. We repeat such subgroup zero or more times.
(?:[-]([A-Za-z]+))?: a capture group that starts with a hyphen [-] and then one or more [A-Za-z] characters. The capture group is however optional (the ? at the end).
((?:\.\d+)*): a group that again has such \.\d+ non-capture subgroup, so we capture a dot followed by a sequence of digits, and this pattern is repeated zero or more times.
For example:
rgx = re.compile(r'(v\d+(?:\.\d+)*)([-][A-Za-z]+)?((?:\.\d+)*)\.zip')
We then obtain:
>>> rgx.findall('v1.1.2-beta.2.zip')
[('v1.1.2', '-beta', '.2')]
>>> rgx.findall('v1.1.2.zip')
[('v1.1.2', '', '')]

Look Around and re.sub()

I want to know how re.sub() works.
The following example is in a book I am reading.
I want "1234567890" to be "1,234,567,890".
pattern = re.compile(r"\d{1,3}(?=(\d{3})+(?!\d))")
pattern.sub(r"\g<0>,", "1234567890")
"1,234,567,890"
Then, I changed "\g<0>" to "\g<1>" and it did not work.
The result was "890,890,890,890".
Why?
I want to know exactly how the capturing and replacing of re.sub()and look ahead mechanism is working.
You have 890 repeated because it is Group 1 (= \g<1>), and you replace every 3 digits with the last captured Group 1 (which is 890).
One more thing here is (\d{3})+ that also captures groups of 3 digits one by one until the end (because of the (?!\d) condition), and places only the last captured group of characters into Group 1. And you are using it to replace each 3-digit chunks in the input string.
See visualization at regex101.com.

Categories

Resources