Restrict the count of occurrences regex - python

I am trying to find combination of dates. I am having the following regular expression.
\b([\d]{1,2}[\/\s-]{0,3}\d{2,4})
I want to match the following combinations:
8/1967 or 8-1967
08/1967 same
8/67 same
08/67 same
I dont want it to match the following
08/967
That is i want the combination after "/" or "-" to be either 2 digit or 4 digit.
But "\d{2,4}" will give combinations if 2, 3 and 4. But I dont know how to restrict it to either 2 or 4. If there is any other problem with this regex , please let me know. help please.

If you are matching months and years, do
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
Explanation:
\b - a word boundary between non-alphanumeric and alphanumeric character
(?:0?[1-9]|1[0-2]) - 1-12 and 01-12 (with leading zero)
? - possible space on either side of the separator
[-/] 1 separator character, either - or /
(?:[12][0-9])?[0-9]{2}) - either 4-digit number that starts with 1 or 2, or 2 digit number with any digits.
\b - ends with word boundary (the next character is not alphanumeric).
This will match the following strings: 03-1902, 12 / 2014, 6 / 03
but will not match any of 3 / 3009, 13/2009, or 26-30, or 3///60, or 12/34567.
I use [0-9] instead of \d because \d is locale dependent.
DEMO
To match a date range (are you possibly doing a cv/resume parser here?), you can do:
date_re = r'\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b'
date_span = r'%s(?:[\s-]+)-\s*%s' % (date_re, date_re)
which produces the following regular expression in date_span:
\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b(?:[\s-]+)-\s*\b((?:0?[1-9]|1[0-2]) ?[/-] ?(?:[12][0-9])?[0-9]{2})\b
DEMO

Change \d{2,4} into \d{2}(\d{2})?
This will get you what you want.
First match 2 digits, then a two digits combination for only one time or not.
That's exactly 2 or 4 digits.

\b((?<!\/)[\d]{1,2}[\/\s-]{0,3}(?!\d{3}\b)\d{2,4})
Try this.See demo.
https://regex101.com/r/wX9fR1/11
(?!\d{3}\b will make 3 digits wont be matched.

Related

How to extract a specific type of number from a string using regex?

Consider this string:
text = '''
4 500,5
12%
1,63%
568768,74832 days in between
34 cars in a row'''
As you can see, there are simple numbers, numbers with spaces in between, numbers with comas, and both. Thus, 4 500,5 is considered as a standalone, separate number. Extracting the numbers with comas and spaces is easy and I found the pattern as:
pattern = re.compile(r'(\d+ )?\d+,\d+')
However, I am struggling to extract just the simple numbers like 12 and 34. I tried using (?!...) and [^...] but these options do not allow me to exclude the unwanted parts of other numbers.
((?:\d+ )?\d+,\d+)|(\d+(?! \d))
I believe this will do what you want (Regexr link: https://regexr.com/695tc)
To capture "simple" numbers, it looks for [one or more digits], which are not followed by [a space and another digit].
I edited so that you can use capture groups appropriately, if desired.
If you only want to match 12 and 34:
(?<!\S)\d+\b(?![^\S\n]*[,\d])
(?<!\S) Assert a whitespace boundary to the left
\d+\b Match 1+ digits and a word boundary
(?! Negative lookahead, assert what is directly to the right is not
[^\S\n]*[,\d] Match optional spaces and either , or a digit
) Close lookahead
Regex demo
I'd suggest extracting all numbers first, then filter those with a comma to a list with floats, and those without a comma into a list of integers:
import re
text = '4 500,5\n\n12%\n\n1,63%\n\n568768,74832 days in between\n\n34 cars in a row'
number_rx = r'(?<!\d)(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+)(?:,\d+)?(?!\d)'
number_list = re.findall(number_rx, text)
print('Float: ', [x for x in number_list if ',' in x])
# => Float: ['4 500,5', '1,63', '568768,74832']
print('Integers: ', [x for x in number_list if ',' not in x])
# => Integers: ['12', '34']
See the Python demo and the regex demo.
The regex matches:
(?<!\d) - a negative lookbehind that allows no digit immediately to the left of the current location
(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+) - either of the two alternatives:
\d{1,3}(?:[ \xA0]\d{3})* - one, two or three digits, and then zero or more occurrences of a space / hard (no-breaking) space followed with three digits
| - or
\d+ - one or more digits
(?:,\d+)? - an optional sequence of , and then one or more digits
(?!\d) - a negative lookahead that allows no digit immediately to the right of the current location.

Matching consecutive digits in regex while ignoring dashes in python3 re

I'm working to advance my regex skills in python, and I've come across an interesting problem. Let's say that I'm trying to match valid credit card numbers , and on of the requirments is that it cannon have 4 or more consecutive digits. 1234-5678-9101-1213 is fine, but 1233-3345-6789-1011 is not. I currently have a regex that works for when I don't have dashes, but I want it to work in both cases, or at least in a way i can use the | to have it match on either one. Here is what I have for consecutive digits so far:
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})')
I know I could do some sort of replace '-' with '', but in an effort to make my code more versatile, it would be easier as just a regex. Here is the function for more context:
def isValid(number):
validStart = re.compile(r'^[456]') # Starts with 4, 5, or 6
validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$') # is 16 digits long
validOnlyDigits = re.compile(r'^[0-9-]*$') # only digits or dashes
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})') # no consecutives over 3
validators = [validStart, validLength, validOnlyDigits, validNoConsecutive]
return all([val.search(number) for val in validators])
list(map(print, ['Valid' if isValid(num) else 'Invalid' for num in arr]))
I looked into excluding chars and lookahead/lookbehind methods, but I can't seem to figure it out. Is there some way to perhaps ignore a character for a given regex? Thanks for the help!
You can add the (?!.*(\d)(?:-*\1){3}) negative lookahead after ^ (start of string) to add the restriction.
The ^(?!.*(\d)(?:-*\1){3}) pattern matches
^ - start of string
(?!.*(\d)(?:-*\1){3}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is
.* - any zero or more chars other than line break chars as many as possible
(\d) - Group 1: one digit
(?:-*\1){3} - three occurrences of zero or more - chars followed with the same digit as captured in Group 1 (as \1 is an inline backreference to Group 1 value).
See the regex demo.
If you want to combine this pattern with others, just put the lookahead right after ^ (and in case you have other patterns before with capturing groups, you will need to adjust the \1 backreference). E.g. combining it with your second regex, validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$'), it will look like
validLength = re.compile(r'^(?!.*(\d)(?:-*\1){3})(?:[0-9]{16}|[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})$')

How can I write a regex that finds everything but 4 digit numbers like 2000 or 1990 or 1234?

I have a text like this:
Film_relase_date:1970_films_by_20th_Century_Fox
I would like to create a regex that matches all text except 1970, resulting in:
Film_relase_date:_films_by_20th_Century_Fox
I tried with the regex:
[^\d{4}]
But this regex returns:
Film_relase_date:_films_by_th_Century_Fox
And therefore also excludes the 20 which instead I would like to be matched.
How can I improve the regex?
EDIT:
I want to use this regex to do something like:
x = 'Film_relase_date: 1970_films_by_20th_Century_Fox'
REPLACE (x, "Anything that is not a 4-digit number", "Non-Space") = 1970
Remember that {4} is supposed to be added after the character class, not inside.
Anyway, if you want to match "all text except 1970", you can use the following regex:
([^\d]|(?<!\d)\d(?!\d{3}(?!\d))\d*)?
see demo.
This regex matches:
a non-digit character or
a digit char that is nor preceded by another digit and it is not followeb by exactly 3 digits
If you want to match all except 4 digits, I would suggest an unrolled version matching either 1-3 or 5 digits asserting not followed by a digit to prevent consecutive matching digits.
If you don't want to cross newlines, you could use [^\d\r\n] instead of \D
\D+(?:(?:\d{1,3}|\d{5,})(?!\d)\D*)*
Explanation
\D+ Match 1+ non digits
(?: Non capture group
(?:\d{1,3}|\d{5,}) Match either 1-3 or 5 or more digits
(?!\d)\D* Negative lookahead, assert not a digit directly to the right followed by matching optional non digits
)* Close the non capture group and repeat 0+ times
Regex demo
Note that if you want to match 4 digits only, you could perhaps extract the 4 digits using (?<!\d)\d{4}(?!\d) instead of replacing with an empty string.
See another regex demo

How regex a number length 7 in a string but the number is not start with 0 in Python?

I have a string like this:
s = "Abc 3456789 cbd 0045678 def 12345663333"
print(re.findall(r"(?<!\d)\d{7}(?!\d)", s))
Ouput is : 3456789 and 0045678
but I only want to get 3456789. How can I do that?
As per title of finding 7 digit numbers that don't start with 0 you may use:
(?<!\d)[1-9]\d{6}(?!\d)
Note [1-9] at start of match before matching next 6 digits to make it total 7 digits.
RegEx Demo
To make it match any number that doesn't start with 0 use:
(?<!\d)[1-9]\d*(?!\d)
This will do it: ^\D+(\d+)\s
At the beginning of the string ^, there are any non digit characters \D+, followed by any number of digits \d+, which are captured (\d+), and need to be followed by a whitespace \s.
See: https://regex101.com/r/ZuGJ7l/1
If you are looking for number not beginning with 0, then use [1-9] for the first digit and \d for the remaining digits.
For example, to find the ones of length 7 (as per the title), this would give you:
re.findall(r'(?<!\d)[1-9]\d{6}(?!\d)', s)
in other words, a non-zero digit followed by 6 digits, the whole thing neither preceded nor followed by a digit (per the negative lookahead and negative lookbehind assertions),
which for your current example string would produce:
['3456789']
If you want ones that are not length 7, you could use:
re.findall(r'(?<!\d)[1-9](?:\d{,5}|\d{7,})(?!\d)', s)
in other words, a non-zero digit followed by either <= 5 or >= 7 digits (i.e. any number other than 6), the whole thing neither preceded nor followed by a digit,
which would give:
['12345663333']
Note in the second case the use of ?: to ensure that the bracketed group is a non-capturing one -- this ensures that re.findall will return everything that is matched, rather than the contents of the parentheses.

python regex look ahead positive + negative

This regex will get 456. My question is why it CANNOT be 234 from 1-234-56 ? Does 56 qualify the (?!\d)) pattern since it is NOT a single digit. Where is the beginning point that (?!\d)) will look for?
import re
pattern = re.compile(r'\d{1,3}(?=(\d{3})+(?!\d))')
a = pattern.findall("The number is: 123456") ; print(a)
It is in the first stage to add the comma separator like 123,456.
a = pattern.findall("The number is: 123456") ; print(a)
results = pattern.finditer('123456')
for result in results:
print ( result.start(), result.end(), result)
My question is why it CANNOT be 234 from 1-234-56?
It is not possible as (?=(\d{3})+(?!\d)) requires 3-digit sequences appear after a 1-3-digit sequence. 56 (the last digit group in your imagined scenario) is a 2-digit group. Since a quantifier can be either lazy or greedy, you cannot match both one, two and three digit groups with \d{1,3}. To get 234 from 123456, you'd need a specifically tailored regex for it: \B\d{3}, or (?<=1)\d{3} or even \d{3}(?=\d{2}(?!\d)).
Does 56 match the (?!\d)) pattern? Where is the beginning point that (?!\d)) will look for?
No, this is a negative lookahead, it does not match, it only checks if there is no digit right after the current position in the input string. If there is a digit, the match is failed (not result found and returned).
More clarification on the look-ahead: it is located after (\d{3})+ subpattern, thus the regex engine starts searching for a digit right after the last 3-digit group, and fails a match if the digit is found (as it is a negative lookahead). In plain words, the (?!\d) is a number closing/trailing boundary in this regex.
A more detailed breakdown:
\d{1,3} - 1 to 3 digit sequence, as many as possible (greedy quantifier is used)
(?=(\d{3})+(?!\d)) - a positive look-ahead ((?=...)) that checks if the 1-3 digit sequence matched before are followed by
(\d{3})+ - 1 or more (+) sequences of exactly 3 digits...
(?!\d) - not followed by a digit.
Lookaheads do not match, do not consume characters, but you still can capture inside them. When a lookahead is executed, the regex index is at the same character as before. With your regex and input, you match 123 with \d{1,3} as then you have 3-digit sequence (456). But 456 is capured within a lookahead, and re.findall returns only captured texts if capturing groups are set.
To just add comma as digit grouping symbol, use
rx = r'\d(?=(?:\d{3})+(?!\d))'
See IDEONE demo

Categories

Resources