Match fixed-width numbers with right-padded contiguous whitespace - python

I am trying to validate a format within a larger regular expression and block of fixed-column text. I would like to match a fixed-width pattern, but only if it has only digits on the left, and only whitespace (or none) on the right. The resulting expression will be used within python.
The following lines should match the 17 digit pattern (except the header):
MATCH
*****************
A 20081122122332444 B
A 20081122122332 B
A 200811221223 B
A 2008112212 B
A 20081122 B
But the following should not match
NO MATCH
*****************
A 20081122112233 1 B
A 2008112211223 1 B
A 200811221 C B
A 20081122 . B
This regex matches the valid data easy enough: (?=\d+\s*)[\d\s]{17}
This also seems to pick up the corrupting characters: (?=\d+[\s]?[^\d])[\d\s]{17}
A negative lookbehind will not work, due to the varying position, and I would rather not repeat the pattern to work all the possible variants for the string length.
It would seem there is an elegant way to do this within a regex - capture a contiguous block of digits, followed by a contiguous block of space, for a total of 17 characters.

Part of the problem with your pattern is that you're using [\d\s]{17}. This will match a string of 17 characters of a mixture of digits and whitespaces. While you want to make sure both the digits and the whitespaces (if any) are consecutive.
For restricting the length of the string, you may use a positive Lookahead to validate that the whole string is exactly 17 characters. Then you match any number of digits (length already restricted), optionally, followed by whitespace characters.
You may use the following pattern:
^(?=.{17}$)\d+\s*?$
Demo.

You can achieve objective by searching for digits , followed by space and counting number of characters in span
import re
text="20081122 . "
if re.search('[\d]{1,}[\s]{0,}',s).span()[1]==17:
print("yes")
else print("no")

You say you are looking for 17-character width columns and so I will only match up to 17 characters, since it is not clear what can follow these 17 characters (might be extra spaces, which appears to be the case):
import re
text = """A 20081122122332444 B
A 20081122122332 B
A 200811221223 B
A 2008112212 B
A 20081122 B"""
l = [m.group(0)[0:17] for m in re.finditer(r'\d+\s*', text) if m.span(0)[1] - m.span(0)[0] >= 17]
print(l)
Prints:
['20081122122332444', '20081122122332 ', '200811221223 ', '2008112212 ', '20081122 ']
If you are using this as part of a larger regex, then perhaps we have to then assume that the 17-character column is followed by a space and we then have:
(?=\d[\d\s]{16})\d+\s*(?=\s)

Related

How to extract a specific type of number from a string using regex?

Consider this string:
text = '''
4 500,5
12%
1,63%
568768,74832 days in between
34 cars in a row'''
As you can see, there are simple numbers, numbers with spaces in between, numbers with comas, and both. Thus, 4 500,5 is considered as a standalone, separate number. Extracting the numbers with comas and spaces is easy and I found the pattern as:
pattern = re.compile(r'(\d+ )?\d+,\d+')
However, I am struggling to extract just the simple numbers like 12 and 34. I tried using (?!...) and [^...] but these options do not allow me to exclude the unwanted parts of other numbers.
((?:\d+ )?\d+,\d+)|(\d+(?! \d))
I believe this will do what you want (Regexr link: https://regexr.com/695tc)
To capture "simple" numbers, it looks for [one or more digits], which are not followed by [a space and another digit].
I edited so that you can use capture groups appropriately, if desired.
If you only want to match 12 and 34:
(?<!\S)\d+\b(?![^\S\n]*[,\d])
(?<!\S) Assert a whitespace boundary to the left
\d+\b Match 1+ digits and a word boundary
(?! Negative lookahead, assert what is directly to the right is not
[^\S\n]*[,\d] Match optional spaces and either , or a digit
) Close lookahead
Regex demo
I'd suggest extracting all numbers first, then filter those with a comma to a list with floats, and those without a comma into a list of integers:
import re
text = '4 500,5\n\n12%\n\n1,63%\n\n568768,74832 days in between\n\n34 cars in a row'
number_rx = r'(?<!\d)(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+)(?:,\d+)?(?!\d)'
number_list = re.findall(number_rx, text)
print('Float: ', [x for x in number_list if ',' in x])
# => Float: ['4 500,5', '1,63', '568768,74832']
print('Integers: ', [x for x in number_list if ',' not in x])
# => Integers: ['12', '34']
See the Python demo and the regex demo.
The regex matches:
(?<!\d) - a negative lookbehind that allows no digit immediately to the left of the current location
(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+) - either of the two alternatives:
\d{1,3}(?:[ \xA0]\d{3})* - one, two or three digits, and then zero or more occurrences of a space / hard (no-breaking) space followed with three digits
| - or
\d+ - one or more digits
(?:,\d+)? - an optional sequence of , and then one or more digits
(?!\d) - a negative lookahead that allows no digit immediately to the right of the current location.

Matching consecutive digits in regex while ignoring dashes in python3 re

I'm working to advance my regex skills in python, and I've come across an interesting problem. Let's say that I'm trying to match valid credit card numbers , and on of the requirments is that it cannon have 4 or more consecutive digits. 1234-5678-9101-1213 is fine, but 1233-3345-6789-1011 is not. I currently have a regex that works for when I don't have dashes, but I want it to work in both cases, or at least in a way i can use the | to have it match on either one. Here is what I have for consecutive digits so far:
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})')
I know I could do some sort of replace '-' with '', but in an effort to make my code more versatile, it would be easier as just a regex. Here is the function for more context:
def isValid(number):
validStart = re.compile(r'^[456]') # Starts with 4, 5, or 6
validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$') # is 16 digits long
validOnlyDigits = re.compile(r'^[0-9-]*$') # only digits or dashes
validNoConsecutive = re.compile(r'(?!([0-9])\1{4,})') # no consecutives over 3
validators = [validStart, validLength, validOnlyDigits, validNoConsecutive]
return all([val.search(number) for val in validators])
list(map(print, ['Valid' if isValid(num) else 'Invalid' for num in arr]))
I looked into excluding chars and lookahead/lookbehind methods, but I can't seem to figure it out. Is there some way to perhaps ignore a character for a given regex? Thanks for the help!
You can add the (?!.*(\d)(?:-*\1){3}) negative lookahead after ^ (start of string) to add the restriction.
The ^(?!.*(\d)(?:-*\1){3}) pattern matches
^ - start of string
(?!.*(\d)(?:-*\1){3}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is
.* - any zero or more chars other than line break chars as many as possible
(\d) - Group 1: one digit
(?:-*\1){3} - three occurrences of zero or more - chars followed with the same digit as captured in Group 1 (as \1 is an inline backreference to Group 1 value).
See the regex demo.
If you want to combine this pattern with others, just put the lookahead right after ^ (and in case you have other patterns before with capturing groups, you will need to adjust the \1 backreference). E.g. combining it with your second regex, validLength = re.compile(r'^[0-9]{16}$|^[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4}$'), it will look like
validLength = re.compile(r'^(?!.*(\d)(?:-*\1){3})(?:[0-9]{16}|[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})$')

Regex expression to exclude any number from any place

I have a code which takes a string as input and discards all the letters and prints only the numbers which doesn't contain 9 at any of the place.
I have decided to do it with the help of regex but couldn't find a working expression to achieve it where it is needed to be modified?
I have also tried with [^9] but it doesn't work.
import re
s = input().lstrip().rstrip()
updatedStr = s.replace(' ', '')
nums = re.findall('[0-8][0-8]', updatedStr)
print(nums)
The code should completely discard the number which contains 9 at any place.
for example - if the input is:
"This is 67 and 98"
output: ['67']
input:
"This is the number 678975 or 56783 or 87290 thats it"
output: ['56783'] (as the other two numbers contain 9 at some places)
I think you should try using:
nums=re.findall('[0-8]+',updatedStr)
Instead.
[0-8]+ means "one or more ocurrences of a number from 0 to 8"
I tried : 12313491 a asfasgf 12340 asfasf 123159
And got: ['123134', '1', '12340', '12315']
(Your code returns the array. If you want to join the numbers you should add some code)
It sounds like you wan't to match all numbers that don't contain a 9.
Your pattern should match any string of digits that doesn't contain a nine but ends and starts with a non-digit
pattern = re.compile('(?<=[^\d])[0-8]+(?=[^\d])')
pattern.findall(inputString) # Finds all the matches
Here the pattern is doing a couple of things.
(?<=...) is a positive look behind. This means we will only get matches that have a non digit before it.
[0-8]+ will match 1 or more digits except 9
(?=...) is a lookahead. We will only get matches that end in a non digit.
Note:
inputString does not need to be stripped. And in fact this pattern may run into issues if there is a number at the beginning or end of a string. To prevent this. simply pad it with any chars.
inputString = ' ' + inputString + ' '
Look at the python re docs for more info

using reg exp to check if test string is of a fixed format

I want to make sure using regex that a string is of the format- "999.999-A9-Won" and without any white spaces or tabs or newline characters.
There may be 2 or 3 numbers in the range 0 - 9.
Followed by a period '.'
Again followed by 2 or 3 numbers in the range 0 - 9
Followed by a hyphen, character 'A' and a number between 0 - 9 .
This can be followed by anything.
Example: 87.98-A8-abcdef
The code I have come up until now is:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9][0-9][.][0-9][0-9][-A][0-9][-]*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
This doesn't seem to work. I'm not sure what I'm missing and also the problem here is I'm not checking for white spaces, tabs and new line characters and also hard-coded the number for integers before and after decimal.
With {m,n} you can specify the number of times a pattern can repeat, and the \d character class matches all digits. The \S character class matches anything that is not whitespace. Using these your regular expression can be simplified to:
re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
Note also the \Z anchor, making the \S* expression match all the way to the end of the string. No whitespace (newlines, tabs, etc.) are allowed here. If you combine this with the .match() method you assure that all characters in your tested string conform to the pattern, nothing more, nothing less. See search() vs. match() for more information on .match().
A small demonstration:
>>> import re
>>> pattern = re.compile(r'\d{2,3}\.\d{2,3}-A\d-\S*\Z')
>>> pattern.match('87.98-A1-help')
<_sre.SRE_Match object at 0x1026905e0>
>>> pattern.match('123.45-A6-no whitespace allowed')
>>> pattern.match('123.45-A6-everything_else_is_allowed')
<_sre.SRE_Match object at 0x1026905e0>
Let's look at your regular expression. If you want:
"2 or 3 numbers in the range 0 - 9"
then you can't start your regular expression with '^[0-9][0-9][.] because that will only match strings with exactly two integers at the beginning. A second issue with your regex is at the end: [0-9][-]* - if you wish to match anything at the end of the string then you need to finish your regular expression with .* instead. Edit: see Martijn Pieters's answer regarding the whitespace in the regular expressions.
Here is an updated regular expression:
testString = "87.98-A1-help"
regCompiled = re.compile('^[0-9]{2,3}\.[0-9]{2,3}-A[0-9]-.*');
checkMatch = re.match(regCompiled, testString);
if checkMatch:
print ("FOUND")
else:
print("Not Found")
Not everything needs to be enclosed inside [ and ], in particular when you know the character(s) that you wish to match (such as the part -A). Furthermore:
the notation {m,n} means: match at least m times and at most n times, and
to explicitly match a dot, you need to escape it: that's why there is \. in the regular expression above.

Python Regular Expression Match All 5 Digit Numbers but None Larger

I'm attempting to string match 5-digit coupon codes spread throughout a HTML web page. For example, 53232, 21032, 40021 etc... I can handle the simpler case of any string of 5 digits with [0-9]{5}, though this also matches 6, 7, 8... n digit numbers. Can someone please suggest how I would modify this regular expression to match only 5 digit numbers?
>>> import re
>>> s="four digits 1234 five digits 56789 six digits 012345"
>>> re.findall(r"\D(\d{5})\D", s)
['56789']
if they can occur at the very beginning or the very end, it's easier to pad the string than mess with special cases
>>> re.findall(r"\D(\d{5})\D", " "+s+" ")
Without padding the string for special case start and end of string, as in John La Rooy answer one can use the negatives lookahead and lookbehind to handle both cases with a single regular expression
>>> import re
>>> s = "88888 999999 3333 aaa 12345 hfsjkq 98765"
>>> re.findall(r"(?<!\d)\d{5}(?!\d)", s)
['88888', '12345', '98765']
full string: ^[0-9]{5}$
within a string: [^0-9][0-9]{5}[^0-9]
Note: There is problem in using \D since \D matches any character that is not a digit , instead use \b.
\b is important here because it matches the word boundary but only at end or beginning of a word .
import re
input = "four digits 1234 five digits 56789 six digits 01234,56789,01234"
re.findall(r"\b\d{5}\b", input)
result : ['56789', '01234', '56789', '01234']
but if one uses
re.findall(r"\D(\d{5})\D", s)
output : ['56789', '01234']
\D is unable to handle comma or any continuously entered numerals.
\b is important part here it matches the empty string but only at end or beginning of a word .
More documentation: https://docs.python.org/2/library/re.html
More Clarification on usage of \D vs \b:
This example uses \D but it doesn't capture all the five digits number.
This example uses \b while capturing all five digits number.
Cheers
A very simple way would be to match all groups of digits, like with r'\d+', and then skip every match that isn't five characters long when you process the results.
You probably want to match a non-digit before and after your string of 5 digits, like [^0-9]([0-9]{5})[^0-9]. Then you can capture the inner group (the actual string you want).
You could try
\D\d{5}\D
or maybe
\b\d{5}\b
I'm not sure how python treats line-endings and whitespace there though.
I believe ^\d{5}$ would not work for you, as you likely want to get numbers that are somewhere within other text.
I use Regex with easier expression :
re.findall(r"\d{5}", mystring)
It will research 5 numerical digits. But you have to be sure not to have another 5 numerical digits in the string

Categories

Resources