Regular expression to convert given number in the required format - python

I am first time using regular expression hence need help with one slightly complex regular expression. I have input list of around 100-150 string object(numbers).
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
Expected output = ['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']
## I have tried this -
import re
new_list = []
for i in range (0, len(input)):
new_list.append(re.sub('\d+-\d+-\d+','0000\\1', input[i]))
## problem is with second argument '0000\\1'. I know its wrong but unable to solve
print(new_list) ## new_list is the expected output.
As you can see, I need to convert string of numbers coming in different formats into 15 digit numbers by adding leading zeros to them.
But there is catch here i.e. some numbers i.e.'000480087800784' are already 15 digits, so should be left unchanged (That's why I cannot use string formatting (.format) option of python) Regex has to be used here, which will modify only required numbers. I have already tried following answers but not been able to solve.
Using regex to add leading zeroes
using regular expression substitution command to insert leading zeros in front of numbers less than 10 in a string of filenames
Regular expression to match defined length number with leading zeros

Your regex does not work as you used \1 in the replacement, but the regex pattern has no corresponding capturing group. \1 refers to the first capturing group in the pattern.
If you want to try your hand at regex, you may use
re.sub(r'^(\d+)-(\d+)-(\d+)$', lambda x: "{}-{}-{}".format(x.group(1).zfill(5), x.group(2).zfill(5), x.group(3).zfill(5)), input[i])
See the Python demo.
Here, ^(\d+)-(\d+)-(\d+)$ matches a string that starts with 1+ digits, then has -, then 1+ digits, - and again 1+ digits followed by the end of string. There are three capturing groups whose values can be referred to with \1, \2 and \3 backreferences from the replacement pattern. However, since we need to apply .zfill(5) on each captured text, a lambda expression is used as the replacement argument, and the captures are accessed via the match data object group() method.
However, if your strings are already in correct format, you may just split the strings and format as necessary:
for i in range (0, len(input)):
splits = input[i].split('-')
if len(splits) == 1:
new_list.append(input[i])
else:
new_list.append("{}-{}-{}".format(splits[0].zfill(5), splits[1].zfill(5), splits[2].zfill(5)))
See another Python demo. Both solutions yield
['00090-00010-07457', '000480087800784', '00001-00713-00926', '00012-00710-08197', '00001-00345-01715', '00009-00023-04532', '000200007100272']

How about analysing the string for numbers and dashes, then adding leading zeros?
input = ['90-10-07457', '000480087800784', '001-713-0926', '12-710-8197', '1-345-1715', '9-23-4532', '000200007100272']
output = []
for inp in input:
# calculate length of string
inpLen = len(inp)
# calculate num of dashes
inpDashes = inp.count('-')
# add specific number of leading zeros
zeros = "0" * (15-(inpLen-inpDashes))
output.append(zeros + inp)
print (output)
>>> ['00000090-10-07457', '000480087800784', '00000001-713-0926', '00000012-710-8197', '00000001-345-1715', '000000009-23-4532', '000200007100272']

Related

How to add dot separator on different positions of a number in Python?

I am trying to capture a number from a string, which sometimes contains dot separators and sometimes it does not. In any case I need a number with the dot separator.
e.g.:
num = re.findall('\d{3}\.(?:\d{2}\.){4}\d{3}|\d{14}', txt)[0]
will capture both variations:
304.33.44.52.03.002
30433445203002
In case it captured the one without dots, I would need to add the dots with the systematic of:
AAA.BB.CC.DD.EE.FFF
How can I add those dots with Python?
Solution without regexp.
You can transform it to list and insert dots in required positions, ensuring that value is string.
n = 30433445203002
l = list(str(n))
Add dots in positions you need
l.insert(3, '.')
l.insert(6, '.')
l.insert(9, '.')
l.insert(12, '.')
l.insert(15, '.')
If this is well-defined pattern. You can generalize the insertion above.
After insertion is done, join them back to the string:
num = "".join(l)
Input:
30433445203002
Output:
304.33.44.52.03.002
You can capture each "group" of numbers into a capturing group, and refer to it in the replacement string. The dots can be made optional with \.?.
string = "30433445203002"
regex = r"(\d{3})\.?(\d{2})\.?(\d{2})\.?(\d{2})\.?(\d{2})\.?(\d{3})"
pattern = "\\1.\\2.\\3.\\4.\\5.\\6"
result = re.sub(regex, pattern, string)
For more details, take a look on re.sub
Output:
304.33.44.52.03.002
Regex Demo
EDIT:
If I have misunderstood you and what you actually want is to get the first 3 numbers, 4th and 5th numbers, 6th and 7th numbers etc, you can use the same regex with search:
re.search(regex, string).group(1) # 304
re.search(regex, string).group(2) # 33

Python zfill off by one in regular expression

I'm working on a script that uses zfill to add leading zeros to numbers matched from a regular expression in Python 3.
Here's my code:
#!/usr/bin/env python
import re
string = "7-8"
pattern = re.compile("^(\d+)-(\d+)$")
replacement = "-{}-{}-".format(
"\\1".zfill(2),
"\\2".zfill(3)
)
result = re.sub(pattern, replacement, string)
print(result)
The output I expect is for the first number to be padded to two characters in width and the second number to be padded out to three characters. For example:
-07-008-
Instead, I'm getting:
-7-08-
Why is there one less zero than expected?
You're zfilling the constants used for your back-reference which are two characters already (\ and an int), leaving no space for an extra zero for the first character, and just one space for the second character.
You can instead pass a function as your replacement to re.sub and do the zfilling in there:
def repl_fn(m):
return f'-{m.group(1).zfill(2)}-{m.group(2).zfill(3)}-'
result = re.sub(pattern, repl_fn, string)
print(result)
# -07-008-
The zfilling is now done at replacement time, not before, as in your code.

Avoid special values or space between values using python re

For any phone number which allows () in the area code and any space between area code and the 4th number, I want to create a tuple of the 3 sets of numbers.
For example: (301) 556-9018 or (301)556-9018 would return ('301','556','9018').
I will raise a Value error exception if the input is anything other than the original format.
How do I avoid () characters and include either \s or none between the area code and the next values?
This is my foundation so far:
phonenum=re.compile('''([\d)]+)\s([\d]+) - ([\d]+)$''',re.VERBOSE).match('(123) 324244-123').groups()
print(phonenum)
Do I need to make a if then statement to ignore the () for the first tuple element, or is there a re expression that does that more efficiently?
In addition the \s in between the first 2 tuples doesn't work if it's (301)556-9018.
Any hints on how to approach this?
When specifying a regular expression, you should use raw-string mode:
`r'abc'` instead of `'abc'`
That said, right now you are capturing three sets of numbers in groups. To allow parens, you will need to match parens. (The parens you currently have are for the capturing groups.)
You can match parens by escaping them: \( and \)
You can find various solutions to "what is a regex for XXX" by seaching one of the many "regex libary" web sites. I was able to find this one via DuckDuckGo: http://www.regexlib.com/Search.aspx?k=phone
To make a part of your pattern optional, you can make the individual pieces optional, or you can provide alternatives with the piece present or absent.
Since the parens have to be present or absent together - that is, you don't want to allow an opening paren but no closing paren - you probably want to provide alternatives:
# number, no parens: 800 555-1212
noparens = r'\d{3}\s+\d{3}-\d{4}'
# number with parens: (800) 555-1212
yesparens = r'\(\d{3}\)\s*\d{3}-\d{4}'
You can match the three pieces by inserting "grouping parens":
noparens_grouped = r'(\d{3})\s+(\d{3})-(\d{4})'
yesparens_grouped = r'\((\d{3})\)\s*(\d{3})-(\d{4})'
Note that the quoted parens go outside of the grouping parens, so that the parens do not become part of the captured group.
You can join the alternatives together with the | operator:
yes_or_no_parens_groups = noparens_grouped + '|' + yesparens_grouped
In regular expressions you can use special characters to specify some behavior of some part of the expression.
From python re documentation:
'*' =
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.
'+' =
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
'?' =
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.
So to solve the blank space problem you can use either '?' if you know the occurrence will be no more than 1, or '+' if you can have more than 1.
In case of grouping information together and them returning a list, you can put your expression inside parenthesis and then use function groups() from re.
The result would be:
results = re.search('\((\d{3})\)\s?(\d{3})-(\d{4})', '(301) 556-9018')
if results:
print results.groups()
else:
print('Invalid phone number')

Python Regex to look for string

I have a text file with text that looks like below
Format={ Window_Type="Tabular", Tabular={ Num_row_labels=10
}
}
I need to look for Num_row_labels >=10 in my text file. How do I do that using Python 3.2 regex?
Thanks.
Assume that the data is formatted as above, and there is no leading 0's in the number:
Num_row_labels=\d{2,}
A more liberal regex which allows arbitrary spaces, still assume no leading 0's:
Num_row_labels\s*=\s*\d{2,}
An even more liberal regex which allows arbitrary spaces, and allow leading 0's:
Num_row_labels\s*=\s*0*[1-9]\d+
If you need to capture the numbers, just surround \d{2,} (in 1st and 2nd regex) or [1-9]\d+ (in 3rd regex) with parentheses () and refers to it in the 1st capture group.
Use:
match = re.search("Num_row_labels=(\d+)", line)
The (\d+) matches at least one decimal digit (0-9) and captures all digits matched as a group (groups are stored in the object returned by re.search and re.match, which I'm assigning to match here). To access the group and compare compare against 10, use:
if int(match.group(1)) >= 10:
print "Num_row_labels is at least 10"
This will allow you to easily change the value of your threshold, unlike the answers that do everything in the regex. Additionally, I believe this is more readable in that it is very obvious that you are comparing a value against 10, rather than matching a nonzero digit in the regex followed by at least one other digit. What the code above does is ask for the 1st group that was matched (match.group(1) returns the string that was matched by \d+), and then, with the call to int(), converts the string to an integer. The integer returned by int() is then compared against 10.
The regex is Num_row_labels=[1-9][0-9]{1}.*
Now you can use the re python module (take a look here) to analyze your text and extract those
the re looks like:
Num_row_labels=[0-9]*[1-9][0-9]+
Example of usage:
if re.search('Num_row_labels=[0-9]*[1-9][0-9]+', line):
print line
The regular expression [0-9]*[1-9][0-9]+ means that in the string must be at least
one digit from 1 to 9 ([1-9], symbol class [] in regular expressions means that here can be any symbol from the range specified in the brackets);
and at least one digit from 0 to 9 (but it can be more of them) ([0-9]+, the + sign in regular expression means that the symbol/expression that stand before it can be repeated 1 or more times).
Before these digits can be any other digits ([0-9]*, that means any digit, 0 or more times). When you already have two digits you can have any other digits before — the number would be greater or equal 10 anyway.

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Categories

Resources