Regular Expressions using Substitution to convert numbers - python

I'm a Python beginner, so keep in mind my regex skills are level -122.
I need to convert a string with text containing file1 to file01, but not convert file10 to file010.
My program is wrong, but this is the closest I can get, I've tried dozens of combinations but I can't get close:
import re
txt = 'file8, file9, file10'
pat = r"[0-9]"
regexp = re.compile(pat)
print(regexp.sub(r"0\d", txt))
Can someone tell me what's wrong with my pattern and substitution and give me some suggestions?

You could capture the number and check the length before adding 0, but you might be able to use this instead:
import re
txt = 'file8, file9, file10'
pat = r"(?<!\d)(\d)(?=,|$)"
regexp = re.compile(pat)
print(regexp.sub(r"0\1", txt))
regex101 demo
(?<! ... ) is called a negative lookbehind. This prevents (negative) a match if the pattern after it has the pattern in the negative lookbehind matches. For example, (?<!a)b will match all b in a string, except if it has an a before it, meaning bb, cb matches, but ab doesn't match. (?<!\d)(\d) thus matches a digit, unless it has another digit before it.
(\d) is a single digit, enclosed in a capture group, denoted by simple parentheses. The captured group gets stored in the first capture group.
(?= ... ) is a positive lookahead. This matches only if the pattern inside the positive lookahead matches after the pattern before this positive lookahead. In other words, a(?=b) will match all a in a string only if there's a b after it. ab matches, but ac or aa don't.
(?=,|$) is a positive lookahead containing ,|$ meaning either a comma, or the end of the string.
(?<!\d)(\d)(?=,|$) thus matches any digit, as long as there's no digit before it and there's a comma after it, or if that digit is at the end of the string.

how about?
a='file1'
a='file' + "%02d" % int(a.split('file')[1])

This approach uses a regex to find every sequence of digits and str.zfill to pad with zeros:
>>> txt = 'file8, file9, file10'
>>> re.sub(r'\d+', lambda m : m.group().zfill(2), txt)
'file08, file09, file10'

Related

Searching for a pattern in a sentence with regex in python

I want to capture the digits that follow a certain phrase and also the start and end index of the number of interest.
Here is an example:
text = The special code is 034567 in this particular case and not 98675
In this example, I am interested in capturing the number 034657 which comes after the phrase special code and also the start and end index of the the number 034657.
My code is:
p = re.compile('special code \s\w.\s (\d+)')
re.search(p, text)
But this does not match anything. Could you explain why and how I should correct it?
Your expression matches a space and any whitespace with \s pattern, then \w. matches any word char and any character other than a line break char, and then again \s requires two whitespaces, any whitespace and a space.
You may simply match any 1+ whitespaces using \s+ between words, and to match any chunk of non-whitespaces, instead of \w., you may use \S+.
Use
import re
text = 'The special code is 034567 in this particular case and not 98675'
p = re.compile(r'special code\s+\S+\s+(\d+)')
m = p.search(text)
if m:
print(m.group(1)) # 034567
print(m.span(1)) # (20, 26)
See the Python demo and the regex demo.
Use re.findall with a capture group:
text = "The special code is 034567 in this particular case and not 98675"
matches = re.findall(r'\bspecial code (?:\S+\s+)?(\d+)', text)
print(matches)
This prints:
['034567']

Regex : matching integers inside of brackets

I am trying to take off bracketed ends of strings such as version = 10.9.8[35]. I am trying to substitute the integer within brackets pattern
(so all of [35], including brackets) with an empty string using the regex [\[+0-9*\]+] but this also matches with numbers not surrounded by brackets. Am I not using the + quantifier properly?
You could match the format of the number and then match one or more digits between square brackets.
In the replacement using the first capturing group r'\1'
\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]
\b Word boundary
( Capture group 1
[0-9]+ Match 1+ digits
(?:\.[0-9]+)+ Match a . and 1+ digits and repeat that 1 or more times
) Close group
\[[0-9]+\] Match 1+ digits between square brackets
Regex demo
For example
import re
regex = r"\b([0-9]+(?:\.[0-9]+)+)\[[0-9]+\]"
test_str = "version = 10.9.8[35]"
result = re.sub(regex, r'\1', test_str)
print (result)
Output
version = 10.9.8
No need for regex
s = '10.9.8[35]'
t = s[:s.rfind("[")]
print(t)
But if you insist ;-)
import re
s = '10.9.8[35]'
t = re.sub(r"^(.*?)[[]\d+[]]$", r"\1", s)
print(t)
Breakdown of regex:
^ - begins with
() - Capture Group 1 you want to keep
.*? - Any number of chars (non-greedy)
[[] - an opening [
\d+ 1+ digit
[]] - closing ]
$ - ends with
\1 - capture group 1 - used in replace part of regex replace. The bit you want to keep.
Output in both cases:
10.9.8
Use regex101.com to familiarise yourself more. If you click on any of the regex samples at bottom right of the website, it will give you more info. You can also use it to generate regex code in a variety of languages too. (not good for Java though!).
There's also a great series of Python regex videos on Youtube by PyMoondra.
A simpler regex solution:
import re
pattern = re.compile(r'\[\d+\]$')
s = '10.9.8[35]'
r = pattern.sub('', s)
print(r) # 10.9.8
The pattern matches square brackets at the end of a string with one or more number inside. The sub then replaces the square brackets and number with an empty string.
If you wanted to use the number in the square brackets just change the sub expression such as:
import re
pattern = re.compile(r'\[(\d+)\]$')
s = '10.9.8[35]'
r = pattern.sub(r'.\1', s)
print(r) # 10.9.8.35
Alternatively as said by the other answer you can just find it and splice to get rid of it.

Regular expression: alphanumerics without pure numerics

I need to extract the names of variables from a function string.
A variable can be [a-zA-Z0-9]+ but not a real number notated like 1, 3.5, 1e4, 1e5...
Is there a smart way of doing this?
Here's a M(not)WE in python:
import re
pattern = r"[a-zA-z0-9.]+"
function_string = "(A+B1)**2.5"
re.findall(pattern, function_string)
The above code returns:
A, B1 and 2.5.
My desired output is
A and B1.
And here's a nice way of testing the regular expressions:
https://regex101.com/r/fv0DfR/1
import re
pattern = r'[a-zA-Z_][a-zA-Z0-9_]{0,31}'
function_string = "(A+B1)2.5"
print(re.findall(pattern, function_string))
OUTPUT:
['A', 'B1']
Try this Regex:
\b(?!\d)[a-zA-Z0-9]+
Click for Demo
Explanation:
\b - matches a word boundary
(?!\d) - negative lookahead to make sure that the next character is not a digit. This will make sure that the variable name does not start with a digit. Will also exclude words like 1e3
[a-zA-Z0-9]+ - matches 1+ letters or digits
If you want those variables also which start with a digit and are alphanumeric, you can use \b(?!\d+(?:[eE]\d+)?\b)[a-zA-Z0-9]+

python regular expression : how to remove all punctuation characters from a string but keep those between numbers?

I am working on a Chinese NLP project. I need to remove all punctuation characters except those characters between numbers and remain only Chinese character(\u4e00-\u9fff),alphanumeric characters(0-9a-zA-Z).For example,the
hyphen in 12-34 should be kept while the equal mark after 123 should be removed.
Here is my python script.
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.sub(u'(?<=[^0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[^0-9])','',s)
print(res)
the expected output should be
中国中国foo中国bar中123国中国12-34中国
but the result is
中国中国foo中国bar中123=国中国12-34中国
I can't figure out why there is an extra equal sign in the output?
Your regex will first check "=" against [^\u4e00-\u9fff0-9a-zA-Z]+. This will succeed. It will then check the lookbehind and lookahead, which must both fail. Ie: If one of them succeeds, the character is kept. This means your code actually keeps any non-alphanumeric, non-Chinese characters which have numbers on any side.
You can try the following regex:
u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))'
You can use it as such:
import re
s = "中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
res = re.findall(u'([\u4e00-\u9fff0-9a-zA-Z]|(?<=[0-9])[^\u4e00-\u9fff0-9a-zA-Z]+(?=[0-9]))',s)
print(res.join(''))
I suggest matching and capturing these characters in between digits (to restore them later in the output), and just match them in other contexts.
In Python 2, it will look like
import re
s = u"中国,中,。》%国foo中¥国bar#中123=国%中国12-34中国"
pat_block = u'[^\u4e00-\u9fff0-9a-zA-Z]+';
pattern = u'([0-9]+{0}[0-9]+)|{0}'.format(pat_block)
res = re.sub(pattern, lambda x: x.group(1) if x.group(1) else u"" ,s)
print(res.encode("utf8")) # => 中国中国foo中国bar中123国中国12-34中国
See the Python demo
If you need to preserve those symbols inside any Unicode digits, you need to replace [0-9] with \d and pass the re.UNICODE flag to the regex.
The regex will look like
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+)|[^\u4e00-\u9fff0-9a-zA-Z]+
It will works like this:
([0-9]+[^\u4e00-\u9fff0-9a-zA-Z]+[0-9]+) - Group 1 capturing
[0-9]+ - 1+ digits
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
[0-9]+ - 1+ digits
| - or
[^\u4e00-\u9fff0-9a-zA-Z]+ - 1+ chars other than those defined in the specified ranges
In Python 2.x, when a group is not matched in re.sub, the backreference to it is None, that is why a lambda expression is required to check if Group 1 matched first.

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

Categories

Resources