Using regular expressions to manipulate strings - python

So I have a string in the format of ABCD-EFGH-IJ where A through J are numbers 0-9 in a list of a ton of other strings. I have a regular expression identifying it, but how do I get it to also replace it with the format IJABCDEFGH?

You can use the following regular expression with substitution:
import re
s = '1234-5678-90'
print re.sub(r'(\d{4})-(\d{4})-(\d{2})', r'\3\1\2', s)
Result:
9012345678
\3 matches the content of what inside the third pair of parentheses. So \3\1\2 means to replace with the third group of your numbers, followed by the first followed by the second.

Related

How can I replace a string match with part of itself in Python?

I need to process text in Python and replace any occurrence of "[xz]" by "x", where "x" is the first letter enclosed in the brackets, and "z" can be a string of variable length. Note that I do not want the brackets in the output.
For example, "alEhos[cr#e]sjt" should become "alEhoscsjt"
I think re.sub() could be a way to go, but I am not sure how to implement it.
This will work for the example given.
import re
example = "alEhos[cr#e]sjt"
result = re.sub(r'(.*)\[(.).*\](.*)', r'\1\2\3', example)
print(result)
The regular expression uses three capturing groups. \1 and \3 capture the text before and after the square brackets. \2 captures the first character inside the bracket.
Output:
alEhoscsjt
If you have more than one occurrence of square brackets in your string, you can use the following:
example = "alEhos[cr#e]sjt[abc]xyz"
result = re.sub(r'\[(.).*?\]', r'\1', example)
print(result)
This version replaces all of the bracketed substrings (including brackets) by the first character found inside the brackets. (Note the use of the non-greedy qualifier to avoid consuming everything between the first [ and last ].)
Output:
alEhoscsjtaxyz
Instead of directly using the re.sub() method, you can use the re.findall() method to find all substrings (in a non-greedy fashion) that begins and ends with the proper square brackets.
Then, iterate through the matches and use the str.replace() method to replace each match in the string with the second character in the match:
import re
s = "alEhos[cr#e]sjt"
for m in re.findall("\[.*?\]", s):
s = s.replace(m, m[1])
print(s)
Output:
alEhoscsjt
You could use the split() method:
str1 = "alEhos[cr#e]sjt"
lst1 = str1.split("[")
lst2 = lst1[1].split("]")
print(lst1[0]+lst2[0][0]+lst2[1])

python regular expression pattern to get digits in between_

I want to find regular expression pattern to match digits between XYZ_ and first underscore.
Example want to get 2M284904C4 from below string.
XYZ_2M284904C4_20210201_120032.xyz
I tried XYZ_.*_,but it matches XYZ_2M284904C4_20210201_
Try string.split() at the underscores, and then select the item you want from the returned list. For example,
string = 'XYZ_2M284904C4_20210201_120032.xyz'
string_list = string.split('_')
result = string_list[1]

Regular Expression replacement in Python

I have a regular expression to match all instances of 1 followed by a letter. I would like to remove all these instances.
EXPRESSION = re.compile(r"1([A-Z])")
I can use re.split.
result = EXPRESSION.split(input)
This would return a list. So we could do
result = ''.join(EXPRESSION.split(input))
to convert it back to a string.
or
result = EXPRESSION.sub('', input)
Are there any differences to the end result?
Yes, the results are different. Here is a simple example:
import re
EXPRESSION = re.compile(r"1([A-Z])")
s = 'hello1Aworld'
result_split = ''.join(EXPRESSION.split(s))
result_sub = EXPRESSION.sub('', s)
print('split:', result_split)
print('sub: ', result_sub)
Output:
split: helloAworld
sub: helloworld
The reason is that because of the capture group, EXPRESSION.split(s) includes the A, as noted in the documentation:
re.split = split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
When removing the capturing parentheses, i.e., using
EXPRESSION = re.compile(r"1[A-Z]")
then so far I have not found a case where result_split and result_sub are different, even after reading this answer to a similar question about regular expressions in JavaScript, and changing the replacement string from '' to '-'.

python regex get value after string

I am trying to parse a comma separated string keyword://pass#ip:port.
The string is a comma separated string, however the password can contain any character including comma. hence I can not use a split operation based on comma as delimiter.
I have tried to use regex to get the string after "myserver://" and later on I can split the rest of the information by using string operation (pass#ip:port/key1) but I could not make it working as I can not fetch the information after the above keyword.
myserver:// is a hardcoded string, and I need to get whatever follows each myserver as a comma separated list (i.e. pass#ip:port/key1, pass2#ip2:port2/key2, etc)
This is the closest I can get:
import re
my_servers="myserver://password,123#ip:port/key1,myserver://pass2#ip2:port2/key2"
result = re.search(r'myserver:\/\/(.*)[,(.*)|\s]', my_servers)
using search I tries to find the occurrence of the "myserver://" keyword followed by any characters, and ends with comma (means it will be followed by myserver://zzz,myserver://qqq) or space (incase of single myserver:// element, but I do not know how to do this better apart of using space as end-indicator). However this does not come out right. How can I do this better with regex?
You may consider the following splitting approach if you do not need to keep myserver:// in the results:
filter(None, re.split(r'\s*,?\s*myserver://', s))
The \s*,?\s*myserver:// pattern matches an optional , enclosed with 0+ whitespaces and then myserver:// substring. See this regex demo. Note we need to remove empty entries to get rid of an empty leading entry as when the match is found at the string start, the empty string at the beginning will be added to the resulting list.
Alternatively, you can use the lookahead based pattern with a lazy dot matching pattern with re.findall:
rx = r"myserver://(.*?)(?=\s*,\s*myserver://|$)"
See the Python demo
Details:
myserver:// - a literal substring
(.*?) - Capturing group 1 whose contents will be returned by re.findall matching any 0+ chars other than line break chars, as few as possible, up to the first occurrence (but excluding it)
(?=\s*,\s*myserver://|$) - either of the 2 alternatives:
\s*,\s*myserver:// - , enclosed with 0+ whitespaces and then a literal myserver:// substring
| - or
$ - end of string.
Here is the regex demo.
See a Python demo for the both approaches:
import re
s = "myserver://password,123#ip:port/key1,myserver://pass2#ip2:port2/key2"
rx1 = r'\s*,?\s*myserver://'
res1 = filter(None, re.split(rx1, s))
print(res1)
#or
rx2 = r"myserver://(.*?)(?=\s*,\s*myserver://|$)"
res2 = re.findall(rx2, s)
print(res2)
Both will print ['password,123#ip:port/key1', 'pass2#ip2:port2/key2'].

Allowing escape sequences in my regular expression

I'm trying to create a regular expression which finds occurences of $VAR or ${VAR}. If something like \$VAR or \${VAR} was given, it would not match. If it were given something like \\$VAR or \\${VAR} or any multiple of 2 \'s, it should match.
i.e.
$BLOB matches
\$BLOB doesn't match
\\$BLOB matches
\\\$BLOB doesn't match
\\\\$BLOB matches
... etc
I'm currently using the following regex:
line = re.sub("[^\\][\\\\]*\$(\w[^-]+)|"
"[^\\][\\\\]*\$\{(\w[^-]+)\}",replace,line)
However, this doesn't work properly. When I give it \$BLOB, it still matches for some reason. Why is this?
The second groupings of double slashes are written as a redundant character class [\\\\]*, matching one or more backslashes, but should be a repeating group ((?:\\\\)*) matching one or more sets of two backslashes:
re.sub(r'(?<!\\)((?:\\\\)*)\$(\w[^-]+|\{(\w[^-]+)\})',r'\1' + replace, line)
To write a regular expression that finds $ unless it is escaped using E unless it in turn is also escaped EE:
import re
values = dict(BLOB='some value')
def repl(m):
return m.group('before') + values[m.group('name').strip('{}')]
regex = r"(?<!E)(?P<before>(?:EE)*)\$(?P<name>N|\{N\})"
regex = regex.replace('E', re.escape('\\'))
regex = regex.replace('N', r'\w+') # name
line = re.sub(regex, repl, line)
Using E instead of '\\\\' exposes your embed language without thinking about backslashes in Python string literals and regular expression patterns.

Categories

Resources