I have a regular expression to match all instances of 1 followed by a letter. I would like to remove all these instances.
EXPRESSION = re.compile(r"1([A-Z])")
I can use re.split.
result = EXPRESSION.split(input)
This would return a list. So we could do
result = ''.join(EXPRESSION.split(input))
to convert it back to a string.
or
result = EXPRESSION.sub('', input)
Are there any differences to the end result?
Yes, the results are different. Here is a simple example:
import re
EXPRESSION = re.compile(r"1([A-Z])")
s = 'hello1Aworld'
result_split = ''.join(EXPRESSION.split(s))
result_sub = EXPRESSION.sub('', s)
print('split:', result_split)
print('sub: ', result_sub)
Output:
split: helloAworld
sub: helloworld
The reason is that because of the capture group, EXPRESSION.split(s) includes the A, as noted in the documentation:
re.split = split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
When removing the capturing parentheses, i.e., using
EXPRESSION = re.compile(r"1[A-Z]")
then so far I have not found a case where result_split and result_sub are different, even after reading this answer to a similar question about regular expressions in JavaScript, and changing the replacement string from '' to '-'.
Related
I need to process text in Python and replace any occurrence of "[xz]" by "x", where "x" is the first letter enclosed in the brackets, and "z" can be a string of variable length. Note that I do not want the brackets in the output.
For example, "alEhos[cr#e]sjt" should become "alEhoscsjt"
I think re.sub() could be a way to go, but I am not sure how to implement it.
This will work for the example given.
import re
example = "alEhos[cr#e]sjt"
result = re.sub(r'(.*)\[(.).*\](.*)', r'\1\2\3', example)
print(result)
The regular expression uses three capturing groups. \1 and \3 capture the text before and after the square brackets. \2 captures the first character inside the bracket.
Output:
alEhoscsjt
If you have more than one occurrence of square brackets in your string, you can use the following:
example = "alEhos[cr#e]sjt[abc]xyz"
result = re.sub(r'\[(.).*?\]', r'\1', example)
print(result)
This version replaces all of the bracketed substrings (including brackets) by the first character found inside the brackets. (Note the use of the non-greedy qualifier to avoid consuming everything between the first [ and last ].)
Output:
alEhoscsjtaxyz
Instead of directly using the re.sub() method, you can use the re.findall() method to find all substrings (in a non-greedy fashion) that begins and ends with the proper square brackets.
Then, iterate through the matches and use the str.replace() method to replace each match in the string with the second character in the match:
import re
s = "alEhos[cr#e]sjt"
for m in re.findall("\[.*?\]", s):
s = s.replace(m, m[1])
print(s)
Output:
alEhoscsjt
You could use the split() method:
str1 = "alEhos[cr#e]sjt"
lst1 = str1.split("[")
lst2 = lst1[1].split("]")
print(lst1[0]+lst2[0][0]+lst2[1])
I want to find regular expression pattern to match digits between XYZ_ and first underscore.
Example want to get 2M284904C4 from below string.
XYZ_2M284904C4_20210201_120032.xyz
I tried XYZ_.*_,but it matches XYZ_2M284904C4_20210201_
Try string.split() at the underscores, and then select the item you want from the returned list. For example,
string = 'XYZ_2M284904C4_20210201_120032.xyz'
string_list = string.split('_')
result = string_list[1]
So I have a string in the format of ABCD-EFGH-IJ where A through J are numbers 0-9 in a list of a ton of other strings. I have a regular expression identifying it, but how do I get it to also replace it with the format IJABCDEFGH?
You can use the following regular expression with substitution:
import re
s = '1234-5678-90'
print re.sub(r'(\d{4})-(\d{4})-(\d{2})', r'\3\1\2', s)
Result:
9012345678
\3 matches the content of what inside the third pair of parentheses. So \3\1\2 means to replace with the third group of your numbers, followed by the first followed by the second.
I am trying to get first pair of numbers from "09_135624.jpg"
My code now:
import re
string = "09_135624.jpg"
pattern = r"(?P<pair>(.*))_135624.jpg"
match = re.findall(pattern, string)
print match
Output:
[('09', '09')]
Why I have tuple in output?
Can you help me modify my code to get this:
['09']
Or:
'09'
re.findall returns differently according to the number of capturing group in the pattern:
>>> re.findall(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
['09']
According to the documentation:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
Alternative using re.search:
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg")
<_sre.SRE_Match object at 0x00000000025D0D50>
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group('pair')
'09'
>>> re.search(r"(?P<pair>.*)_135624\.jpg", "09_135624.jpg").group(1)
'09'
UPDATE
To match . literally, you need to escape it: \..
(?P<pair>(?:.*))_135624.jpg
Try this. You are getting two results because you are capturing them twice. I have modified it to capture only once:
http://regex101.com/r/lS5tT3/62
Such as "example123" would be 123, "ex123ample" would be None, and "123example" would be None.
You can use regular expressions from the re module:
import re
def get_trailing_number(s):
m = re.search(r'\d+$', s)
return int(m.group()) if m else None
The r'\d+$' string specifies the expression to be matched and consists of these special symbols:
\d: a digit (0-9)
+: one or more of the previous item (i.e. \d)
$: the end of the input string
In other words, it tries to find one or more digits at the end of a string. The search() function returns a Match object containing various information about the match or None if it couldn't match what was requested. The group() method, for example, returns the whole substring that matched the regular expression (in this case, some digits).
The ternary if at the last line returns either the matched digits converted to a number or None, depending on whether the Match object is None or not.
I'd use a regular expression, something like /(\d+)$/. This will match and capture one or more digits, anchored at the end of the string.
Read about regular expressions in Python.
Oops, correcting (sorry, I missed the point)
you should do something like this ;)
Import the RE module
import re
Then write a regular expression, "searching" for an expression.
s = re.search("[a-zA-Z+](\d{3})$", "string123")
This will return "123" if match or NoneType if not.
s.group(0)