I want to find regular expression pattern to match digits between XYZ_ and first underscore.
Example want to get 2M284904C4 from below string.
XYZ_2M284904C4_20210201_120032.xyz
I tried XYZ_.*_,but it matches XYZ_2M284904C4_20210201_
Try string.split() at the underscores, and then select the item you want from the returned list. For example,
string = 'XYZ_2M284904C4_20210201_120032.xyz'
string_list = string.split('_')
result = string_list[1]
Related
I need to process text in Python and replace any occurrence of "[xz]" by "x", where "x" is the first letter enclosed in the brackets, and "z" can be a string of variable length. Note that I do not want the brackets in the output.
For example, "alEhos[cr#e]sjt" should become "alEhoscsjt"
I think re.sub() could be a way to go, but I am not sure how to implement it.
This will work for the example given.
import re
example = "alEhos[cr#e]sjt"
result = re.sub(r'(.*)\[(.).*\](.*)', r'\1\2\3', example)
print(result)
The regular expression uses three capturing groups. \1 and \3 capture the text before and after the square brackets. \2 captures the first character inside the bracket.
Output:
alEhoscsjt
If you have more than one occurrence of square brackets in your string, you can use the following:
example = "alEhos[cr#e]sjt[abc]xyz"
result = re.sub(r'\[(.).*?\]', r'\1', example)
print(result)
This version replaces all of the bracketed substrings (including brackets) by the first character found inside the brackets. (Note the use of the non-greedy qualifier to avoid consuming everything between the first [ and last ].)
Output:
alEhoscsjtaxyz
Instead of directly using the re.sub() method, you can use the re.findall() method to find all substrings (in a non-greedy fashion) that begins and ends with the proper square brackets.
Then, iterate through the matches and use the str.replace() method to replace each match in the string with the second character in the match:
import re
s = "alEhos[cr#e]sjt"
for m in re.findall("\[.*?\]", s):
s = s.replace(m, m[1])
print(s)
Output:
alEhoscsjt
You could use the split() method:
str1 = "alEhos[cr#e]sjt"
lst1 = str1.split("[")
lst2 = lst1[1].split("]")
print(lst1[0]+lst2[0][0]+lst2[1])
I have a regular expression to match all instances of 1 followed by a letter. I would like to remove all these instances.
EXPRESSION = re.compile(r"1([A-Z])")
I can use re.split.
result = EXPRESSION.split(input)
This would return a list. So we could do
result = ''.join(EXPRESSION.split(input))
to convert it back to a string.
or
result = EXPRESSION.sub('', input)
Are there any differences to the end result?
Yes, the results are different. Here is a simple example:
import re
EXPRESSION = re.compile(r"1([A-Z])")
s = 'hello1Aworld'
result_split = ''.join(EXPRESSION.split(s))
result_sub = EXPRESSION.sub('', s)
print('split:', result_split)
print('sub: ', result_sub)
Output:
split: helloAworld
sub: helloworld
The reason is that because of the capture group, EXPRESSION.split(s) includes the A, as noted in the documentation:
re.split = split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
When removing the capturing parentheses, i.e., using
EXPRESSION = re.compile(r"1[A-Z]")
then so far I have not found a case where result_split and result_sub are different, even after reading this answer to a similar question about regular expressions in JavaScript, and changing the replacement string from '' to '-'.
My string is of the form my_str = "2a1^ath67e22^yz2p0". I would like to split based on the pattern '^(any characters) and get ["2a1", "67e22", "2p0"]. The pattern could also appear in the front or the back part of the string, such as ^abc27e4 or 27c2^abc. I tried re.split("(.*)\^[a-z]{1,100}(.*)", my_str) but it only splits one of those patterns. I am assuming here that the number of characters appearing after ^ will not be larger than 100.
you don't need regex for simple string operations, you can use
my_list = my_str.split('^')
EDIT: sorry, I just saw that you don't want to split just on the ^ character but also on strings following. Therefore you will need regex.
my_list = re.split('\^[a-z]+', my_str)
If the pattern is at the front or the end of the string, this will create an empty list element. you can remove them with
my_list = list(filter(None, my_list))
if you want to use regex library, you can just split by '\^'
re.split('\^', my_str)
# output : ['2a1', 'ath67e22', 'yz2p0']
I have a list of words and am creating a regular expression like so:
((word1)|(word2)|(word3){1,3})
Basically I want to match a string that contains 1 - 3 of those words.
This works, however I want it to match the string only if the string contains words from the regex. For example:
((investment)|(property)|(something)|(else){1,3})
This should match the string investmentproperty but not the string abcinvestmentproperty. Likewise it should match somethinginvestmentproperty because all those words are in the regex.
How do I go about achieving that?
Thanks
You can use $...^ to match with a string with (^) and ($) to mark the beginning and ending of the string you want to match. Also note you need to add (...) around your group of words to match for the {1,3}:
^((investment)|(property)|(something)|(else)){1,3}$
Regex101 Example
So I have a string in the format of ABCD-EFGH-IJ where A through J are numbers 0-9 in a list of a ton of other strings. I have a regular expression identifying it, but how do I get it to also replace it with the format IJABCDEFGH?
You can use the following regular expression with substitution:
import re
s = '1234-5678-90'
print re.sub(r'(\d{4})-(\d{4})-(\d{2})', r'\3\1\2', s)
Result:
9012345678
\3 matches the content of what inside the third pair of parentheses. So \3\1\2 means to replace with the third group of your numbers, followed by the first followed by the second.