Python regex multiline replacement - python

I searched existing questions but they do not seem to answer this specific question.
I have the following python program
description = """\
before
{cs:id=841398|rep=myrepo}: after
"""
pattern = re.compile(r"(.*)\{cs:id=(.*)\|rep=(.*)\}(.*)")
and I need to replace the regex in the description to look like the below but I can't get the pattern and replacement syntax right
description="""\
before
841398 : after
"""
The crucible.app.com:9090 is a constant that I have beforehand so I basically need to substitute the pattern with my replacement.
Can someone show me what is the best python regex find and replace syntax for this?

There is no need for the first and last (.*) in your pattern. To write back captured groups in the replacement string, use \1 and \2:
description = re.sub(pattern, "\1", description)
By the way, another way to improve your pattern (performance- and robustness-wise) is to mkae the inner repetitions more explicit so that they cannot accidentally go past the | or }:
pattern = re.compile(r"\{cs:id=([^|]*)\|rep=([^}]*)\}")
You can also use named groups:
pattern = re.compile(r"\{cs:id=(?P<id>[^|]*)\|rep=(?P<rep>[^}]*)\}")
And then in the replacement string:
"\g<id>"

Use re.sub / RegexObject.sub:
>>> pattern = re.compile(r"{cs:id=(.*?)\|rep=(.*?)}")
>>> description = pattern.sub(r'\1', description)
>>> print(description)
before
841398: after
\1, \2 refer to matched group 1, 2.
I modified the regular expression slightly.
No need to escape {, }.
Removed capturing group before, after {..}.
Used non-greedy match: .*?

Related

Python matching dashes using Regular Expressions

I am currently new to Regular Expressions and would appreciate if someone can guide me through this.
import re
some = "I cannot take this B01234-56-K-9870 to the house of cards"
I have the above string and trying to extract the string with dashes (B01234-56-K-9870) using python regular expression. I have following code so far:
regex = r'\w+-\w+-\w+-\w+'
match = re.search(regex, some)
print(match.group()) #returns B01234-56-K-9870
Is there any simpler way to extract the dash pattern using regular expression? For now, I do not care about the order or anything. I just wanted it to extract string with dashes.
Try the following regex (as shortened by The fourth bird),
\w+-\S+
Original regex: (?=\w+-)\S+
Explanation:
\w+- matches 1 or more words followed by a -
\S+ matches non-space characters
Regex demo!

Replace second and last second characters, using re.sub

I have a string "F(foo)", and I'd like to replace that string with "F('foo')". I know we can also use regular expression in the second parameter and do this replacement using re.sub(r"F\(foo\)", r"F\('foo'\)",str). But the problem here is, foo is a dynamic string variable. It is different every time we want to do this replacement. Is it possible by some sort of regex, to do such replacement in a cleaner way?
I remember one way to extract foo using () and then .group(1). But this would require me to define one more temporary variable just to store foo. I'm curious if there is a way by which we can replace "F(foo)" with "F('foo')" in a single line or in other words in a more cleaner way.
Examples :
F(name) should be replaced with F('name').
F(id) should be replaced with F('id').
G(name) should not be replaced.
So, the regex would be r"F\((\w)+\)" to find such strings.
Using re.sub
Ex:
import re
s = "F(foo)"
print(re.sub(r"\((.*)\)", r"('\1')", s))
Output:
F('foo')
The following regex encloses valid [Python|C|Java] identifiers after F and in parentheses in single quotation marks:
re.sub(r"F\(([_a-z][_a-z0-9]+)\)", r"F('\1')", s, flags=re.I)
#"F('foo')"
There are several ways, depending on what foo actually is.
If it can't contain ( or ), you can just replace ( with (' and ) with '). Otherwise, try using
re.sub(r"F\((.*)\)", r"F('\1')", yourstring)
where the \1 in the replacement part will reference the (.*) capture group in the search regex
In your pattern F\((\w)+\) you are almost there, you just need to put the quantifier + after the \w to repeat matching 1+ word characters.
If you put it after the capturing group, you repeat the group which will give you the value of the last iteration in the capturing group which would be the second o in foo.
You could update your expression to:
F\((\w+)\)
And in the replacement refer to the capturing group using \1
F('\1')
For example:
import re
str = "F(foo)"
print(re.sub(r"F\((\w+)\)", r"F('\1')", str)) # F('foo')
Python demo

Does this regex fail, or do I need to modify the regex to support "optional followed by"?

I am trying the following regex: https://regex101.com/r/5dlRZV/1/, I am aware, that I am trying with \author and not \maketitle
In python, I try the following:
import re
text = str(r'
\author{
\small
}
\maketitle
')
regex = [re.compile(r'[\\]author*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S),
re.compile(r'[\\]maketitle*|[{]((?:[^{}]*|[{][^{}]*[}])*)[}]', re.M | re.S)]
for p in regex:
for m in p.finditer(text):
print(m.group())
Python freezes, I am suspecting that this has something to do with my pattern, and the SRE fails.
EDIT: Is there something wrong with my regex? Can it be improved to actually work? Still I get the same results on my machine.
EDIT 2: Can this be fixed somehow so the pattern supports optional followed by ?: or ?= look-heads? So that one can capture both?
After reading the heading, "Parentheses Create Numbered Capturing Groups", on this site: https://www.regular-expressions.info/brackets.html, I managed to find the answer which is:
Besides grouping part of a regular expression together, parentheses also create a
numbered capturing group. It stores the part of the string matched by the part of
the regular expression inside the parentheses.
The regex Set(Value)? matches Set or SetValue.
In the first case, the first (and only) capturing group remains empty.
In the second case, the first capturing group matches Value.

Python Regex for alpha(alpha|digit)*

I'm trying to produce a python regex to represent identifiers for a lexical analyzer. My approach is:
([a-zA-Z]([a-zA-Z]|\d)*)
When I use this in:
regex = re.compile("\s*([a-zA-Z]([a-zA-Z]|\d)*)")
regex.findall(line)
It doesn't produce a list of identifiers like it should. Have I built the expression incorrectly?
What's a good way to represent the form:
alpha(alpha|digit)*
With the python re module?
like this:
regex = re.compile(r'[a-zA-Z][a-zA-Z\d]*')
Note the r before the quote to obtain a raw string, otherwise you need to escape all backslashes.
Since the \s* before is optional, you can remove it, like capture groups.
If you want to ensure that the match isn't preceded by a digit, you can write it like this with a negative lookbehind (?<!...):
regex = re.compile(r'(?:^|(?<![\da-zA-Z]))[a-zA-Z][a-zA-Z\d]*')
Note that with re.compile you can use the case insensitive option:
regex = re.compile(r'(?:^|(?<![\da-z]))[a-z][a-z\d]*', re.I)

Regular expressions: How do I find a sub-string that is between two regular expression matches?

Let's say I have a string like:
data = 'MESSAGE: Hello world!END OF MESSAGE'
And I want to get the string between 'MESSAGE: ' and the next capitalized word. There are never any fully capitalized words in the message.
I tried to get this by using this regular expression in re.search:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
Here I would like it to output 'Hello world!'- but it always returns the wrong result. It is very easy in regular expressions for one to find a sub-string that occurs between two other strings, but how do you find a substring between strings that are matches for a regular expression. I have tried making it a raw string but that didn't seem to work.
I hope I am expressing myself well- I have extensive experience in Python but am new to regular expressions. If possible, I would like an explanation along with an example of how to make my specific example code work. Any helpful posts are greatly appreciated.
BTW, I am using Python 3.3.
Your code doesn't work but for the opposite reason:
re.search('MESSAGE: (.*)([A-Z]{2,})', data).group(1)
would match
'Hello world!END OF MESSA'
because (.*) is "greedy", i.e. it matches the most that will allow the rest (two uppercase chars) to match. You need to use a non-greedy quantifier with
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
that correctly matches
'Hello world!'
One little question mark:
re.search('MESSAGE: (.*?)([A-Z]{2,})', data).group(1)
Out[91]: 'Hello world!'
if you make the first capturing group lazy, it won't consume anything after the exclamation point.
You need your .* to be non-greedy (see the first ?) which means that it stops matching at the point where the next item could match, and you need the second group to be non-capturing (see the ?:).
import re
data = 'MESSAGE: Hello world!END OF MESSAGE'
regex = r'MESSAGE: (.*?)(?:[A-Z]{2,})'
re.search(regex, data).group(1)
Returns:
'Hello world!'
Alternatively, you could use this:
regex = r'MESSAGE: (.*?)[A-Z]{2,}'
To break this down (I'll include the search line with the VERBOSE flag:):
regex = r'''
MESSAGE:\s # first part, \s for the space (matches whitespace)
(.*?) # non-greedy, anything but a newline
(?:[A-Z]{2,}) # a secondary group, but non-capturing,
# good for alternatives separated by a pipe, |
'''
re.search(regex, data, re.VERBOSE).group(1)

Categories

Resources