python regular expression matching anything - python

My regular expression isnt doing anything to my string.
python
data = 'random\n<article stuff\n</article>random stuff'
datareg = re.sub(r'.*<article(.*)</article>.*', r'<article\1</article>', data, flags=re.MULTILINE)
print datareg
i get
random
<article stuff
</article>random stuff
i want
<article stuff
</article>

re.MULTILINE doesn't actually make your regex multiline in the way you want it to be.
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
re.DOTALL does:
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Change flags=re.MULTILINE to flags=re.DOTALL and your regex will work.

Related

Convert a TRIM on regex

I have a large string and one of the lines is in the form of
Description: something here....
I want to get everything in the: something here... without any trailing or leading space on it. Currently I'm doing it with a mix of regex and a strip(). How could this be done entirely in regex? Currently:
re.search('Description:\s+(.+)', body).group(1).strip()
Other thoughts:
re.search('Description:\s+\w(.+)\w', body).group(1) # works
Also, why doesn't putting an anchor work in the above context?
re.search('Description:\s+\w(.+)$', body).group(1) # fails
You can use either of
Description:\s+(.*\S)
See the regex demo.
The point is that you need to match up to the last non-whitespace character. .* matches any zero or more characters other than line break chars, as many as possible, so the \S matches the last non-whitespace character in the string.
If you have a multiline string and you need to get to the last non-whitespace character, you may add re.S / re.DOTALL option when passing the pattern above to a regex method, or re-write it as
Description:\s+(\S+(?:\s+\S+)*)
where \S+ matches one or more non-whitespace chars and (?:\s+\S+)* matches zero or more occurrences of one or more whitespaces followed with one or more non-whitespace chars.
See this regex demo.

How regular expression handles `^` or `$` in the middle of regex pattern?

I am trying to play with regular expressions in python. I have framed regular expression as given below. I know that ^ is used to match at the beginning of search string. I have framed by match pattern which contains multiple ^, but I am not sure about how re will try to match the pattern in search string.
re.match("^def/^def", "def/def")
I was expecting that re will be raising error, regarding invalid regular expression, but it doesn't raise any error and returns no matches.
So, my questions is "^def/^def" or "$def/$def" a valid regular expression ?
You do not have an invalid regular expression, ^ has legal uses in the middle of a string. When you use the re.M flag for example:
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline).
It is also possible to create patterns with optional groups, where a later ^ would still match if all of the preceding pattern matched the empty string. Using the ^ in places it can't match is not something the parser checks for and no error will be raised.
Your specific pattern will never match anything, because the ^ in the middle is unconditional and there is no possibility that the / preceding it will ever match the requisite newline character, even if the multiline flag was enabled.

python regular_expression_ multiple expression within expression

In my python script it needed a expression like
"\[.*[ERROR].*\n.*\n.*\n.*/\n.*is for multiple time/[\]]{2}"
please let me know how to take "\n." for multiple time... I'm getting stuck in this place
There is the multiline flag available, that let's you match across multiple lines.
https://docs.python.org/2/library/re.html#re.MULTILINE
re.MULTILINE
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
You also have access to DOTALL that will have . match even newlines
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
Depending on your match, those two flags let you choose how newlines are handled. In your case, you probably want to adjust your pattern like this:
text = '\n[ [ERROR]\n\nsome text\nis for multiple time]'
re.findall("\[.*\[ERROR\].*is for multiple time\]", text, re.DOTALL)
# result: ['[ [ERROR]\n\nsome text\nis for multiple time]']

Matching only the beginning of a string in Python MULTILINE mode

Python's re module says this:
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
I want to use MULTILINE but I want to require a match at the beginning of the string (not just the beginning of a line). Is there a way to do this?
Just use the \A anchor that matches the start of string unambiguously.
Check the Regular Expression Syntax:
\A
Matches only at the start of the string.

Python regex with *?

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.
* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.
. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.
Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Categories

Resources