Python regex with *? - python

What does this Python regex match?
.*?[^\\]\n
I'm confused about why the . is followed by both * and ?.

* means "match the previous element as many times as possible (zero or more times)".
*? means "match the previous element as few times as possible (zero or more times)".
The other answers already address this, but what they don't bring up is how it changes the regex, well if the re.DOTALL flag is provided it makes a huge difference, because . will match line break characters with that enabled. So .*[^\\]\n would match from the beginning of the string all the way to the last newline character that is not preceeded by a backslash (so several lines would match).
If the re.DOTALL flag is not provided, the difference is more subtle, [^\\] will match everything other than backslash, including line break characters. Consider the following example:
>>> import re
>>> s = "foo\n\nbar"
>>> re.findall(r'.*?[^\\]\n', s)
['foo\n']
>>> re.findall(r'.*[^\\]\n', s)
['foo\n\n']
So the purpose of this regex is to find non-empty lines that don't end with a backslash, but if you use .* instead of .*? you will match an extra \n if you have an empty line following a non-empty line.
This happens because .*? will only match fo, [^\\] will match the second o, and the the \n matches at the end of the first line. However the .* will match foo, the [^\\] will match the \n to end the first line, and the next \n will match because the second line is blank.

. indicates a wild card. It can match anything except a \n, unless the appropriate flag is used.
* indicates that you can have 0 or more of the thing preceding it.
? indicates that the preceding quantifier is lazy. It will stop searching after the first match it finds.

Opening the Python re module documentation, and searching for *?, we find:
*?, +?, ??:
The *, +, and ? qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <H1>title</H1>, it will match the entire string, and not just <H1>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only <H1>.

Related

Regex match for non hyphenated words

I am trying to create a regex expression in Python for non-hyphenated words but I am unable to figure out the right syntax.
The requirements for the regex are:
It should not contain hyphens AND
It should contain atleast 1 number
The expressions that I tried are:=
^(?!.*-)
This matches all non-hyphenated words but I am not able to figure out how to additionally add the second condition.
^(?!.*-(?=/d{1,}))
I tried using double lookahead but I am not sure about the syntax to use for it. This matches ID101 but also matches STACKOVERFLOW
Sample Words Which Should Match:
1DRIVE , ID100 , W1RELESS
Sample Words Which Should Not Match:
Basically any non-numeric string (like STACK , OVERFLOW) or any hyphenated words (Test-11 , 24-hours)
Additional Info:
I am using library re and compiling the regex patterns and using re.search for matching.
Any assistance would be very helpful as I am new to regex matching and am stuck on this for quite a few hours.
Maybe,
(?!.*-)(?=.*\d)^.+$
might simply work OK.
Test
import re
string = '''
abc
abc1-
abc1
abc-abc1
'''
expression = r'(?m)(?!.*-)(?=.*\d)^.+$'
print(re.findall(expression, string))
Output
['abc1']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
RegEx 101 Explanation
/
(?!.*-)(?=.*\d)^.+$
/
gm
Negative Lookahead (?!.*-)
Assert that the Regex below does not match
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- matches the character - literally (case sensitive)
Positive Lookahead (?=.*\d)
Assert that the Regex below matches
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d matches a digit (equal to [0-9])
^ asserts position at start of a line
.+ matches any character (except for line terminators)
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
I came up with -
^[^-]*\d[^-]*$
so we need at LEAST one digit (\d)
We need the rest of the string to contain anything BUT a - ([^-])
We can have unlimited number of those characters, so [^-]*
but putting them together like [^-]*\d would fail on aaa3- because the - comes after a valid match- lets make sure no dashes can sneak in before or after our match ^[-]*\d$
Unfortunately that means that aaa555D fails. So we actually need to add the first group again- ^[^-]*\d[^-]$ --- which says start - any number of chars that aren't dashes - a digit - any number of chars that aren't dashes - end
Depending on style, we could also do ^([^-]*\d)+$ since the order of the digits/numbers dont matter, we can have as many of those as we want.
However, finally... this is how I would ACTUALLY solve this particular problem, since regexes may be powerful, but they tend to make the code harder to understand...
if ("-" not in text) and re.search("\d", text):

unexpected result for python re.sub() with non-capturing character

I cannot understand the following output :
import re
re.sub(r'(?:\s)ff','fast-forward',' ff')
'fast-forward'
According to the documentation :
Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
So why is the whitespace included in the captured occurence, and then replaced, since I added a non-capturing tag before it?
I would like to have the following output :
' fast-forward'
The non-capturing group still matches and consumes the matched text. Note that consuming means adding the matched text to the match value (memory buffer alotted for the whole matched substring) and the corresponding advancing of the regex index. So, (?:\s) puts the whitespace into the match value, and it is replaced with the ff.
You want to use a look-behind to check for a pattern without consuming it:
re.sub(r'(?<=\s)ff','fast-forward',' ff')
See the regex demo.
An alternative to this approach is using a capturing group around the part of the pattern one needs to keep and a replacement backreference in the replacement pattern:
re.sub(r'(\s)ff',r'\1fast-forward',' ff')
^ ^ ^^
Here, (\s) saves the whitespace in Group 1 memory buffer and \1 in the replacement retrieves it and adds to the replacement string result.
See the Python demo:
import re
print('"{}"'.format(re.sub(r'(?<=\s)ff','fast-forward',' ff')))
# => " fast-forward"
A non-capturing group still matches the pattern it contains. What you wanted to express was a look-behind, which does not match its pattern but simply asserts it is present before your match.
Although, if you are to use a look-behind for whitespace, you might want to consider using a word boundary metacharacter \b instead. It matches the empty string between a \w and a \W character, asserting that your pattern is at the beginning of a word.
import re
re.sub(r'\bff\b', 'fast-forward', ' ff') # ' fast-forward'
Adding a trailing \b will also make sure that you only match 'ff' if it is surrounded by whitespaces, not at the beginning of a word such as in 'ffoo'.
See the demo.

Strange behavior of capturing group in regular expression

Given the following simple regular expression which goal is to capture the text between quotes characters:
regexp = '"?(.+)"?'
When the input is something like:
"text"
The capturing group(1) has the following:
text"
I expected the group(1) to have text only (without the quotes). Could somebody explain what's going on and why the regular expression is capturing the " symbol even when it's outside the capturing group #1. Another strange behavior that I don't understand is why the second quote character is captured but not the first one given that both of them are optional. Finally I fixed it by using the following regex, but I would like to understand what I'm doing wrong:
regexp = '"?([^"]+)"?'
Quantifiers in regular expressions are greedy: they try to match as much text as possible. Because your last " is optional (you wrote "? in your regular expression), the .+ will match it.
Using [^"] is one acceptable solution. The drawback is that your string cannot contain " characters (which may or may not be desirable, depending on the case).
Another is to make " required:
regexp = '"(.+)"'
Another one is to make the + non-greedy, by using +?. However you also need to add anchors ^ and $ (or similar, depending on the context), otherwise it'll match only the first character (t in the case of "test"):
regexp = '^"?(.+?)"?$'
This regular expression allows " characters to be in the middle of the string, so that "t"e"s"t" will result in t"e"s"t being captured by the group.
why the regular expression is capturing the " symbol even when it's outside the capturing group #1
The "?(.+)"? pattern contains a greedy dot matching subpattern. A . can match a ", too. The "? is an optional subpattern. It means that if the previous subpattern is greedy (and .+ is a greedy subpattern) and can match the subsequent subpattern (and . can match a "), the .+ will take over that optional value.
The negated character class is a correct way to match any characters but a certain one/range(s) of characters. [^"] will never match a ", so the last " will never get matched with this pattern.
why the second quote character is captured but not the first one given that both of them are optional
The first "? comes before the greedy dot matching pattern. The engine sees the " (if it is in the string) and matches the quote with the first "?.
.+ is greedy. It'll collect everything including the ". Your final "? doesn't require that a quote be present, hence .+ includes the quote.
The first quote isn't captured because it's matched by the "?
The regexp is greedy by default, it will try to match as much as possible as soon as possible.
Since your capturing group contains .+, this will match the ending parenthesis before the "?. Then, when exiting the group, it is at the end of your line, which is matched by the optional ".
.+ matches any character as long as it can (including the "). And when it reaches end of the input the "? is matching as it means the " is optional.
You should use "non greedy":
regex
"(.+?)"

Exclude the characters before a given pattern

For this question, I am not interested in alternative pythonic methods, I am only interested in solving the Regex in my code. I can't figure out why it does not work.
Let's say I have the following string:
hello.world
I want to get all characters, excluding all characters before the dot, except the first one before it. So, I am trying to extract the following substring:
o.world
This is my code:
re.sub('^.*[^.\..*]', '', string)
My Regex logic is broken down as follows, the first characters ^.* which are not one character followed by a dot followed by any number of characters [^.\..*], are removed.
However, the Regex doesn't work, can someone help me out?
Your current code is not working because your pattern is not matching what you think it is. Putting .* in a character set does not mean "zero or more characters". Instead, it means the characters . or * literally. Also, \. is treated as \ or ., not an escaped . (since . has no special meaning in a character set).
This means that your pattern is actually equivalent to:
^.*[^\.*]
which matches:
^ # The start of the string
.* # Zero or more characters
[^\.*] # A character that is not \, ., or *
To do what you want with re.sub, you can use:
>>> import re
>>> re.sub('[^.]*(.\..*)', r'\1', 'hello.world')
'o.world'
>>>
Below is an explanation of what the pattern does:
[^.]* # Matches zero or more characters that are not .
( # Starts a capture group
. # Matches any character (save a newline).
\. # Matches a literal .
.* # Matches zero or more characters
) # Closes the capture group
The important part though is the capture group. Inside the replace string, \1 will refer to whatever was matched by it, which in this case is the text that you want to keep. So, the code above can be seen as replacing all of the text with only that which we need.
That said, it seems like it would be better to just use re.search:
>>> import re
>>> re.search('[^.]*(.\..*)', 'hello.world').group(1)
'o.world'
>>>

match part of a string until it reaches the end of the line (python regex)

If I have a large string with multiple lines and I want to match part of a line only to end of that line, what is the best way to do that?
So, for example I have something like this and I want it to stop matching when it reaches the new line character.
r"(?P<name>[A-Za-z\s.]+)"
I saw this in a previous answer:
$ - indicates matching to the end of the string, or end of a line if
multiline is enabled.
My question is then how do you "enable multiline" as the author of that answer states?
Simply use
r"(?P<name>[A-Za-z\t .]+)"
This will match ASCII letters, spaces, tabs or periods. It'll stop at the first character that's not included in the group - and newlines aren't (whereas they are included in \s, and because of that it's irrelevant whether multiline mode is turned on or off).
You can enable multiline matching by passing re.MULTILINE as the second argument to re.compile(). However, there is a subtlety to watch out for: since the + quantifier is greedy, this regular expression will match as long a string as possible, so if the next line is made up of letters and whitespace, the regex might match more than one line ($ matches the end of any string).
There are three solutions to this:
Change your regex so that, instead of matching any whitespace including newline (\s) your repeated character set does not match that newline.
Change the quantifier to +?, the non-greedy ("minimal") version of +, so that it will match as short a string as possible and therefore stop at the first newline.
Change your code to first split the text up into an individual string for each line (using text.split('\n').
Look at the flags parameter at http://docs.python.org/library/re.html#module-contents

Categories

Resources