Python regex * matches occurrences only in the starting of the string - python

When I use regex p* on string blackpink it returns the empty string as a match even though p is inside the string.
When I use the same regex p* on string pinkpink then it matches and returns p, indicating its matching only on the start of the string even though i have not specified anything of the kind.
The peculiar behavior is that, when I use p+ on string pink and blackpink, in both cases it returns p , indicating it does not care if the match is in the beginning or inside a string.
Can anyone explain this?

There are two important things to understand here:
First, p* matches zero or more, while p+ matches one or more.
Second, you will get the first match, no matter if that match is an empty string or not.
Third, regex is greedy by default so once it found the first match it will include as many p as possible.
So, as a result of this,
p* on blackpink matches the zero p at the very beginning of the string, that is ''.
p* on pinkpink matches the first p (not the second).
p+ on blackpink matches the sixth letter, the p, since the empty string is no longer a match because of the +.
p+ on pinkpink matches the first p.

I think you're using re.match to find your pattern's matches. As you can see from the docs:
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
emphasis mine
Since, p* means 0 or more characters, greedily, the starting point of the string blackpink is just an empty string, '' which satisfies your pattern. In fact, the pattern p* will return successful match for every empty (0-length) string between any two characters.

Related

regex to get a substring where the main string's ending is also the substring's enging [duplicate]

I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).

Regex, find pattern only in middle of string

I am using python 2.6 and trying to find a bunch of repeating characters in a string, let's say a bunch of n's, e.g. nnnnnnnABCnnnnnnnnnDEF. In any place of the string the number of n's can be variable.
If I construct a regex like this:
re.findall(r'^(((?i)n)\2{2,})', s),
I can find occurences of case-insensitive n's only in the beginning of the string, which is fine. If I do it like this:
re.findall(r'(((?i)n)\2{2,}$)', s),
I can detect the ones only in the end of the sequence. But what about just in the middle?
At first, I thought of using re.findall(r'(((?i)n)\2{2,})', s) and the two previous regex(-ices?) to check the length of the returned list and the presence of n's either in the beginning or end of the string and make logical tests, but it became an ugly if-else mess very quickly.
Then, I tried re.findall(r'(?!^)(((?i)n)\2{2,})', s), which seems to exlude the beginning just fine but (?!$) or (?!\z) at the end of the regex only excludes the last n in ABCnnnn. Finally, I tried re.findall(r'(?!^)(((?i)n)\2{2,})\w+', s) which seems to work sometimes, but I get weird results at others. It feels like I need a lookahead or lookbehind, but I can't wrap my head around them.
Instead of using a complicated regex in order to refuse of matching the leading and trailing n characters. As a more pythonic approach you can strip() your string then find all the sequence of ns using re.findall() and a simple regex:
>>> s = "nnnABCnnnnDEFnnnnnGHInnnnnn"
>>> import re
>>>
>>> re.findall(r'n{2,}', s.strip('n'), re.I)
['nnnn', 'nnnnn']
Note : re.I is Ignore-case flag which makes the regex engine matches upper case and lower case characters.
Since "n" is a character (and not a subpattern), you can simply use:
re.findall(r'(?<=[^n])nn+(?=[^n])(?i)', s)
or better:
re.findall(r'n(?<=[^n]n)n+(?=[^n])(?i)', s)
NOTE: This solution assumes n may be a sequence of some characters. For more efficient alternatives when n is just 1 character, see other answers here.
You can use
(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)
See the regex demo
The regex will match repeated consecutive ns (ignoring case can be achieved with re.I flag) that are not at the beginning ((?<!^)) or end ((?!$)) of the string and not before ((?!n)) or after ((?<!n)) another n.
The (?<!^)(?<!n) is a sequence of 2 lookbehinds: (?<!^) means do not consume the next pattern if preceded with the start of the string. The (?<!n) negative lookbehind means do not consume the next pattern if preceded with n. The negative lookaheads (?!$) and (?!n)have similar meanings: (?!$) fails a match if after the current position the end of string occurs and (?!n) will fail a match if n occurs after the current position in string (that is, right after matching all consecutive ns. The lookaround conditions must all be met, that is why we only get the innermost matches.
See IDEONE demo:
import re
p = re.compile(r'(?<!^)(?<!n)((n)\2{2,})(?!$)(?!n)', re.IGNORECASE)
s = "nnnnnnnABCnnnnnNnnnnDEFnNn"
print([x.group() for x in p.finditer(s)])

Python regex - (\w+) results different output when used with complex expression

I have doubt on python regex operation. Here you go my sample test.
>>>re.match(r'(\w+)','a-b') gives an output
>>> <_sre.SRE_Match object at 0x7f51c0033210>
>>>re.match(r'(\w+):(\d+)','a-b:1')
>>>
Why does the 2nd regex condition doesn't give match object though the 1st regex gives match object for a normal string match condition, irrespective of special characters is available in the string?
However, \w+ will matches for [a-z,A-Z,_]. I'm not clear why (\w+) gives matched object for the string 'a-b'. How can I check whether the given string doesn't contain any special characters?
Taking a look at the actual match will give you an idea of what happens.
>>> re.match(r'(\w+)', 'a-b')
<_sre.SRE_Match object at 0x0000000002DE45D0>
>>> _.groups()
('a',)
As you can see, the expression matched a. The character sequence \w only contains actual word characters, but not separators like dashes. So you can’t actually match a-b using just a \w+.
Now in the second expression one might think that it would match b:1 at least, given that \w+ matches b and :(\d+) does match the 1. However it does not happen due to how re.match works. As the documentation hints, it only tries to match “at the beginning of string”. So when using re.match there is an implicit ^ at the beginning of the expression that makes it only match from the start. So it actually tries to find a match starting with a.
Instead, you can use re.search which actually looks in the whole string if it can match the expression anywhere. So there, you will get a result:
>>> re.search(r'(\w+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('b', '1')
For further information on the search vs. match topic, check this section in the manual.
And finally, if you want to match dashes too, you can use a character sequence [\w-] for example:
>>> re.match(r'([\w-]+):(\d+)', 'a-b:1')
<_sre.SRE_Match object at 0x0000000002E01B58>
>>> _.groups()
('a-b', '1')
The first matches the a - one or more word chars.
The second is one or more word chars immediately followed by a : which there aren't...
[a-z,A-Z,_] (the equivalent of \w) means a to z and A to Z - it isn't the literal hyphen in this context, if you did want a hyphen, put it as the first or last character of a character class.
Match's docs say
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding MatchObject
instance.
match method will return the matched object if it finds a match at the beginning of the string. (\w+) matches a in a-b.
print re.match(r'(\w+)','a-b').group()
will print
a
In the second case ((\w+):(\d+)), the actual string which gets matched is b:1, which is not at the beginning of the string. That's why its returning None.
How can I check whether the given string doesn't contain any special characters?
I would say, the second regular expression which you have used should be enough and match function should be enough. I insist on match, since there are differences between match and search http://docs.python.org/2.7/library/re.html#search-vs-match
Remember, you

Python regular expressions acting strangely

url = "http://www.domain.com/7464535"
match = re.search(r'\d*',url)
match.group(0)
returns '' <----- empty string
but
url = "http://www.domain.com/7464535"
match = re.search(r'\d+',url)
match.group(0)
returns '7464535'
I thought '+' was supposed to be 1 or more and '*' was 0 or more correct? And RE is supposed to be greedy. So why don't they both return the same thing and more importantly why does the 1st one return nothing?
You are correct about the meanings of + and *. So \d* will match zero or more digits — and that's exactly what it's doing. Starting at the beginning of the string, it matches zero digits, and then it's done. It successfully matched zero or more digits.
* is greedy, but that only means that it will match as many digits as it can at the place where it matches. It won't give up a match to try to find a longer one later in the string.
Edit: A more detailed description of what the regex engine does:
Take the case where our string to search is "http://www.domain.com/7464535" and the pattern is \d+.
In the beginning, the regex engine is pointing at the beginning of our URL and the beginning of the regex pattern. \d+ needs to match one or more digits, so first the regex engine must find at least one digit to have a successful match.
The first place it looks it finds an 'h' character. That's not a digit, so it moves on to the 't', then the next 't', and so on until it finally reaches the '7'. Now we've matched one digit, so the "one or more" requirement is satisfied and we could have a successful match, except + is greedy so it will match as many digits as it can without changing the starting point of the match, the '7'. So it hits the end of the string and matches that whole number '7464535'.
Now consider if our pattern was \d*. The only difference now is that zero digits is a valid match. Since regex matches left-to-right, the first place \d* will match is the very start of the string. So we have a zero-length match at the beginning, but since * is greedy, it will extend the match as long as there are digits. Since the first thing we find is 'h', a non-digit, it just returns the zero-length match.
How is * even useful, then, if it will just give you a zero-length match? Consider if I was matching a config file like this:
foo: bar
baz: quux
blah:blah
I want to allow any amount of spaces (even zero) after the colon. I would use a regex like (\w+):\s*(\w+) where \s* matches zero or more spaces. Since it occurs after the colon in the pattern, it will match just after the colon in the string and then either match a zero-length string (as in the third line blah:blah because the 'b' after the colon ends the match) or all the spaces there are before the next non-space, because * is greedy.

clarifications on the re.findall() method in python

I wanted to strip a string of punctuation marks and I ended up using
re.findall(r"[\w]+|[^\s\w]", text)
It works fine and it does solve my problem. What I don't understand is the details within the parentheses and the whole pattern thing. What does r"[\w]+|[^\s\w]" really mean? I looked it up in the Python standard library and it says:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.
I am not sure if I get this and the clarification sounds a little vague to me. Can anyone please tell me what a pattern in this context means and how exactly it is defined in the findall() method?
To break it down, [] creates a character class. You'll often see things like [abc] which will match a, b or c. Conversely, you also might see [^abc] will will match anything that isn't a, b or c. Finally, you'll also see character ranges: [a-cA-C]. This introduces two ranges and it will match any of a, b, c, A, B, C.
In this case, your character class contains special tokens. \w and \s. \w matches anything letter-like. \w actually depends on your locale, but it is usually the same thing as [a-zA-Z0-9_] matches anything in the ranges a-z, A-Z, 0-9 or _. \s is similar, but it matches anything that can be considered whitespace.
The + means that you can repeat the previous match 1 or more times. so [a]+ will match the entire string aaaaaaaaaaa. In your case, you're matching alphanumeric characters that are next to each other.
the | is basically like "or". match the stuff on the left, or match the stuff on the right if the left stuff doesn't match.
\w means Alphanumeric characters plus "_". And \s means Whitespace characters including " \t\r\n\v\f" and space character " ". So, [\w]+|[^\s\w] means a string which contains only words and "_".

Categories

Resources