Python empty matches replaced - python

I really don't understand the following example found on docs.python.org:
>>>> p = re.compile('x*')
>>>> p.sub('-', 'abxd')
'-a-b-d-'
Why the regex 'x*' is matching four times?
I thought the output should be: 'ab-'

* meta character matches 0 or more times. So,
a bx d
^ ^ -- ^
^ is the position where x* matches 0 times and -- is the place where x* matches 1 time. That is why the output is -a-b-d-.
To get the output ab-d, you need to use x+ in the regular expression. It means that match one or more times. So, it will match only the following positions
abxd
^

One update about re.sub since Python 3.7.
Empty matches for the pattern are replaced when adjacent to a previous non-empty match.
The result becomes "-a-b--d-" because that "d" is now having an empty match. In the previous versions of python, this empty match is not allowed since it is adjacent to the matching of "x".

Related

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Understanding * (zero or more) operator using re.search() [duplicate]

This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 5 years ago.
I am new to python and was going through "Google for Education" python course
Now, the line below confuses me:
* -- 0 or more occurrences of the pattern to its left
(all the examples are in python3)
e.g. 1
In [1]: re.search(r"pi*", "piiig!!").group()
Out[1]: 'piii'
This is fine since, "pi" has 1 occurrance so it is retured
e.g. 2
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
Why does it not return "i" in fact - from my understanding, it should be returning "iii". But the result is an empty string.
Also, What exactly does "0 or more" mean? I searched on google but everywhere it is mentioned * -- 0 or more. But if there is 0 occurrence of an expression, does that not become true even if it's not there? What is the point of searching then?
I am so confused with this. Can you please help me with explaining this or point me in the right direction.
i hope the right explanation would also resolve my this issue:
In [3]: re.search(r"i?", "piiig!!").group()
Out[3]: ''
I have tried the examples in Spyder 3.2.4
The explanation is a bit more complicated than the answers we have seen so far.
First, unlike re.match() the primitive operation re.search() checks for a match anywhere in the string (this is what Perl does by default) and finds the pattern once:
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string. See: Ref.
If we follow every step of the regex engine while it tries to find a match, we can observe the following for the pattern i* and the test string piigg!!:
As you can see, the first character (at position 0) produces a match because p is zero times i and the result is an empty match (and not p - because we do not search for p or any other character).
At the second character (position 1) the second match (spanning to position 2) is found since ii is zero or more times i... at position 3 there is another empty match, and so far and so forth.
Because re.search only returns the first match it sticks with the first empty match at position 0. That's why you get the (confusing) result you have posted:
In [2]: re.search(r"i*", "piiig!!").group()
Out[2]: ''
In order to match every occurrence, you need re.findall():
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match. See: Ref.
You need to use *(0 or more) and +(1 or more) properly to get your desired output
Eg: 1 Matches because you have defined * only for "i", this patter will capture all the "p" or "pi" combination
Eg: 2 If you need to match only "i" you need to use "+" instead of "*".
If you use "*"
In: re.search(r"pi*g", "piiig!!").group()
This will return if you input is ("pig" or "piig" or "pg")
If you use "+"
In: re.search(r"pi+g", "piiig!!").group()
This will return if you input is ("pig" or "piig")
Because '' is the first matched result of r'i*' and 'iii' is the second matched result.
In [1]: import re
In [2]: re.findall(r'i*', 'piiig!!')
Out[2]: ['', 'iii', '', '', '', '']
This website will also explain the way how regular expression work.
https://regex101.com/r/XVPXMv/1
The special charecter * means 0 or more occurrence of the preceding character. For eg. a* matches with 0 or more occurrence of a which could be '', 'a', 'aa' etc. This happens because '' has 0 occurrence of a.
To get iii you should have used + instead of * and thus would have got the first non zero sequence of 'i' which is iii
re.search("i+", "piiig!!").group()

Why re.search(r'(ab*)','aaAaABBbbb',re.I) in python gives result 'a' instead of 'ABBbbb' though 're.I' is used?

In python, re.search() checks for a match anywhere in the string (this is what Perl does by default).
So why don't we get output as 'ABBbbb' in Ex(1) as we found in Ex(2) and Ex(3) below.
Ex(1)
>>> s=re.search(r'(ab*)','aaAaABBbbb',re.I)
>>> print s.group()
a
Ex(2)
>>> s=re.search(r'(ab.*)','aaAaABBbbb',re.I)
>>> print s.group()
ABBbbb
Ex(3)
>>> s=re.search(r'(ab+)','aaAaABBbbb',re.I)
>>> print s.group()
ABBbbb
Example 1 is searching for a followed by zero or more b, ignoring case. This matches right at the beginning of the string. The regex engine will see that match and use it. It won't look for any other matches.
Example 2 is searching for ab followed by as much of the string as it can eat. Example 3 is searching for a following by at least one b. The difference is that each of these needs at least one b, while Example 1 does not.
search:
checks for a match anywhere in the string (this is what Perl does by default).
re.search(r'(ab*)', 'aaAaABBbbb', re.I)
This will try to match a (case ignored) followed by zero or more b. It find that match in the first a, since it's followed by zero b, and it returned it.
re.search(r'(ab.*)', 'aaAaABBbbb', re.I)
This one will try to match a, followed by b and then with anything (.* is greedy). It matches ABBbbb because it's the first sequence that matches the regex.
re.search(r'(ab+)', 'aaAaABBbbb', re.I)
Finally, this will match a, followed by at least one b (again, case ignored). That first match is ABBbbb, and it's returned.

Regular Expression: How to match using previous matches?

I am searching for string patterns of the form:
XXXAXXX
# exactly 3 Xs, followed by a non-X, followed by 3Xs
All of the Xs must be the same character and the A must not be an X.
Note: I am not searching explicitly for Xs and As - I just need to find this pattern of characters in general.
Is it possible to build this using a regular expression? I will be implementing the search in Python if that matters.
Thanks in advance!
-CS
Update:
#rohit-jain's answer in Python
x = re.search(r"(\w)\1{2}(?:(?!\1)\w)\1{3}", data_str)
#jerry's answer in Python
x = re.search(r"(.)\1{2}(?!\1).\1{3}", data_str)
You can try this:
(\w)\1{2}(?!\1)\w\1{3}
Break Up:
(\w) # Match a word character and capture in group 1
\1{2} # Match group 1 twice, to make the same character thrice - `XXX`
(?!\1) # Make sure the character in group 1 is not ahead. (X is not ahead)
\w # Then match a word character. This is `A`
\1{3} # Match the group 1 thrice - XXX
You can perhaps use this regex:
(.)\1{2}(?!\1).\1{3}
The first dot matches any character, then we call it back twice, make use of a negative lookahead to make sure there's not the captured character ahead and use another dot to accept any character once again, then 3 callbacks.

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Categories

Resources