Regex pattern confusion - python

I am learning regex using Python and am a little confused by this tutorial I am following. Here is the example:
rand_str_2 = "doctor doctors doctor's"
# Match doctor doctors or doctor's
regex = re.compile("[doctor]+['s]*")
matches = re.findall(regex, rand_str_2)
print("Matches :", len(matches))
I get 3 matches
When I do the same thing but replace the * with a ? I still get three matches
regex = re.compile("[doctor]+['s]?")
When I look into the documentation I see that the * finds 0 or more and ? finds 0 or 1
My understanding of this is that it would not return "3 matches" because it is only looking for 0 or 1.
Can someone offer a better understanding of what I should expect out of these two Quanti­fiers?
Thank you

You are correct about the behavior of the two quantifiers. When using the *, the three matches are "doctor", "doctor", "doctor's". When using the ?, the three matches are "doctor", "doctor" and "doctor'". With the * it tries to match the characters in the character class (' and s) 0 or more times. Thus, for the final match it is greedy and matches as many times as possible, matching both ' and s. However, the ? will only match at most one character in the character class, so it matches to '.

The reason this happens is because of the grouping in that specific expression. The square brackets are telling whatever is reading the expression to "match any single character in this list". This means that it is looking for either a ' or a s to satisfy the expression.
Now you can see how the quantifier effects this. Doing ['s]? is telling the pattern to "match ' or s between 0 and 1 times, as many times as possible", so it matches the ' and stops right before the s.
Doing ['s]* on the other hand is telling it to "match ' or s between 0 and infinity, as many times as possible". In this case it will match both the ' and the s because they're both in the list of characters it's trying to match.
I hope this makes sense. If not, feel free to leave a comment and I'll try my best to clarify it.

Related

Match strings with alternating characters

I want to match strings in which every second character is same.
for example 'abababababab'
I have tried this : '''(([a-z])[^/2])*'''
The output should return the complete string as it is like 'abababababab'
This is actually impossible to do in a real regular expression with an amount of states polynomial to the alphabet size, because the expression is not a Chomsky level-0 grammar.
However, Python's regexes are not actually regular expressions, and can handle much more complex grammars than that. In particular, you could put your grammar as the following.
(..)\1*
(..) is a sequence of 2 characters. \1* matches the exact pair of characters an arbitrary (possibly null) number of times.
I interpreted your question as wanting every other character to be equal (ababab works, but abcbdb fails). If you needed only the 2nd, 4th, ... characters to be equal you can use a similar one.
.(.)(.\1)*
You could match the first [a-z] followed by capturing ([a-z]) in a group. Then repeat 0+ times matching again a-z and a backreference to group 1 to keep every second character the same.
^[a-z]([a-z])(?:[a-z]\1)*$
Explanation
^ Start of the string
[a-z]([a-z]) Match a-z and capture in group 1 matching a-z
)(?:[a-z]\1)* Repeat 0+ times matching a-z followed by a backreference to group 1
$ End of string
Regex demo
Though not a regex answer, you could do something like this:
def all_same(string):
return all(c == string[1] for c in string[1::2])
string = 'abababababab'
print('All the same {}'.format(all_same(string)))
string = 'ababacababab'
print('All the same {}'.format(all_same(string)))
the string[1::2] says start at the 2nd character (1) and then pull out every second character (the 2 part).
This returns:
All the same True
All the same False
This is a bit complicated expression, maybe we would start with:
^(?=^[a-z]([a-z]))([a-z]\1)+$
if I understand the problem right.
Demo

Find String Between Two Substrings in Python When There is A Space After First Substring

While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.

Python RegEx Overlapping

The title of this question probably isn't sufficient to describe the problem I'm trying to solve so hopefully my example gets the point across. I am hoping a Python RegEx is the right tool for the job:
First, we're lookig for any one of these strings:
CATGTG
CATTTG
CACGTG
Second, the pattern is:
string
6-7 letters
string
Example
match: CATGTGXXXXXXCACGTG
no match: CATGTGXXXCACGTG (because 3 letters between)
Third, when a match is found, begin the next search from the end of the previous match, inclusive. Report index of each match.
Example:
input (spaces for readability): XXX CATGTG XXXXXX CATTTG XXXXXXX CACGTG XXX
workflow (spaces for readability):
found match: CATGTG XXXXXX CATTTG
it starts at 3
resuming search at C in CATTTG
found match: CATTTG XXXXXXX CACGTG
it starts at 15
and so on...
After a few hours of tinkering, my sorry attempt did not yield what I expected:
regex = re.compile("CATGTG|CATTTG|CACGTG(?=.{6,7})CATGTG|CATTTG|CACGTG")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group())
3 CATGTG
15 CATTTG (incorrect)
You're a genius if you can figure this out with a RegEx. Thanks :D
You can use this kind of pattern:
import re
s='XXXCATGTGXXXXXXCATTTGXXXXXXXCACGTGXXX'
regex = re.compile(r'(?=(((?:CATGTG|CATTTG|CACGTG).{6,7}?)(?:CATGTG|CATTTG|CACGTG)))\2')
for m in regex.finditer(s):
print(m.start(), m.group(1))
The idea is to put the whole string inside the lookahead and to use a backreference to consume characters you don't want to test after.
The first capture group contains the whole sequence, the second contains all characters until the next start position.
Note that you can change (?:CATGTG|CATTTG|CACGTG) to CA(?:TGTG|TTTG|CGTG) to improve the pattern.
The main issue is that in order to use the | character, you need to enclose the alternatives in parentheses.
Assuming from your example that you want only the first matching string, try the following:
regex = re.compile("(CATGTG|CATTTG|CACGTG).{6,7}(?:CATGTG|CATTTG|CACGTG)")
for m in regex.finditer('ATTCATGTG123456CATTTGCCG'):
print(m.start(), m.group(1))
Note the .group(1), which will match only what's in the first set of parentheses, as opposed to .group() which will return the whole match.

Regex to find&replace movie names python

I've been working on tweets about different movies (using the Twitter Search API) and now I wanted to replace the match by a fixed string.
I've been struggling with "XMen Apocalypse" because there are many ways to find this on tweets.
I looked for "XMen Apocalypse", "X-Men Apocalypse", "X Men Apocalypse", "XMen", "X-Men", "X Men" and it retrived me matches that also includes "#xmenmovie", "#xmen", "x-men: apocalypse", etc...
This is the regex that I have:
xmen_regex = re.compile("(((#)x[\-]?men:?(apocalypse)?)|(x[\-]? ?men[:]?[ ]?(apocalypse)?))")
def re_place_moviename(text, compiled_regex):
return re.sub(compiled_regex, "MOVIE_NAME", text.lower())
I have tested with RegExr, but still isn't accurate at some edge cases like: '#xmen blabla' -> replace -> '#MOVIE_NAME blabla' or 'MOVIE_NAMEblabla'.
So, there is a better way to do this? maybe compile different regex (on increasing length order (?)) and applying it separately?
edit
Constrains (or summary):
I want to find "x-men", "x men", "xmen"
All of 1 + " apocalypse"
All of 1 + ": apocalypse"
Also: "#xmen", "#x-men", "#xmenapocalypse", "#x-menapocalypse"
All musn't be a substring ("#xmenmovie" or "lovexmen perfect"), must contain at least 1 space at the begining and end of the expression.
PS: Other movies are easier, but xmen and others like Rogue One there has many ways to expressed it and we want to catches the most of it.
PS1: I know that \b can help, but I couldn't understand how it works.
This one should do the job:
(?:^|\s)#x[ -]?men:?\s?apocalypse\b
In case of replacement, if you want to keep the space before, use a capture group and put it in the replacement part:
(^|\s)#x[ -]?men:?\s?apocalypse\b
Explanation:
(?:^|\s) : non capture group, begining of string or a space
# : #
x : x
[ -]? : optional space or dash
men : men
:? : optional semicolon
\s? : optional space
apocalypse : apocalypse
\b : word boundary
This should work per your (vague) constraints:
(?i)(?<![##])x[- ]?men(?!:)( apocalypse)?
(?i) -- ignore case flag
(?<![##]) -- no # or # before 'xmen'
[- ]? -- optional - or
(?!:) -- no colon after 'xmen'
( apocalypse)? -- optional apocalypse string
Edit: Instead of requiring a space in front/behind, I think having a boundary (\b) would be more fitting, i.e. (?i)\b(?<!#)(x[- ]?men:?\s?(?:apocalypse)?)\b as 'xmen' may start the sentence.

Regex search to extract float from string. Python

import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search('(have )(-\w[.]+)( dollars\w+)',sequence)
print m.group(0)
print m.group(1)
print m.group(2)
Looking for a way to extract text between two occurrences. In this case, the format is 'i have ' followed by - floats and then followed by ' dollars\w+'
How do i use re.search to extract this float ?
Why don't the groups work this way ? I know there's something I can tweak to get it to work with these groups. any help would be greatly appreciated
I thought I could use groups with paranthesis but i got an eror
-\w[.]+ does not match -0.03 because [.] matches . literally because . is inside the [...].
\w after dollars also prevent the pattern to match the sequence. There no word character after dollars.
Use (-?\d+\.\d+) as pattern:
import re
sequence = 'i have -0.03 dollars in my hand'
m = re.search(r'(have )(-?\d+\.\d+)( dollars)', sequence)
print m.group(1) # captured group start from `1`.
print m.group(2)
print m.group(3)
BTW, captured group numbers start from 1. (group(0) returns entire matched string)
Your regex doesn't match for several reasons:
it always requires a - (OK in this case, questionable in general)
it requires exactly one digit before the . (and it even allows non-digits like A).
it allows any number of dots, but no more digits after the dots.
it requires one or more alphanumerics immediately after dollars.
So it would match "I have -X.... dollarsFOO in my hand" but not "I have 0.10 dollars in my hand".
Also, there is no use in putting fixed texts into capturing parentheses.
m = re.search(r'\bhave (-?\d+\.\d+) dollars\b', sequence)
would make much more sense.
This question has already been asked in many formulations before. You're looking for a regular expression that will find a number. Since number formats may include decimals, commas, exponents, plus/minus signs, and leading zeros, you'll need a robust regular expression. Fortunately, this regular expression has already been written for you.
See How to extract a floating number from a string and Regular expression to match numbers with or without commas and decimals in text

Categories

Resources