I've been working on tweets about different movies (using the Twitter Search API) and now I wanted to replace the match by a fixed string.
I've been struggling with "XMen Apocalypse" because there are many ways to find this on tweets.
I looked for "XMen Apocalypse", "X-Men Apocalypse", "X Men Apocalypse", "XMen", "X-Men", "X Men" and it retrived me matches that also includes "#xmenmovie", "#xmen", "x-men: apocalypse", etc...
This is the regex that I have:
xmen_regex = re.compile("(((#)x[\-]?men:?(apocalypse)?)|(x[\-]? ?men[:]?[ ]?(apocalypse)?))")
def re_place_moviename(text, compiled_regex):
return re.sub(compiled_regex, "MOVIE_NAME", text.lower())
I have tested with RegExr, but still isn't accurate at some edge cases like: '#xmen blabla' -> replace -> '#MOVIE_NAME blabla' or 'MOVIE_NAMEblabla'.
So, there is a better way to do this? maybe compile different regex (on increasing length order (?)) and applying it separately?
edit
Constrains (or summary):
I want to find "x-men", "x men", "xmen"
All of 1 + " apocalypse"
All of 1 + ": apocalypse"
Also: "#xmen", "#x-men", "#xmenapocalypse", "#x-menapocalypse"
All musn't be a substring ("#xmenmovie" or "lovexmen perfect"), must contain at least 1 space at the begining and end of the expression.
PS: Other movies are easier, but xmen and others like Rogue One there has many ways to expressed it and we want to catches the most of it.
PS1: I know that \b can help, but I couldn't understand how it works.
This one should do the job:
(?:^|\s)#x[ -]?men:?\s?apocalypse\b
In case of replacement, if you want to keep the space before, use a capture group and put it in the replacement part:
(^|\s)#x[ -]?men:?\s?apocalypse\b
Explanation:
(?:^|\s) : non capture group, begining of string or a space
# : #
x : x
[ -]? : optional space or dash
men : men
:? : optional semicolon
\s? : optional space
apocalypse : apocalypse
\b : word boundary
This should work per your (vague) constraints:
(?i)(?<![##])x[- ]?men(?!:)( apocalypse)?
(?i) -- ignore case flag
(?<![##]) -- no # or # before 'xmen'
[- ]? -- optional - or
(?!:) -- no colon after 'xmen'
( apocalypse)? -- optional apocalypse string
Edit: Instead of requiring a space in front/behind, I think having a boundary (\b) would be more fitting, i.e. (?i)\b(?<!#)(x[- ]?men:?\s?(?:apocalypse)?)\b as 'xmen' may start the sentence.
Related
I am learning regex using Python and am a little confused by this tutorial I am following. Here is the example:
rand_str_2 = "doctor doctors doctor's"
# Match doctor doctors or doctor's
regex = re.compile("[doctor]+['s]*")
matches = re.findall(regex, rand_str_2)
print("Matches :", len(matches))
I get 3 matches
When I do the same thing but replace the * with a ? I still get three matches
regex = re.compile("[doctor]+['s]?")
When I look into the documentation I see that the * finds 0 or more and ? finds 0 or 1
My understanding of this is that it would not return "3 matches" because it is only looking for 0 or 1.
Can someone offer a better understanding of what I should expect out of these two Quantifiers?
Thank you
You are correct about the behavior of the two quantifiers. When using the *, the three matches are "doctor", "doctor", "doctor's". When using the ?, the three matches are "doctor", "doctor" and "doctor'". With the * it tries to match the characters in the character class (' and s) 0 or more times. Thus, for the final match it is greedy and matches as many times as possible, matching both ' and s. However, the ? will only match at most one character in the character class, so it matches to '.
The reason this happens is because of the grouping in that specific expression. The square brackets are telling whatever is reading the expression to "match any single character in this list". This means that it is looking for either a ' or a s to satisfy the expression.
Now you can see how the quantifier effects this. Doing ['s]? is telling the pattern to "match ' or s between 0 and 1 times, as many times as possible", so it matches the ' and stops right before the s.
Doing ['s]* on the other hand is telling it to "match ' or s between 0 and infinity, as many times as possible". In this case it will match both the ' and the s because they're both in the list of characters it's trying to match.
I hope this makes sense. If not, feel free to leave a comment and I'll try my best to clarify it.
I'm working on a sentencizer and tokenizer for a tutorial. This means splitting a document string into sentences and sentences into words. Examples:
#Sentencizing
"This is a sentence. This is another sentence! A third..."=>["This is a sentence.", "This is another sentence!", "A third..."]
#Tokenizatiion
"Tokens are 'individual' bits of a sentence."=>["Tokens", "are", "'individual'", "bits", "of", "a", "sentence", "."]
As seen, there's a need for something more than just a string.split(). I'm using re.sub() appending a 'special' tag for each match (and later splitting in this tag), first for sentences and then for tokens.
So far it works great, but there's a problem: how to make a regex that can split at dots, but not at (...) or at numbers (3.14)?
I've been working with these options with lookahead (I need to match the group and then be able to recall it for appending), but none works:
#Do a negative look behind for preceding numbers or dots, central capture group is a dot, do the same as first for a look ahead.
(?![\d\.])(\.)(?<![\d\.])
The application is:
sentence = re.sub(pattern, '\g<0>'+special_tag, raw_sentence)
I used the following to find the periods that it looked like were relevant:
import re
m = re.compile(r'[0-9]\.[^0-9.]|[^0-9]\.[^0-9.]|[!?]')
st = "This is a sentence. This is another sentence! A third... Pi is 3.14. This is 1984. Hello?"
m.findall(st)
# if you want to use lookahead, you can use something like this:
m = re.compile(r'(?<=[0-9])\.(?=[^0-9.])|(?<=[^0-9])\.(?=[^0-9.])|[!?]')
It's not particularly elegant, but I also tried to deal with the case of "We have a .1% chance of success."
Good luck!
This might be overkill, or need a bit of cleanup, but here is the best regex I could come up with:
((([^\.\n ]+|(\.+\d+))\b[^\.]? ?)+)([\.?!\)\"]+)
To break it down:
[^\.\n ]+ // Matches 1+ times any char that isn't a dot, newline or space.
(\.+\d+) // Captures the special case of decimal numbers
\b[^\.]? ? // \b is a word boundary. This may be optionally
// followed by any non-dot character, and optionally a space.
All these previous parts are matches 1+ times. In order to determine that a sentence is finished, we use the following:
[\.?!\)\"] // Matches any of the common sentences terminators 1+ times
Try it out!
While there are several posts on StackOverflow that are similar to this, none of them involve a situation when the target string is one space after one of the substrings.
I have the following string (example_string):
<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>
I want to extract "I want this string." from the string above. The randomletters will always change, however the quote "I want this string." will always be between [?] (with a space after the last square bracket) and Reduced.
Right now, I can do the following to extract "I want this string".
target_quote_object = re.search('[?](.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text[2:])
This eliminates the ] and that always appear at the start of my extracted string, thus only printing "I want this string." However, this solution seems ugly, and I'd rather make re.search() return the current target string without any modification. How can I do this?
Your '[?](.*?)Reduced' pattern matches a literal ?, then captures any 0+ chars other than line break chars, as few as possible up to the first Reduced substring. That [?] is a character class formed with unescaped brackets, and the ? inside a character class is a literal ? char. That is why your Group 1 contains the ] and a space.
To make your regex match [?] you need to escape [ and ? and they will be matched as literal chars. Besides, you need to add a space after ] to actually make sure it does not land into Group 1. A better idea is to use \s* (0 or more whitespaces) or \s+ (1 or more occurrences).
Use
re.search(r'\[\?]\s*(.*?)Reduced', example_string)
See the regex demo.
import re
rx = r"\[\?]\s*(.*?)Reduced"
s = "<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>"
m = re.search(r'\[\?]\s*(.*?)Reduced', s)
if m:
print(m.group(1))
# => I want this string.
See the Python demo.
Regex may not be necessary for this, provided your string is in a consistent format:
mystr = '<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
res = mystr.split('Reduced')[0].split('] ')[1]
# 'I want this string.'
The solution turned out to be:
target_quote_object = re.search('] (.*?)Reduced', example_string)
target_quote_text = target_quote_object.group(1)
print(target_quote_text)
However, Wiktor's solution is better.
You [co]/[sho]uld use Positive Lookbehind (?<=\[\?\]) :
import re
pattern=r'(?<=\[\?\])(\s\w.+?)Reduced'
string_data='<insert_randomletters>[?] I want this string.Reduced<insert_randomletters>'
print(re.findall(pattern,string_data)[0].strip())
output:
I want this string.
Like the other answer, this might not be necessary. Or just too long-winded for Python.
This method uses one of the common string methods find.
str.find(sub,start,end) will return the index of the first occurrence of sub in the substring str[start:end] or returns -1 if none found.
In each iteration, the index of [?] is retrieved following with index of Reduced. Resulting substring is printed.
Every time this [?]...Reduced pattern is returned, the index is updated to the rest of the string. The search is continued from that index.
Code
s = ' [?] Nice to meet you.Reduced efweww [?] Who are you? Reduced<insert_randomletters>[?] I want this
string.Reduced<insert_randomletters>'
idx = s.find('[?]')
while idx is not -1:
start = idx
end = s.find('Reduced',idx)
print(s[start+3:end].strip())
idx = s.find('[?]',end)
Output
$ python splmat.py
Nice to meet you.
Who are you?
I want this string.
reso- lution
sug- gest
evolu- tion
are all words that have contain hyphens due to limited space in a line in a piece of text. e.g.
Analysis of two high reso- lution nucleosome maps revealed strong
signals that even though they do not constitute a definite proof are
at least consistent with such a view. Taken together, all these
findings sug- gest the intriguing possibility that nucleosome
positions are the product of a mechanical evolu- tion of DNA
molecules.
I would like to replace with their natural forms i.e.
resolution
suggest
evolution
How can I do this in a text with python?
Make sure there is a lowercase letter before - and a lowercase letter after the -+space, capture the letters and use backreferences to get these letters back after replacement:
([a-z])- ([a-z])
See regex demo (replace with \1\2 backreference sequence). Note that you may adjust the number of spaces with {1,max} quantifier (say, if there is one or two spaces between the parts of the word, use ([a-z])- {1,2}([a-z])). If there can be any whitespace, use \s rather than .
Python code:
import re
s = 'Analysis of two high reso- lution nucleosome maps revealed strong signals that even though they do not constitute a definite proof are at least consistent with such a view. Taken together, all these findings sug- gest the intriguing possibility that nucleosome positions are the product of a mechanical evolu- tion of DNA molecules.'
s = re.sub(r'([a-z])- ([a-z])', r'\1\2', s)
print(s)
Use str.replace() to replace "- " with "". For example:
>>> my_text = 'reso- lution'
>>> my_text = my_text.replace('- ', '')
>>> my_text # Updated value without "- "
'resolution'
Given the regex and the word below I want to match the part after the - (which can also be a _ or space) only if the part after the delimiter is a digit and nothing comes after it (I basically want to to be a number and number only). I am using group statements but it just doesn't seem to work right. It keeps matching the 3 at the beginning (or the 1 at the end if I modify it a bit). How do I achieve this (by using grouping) ?
Target word: BR0227-3G1
Regex: ([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*)
It should not match 3G1, G1 , 1G
It should match only pure numbers like 3,10, 2 etc.
Here is also a helper web site for evaluating the regex: http://www.pythonregex.com/
More examples:
It should match:
BR0227-3
BR0227 3
BR0227_3
into groups (BR0227) (3)
It should only match (BR0227) for
BR0227-3G1
BR0227-CS
BR0227
BR0227-
I would use
re.findall('^([A-Z]*\s?[0-9]*)[\s_-]*([1-9][1-9]*$)?', str)
Each string starts with the first group and ends with the last group, so the ^ and $ groups can assist in capture. The $ at the end requires all numbers to be captured, but it's optional so the first group can still be captured.
Since you want the start and (possible) end of the word in groups, then do this:
r'\b([A-Z0-9]+)(?:[ _-](\d+))?\b'
This will put the first part of the word in the first group, and optionally the remainder in the second group. The second group will be None if it didn't match.
This should match anything followed by '-', ' ', or '_' with only digits after it.
(.*)[- _](\d+)