One rule that I need is that if the last vowel (aeiou) of a string is before a character from the set ('t','k','s','tk'), then a : needs to be added right after the vowel.
So, in Python if I have the string "orchestras" I need a rule that will turn it into "orchestra:s"
edit: The (t, k, s, tk) would be the final character(s) in the string
re.sub(r"([aeiou])(t|k|s|tk)([^aeiou]*)$", r"\1:\2\3", "orchestras")
re.sub(r"([aeiou])(t|k|s|tk)$", r"\1:\2", "orchestras")
You don't say if there can be other consonants after the t/k/s/tk. The first regex allows for this as long as there aren't any more vowels, so it'll change "fist" to "fi:st" for instance. If the word must end with the t/k/s/tk then use the second regex, which will do nothing for "fist".
If you have not figured it out yet, I recommend trying [python_root]/tools/scripts/redemo.py It is a nice testing area.
Another take on the replacement regex:
re.sub("(?<=[aeiou])(?=(?:t|k|s|tk)$)", ":", "orchestras")
This one does not need to replace using remembered groups.
Related
I need help to find words containing . in middle or at the end with regex in python.
Like N. or N.E. or North.East or N.East.
Not sure if you specifically need to use regex, but here's how you can do it without. Here are a couple of ways of looking at it:
If you're looking anywhere in the word (let's call it MyString) except the first character, you can use MyString[1:].contains('.'), or simply '.' in MyString[1:].
If you want to check the exact center of a string, you can use MyString[len(MyString)/2] == '.'; if the string has an even number of characters, the righthand character will be checked ('d' in 'abcdef', for instance).
If you want to check the very last character without checking anything else, MyString[-1] == '.' is enough.
Assuming that your words are sent as strings, anyway.
Maybe this is what you are looking for:
/\w+\.\w*\.?/g
https://regex101.com/r/iH9bO6/1
^\w+(?:\.\w+)*\.?$
Try this.See demo.
https://regex101.com/r/sS2dM8/15
How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.
Is there any way to directly replace all groups using regex syntax?
The normal way:
re.match(r"(?:aaa)(_bbb)", string1).group(1)
But I want to achieve something like this:
re.match(r"(\d.*?)\s(\d.*?)", "(CALL_GROUP_1) (CALL_GROUP_2)")
I want to build the new string instantaneously from the groups the Regex just captured.
Have a look at re.sub:
result = re.sub(r"(\d.*?)\s(\d.*?)", r"\1 \2", string1)
This is Python's regex substitution (replace) function. The replacement string can be filled with so-called backreferences (backslash, group number) which are replaced with what was matched by the groups. Groups are counted the same as by the group(...) function, i.e. starting from 1, from left to right, by opening parentheses.
The accepted answer is perfect. I would add that group reference is probably better achieved by using this syntax:
r"\g<1> \g<2>"
for the replacement string. This way, you work around syntax limitations where a group may be followed by a digit. Again, this is all present in the doc, nothing new, just sometimes difficult to spot at first sight.
I am writing a regex that will be used for recognizing commands in a string. I have three possible words the commands could start with and they always end with a semi-colon.
I believe the regex pattern should look something like this:
(command1|command2|command3).+;
The problem, I have found, is that since . matches any character and + tells it to match one or more, it skips right over the first instance of a semi-colon and continues going.
Is there a way to get it to stop at the first instance of a semi-colon it comes across? Is there something other than . that I should be using instead?
The issue you are facing with this: (command1|command2|command3).+; is that the + is greedy, meaning that it will match everything till the last value.
To fix this, you will need to make it non-greedy, and to do that you need to add the ? operator, like so: (command1|command2|command3).+?;
Just as an FYI, the same applies for the * operator. Adding a ? will make it non greedy.
Tell it to find only non-semicolons.
[^;]+
What you are looking for is a non-greedy match.
.+?
The "?" after your greedy + quantifier will make it match as less as possible, instead of as much as possible, which it does by default.
Your regex would be
'(command1|command2|command3).+?;'
See Python RE documentation
This is in reference to a question I asked before here
I received a solution to the problem in that question but ended up needing to go with regex for this particular part.
I need a regular expression to search and replace a string for instances of two vowels in a row that are the same, so the "oo" in "took", or the "ee" in "bees" and replace it with the one of the letters that was replaced and a :.
Some examples of expected behavior:
"took" should become "to:k"
"waaeek" should become "wa:e:k"
"raaag" should become "ra:ag"
Thank you for the help.
Try this:
re.sub(r'([aeiou])\1', r'\1:', str)
Search for ([aeiou])\1 and replace it with \1:
I don't know about python, but you should be able to make the regex case insensitive and global with something like /([aeiou])\1/gi
What NOT to do:
As noted, this will match any two vowels together. Leaving this answer as an example of what NOT to do. The correct answer (in this case) is to use backreferences as mentioned in numerous other answers.
import re
data = ["took","waaeek","raaag"]
for s in data:
print re.sub(r'([aeiou]){2}',r'\1:',s)
This matches exactly two occurrences {2} of any member of the set [aeiou]. and replaces it with the vowel, captured with the parens () and placed in the sub string by the \1 followed by a ':'
Output:
to:k
wa:e:k
ra:ag
You'll need to use a back reference in your search expression. Try something like: ([a-z])+\1 (or ([a-z])\1 for just a double).