I would like to get 2 captured groups for a pair of consecutive words. I use this regular expression:
r'\b(hello)\b(world)\b'
However, searching "hello world" with this regular expression yields no results:
regex = re.compile(r'\b(hello)\b(world)\b')
m = regex.match('hello world') # m evaluates to None.
You need to allow for space between the words:
>>> import re
>>> regex = re.compile(r'\b(hello)\s*\b(world)\b')
>>> regex.match('hello world')
<_sre.SRE_Match object at 0x7f6fcc249140>
>>>
Discussion
The regex \b(hello)\b(world)\b requires that the word hello end exactly where the word world begins but with a word break \b between them. That cannot happen. Adding space, \s, between them fixes this.
If you meant to allow punctuation or other separators between hello and world, then that possibility should be added to the regex.
Related
I have strings consistent with this example:
>>> s = "plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater"
Every string has the "OS=" expression and its latter part comprises words linked by underscores. The first part of the string up to "OS=" and the actual words linked by underscores differ among strings.
I want to write a regular expression with the 're' module to ignore the first part of the string up to the pattern part, and then return the first two words within that pattern maintaining the underscore between them.
I want:
>>> 'puffin_CuteDeer'
I can get rid of the first part, and am getting close (I think) to handling the pattern part. Here's what I have and what it returns:
>>> example = re.search('(?<=OS=)(.*(?=_))',s)
>>> example.group(0)
>>> 'puffin_CuteDeer_cat'
I have tried many, many different possibilities and none of them are working.
I was surprised that
>>> example = re.search('(?<=OS=)(.*(?=_){2})',s)
did not work.
Your help is sincerely appreciated.
Update: I realize that there are non-regex ways of obtaining the desired output. However, for reasons that are probably beyond the scope of the question, I think regex is the best choice for me.
You can do:
(?<=OS=)[^_]+_[^_]+
The zero-width positive lookbehind, (?<=OS=), matches OS=
[^_]+ matches one or more characters upto next _, _ matches a literal _
Example:
In [90]: s
Out[90]: 'plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater'
In [91]: re.search(r'(?<=OS=)[^_]+_[^_]+', s).group()
Out[91]: 'puffin_CuteDeer'
You can try this:
import re
s = "plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater"
s = re.findall('(?<=OS\=)[a-zA-Z]+_[a-zA-Z]+', s)[0]
Output:
'puffin_CuteDeer'
The following uses a capturing group (...) and negation [^...] to get the desired part:
>>> re.search(r'OS=([^_]+_[^_]+)', s).group(1)
'puffin_CuteDeer'
Regex may not be necessary:
s = "plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater"
right_side = s.split("=")[-1]
"_".join(right_side.split("_")[:2])
# 'puffin_CuteDeer'
I'm very new to regex, and i'm trying to find instances in a string where there exists a word consisting of either the letter w or e followed by 2 digits, such as e77 w10 etc.
Here's the regex that I currently have, which I think finds that (correct me if i'm wrong)
([e|w])\d{0,2}(\.\d{1,2})?
How can I add a space right after the letter e or w? If there are no instances where the criteria is met, I would like to keep the string as is. Do I need to use re.sub? I've read a bit about that.
Input: hello e77 world
Desired output: hello e 77 world
Thank You.
Your regex needs to just look like this:
([ew])(\d{2})
if you want to only match specifically 2 digits, or
([ew])(\d{1,2})
if you also want to match single digits like e4
The brackets are called capturing groups and could be back referenced in a search and replace, or with python, using re.sub
your replace string should look like
\1 \2
So it should be as simple as a line like:
re.sub(r'([ew])(\d{1,2})', r'\1 \2', your_string)
EDIT: working code
>>> import re
>>> your_string = 'hello e77 world'
>>>
>>> re.sub(r'([ew])(\d{1,2})', r'\1 \2', your_string)
'hello e 77 world'
This is what you're after:
import re
print(re.sub(r'([ew])(\d{1,2})', r'\g<1> \g<2>', 'hello e77 world'))
I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)
I am trying to do this:
word test should be found in some text and be replaced with <strong>test</strong>. but the thing is, Test should be also catched and be replaced with <strong>Test</strong>.
I tried this:
word = "someword"
text = "Someword and many words with someword"
pattern = re.compile(word, re.IGNORECASE)
result = pattern.sub('<strong>'+word+'</strong>',text)
but in this case, Someword is becoming someword. Am I using re somehow wrong?
I want <strong>Someword</strong> and many words with <strong>someword</strong>
You need to use a capturing group:
>>> import re
>>> word = "someword"
>>> text = "Someword and many words with someword"
>>> pattern = re.compile('(%s)' % word, re.IGNORECASE)
>>> pattern.sub(r'<strong>\1</strong>',text)
'<strong>Someword</strong> and many words with <strong>someword</strong>'
Here \1 refers to the first captured group, to what was captured inside the parenthesis.
Also see Search and Replace section of the python re module docs.
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.