Why can I not use re.sub to replace a group?

Why can I not use re.sub to replace a group? - python

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'

First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#

You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)

The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Related

How to substitute only second occurrence of re.search() group

I need to replace part of the string value with extra zeroes if it needs.
T-46-5-В,Г,6-В,Г ---> T-46-005-В,Г,006-В,Г or
T-46-55-В,Г,56-В,Г ---> T-46-055-В,Г,066-В,Г, for example.
I have Regex pattern ^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$ that retrieves 2 separate groups of the string, that i must change. The problem is I can't substitute back exact same groups with changed values if there is another occurrence of my re.search().group() in the whole string.
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^\D-\d{1,2}-([\d,]+)-[а-яА-я,]+,([\d,]+)-[а-яА-я,]+$"
new_string_parts = ["005", "006"]
new_string = re.sub(re.search(my_pattern, my_string).group(1), new_string_parts[0], my_string)
new_string = re.sub(re.search(my_pattern, my_string).group(2), new_string_parts[1], new_string)
print(new_string)
I get T-4006-005-В,Г,006-В,Г instead of T-46-005-В,Г,006-В,Г because there is another "6" in my_string. How can i solve this?
Thanks for your answers!

Capture the parts you need to keep and use a single re.sub pass with unambiguous backreferences in the replacement part (because they are mixed with numeric string variables):
import re
my_string = "T-46-5-В,Г,6-В,Г"
my_pattern = r"^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$"
new_string_parts = ["005", "006"]
new_string = re.sub(my_pattern, fr"\g<1>{new_string_parts[0]}\g<2>{new_string_parts[1]}\3", my_string)
print(new_string)
# => T-46-005-В,Г,006-В,Г
See the Python demo. Note I also added ёЁ to the Russian letter ranges.
The pattern - ^(\D-\d{1,2}-)[\d,]+(-[а-яёА-ЯЁ,]+,)[\d,]+(-[а-яёА-ЯЁ,]+)$ - now contains parentheses around the parts you do not need to change, and \g<1> refers to the string captured with (\D-\d{1,2}-), \g<2> refers to the value captured with (-[а-яёА-ЯЁ,]+,) and \3 - to (-[а-яёА-ЯЁ,]+).

python findall regex expression

I got a long string and i need to find words which contain the character 'd' and afterwards the character 'e'.
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
b=' '.join(l)
runs1=re.findall(r"\b\w?d.*e\w?\b",b)
print(runs1)
\b is the boundary of the word, which follows with any char (\w?) and etc.
I get an empty list.

You can massively simplify your solution by applying a regex based search on each string individually.
>>> p = re.compile('d.*e')
>>> list(filter(p.search, l))
Or,
>>> [x for x in l if p.search(x)]
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Why didn't re.findall work? You were searching one large string, and your greedy match in the middle was searching across strings. The fix would've been
>>> re.findall(r"\b\S*d\S*e\S*", ' '.join(l))
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']
Using \S to match anything that is not a space.

You can filter the result :
import re
l=[" xkn59438","yhdck2","eihd39d9","chdsye847","hedle3455","xjhd53e","45da","de37dp"]
pattern = r'd.*?e'
print(list(filter(lambda x:re.search(pattern,x),l)))
output:
['chdsye847', 'hedle3455', 'xjhd53e', 'de37dp']

Something like this maybe
\b\w*d\w*e\w*
Note that you can probably remove the word boundary here because
the first \w guarantees a word boundary before.
The same \w*d\w*e\w*

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?

re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'

I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

How to replace .. in a string in python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?

Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)

Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho

Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)

[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)

[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)

Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.

Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'

Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)

Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.

Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why can I not use re.sub to replace a group? - python

You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead: x = re.sub(r"([a-zA-Z\s])[\$\#\%\!\s]([a-zA-Z])", r'\1 \2', word)

The problem is that your regex matches the wrong thing entirely. x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Related

How to substitute only second occurrence of re.search() group

python findall regex expression

Remove a character in string if it doesn't belong to a group of matching pattern in Python

How to replace .. in a string in python

Confusing Behaviour of regex in Python

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why can I not use re.sub to replace a group? - python

You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead: x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)

The problem is that your regex matches the wrong thing entirely. x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Related

How to substitute only second occurrence of re.search() group

python findall regex expression

Remove a character in string if it doesn't belong to a group of matching pattern in Python

How to replace .. in a string in python

Confusing Behaviour of regex in Python

Categories

Resources

You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead: x = re.sub(r"([a-zA-Z\s])[\$\#\%\!\s]([a-zA-Z])", r'\1 \2', word)