How to replace .. in a string in python - python

I am trying to replace this string to become this
import re
s = "haha..hehe.hoho"
s = re.sub('[..+]+',' ', s)
my output i get haha hehe hoho
desired output
haha hehe.hoho
What am i doing wrong?

Test on sites like regexpal: http://regexpal.com/
It's easier to get the output and check if the regex is right.
You should change your regex to something like: '\.\.' if you want to remove only double dots.
If you want to remove when there's at least 2 dots you can use '\.{2,}'.
Every character you put inside a [] will be checked against your expression
And the dot character has a special meaning on a regex, to avoid this meaning you should prefix it with a escape character: \
You can read more about regular expressions metacharacters here: https://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php
[a-z] A range of characters. Matches any character in the specified
range.
. Matches any single character except "n".
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
Your new code:
import re
s = "haha..hehe.hoho"
#pattern = '\.\.' #If you want to remove when there's 2 dots
pattern = '\.{2,}' #If you want to remove when there's at least 2 dots
s = re.sub(pattern, ' ', s)

Unless you are constrained to use regex, then I find the replace() function much simpler:
s = "haha..hehe.hoho"
print s.replace('..',' ')
gives your desired output:
haha hehe.hoho

Change:
re.sub('[..+]+',' ', s)
to:
re.sub('\.\.+',' ', s)

[..+]+ , this meaning in regex is that use the any in the list at least one time. So it matches the .. as well as . in your input. Make the changes as below:
s = re.sub('\.\.+',' ', s)

[] is a character class and will match on anything in it (meaning any 1 .).
I'm guessing you used it because a simple . wouldn't work, because it's a meta character meaning any character. You can simply escape it to mean a literal dot with a \. As such:
s = re.sub('\.\.',' ', s)

Here is what your regex means:
So, you allow for 1 or more literal periods or plus symbols, which is not the case.
You do not have to repeat the same symbol when looking for it, you can use quantifiers, like {2}, which means "exactly 2 occurrences".
You can use split and join, see sample working program:
import re
s = "haha..hehe.hoho"
s = " ".join(re.split(r'\.{2}', s))
print s
Output:
haha hehe.hoho
Or you can use the sub with the regex, too:
s = re.sub(r'\.{2}', ' ', "haha..hehe.hoho")
In case you have cases with more than 2 periods, you should use \.{2,} regex.

Related

Merging three regex patterns used for text cleaning to improve efficiency

Given a text I want to make some modifications:
replace uppercase chars at the beginning of a sentence.
remove chars like ’ or ' (without adding whitespace)
remove unwanted chars for example ³ or ? , ! . (and replace with whitespace)
def multiple_replace(text):
# first sub so words like can't will change to cant and not can t
first_strip=re.sub("[’']",'',text)
def cap(match):
return (match.group().lower())
p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
#second sub to change all words that begin a sentence to lowercase
second_strip = p.sub(cap,first_strip)
# third_strip is to remove all . from text unless they are used in decimal numbers
third_strip= re.sub(r'(?<!\d)\.|\.(?!\d)','',second_strip)
# fourth strip to remove unexpected char that might be in text for example !,?³ and replace with whitespace
forth_strip=re.sub('[^A-Za-z0-9##_$&%]+',' ', third_strip)
return forth_strip
I am wondering if there is a more efficient way of doing it? Because I am going over the text 4 times just so it can be in the right format for me to parse. This seems a lot especially if there are millions of documents. Is there a more efficient way of doing this?
You could make use of an alternation to match either an uppercase char A-Z at the start of the string, or after . ? or ! followed by a whitespace char.
I think you can also add a . to the negated character class [^A-Za-z0-9##_$&%.]+ to not remove the dot for a decimal value and change the order of operations to use cap first before removing any dots.
import re
def cap(match):
return match.group().lower()
p = re.compile(r'(?<=[.?!]\s)[A-Z]|^[A-Z]', re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
first_strip = p.sub(cap, text)
second_strip = re.sub(r"[`']+|(?<!\d)\.|\.(?!\d)", '', first_strip)
third_strip = re.sub('[^A-Za-z0-9##_$&%.]+', ' ', second_strip)
print(third_strip)
Output
a test here this is test but keep 1.2
Python demo
You could also use a lambda with all 3 patterns and 2 capturing groups checking the group values in the callback, but I think that would not benefit the readability or making it easier to change or test.
import re
p = re.compile(r"(?:((?<=[.?!]\s)[A-Z]|^[A-Z])|[`']+|((?<!\d)\.|\.(?!\d))|[^A-Za-z0-9##_$&%.]+)", re.M)
text = "A test here. this `` (*)is. Test, but keep 1.2"
result = re.sub(p, lambda x: x.group(1).lower() if x.group(1) else ('' if x.group(2) else ' '), text)
print(result)
Output
a test here this is test but keep 1.2
Python demo

Why can I not use re.sub to replace a group?

My goal is to find a group in a string using regex and replace it with a space.
The group I am looking to find is a group of symbols only when they fall between strings. When I use re.findall() it works exactly as expected
word = 'This##Is # A # Test#'
print(word)
re.findall(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",word)
>>> ['##', '# ', '# ', '']
But when I use re.sub(), instead of replacing the group, it replaces the entire regex.
x = re.sub(r"[a-zA-Z\s]*([\$\#\%\!\s]*)[a-zA-Z]",r' ',word)
print(x)
>>> ' #'
How can I use regular expressions to replace ONLY the group? The outcome I expect is:
'This Is A Test#'
First, there's no need to escape every "magic" character within a character class, [$#%!\s]* is equally fine and much more readable.
Second, matching (i.e. retrieving) is different from replacing and you could use backreferences to achieve your goal.
Third, if you only want to have # at the end, you could help yourself with a much easier expression:
(?:[\s#](?!\Z))+
Which would then need to be replaced by a space, see a demo on regex101.com.
In Python this could be:
import re
string = "This##Is # A # Test#"
rx = re.compile(r'(?:[\s#](?!\Z))+')
new_string = rx.sub(' ', string)
print(new_string)
# This Is A Test#
You can group the portions of the pattern you want to retain and use backreferences in your replacement string instead:
x = re.sub(r"([a-zA-Z\s]*)[\$\#\%\!\s]*([a-zA-Z])", r'\1 \2', word)
The problem is that your regex matches the wrong thing entirely.
x = re.sub(r'\b[$#%!\s]+\b', ' ', word)

Python .replace() function, removing backslash in certain way

I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f
You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\
You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])
You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d
You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing

Add space after full stops

What is the clean way in python to do this simple text fixing - checking if every full stop (except the last one) is followed by space. Assume that having a dot not followed by an empty space is the only possible error we can get in the input string.
I am doing this:
def textFix(text):
result = re.sub('\.(?!\s)', '. ', text)
if (result[len(result) - 1]) == ' ':
return result[:-1]
return result
You may check it with
\.(?!\s|$)
See the regex demo. It matches a dot not followed with whitespace or end of string, that is, any non-final dot that has no whitespace after it.
Or, you may also consider
\.(?=\S)
to match any dot followed with a non-whitespace char.
See another demo.
Python demo:
import re
rx = r"\.(?=\S)"
s = "Text1. Text2.Text3."
result = re.sub(rx, ". ", s)
print(result)
# => "Text1. Text2. Text3."
Your technique looks perfect. But also include a check to avoid adding space after last dot (.)
\.(?!\s)(?!$)
where (?!$) helps make sure if the . is followed by end of string $ then isn't matched and so no space is added after it.
Regex 101 demo

Replacing spaces after any digit using re.sub

I am using the following code to try and replace any spaces found after a digit in a string derived from a regex with a comma:
mystring = re.sub('\d ', '\d,',mystring)
This however gives me an input of 6 and replaces it with \d,. What is the correct syntax I need to give me 6, as my output?
You need to use capturing group to capture the digits which exists before the space. So that you could refer that particular digit in the replacement part.
mystring = re.sub(r'(\d) ', r'\1,',mystring)
or
Use positive lookbehind.
mystring = re.sub(r'(?<=\d) ', r',',mystring)

Categories

Resources