When I have a string "Mary's!!" I want to get "Mary's!", so only one non alphabetic character is removed at the beginning and/or the end of each word in the string, not in the middle of the word.
I have this so far in Python 3
import re
s = "Mary's!! string. With. Punctuation?" # Sample string
out = re.sub(r'[^\w\d\s]','', s)
print(out)
This outputs:
"Marys string With Punctuation"
It removes everything, while it should be like this:
"Mary's! string With Punctuation"
You could require that there is a space next to it (or start/end of string):
re.sub(r'(\s|^)[^\w\d\s]|[^\w\d\s](\s|$)', r'\1\2', s)
Or, alternatively with look-around:
re.sub(r'(?<!\S)[^\w\d\s]|[^\w\d\s](?!\S)', '', s)
Related
As part of preprocessing my data, I want to be able to replace anything that comes with a slash till the occurrence of space with empty string. For example, \fs24 need to be replaced with empty or \qc23424 with empty. There could be multiple occurrences of tags with slashes which I want to remove. I have created a "tags to be eradicated" list which I aim to consume in a regular expression to clean the extracted text.
Input String: This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string.
Expected output: This is a string and it contains some texts and tags. which I want to remove from my string.
I am using the regular expression based replace function in Python:
udpated = re.sub(r'/\fs\d+', '')
However, this is not fetching the desired result. Alternately, I have built an eradicate list and replacing that from a loop from top to lower number but this is a performance killer.
Assuming a 'tag' can also occur at the very beginning of your string, and avoid selecting false positives, maybe you could use:
\s?(?<!\S)\\[a-z\d]+
And replace with nothing. See an online demo.
\s? - Optionally match a whitespace character (if a tag is mid-string and therefor preceded by a space);
(?<!\S) - Assert position is not preceded by a non-whitespace character (to allow a position at the start of your input);
\\ - A literal backslash.
[a-z\d]+ - 1+ (Greedy) Characters as per given class.
First, the / doesn't belong in the regular expression at all.
Second, even though you are using a raw string literal, \ itself has special meaning to the regular expression engine, so you still need to escape it. (Without a raw string literal, you would need '\\\\fs\\d+'.) The \ before f is meant to be used literally; the \ before d is part of the character class matching the digits.
Finally, sub takes three arguments: the pattern, the replacement text, and the string on which to perform the replacement.
>>> re.sub(r'\\fs\d+', '', r"This is a string \fs24 and it contains...")
'This is a string and it contains...'
Does that work for you?
re.sub(
r"\\\w+\s*", # a backslash followed by alphanumerics and optional spacing;
'', # replace it with an empty string;
input_string # in your input string
)
>>> re.sub(r"\\\w+\s*", "", r"\fs24 hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", "hello there")
'hello there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello there")
'there'
>>> re.sub(r"\\\w+\s*", "", r"\fs24hello \qc23424 there")
'there'
'\\' matches '\' and 'w+' matches a word until space
import re
s = r"""This is a string \fs24 and it contains some texts and tags \qc23424. which I want to remove from my string."""
re.sub(r'\\\w+', '', s)
output:
'This is a string and it contains some texts and tags . which I want to remove from my string.'
I tried this and it worked fine for me:
def remover(text, state):
removable = text.split("\\")[1]
removable = removable.split(" ")[0]
removable = "\\" + removable + " "
text = text.replace(removable, "")
state = True if "\\" in text else False
return text, state
text = "hello \\I'm new here \\good luck"
state = True
while state:
text, state = remover(text, state)
print(text)
I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).
I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f
You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\
You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])
You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d
You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing
I'm trying to use a regex to clean some data before I insert the items into the database. I haven't been able to solve the issue of removing trailing special characters at the end of my strings.
How do I write this regex to only remove trailing special characters?
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'([_+!##$?^])', '', item))
print (clean_this)
outputs this:
string01 # correct
string02 # incorrect because it remove _ in the string
string03 # correct
string041 # incorrect because it remove _ in the string
string05a # incorrect because it remove _ in the string and not just the trailing _
You could also use the special purpose rstrip method of strings
[s.rstrip('_+!##$?^') for s in strings]
# ['string01', 'str_ing02', 'string03', 'string04_1', 'string05_a']
You could repeat the character class 1+ times or else only 1 special character would be replaced. Then assert the end of the string $. Note that you don't need the capturing group around the character class:
[_+!##$?^]+$
For example:
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
print (clean_this)
See the Regex demo | Python demo
If you also want to remove whitespace characters at the end you could add \s to the character class:
[_+!##$?^\s]+$
Regex demo
You need an end-of-word anchor $
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
Demo
I have a string like this:
s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
I want this text:
result = 'the unicode text I want with an é'
I've tried to use this code:
expr = r'(?<=BEGIN)[\sa-zA-Z]+(?=END)'
result = re.search(expr, s)
result = re.sub(r'(^\s+)|(\s+$)', '', result) # just to strip out leading/trailing white space
But as long as the é is in the string s, re.search always returns None.
Note, I've tried using different combinations of .* instead of [\sa-zA-Z]+ without success.
The character ranges a-z and A-Z only capture ASCII characters. You can use . to capture Unicode characters:
>>> import re
>>> s = u'something extra BEGIN the unicode text I want with an é END some more extra stuff'
>>> print re.search(r'BEGIN(.+?)END', s).group(1)
the unicode text I want with an é
>>>
Note too that I simplified your pattern a bit. Here is what it does:
BEGIN # Matches BEGIN
(.+?) # Captures one or more characters non-greedily
END # Matches END
Also, you do not need Regex to remove whitespace from the ends of a string. Just use str.strip:
>>> ' a '.strip()
'a'
>>>