I'm trying to use a regex to clean some data before I insert the items into the database. I haven't been able to solve the issue of removing trailing special characters at the end of my strings.
How do I write this regex to only remove trailing special characters?
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'([_+!##$?^])', '', item))
print (clean_this)
outputs this:
string01 # correct
string02 # incorrect because it remove _ in the string
string03 # correct
string041 # incorrect because it remove _ in the string
string05a # incorrect because it remove _ in the string and not just the trailing _
You could also use the special purpose rstrip method of strings
[s.rstrip('_+!##$?^') for s in strings]
# ['string01', 'str_ing02', 'string03', 'string04_1', 'string05_a']
You could repeat the character class 1+ times or else only 1 special character would be replaced. Then assert the end of the string $. Note that you don't need the capturing group around the character class:
[_+!##$?^]+$
For example:
import re
strings = ['string01_','str_ing02_^','string03_#_', 'string04_1', 'string05_a_']
for item in strings:
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
print (clean_this)
See the Regex demo | Python demo
If you also want to remove whitespace characters at the end you could add \s to the character class:
[_+!##$?^\s]+$
Regex demo
You need an end-of-word anchor $
clean_this = (re.sub(r'[_+!##$?^]+$', '', item))
Demo
Related
I have a string with multiple newline symbols:
text = 'foo\na\nb\n$\n\nxz\nbar'
I want to remove the lines that are shorter than 3 symbols. The desired output is
'foo\n\nbar'
I tried
re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text, flags= re.S)
but this matches only some subset of the string and the result is
'foo\nX\nb\nX\nxz\nbar'
I need somehow to do greedy search and replace the longest string matching the pattern.
re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:
>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'
That's "from start of a line, match 0-2 non-newline characters, followed by a newline".
I noticed your desired output has a \n\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.
You might also want to allow the final line of the string to have an optional terminating newline, for example:
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba') # < 3 symbols, no newline
'foo\n'
Perhaps you can use re.findall instead:
text = 'foo\na\nb\n$\n\nxz\nbar'
import re
print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))
#
'foo\n\nbar'
You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:
(^|\n)[^\n]{0,2}(?=\n|$)
In python:
import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))
Output
foo
bar
Demo on rextester
There's no need to use regex for this.
raw_str = 'foo\na\nb\n$\n\nxz\nbar'
str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])
print(str_res):
foo
bar
I have a huge string which contains emotions like "\u201d", AS WELL AS "\advance\"
all that I need is to remove back slashed so that:
- \u201d = \u201d
- \united\ = united
(as it breaks the process of uploading it to BigQuery database)
I know it should be somehow this way:
string.replace('\','') But not sure how to keep \u201d emotions.
ADDITIONAL:
Example of Unicode emotions
\ud83d\udc9e
\u201c
\u2744\ufe0f\u2744\ufe0f\u2744\ufe0f
You can split on all '\' and then use a regex to replace your emotions with adding leading '\'
s = '\\advance\\\\united\\ud83d\\udc9e\\u201c\\u2744\\ufe0f\\u2744\\ufe0f\\u2744\\ufe0f'
import re
print(re.sub('(u[a-f0-9]{4})',lambda m: '\\'+m.group(0),''.join(s.split('\\'))))
As your emotions are 'u' and 4 hexa numbers, 'u[a-f0-9]{4}' will match them all, and you just have to add leading backslashes
First of all, you delete every '\' in the string with either ''.join(s.split('\\')) or s.replace('\\')
And then we match every "emotion" with the regex u[a-f0-9]{4} (Which is u with 4 hex letters behind)
And with the regex sub, you replace every match with a leading \\
You could simply add the backslash in front of your string after replacement if your string starts with \u and have at least one digit.
import re
def clean(s):
re1='(\\\\)' # Any Single Character "\"
re2='(u)' # Any Single Character "u"
re3='.*?' # Non-greedy match on filler
re4='(\\d)' # Any Single Digit
rg = re.compile(re1+re2+re3+re4,re.IGNORECASE|re.DOTALL)
m = rg.search(s)
if m:
r = '\\'+s.replace('\\','')
else:
r = s.replace('\\','')
return r
a = '\\u123'
b = '\\united\\'
c = '\\ud83d'
>>> print(a, b, c)
\u123 \united\ \ud83d
>>> print(clean(a), clean(b), clean(c))
\u123 united \ud83d
Of course, you have to split your sting if multiple entries are in the same line:
string = '\\u123 \\united\\ \\ud83d'
clean_string = ' '.join([clean(word) for word in string.split()])
You can use this simple method to replace the last occurence of your character backslash:
Check the code and use this method.
def replace_character(s, old, new):
return (s[::-1].replace(old[::-1],new[::-1], 1))[::-1]
replace_character('\advance\', '\','')
replace_character('\u201d', '\','')
Ooutput:
\advance
\u201d
You can do it as simple as this
text = text.replace(text[-1],'')
Here you just replace the last character with nothing
I had been using regex to ignore special characters from my list. But now I want to ignore special character excluding a few special characters mentioned by the user.
The code that I am currently using to remove special character is :
final_list=[re.sub('[^a-zA-Z0-9]+', '', _)for _ in a]
This works fine when I want to remove all the special characters in my list.
Input:
["on#3", "two#", "thre%e"]
output:
['on3', 'two', 'three']
But what my expectation is if I give ignore special characters except $#%
Input:
["on#3", "two#", "thre%e"]
output:
['on3', 'two#', 'thre%e']
This is my expected output
$#% is just for example. The user can mention any special character and I need the code to not remove the special character mentioned by the user but remove all other special characters.
Add those charecters to the regex as
[re.sub('[^a-zA-Z0-9$#%]+', '', _)for _ in a]
^^^
as #DYZ mentioned, you could also use '[^\w$#%]+' regex
[re.sub('[^\w$#%]+', '', _)for _ in a]
UPDATE-1
import re
a = ["on#3", "two#", "thre%e"]
special_char_to_be_removed = "%" # here you can change the values
regex = '[^\w{your_regex}]+'.format(your_regex=special_char_to_be_removed)
[re.sub(regex, '', _)for _ in a]
Just add the list of characters to the list.
import re
a = ["on#3", "two$", "thre%e"]
final_list = [re.sub('[^a-zA-Z0-9\$#%]+', '', _) for _ in a]
print final_list
outputs
['on3', 'two$', 'thre%e']
$ has a meaning in regular expressions so you need to escape it with a \
If you want to take user input, just use
import re
a = ["on#3", "two$", "thre%e"]
except_special_chars = input('Exceptions:')
final_list = [re.sub('[^a-zA-Z0-9'+str(except_special_chars)+']+', '', _) for _ in a]
print final_list
then the user input the special characters between quotes ' and with an escaping \ if necessary.
When I have a string "Mary's!!" I want to get "Mary's!", so only one non alphabetic character is removed at the beginning and/or the end of each word in the string, not in the middle of the word.
I have this so far in Python 3
import re
s = "Mary's!! string. With. Punctuation?" # Sample string
out = re.sub(r'[^\w\d\s]','', s)
print(out)
This outputs:
"Marys string With Punctuation"
It removes everything, while it should be like this:
"Mary's! string With Punctuation"
You could require that there is a space next to it (or start/end of string):
re.sub(r'(\s|^)[^\w\d\s]|[^\w\d\s](\s|$)', r'\1\2', s)
Or, alternatively with look-around:
re.sub(r'(?<!\S)[^\w\d\s]|[^\w\d\s](?!\S)', '', s)
I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo