I want to remove punctuation such as " ", ' ', , , "", '' from my string using regex. The code so far I've written only removes the ones which space between them. How do I remove the empty ones such as '',
#Code
s = "hey how ' ' is the ` ` what are '' you doing `` how is everything"
s = re.sub("' '|` `|" "|""|''|``","",s)
print(s)
My expected outcome:
hey how is the what are you doing how is everything
You may use this regex to match all such quotes:
r'([\'"`])\s*\1\s*'
Code:
>>> s = "hey how ' ' is the ` ` what are '' you doing `` how is everything"
>>> print (re.sub(r'([\'"`])\s*\1\s*', '', s))
hey how is the what are you doing how is everything
RegEx Details:
([\'"`]): Match one of the given quotes and capture it in group #1
\s*: Match 0 or more whitespaces
\1: Using back-reference of group #1 make sure we match same closing quote
\s*: Match 0 or more whitespaces
RegEx Demo
In this case, why not match all word characters, and then join them?
' '.join(re.findall('\w+',s))
# 'hey how is the what are you doing how is everything'
Related
I want to parse a str into a list of float values, however I want to be flexible regarding my delimiters. Specifically, I would like to be able to use any of these
s = '3.14; 42.2' # delimiter is '; '
s = '3.14;42.2' # delimiter is ';'
s = '3.14, 42.2' # delimiter is ', '
s = '3.14,42.2' # delimiter is ','
s = '3.14 42.2' # delimiter is ' '
I thought about removing all spaces, but this would disable the last version; I tried the re.split()-function by doing re.split('[;, ]', s) which would work using a single character as delimiter but fails otherwise.
I can however do
s.replace('; ', ';').replace(', ', ';').replace(',', ';').replace(' ', ';')
s.split(';')
which works but seems not really like a good practice or useful - especially if I would add even more delimiters in the future. What would be a good approach to do this?
You can use re.split and split on (The [ ] is a space and the brackets are for display only)
[;,] ?|[ ]
The pattern matches
[;,] ? Match either ; or , followed by an optional space
| or
[ ] Match a single space
Regex demo | Python demo
A bit more strict pattern with lookarounds could be asserting a digit on the left using lookarounds.
(?<=\d)(?:[;,] ?| )(?=\d)
The pattern matches:
(?<=\d) Positive lookbehind, assert a digit to the left
(?: Non capture group for the alternation
[;,] ? Match either ; or , followed by an optional space
| Or
Match a space
) Close non capture group
(?=\d) Positive lookahead, assert a digit to the right
Regex demo
Example code
import re
strings = [
"3.14; 42.2",
"3.14;42.2",
"3.14, 42.2",
"3.14,42.2",
"3.14 42.2"
]
for s in strings:
print(re.split(r"[;,] ?| ", s))
Output
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
I think you can account for the last space(s) like this:
re.split(r'[;,]\s*', s)
Here \s* will capture the spaces after the separator, if any.
can also just do:
res = re.split('; |;|,|, | ', data)
see https://www.geeksforgeeks.org/python-split-multiple-characters-from-string/
Assuming you would know the delimiter of the input ahead of time, you could write a function that takes your delimiter as an argument, replaces with a space, and splits it:
def split_on_delim(strng, delim):
return strng.replace(delim, ' ').split()
for example:
>>> s = '3.14; 42.2'
>>> split_on_delim(s, '; ')
['3.14', '42.2']
I have :
string = 'Here it is, your gif! am a bot. [^(Report an issue)] ❤ that bot,I ❤ ur mom **YEET** 😎 ,GOTTEM!"'
and I try :
string = re.sub(r'\W+', ' ', string)
and that gives me :
'Here it is your gif am a bot Report an issue that bot I ur mom YEET GOTTEM'
But I would like this :
'Here it is, your gif! am a bot. (Report an issue) that bot,I ur mom YEET ,GOTTEM!"'
Just the 26 letters, no numbers and only the most used symbols in this group: .,()'"?!
Make a character class of the things you accept (with []) and invert it (with a leading ^, making it [^stuff]):
string = re.sub(r'[^a-zA-Z.,()\'"?! ]+', '', string)
Use this for your regex instead : [^a-zA-Z?!.,()\'" ]+
The brakets define a collection of elements you wish to select, the caret at the front defines the negation of what is inside.
Thus leaving you with
pattern = r'[^a-zA-Z?!.,()\'" ]+'
string = re.sub(pattern, ' ', string)
I want to apply a regex on a Latin text, and I followed the solution in this question: How to account for accent characters for regex in Python?, where they suggest to add a # character before the regex.
def clean_str(string):
string = re.sub(r"#(#[a-zA-Z_0-9]+)", " ", string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' \1 ', string, re.UNICODE)
string = re.sub(r'#([^a-zA-Z0-9#])', r' ', string, re.UNICODE)
string = re.sub(r'(\s{2,})', ' ', string, re.UNICODE)
return string.lower().strip()
My problem is, the regex work in detecting the latin characters, but nothing is applied from the regex set on the text.
example:
if I have a text like "#aaa bbb các. ddd".
it should be like "bbb các . ddd" with space "before the DOT" and with deleting the Tag "#aaa".
But it produces the same input text!: "#aaa bbb các. ddd"
Did I miss something?
You have several issues in the current code:
To match any Unicode word char, use \w (rather than [A-Za-z0-9_]) with a Unicode flag
When using a re.U with re.sub, remember to either use the count argument (set it to 0 to match all occurrences) before the flag, or just use flags=re.U/ flags=re.UNICODE
To match any non-word char but a whitespace, you may use [^\w\s]
When you want to replace with a whole match, you do not have to wrap the whole pattern with (...), just make sure you use \g<0> backreference in the replacement pattern.
See an updated method to clean the strings:
>>> def clean_str(s):
... s = re.sub(r'#\w+', ' ', s, flags=re.U)
... s = re.sub(r'[^\w\s]', r' \g<0>', s, flags=re.U)
... s = re.sub(r'\s{2,}', ' ', s, flags=re.U)
... return s.lower().strip()
...
>>> print(clean_str(s))
I want to add spaces after and before comma's in a string only if the following character isn't a number (9-0). I tried the following code:
newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
But it seems like the \1 is taking the 2 matching characters and not just the comma.
Example:
>>> newLine = "abc,abc"
>>> newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
"abc ,a bc"
Expected Output:
"abc , abc"
How can I tell the sub to take only the 'comma' ?
Use this one:
newLine = re.sub(r'[,]+(?![0-9])', r' , ', newLine)
Here using negative lookahead (?![0-9]) it is checking that the comma(s) are not followed by a digit.
Your regex didn't work because you picked the comma and the next character(using ([,]+[^0-9])) in a group and placed space on both sides.
UPDATE: If it is not only comma and other things as well, then place them inside the character class [] and capture them in group \1 using ()
newLine = re.sub(r'([,/\\]+)(?![0-9])', r' \1 ', newLine)
I want to eliminate all the whitespace from a string, on both ends, and in between words.
I have this Python code:
def my_handle(self):
sentence = ' hello apple '
sentence.strip()
But that only eliminates the whitespace on both sides of the string. How do I remove all whitespace?
If you want to remove leading and ending spaces, use str.strip():
>>> " hello apple ".strip()
'hello apple'
If you want to remove all space characters, use str.replace() (NB this only removes the “normal” ASCII space character ' ' U+0020 but not any other whitespace):
>>> " hello apple ".replace(" ", "")
'helloapple'
If you want to remove duplicated spaces, use str.split() followed by str.join():
>>> " ".join(" hello apple ".split())
'hello apple'
To remove only spaces use str.replace:
sentence = sentence.replace(' ', '')
To remove all whitespace characters (space, tab, newline, and so on) you can use split then join:
sentence = ''.join(sentence.split())
or a regular expression:
import re
pattern = re.compile(r'\s+')
sentence = re.sub(pattern, '', sentence)
If you want to only remove whitespace from the beginning and end you can use strip:
sentence = sentence.strip()
You can also use lstrip to remove whitespace only from the beginning of the string, and rstrip to remove whitespace from the end of the string.
An alternative is to use regular expressions and match these strange white-space characters too. Here are some examples:
Remove ALL spaces in a string, even between words:
import re
sentence = re.sub(r"\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the BEGINNING of a string:
import re
sentence = re.sub(r"^\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the END of a string:
import re
sentence = re.sub(r"\s+$", "", sentence, flags=re.UNICODE)
Remove spaces both in the BEGINNING and in the END of a string:
import re
sentence = re.sub("^\s+|\s+$", "", sentence, flags=re.UNICODE)
Remove ONLY DUPLICATE spaces:
import re
sentence = " ".join(re.split("\s+", sentence, flags=re.UNICODE))
(All examples work in both Python 2 and Python 3)
"Whitespace" includes space, tabs, and CRLF. So an elegant and one-liner string function we can use is str.translate:
Python 3
' hello apple '.translate(str.maketrans('', '', ' \n\t\r'))
OR if you want to be thorough:
import string
' hello apple'.translate(str.maketrans('', '', string.whitespace))
Python 2
' hello apple'.translate(None, ' \n\t\r')
OR if you want to be thorough:
import string
' hello apple'.translate(None, string.whitespace)
For removing whitespace from beginning and end, use strip.
>> " foo bar ".strip()
"foo bar"
' hello \n\tapple'.translate({ord(c):None for c in ' \n\t\r'})
MaK already pointed out the "translate" method above. And this variation works with Python 3 (see this Q&A).
In addition, strip has some variations:
Remove spaces in the BEGINNING and END of a string:
sentence= sentence.strip()
Remove spaces in the BEGINNING of a string:
sentence = sentence.lstrip()
Remove spaces in the END of a string:
sentence= sentence.rstrip()
All three string functions strip lstrip, and rstrip can take parameters of the string to strip, with the default being all white space. This can be helpful when you are working with something particular, for example, you could remove only spaces but not newlines:
" 1. Step 1\n".strip(" ")
Or you could remove extra commas when reading in a string list:
"1,2,3,".strip(",")
Be careful:
strip does a rstrip and lstrip (removes leading and trailing spaces, tabs, returns and form feeds, but it does not remove them in the middle of the string).
If you only replace spaces and tabs you can end up with hidden CRLFs that appear to match what you are looking for, but are not the same.
eliminate all the whitespace from a string, on both ends, and in between words.
>>> import re
>>> re.sub("\s+", # one or more repetition of whitespace
'', # replace with empty string (->remove)
''' hello
... apple
... ''')
'helloapple'
https://en.wikipedia.org/wiki/Whitespace_character
Python docs:
https://docs.python.org/library/stdtypes.html#textseq
https://docs.python.org/library/stdtypes.html#str.replace
https://docs.python.org/library/string.html#string.replace
https://docs.python.org/library/re.html#re.sub
https://docs.python.org/library/re.html#regular-expression-syntax
I use split() to ignore all whitespaces and use join() to concatenate
strings.
sentence = ''.join(' hello apple '.split())
print(sentence) #=> 'helloapple'
I prefer this approach because it is only a expression (not a statement).
It is easy to use and it can use without binding to a variable.
print(''.join(' hello apple '.split())) # no need to binding to a variable
import re
sentence = ' hello apple'
re.sub(' ','',sentence) #helloworld (remove all spaces)
re.sub(' ',' ',sentence) #hello world (remove double spaces)
In the following script we import the regular expression module which we use to substitute one space or more with a single space. This ensures that the inner extra spaces are removed. Then we use strip() function to remove leading and trailing spaces.
# Import regular expression module
import re
# Initialize string
a = " foo bar "
# First replace any number of spaces with a single space
a = re.sub(' +', ' ', a)
# Then strip any leading and trailing spaces.
a = a.strip()
# Show results
print(a)
I found that this works the best for me:
test_string = ' test a s test '
string_list = [s.strip() for s in str(test_string).split()]
final_string = ' '.join(string_array)
# final_string: 'test a s test'
It removes any whitespaces, tabs, etc.
try this.. instead of using re i think using split with strip is much better
def my_handle(self):
sentence = ' hello apple '
' '.join(x.strip() for x in sentence.split())
#hello apple
''.join(x.strip() for x in sentence.split())
#helloapple