Related
I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation.
Here is my re pattern wordsWithPunc = re.split(r'([^-\w]+)',words)
If I have a word like "hello" with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example
"hello,-" will equal "hello",",-" but I want it to be "hello",",","-"
Another example. My name is mud!!! would be split into "My","name","is","mud","!!!" but I want it to be "My","name","is","mud","!","!","!"
You need to make your pattern non-greedy (remove the +) if you want to capture single non-word characters, something like:
import re
words = 'My name is mud!!!'
splitted = re.split(r'([^-\w])', words)
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']
This will produce also 'empty' matches between non-word characters (because you're slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:
splitted = [match for match in re.split(r'([^-\w])', words) if match]
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']
You can further strip spaces in the generator (i.e. ... if match.strip() ...) if you want to get rid off the space matches as well.
The answer to the question at Python remove all whitespace in a string shows separate ways to remove leading/ending, duplicated, and all spaces, respectively, from a string in Python. But strip() removes tabs and newlines, and lstrip() only affects leading spaces. The solution using .join(sentence.split()) also appears to remove Unicode whitespace characters.
Suppose I have a string, in this case scraped from a website using Scrapy, like this:
['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text', '
']
The newlines preserve formatting of the text when I use it in another contexts, but all the extra space is a nuisance. How do I remove all the leading, ending, and duplicated internal spaces while preserving the newline characters (in addition to any \r or \t characters, if there are any)?
The result I want (after I join the individual strings) would then be:
['\n\n\nSome text and some more text\nand on another line some more text']
No sample code is provided because what I've tried so far is just the suggestions on the page referenced above, which gets the results I'm trying to avoid.
In that case str.strip() won't help you (even if you use " " as an argument because it won't remove the spaces inside, only at the start/end of your string, and it would remove the single space before "and" as well.
Instead, use regex to remove 2 or more spaces from your strings:
l= ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text']
import re
result = "".join([re.sub(" +","",x) for x in l])
print(repr(result))
prints:
'\n\n\nSome text and some more text\n and on another a line some more text'
EDIT: if we apply the regex to each line, we cannot detect \n in some cases, as you noted. So, the alternate and more complex solution would be to join the strings before applying regex, and apply a more complex regex (note that I changed the test list of strings to add more corner cases):
l= ['\n \n ',
'\n ',
'Some text',
' and some more text \n',
'\n and on another a line some more text ']
import re
result = re.sub("(^ |(?<=\n) | +| (?=\n)| $)","","".join(l))
print(repr(result))
prints:
'\n\n\nSome text and some more text\n\nand on another a line some more text'
There are 5 cases in the regex now that will be removed:
start by one space
space following a newline
2 or more spaces
space followed by a newline
end by one space
Aftertought: looks (and is) complicated. There is a non-regex solution after all which gives exactly the same result (if there aren't multiple spaces between words):
result = "\n".join([x.strip(" ") for x in "".join(l).split("\n")])
print(repr(result))
just join the strings, then split according to newline, apply strip with " " as argument to preserve tabs, and join again according to newline.
Chain with re.sub(" +"," ",x.strip(" ")) to take care of possible double spaces between words:
result = "\n".join([re.sub(" +"," ",x.strip(" ")) for x in "".join(l).split("\n")])
You can also do the whole thing in terms of built in string operations if you like.
l = ['\n \n ',
'\n ',
'Some text',
' and some more text\n',
' and on another a line some more text',
' ']
def remove_duplicate_spaces(l):
words = [w for w in l.split(' ') if w != '']
return ' '.join(words)
lines = ''.join(l).split('\n')
formatted_lines = map(remove_duplicate_spaces, lines)
u = "\n".join(formatted_lines)
print(repr(u))
gives
'\n\n\nSome text and some more text\nand on another a line some more text'
You can also collapse the whole thing into a one-liner:
s = '\n'.join([' '.join([s for s in x.strip(' ').split(' ') if s!='']) for x in ''.join(l).split('\n')])
# OR
t = '\n'.join(map(lambda x: ' '.join(filter(lambda s: s!='', x.strip(' ').split(' '))), ''.join(l).split('\n')))
I have a string that for example can have - any where including some white spaces.
I want using regex in Python to remove - only if it is before all other non-whitespace chacacters or after all non-white space characters. Also want to remove all whitespaces at the beginning or the end.
For example:
string = ' - test '
it should return
string = 'test'
or:
string = ' -this - '
it should return
string = 'this'
or:
string = ' -this-is-nice - '
it should return
string = 'this-is-nice'
You don't need regex for this. str.strip strip removes all combinations of characters passed to it, so pass ' -' or '- ' to it.
>>> s = ' - test '
>>> s.strip('- ')
'test'
>>> s = ' -this - '
>>> s.strip('- ')
'this'
>>> s = ' -this-is-nice - '
>>> s.strip('- ')
'this-is-nice'
To remove any type of white-space character and '-' use string.whitespace + '-'.
>>> from string import whitespace
>>> s = '\t\r\n -this-is-nice - \n'
>>> s.strip(whitespace+'-')
'this-is-nice'
import re
out = re.sub(r'^\s*(-\s*)?|(\s*-)?\s*$', '', input)
This will remove at most one instance of - at the beginning of the string and at most one instance of - at the end of the string. For example, given input - - text - - , the output will be - text -.
Note that \s matches Unicode whitespaces (in Python 3). You will need re.ASCII flag to revert it to matching only [ \t\n\r\f\v].
Since you are not very clear about cases such as -text, -text-, -text -, the regex above will just output text for those 3 cases.
For strings such as text , the regex will just strip the spaces.
I am trying to use regex to remove #tags from a string in python however when i try to do this
str = ' you #warui and #madawar '
h = re.search('#\w*',str,re.M|re.I)
print h.group()
It outputs only the first #tag.
#warui
and when i try it on http://regexr.com?304a6 it works
"to use regex to remove #tags from a string"
import re
text = ' you #warui and #madawar '
stripped_text = re.sub(r'#\w+', '', text)
# stripped_text == ' you and '
or do you want to extract them?
import re
text = ' you #warui and #madawar '
tags = re.findall(r'#\w+', text)
# tags == ['#warui', '#madawar']
A #tag is defined as # followed by at least one alphanumeric character, that's why #\w+ is better than #\w*. Also you don't need to modify the case-sensitiveness, because \w matches both lower and upper characters.
re.search() will only match one occurrence of the pattern. If you want to find more, try using re.findall().
import re
s = ' you #warui and #madawar '
for h in re.findall('#\w*',s,re.M|re.I):
print h
Prints:
#warui
#madawar
I want to eliminate all the whitespace from a string, on both ends, and in between words.
I have this Python code:
def my_handle(self):
sentence = ' hello apple '
sentence.strip()
But that only eliminates the whitespace on both sides of the string. How do I remove all whitespace?
If you want to remove leading and ending spaces, use str.strip():
>>> " hello apple ".strip()
'hello apple'
If you want to remove all space characters, use str.replace() (NB this only removes the “normal” ASCII space character ' ' U+0020 but not any other whitespace):
>>> " hello apple ".replace(" ", "")
'helloapple'
If you want to remove duplicated spaces, use str.split() followed by str.join():
>>> " ".join(" hello apple ".split())
'hello apple'
To remove only spaces use str.replace:
sentence = sentence.replace(' ', '')
To remove all whitespace characters (space, tab, newline, and so on) you can use split then join:
sentence = ''.join(sentence.split())
or a regular expression:
import re
pattern = re.compile(r'\s+')
sentence = re.sub(pattern, '', sentence)
If you want to only remove whitespace from the beginning and end you can use strip:
sentence = sentence.strip()
You can also use lstrip to remove whitespace only from the beginning of the string, and rstrip to remove whitespace from the end of the string.
An alternative is to use regular expressions and match these strange white-space characters too. Here are some examples:
Remove ALL spaces in a string, even between words:
import re
sentence = re.sub(r"\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the BEGINNING of a string:
import re
sentence = re.sub(r"^\s+", "", sentence, flags=re.UNICODE)
Remove spaces in the END of a string:
import re
sentence = re.sub(r"\s+$", "", sentence, flags=re.UNICODE)
Remove spaces both in the BEGINNING and in the END of a string:
import re
sentence = re.sub("^\s+|\s+$", "", sentence, flags=re.UNICODE)
Remove ONLY DUPLICATE spaces:
import re
sentence = " ".join(re.split("\s+", sentence, flags=re.UNICODE))
(All examples work in both Python 2 and Python 3)
"Whitespace" includes space, tabs, and CRLF. So an elegant and one-liner string function we can use is str.translate:
Python 3
' hello apple '.translate(str.maketrans('', '', ' \n\t\r'))
OR if you want to be thorough:
import string
' hello apple'.translate(str.maketrans('', '', string.whitespace))
Python 2
' hello apple'.translate(None, ' \n\t\r')
OR if you want to be thorough:
import string
' hello apple'.translate(None, string.whitespace)
For removing whitespace from beginning and end, use strip.
>> " foo bar ".strip()
"foo bar"
' hello \n\tapple'.translate({ord(c):None for c in ' \n\t\r'})
MaK already pointed out the "translate" method above. And this variation works with Python 3 (see this Q&A).
In addition, strip has some variations:
Remove spaces in the BEGINNING and END of a string:
sentence= sentence.strip()
Remove spaces in the BEGINNING of a string:
sentence = sentence.lstrip()
Remove spaces in the END of a string:
sentence= sentence.rstrip()
All three string functions strip lstrip, and rstrip can take parameters of the string to strip, with the default being all white space. This can be helpful when you are working with something particular, for example, you could remove only spaces but not newlines:
" 1. Step 1\n".strip(" ")
Or you could remove extra commas when reading in a string list:
"1,2,3,".strip(",")
Be careful:
strip does a rstrip and lstrip (removes leading and trailing spaces, tabs, returns and form feeds, but it does not remove them in the middle of the string).
If you only replace spaces and tabs you can end up with hidden CRLFs that appear to match what you are looking for, but are not the same.
eliminate all the whitespace from a string, on both ends, and in between words.
>>> import re
>>> re.sub("\s+", # one or more repetition of whitespace
'', # replace with empty string (->remove)
''' hello
... apple
... ''')
'helloapple'
https://en.wikipedia.org/wiki/Whitespace_character
Python docs:
https://docs.python.org/library/stdtypes.html#textseq
https://docs.python.org/library/stdtypes.html#str.replace
https://docs.python.org/library/string.html#string.replace
https://docs.python.org/library/re.html#re.sub
https://docs.python.org/library/re.html#regular-expression-syntax
I use split() to ignore all whitespaces and use join() to concatenate
strings.
sentence = ''.join(' hello apple '.split())
print(sentence) #=> 'helloapple'
I prefer this approach because it is only a expression (not a statement).
It is easy to use and it can use without binding to a variable.
print(''.join(' hello apple '.split())) # no need to binding to a variable
import re
sentence = ' hello apple'
re.sub(' ','',sentence) #helloworld (remove all spaces)
re.sub(' ',' ',sentence) #hello world (remove double spaces)
In the following script we import the regular expression module which we use to substitute one space or more with a single space. This ensures that the inner extra spaces are removed. Then we use strip() function to remove leading and trailing spaces.
# Import regular expression module
import re
# Initialize string
a = " foo bar "
# First replace any number of spaces with a single space
a = re.sub(' +', ' ', a)
# Then strip any leading and trailing spaces.
a = a.strip()
# Show results
print(a)
I found that this works the best for me:
test_string = ' test a s test '
string_list = [s.strip() for s in str(test_string).split()]
final_string = ' '.join(string_array)
# final_string: 'test a s test'
It removes any whitespaces, tabs, etc.
try this.. instead of using re i think using split with strip is much better
def my_handle(self):
sentence = ' hello apple '
' '.join(x.strip() for x in sentence.split())
#hello apple
''.join(x.strip() for x in sentence.split())
#helloapple