How do I replace multiple spaces with just one character? - python

Here's my code so far:
input1 = input("Please enter a string: ")
newstring = input1.replace(' ','_')
print(newstring)
So if I put in my input as:
I want only one underscore.
It currently shows up as:
I_want_only_____one______underscore.
But I want it to show up like this:
I_want_only_one_underscore.

This pattern will replace any groups of whitespace with a single underscore
newstring = '_'.join(input1.split())
If you only want to replace spaces (not tab/newline/linefeed etc.) it's probably easier to use a regex
import re
newstring = re.sub(' +', '_', input1)

Dirty way:
newstring = '_'.join(input1.split())
Nicer way (more configurable):
import re
newstring = re.sub('\s+', '_', input1)
Extra Super Dirty way using the replace function:
def replace_and_shrink(t):
'''For when you absolutely, positively hate the normal ways to do this.'''
t = t.replace(' ', '_')
if '__' not in t:
return t
t = t.replace('__', '_')
return replace_and_shrink(t)

First approach (doesn't work)
>>> a = '213 45435 fdgdu'
>>> a
'213 45435 fdgdu '
>>> b = ' '.join( a.split() )
>>> b
'213 45435 fdgdu'
As you can see the variable a contains a lot of spaces between the "useful" sub-strings. The combination of the split() function without arguments and the join() function cleans up the initial string from the multiple white spaces.
The previous technique fails when the initial string contains special characters such as '\n':
>>> a = '213\n 45435\n fdgdu\n '
>>> b = ' '.join( a.split() )
>>> b
'213 45435 fdgdu' (the new line characters have been lost :( )
In order to correct this we can use the following (more complex) solution.
Second approach (works)
>>> a = '213\n 45435\n fdgdu\n '
>>> tmp = a.split( ' ' )
>>> tmp
['213\n', '', '', '', '', '', '', '', '', '45435\n', '', '', '', '', '', '', '', '', '', '', '', '', 'fdgdu\n', '']
>>> while '' in tmp: tmp.remove( '' )
...
>>> tmp
['213\n', '45435\n', 'fdgdu\n']
>>> b = ' '.join( tmp )
>>> b
'213\n 45435\n fdgdu\n'
Third approach (works)
This approach is a little bit more pythonic in my eyes. Check it:
>>> a = '213\n 45435\n fdgdu\n '
>>> b = ' '.join( filter( len, a.split( ' ' ) ) )
>>> b
'213\n 45435\n fdgdu\n'

Related

how to abandon special characters and numbers when using re.split with string?

If I want to split a string with spaces preserved, but don't want to include special characters and numbers.
So it would look like this.
sentence = "jak3 love$ $b0x1n%"
list_after_split = ["jak", " ", "love", " ", "bxn"]
I want to use re.split(), but I am not sure what to write as a pattern.
Try filtering the unwanted characters out first:
>>> import re
>>> sentence = "jak3 love$ $b0x1n%"
>>> sentence_filtered = re.sub(r'[^a-zA-Z\s]+', '', sentence)
>>> # Alternative: sentence_filtered = ''.join(ch for ch in sentence if ch.isalpha() or ch.isspace())
>>> sentence_filtered
'jak love bxn'
>>> re.split('(\s+)', sentence_filtered)
['jak', ' ', 'love', ' ', 'bxn']
If you want to condense whitespaces into a single space:
import re
# String with multi-spaces, tab(s), and newline(s).
s='Jak3 \t love$s \n $D0ax1t3e90r%.'
print(s)
# Jak3 love$s
# $D0ax1t3e90r%.
# First, remove all characters which aren't letters or a space.
# Second, condense spaces together into a single space.
# Third, split into desired list.
print(re.split(r'( )', re.sub(r' +',' ',re.sub(r'[^a-zA-Z ]+', '', s))))
# ['Jak', ' ', 'loves', ' ', 'Daxter']

Split a Python string (sentence) with appended white spaces

Might it be possible to split a Python string (sentence) so it retains the whitespaces between words in the output, but within a split substring by appending it after each word?
For example:
given_string = 'This is my string!'
output = ['This ', 'is ', 'my ', 'string!']
I avoid regexes most of the time, but here it makes it really simple:
import re
given_string = 'This is my string!'
res = re.findall(r'\w+\W?', given_string)
# res ['This ', 'is ', 'my ', 'string!']
Maybe this will help?
>>> given_string = 'This is my string!'
>>> l = given_string.split(' ')
>>> l = [item + ' ' for item in l[:-1]] + l[-1:]
>>> l
['This ', 'is ', 'my ', 'string!']
just split and add the whitespace back:
a = " "
output = [e+a for e in given_string.split(a) if e]
output[len(output)-1] = output[len(output)-1][:-1]
the last line is for deleting space after thankyou!

How to replace strings that are similar

I am creating some code that will replace spaces.
I want a double space to turn into a single space and a single space to become nothing.
Example:
string = "t e s t t e s t"
string = string.replace(' ', ' ').replace(' ', '')
print (string)
The output is "testest" because it replaces all the spaces.
How can I make the output "test test"?
Thanks
A regular expression approach is doubtless possible, but for a quick solution, first split on the double space, then rejoin on a single space after using a comprehension to remove the single spaces in each of the elements in the split:
>>> string = "t e s t t e s t"
>>> ' '.join(word.replace(' ', '') for word in string.split(' '))
'test test'
Just another idea:
>>> s = 't e s t t e s t'
>>> s.replace(' ', ' ').replace(' ', '').replace(' ', '')
'test test'
Seems to be faster:
>>> timeit(lambda: s.replace(' ', ' ').replace(' ', '').replace(' ', ''))
2.7822862677683133
>>> timeit(lambda: ' '.join(w.replace(' ','') for w in s.split(' ')))
7.702567737466012
And regex (at least this one) is shorter but a lot slower:
>>> timeit(lambda: re.sub(' ( ?)', r'\1', s))
37.2261058654488
I like this regex solution because you can easily read what's going on:
>>> import re
>>> string = "t e s t t e s t"
>>> re.sub(' {1,2}', lambda m: '' if m.group() == ' ' else ' ', string)
'test test'
We search for one or two spaces, and substitute one space with the empty string but two spaces with a single space.

Efficiently split a string using multiple separators and retaining each separator?

I need to split strings of data using each character from string.punctuation and string.whitespace as a separator.
Furthermore, I need for the separators to remain in the output list, in between the items they separated in the string.
For example,
"Now is the winter of our discontent"
should output:
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
I'm not sure how to do this without resorting to an orgy of nested loops, which is unacceptably slow. How can I do it?
A different non-regex approach from the others:
>>> import string
>>> from itertools import groupby
>>>
>>> special = set(string.punctuation + string.whitespace)
>>> s = "One two three tab\ttabandspace\t end"
>>>
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
>>> split_combined
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
>>> split_separated
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']
Could use dict.fromkeys and .get instead of the lambda, I guess.
[edit]
Some explanation:
groupby accepts two arguments, an iterable and an (optional) keyfunction. It loops through the iterable and groups them with the value of the keyfunction:
>>> groupby("sentence", lambda c: c in 'nt')
<itertools.groupby object at 0x9805af4>
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]
where terms with contiguous values of the keyfunction are grouped together. (This is a common source of bugs, actually -- people forget that they have to sort by the keyfunc first if they want to group terms which might not be sequential.)
As #JonClements guessed, what I had in mind was
>>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
>>> s = "One two three tab\ttabandspace\t end"
>>> [''.join(g) for k,g in groupby(s, special.get)]
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']
for the case where we were combining the separators. .get returns None if the value isn't in the dict.
import re
import string
p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
string.punctuation + string.whitespace)))
print p.findall("Now is the winter of our discontent")
I'm no big fan of using regexps for all problems, but I don't think you have much choice in this if you want it fast and short.
I'll explain the regexp since you're not familiar with it:
[...] means any of the characters inside the square brackets
[^...] means any of the characters not inside the square brackets
+ behind means one or more of the previous thing
x|y means to match either x or y
So the regexp matches 1 or more characters where either all must be punctuation and whitespace, or none must be. The findall method finds all non-overlapping matches of the pattern.
Try this:
import re
re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")
Explanation from the Python documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
Solution in linear (O(n)) time:
Let's say you have a string:
original = "a, b...c d"
First convert all separators to space:
splitters = string.punctuation + string.whitespace
trans = string.maketrans(splitters, ' ' * len(splitters))
s = original.translate(trans)
Now s == 'a b c d'. Now you can use itertools.groupby to alternate between spaces and non-spaces:
result = []
position = 0
for _, letters in itertools.groupby(s, lambda c: c == ' '):
letter_count = len(list(letters))
result.append(original[position:position + letter_count])
position += letter_count
Now result == ['a', ', ', 'b', '...', 'c', ' ', 'd'], which is what you need.
My take:
from string import whitespace, punctuation
import re
pattern = re.escape(whitespace + punctuation)
print re.split('([' + pattern + '])', 'now is the winter of')
Depending on the text you are dealing with, you may be able to simplify your concept of delimiters to "anything other than letters and numbers". If this will work, you can use the following regex solution:
re.findall(r'[a-zA-Z\d]+|[^a-zA-Z\d]', text)
This assumes that you want to split on each individual delimiter character even if they occur consecutively, so 'foo..bar' would become ['foo', '.', '.', 'bar']. If instead you expect ['foo', '..', 'bar'], use [a-zA-Z\d]+|[^a-zA-Z\d]+ (only difference is adding + at the very end).
from string import punctuation, whitespace
s = "..test. and stuff"
f = lambda s, c: s + ' ' + c + ' ' if c in punctuation else s + c
l = sum([reduce(f, word).split() for word in s.split()], [])
print l
For any arbitrary collection of separators:
def separate(myStr, seps):
answer = []
temp = []
for char in myStr:
if char in seps:
answer.append(''.join(temp))
answer.append(char)
temp = []
else:
temp.append(char)
answer.append(''.join(temp))
return answer
In [4]: print separate("Now is the winter of our discontent", set(' '))
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
In [5]: print separate("Now, really - it is the winter of our discontent", set(' ,-'))
['Now', ',', '', ' ', 'really', ' ', '', '-', '', ' ', 'it', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']
Hope this helps
from itertools import chain, cycle, izip
s = "Now is the winter of our discontent"
words = s.split()
wordsWithWhitespace = list( chain.from_iterable( izip( words, cycle([" "]) ) ) )
# result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']

Python: Splitting a string into words, saving separators

I have a string:
'Specified, if char, else 10 (default).'
I want to split it into two tuples
words=('Specified', 'if', 'char', 'else', '10', 'default')
separators=(',', ' ', ',', ' ', ' (', ').')
Does anyone have a quick solution of this?
PS: this symbol '-' is a word separator, not part of the word
import re
line = 'Specified, if char, else 10 (default).'
words = re.split(r'\)?[, .]\(?', line)
# words = ['Specified', '', 'if', 'char', '', 'else', '10', 'default', '']
separators = re.findall(r'\)?[, .]\(?', line)
# separators = [',', ' ', ' ', ',', ' ', ' ', ' (', ').']
If you really want tuples pass the results in tuple(), if you do not want words to have the empty entries (from between the commas and spaces), use the following:
words = [x for x in re.split(r'\)?[, .]\(?', line) if x]
or
words = tuple(x for x in re.split(r'\)?[, .]\(?', line) if x)
You can use regex for that.
>>> a='Specified, if char, else 10 (default).'
>>> from re import split
>>> split(",? ?\(?\)?\.?",a)
['Specified', 'if', 'char', 'else', '10', 'default', '']
But in this solution you should write that pattern yourself. If you want to use that tuple, you should convert it contents to regex pattern for that in this solution.
Regex to find all separators (assumed anything that's not alpha numeric
import re
re.findall('[^\w]', string)
I probably would first .split() on spaces into a list, then iterate through the list, using a regex to check for a character after the word boundary.
import re
s = 'Specified, if char, else 10 (default).'
w = s.split()
seperators = []
finalwords = []
for word in words:
match = re.search(r'(\w+)\b(.*)', word)
sep = '' if match is None else match.group(2)
finalwords.append(match.group(1))
seperators.append(sep)
In pass to get both separators and words you could use findall as follows:
import re
line = 'Specified, if char, else 10 (default).'
words = []
seps = []
for w,s in re.findall("(\w*)([), .(]+)", line):
words.append(w)
seps.append(s)
Here's my crack at it:
>>> p = re.compile(r'(\)? *[,.]? *\(?)')
>>> tmp = p.split('Specified, char, else 10 (default).')
>>> words = tmp[::2]
>>> separators = tmp[1::2]
>>> print words
['Specified', 'char', 'else', '10', 'default', '']
>>> print separators
[', ', ', ', ' ', ' (', ').']
The only problem is you can have a '' at the end or the beginning of words if there is a separator at the beginning/end of the sentence without anything before/after it. However, that is easily checked for and eliminated.

Categories

Resources