eliminating multiple occurrences of whitespace in a string in python

eliminating multiple occurrences of whitespace in a string in python - python

If I have a string
"this is a string"
How can I shorten it so that I only have one space between the words rather than multiple? (The number of white spaces is random)
"this is a string"

You could use string.split and " ".join(list) to make this happen in a reasonably pythonic way - there are probably more efficient algorithms but they won't look as nice.
Incidentally, this is a lot faster than using a regex, at least on the sample string:
import re
import timeit
s = "this is a string"
def do_regex():
for x in xrange(100000):
a = re.sub(r'\s+', ' ', s)
def do_join():
for x in xrange(100000):
a = " ".join(s.split())
if __name__ == '__main__':
t1 = timeit.Timer(do_regex).timeit(number=5)
print "Regex: ", t1
t2 = timeit.Timer(do_join).timeit(number=5)
print "Join: ", t2
$ python revsjoin.py
Regex: 2.70868492126
Join: 0.333452224731
Compiling this regex does improve performance, but only if you do call sub on the compiled regex, instead of passing the compiled form into re.sub as an argument:
def do_regex_compile():
pattern = re.compile(r'\s+')
for x in xrange(100000):
# Don't do this
# a = re.sub(pattern, ' ', s)
a = pattern.sub(' ', s)
$ python revsjoin.py
Regex: 2.72924399376
Compiled Regex: 1.5852200985
Join: 0.33763718605

re.sub(r'\s+', ' ', 'this is a string')
You can pre-compile and store this for potentially better performance:
MULT_SPACES = re.compile(r'\s+')
MULT_SPACES.sub(' ', 'this is a string')

Pretty the same answer by Ben Gartner, but, this adds the "if this is not an empty string" check.
>>> a = 'this is a string'
>>> ' '.join([k for k in a.split(" ") if k])
'this is a string'
>>>
if you don't check for empty strings you'll get this:
>>> ' '.join([k for k in a.split(" ")])
'this is a string'
>>>

Try this:
s = "this is a string"
tokens = s.split()
neat_s = " ".join(tokens)
The string's split function will return a list of non empty tokens split by whitespace. So if you try
"this is a string".split()
you will get back
['this', 'is', 'a', 'string']
The string's join function will join a list of tokens together using the string itself as a delimiter. In this case we want a space, so
" ".join("this is a string".split())
Will split on occurrences of a space, discard the empties, then join again, separating by spaces. For more about string operations, check out Python's common string function documentation.
EDIT: I misunderstood what happens when you pass a delimiter to the split function. See markuz's answer for this.

Related

efficient way to split multi-word hashtag in python

Given a text like
THIS is a #hashtag and this is a #multiWordHashtag
I need to output
THIS is a hashtag and this is a multi Word Hashtag
For now, I use this function:
def do_process_eng_hashtag(input_text: str):
result = []
for word in input_text.split():
if word.startswith('#') and len(word) > 1:
word = list(word)
word[1] = word[1].upper()
word = ''.join(word)
word = ' '.join(re.findall('[A-Z][^A-Z]*', word))
result.append(word)
return ' '.join(result)
But I wonder if there is a more efficient and neat way to do so?

Using re.sub:
You can specify replacement function:
def do_process_eng_hashtag(input_text: str) -> str:
return re.sub(
r'#[a-z]\S*',
lambda m: ' '.join(re.findall('[A-Z][^A-Z]*|[a-z][^A-Z]*', m.group().lstrip('#'))),
input_text,
)
The replacement function (lambda) will split hash tag into multiple words:
>>> re.findall('[A-Z][^A-Z]*|[a-z][^A-Z]*', '#multiWordHashtag'.lstrip('#'))
['multi', 'Word', 'Hashtag']
>>> do_process_eng_hashtag('THIS is a #hashtag and this is a #multiWordHashtag')
'THIS is a hashtag and this is a multi Word Hashtag '

You can use a function with re.sub like so:
import re
example='THIS is a #hashtag and this is a #multiWordHashtag'
def rep(m):
s=m.group(1)
return ' '.join(re.split(r'(?=[A-Z])', s))
>>> re.sub(r'#(\w+)', rep, example)
THIS is a hashtag and this is a multi Word Hashtag
Works like this:
re.sub(r'#(\w+)', rep, example) calls rep function with a match group for all hashtags.
The rep function then uses a lookahead to split the string on capitalization:
>>> re.split(r'(?=[A-Z])','multiWordHashtag')
['multi', 'Word', 'Hashtag']
>>> re.split(r'(?=[A-Z])','hastag')
['hastag']
The ' '.join() adds space delimiters. If there are no capitals, (ie, the argument to join is a list of length 1), just the string is returned.
You can modify the regex in re.sub(r'#(\w+)', rep, example) to whatever YOU consider a hashtag. Perhaps re.sub(r'#([a-zA-Z]+)', rep, example)?
Alternatively, you can combine Python splitting with the same regex to detect upper case:
def word_func(s):
return ' '.join(re.split(r'(?=[A-Z])', s[1:]))
' '.join([word_func(s) if s.startswith('#') else s for s in example.split()])
# same output

Is there a way to remove all characters except letters in a string in Python?

I call a function that returns code with all kinds of characters ranging from ( to ", and , and numbers.
Is there an elegant way to remove all of these so I end up with nothing but letters?

Given
s = '##24A-09=wes()&8973o**_##me' # contains letters 'Awesome'
You can filter out non-alpha characters with a generator expression:
result = ''.join(c for c in s if c.isalpha())
Or filter with filter:
result = ''.join(filter(str.isalpha, s))
Or you can substitute non-alpha with blanks using re.sub:
import re
result = re.sub(r'[^A-Za-z]', '', s)

A solution using RegExes is quite easy here:
import re
newstring = re.sub(r"[^a-zA-Z]+", "", string)
Where string is your string and newstring is the string without characters that are not alphabetic. What this does is replace every character that is not a letter by an empty string, thereby removing it. Note however that a RegEx may be slightly overkill here.
A more functional approach would be:
newstring = "".join(filter(str.isalpha, string))
Unfortunately you can't just call stron a filterobject to turn it into a string, that would look much nicer...
Going the pythonic way it would be
newstring = "".join(c for c in string if c.isalpha())

You didn't mention you want only english letters, here's an international solution:
import unicodedata
str = u"hello, ѱϘяԼϷ!"
print ''.join(c for c in str if unicodedata.category(c).startswith('L'))

Here's another one, using string.ascii_letters
>>> import string
>>> "".join(x for x in s if x in string.ascii_letters)
`

>>> import re
>>> string = "';''';;';1123123!##!##!#!$!sd sds2312313~~\"~s__"
>>> re.sub("[\W\d_]", "", string)
'sdsdss'

Well, I use this for myself in this kind of situations
Sorry, if it's outdated :)
string = "The quick brown fox jumps over the lazy dog!"
alphabet = "abcdefghijklmnopqrstuvwxyz"
def letters_only(source):
result = ""
for i in source.lower():
if i in alphabet:
result += i
return result
print(letters_only(string))

s = '##24A-09=wes()&8973o**_##me'
print(filter(str.isalpha, s))
# Awesome
About return value of filter:
filter(function or None, sequence) -> list, tuple, or string

python string manipulation [duplicate]

I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?

Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK

You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.

>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'

Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'

Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)

You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?

You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E

this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()

Python strip() multiple characters?

I want to remove any brackets from a string. Why doesn't this work properly?
>>> name = "Barack (of Washington)"
>>> name = name.strip("(){}<>")
>>> print name
Barack (of Washington

Because that's not what strip() does. It removes leading and trailing characters that are present in the argument, but not those characters in the middle of the string.
You could do:
name= name.replace('(', '').replace(')', '').replace ...
or:
name= ''.join(c for c in name if c not in '(){}<>')
or maybe use a regex:
import re
name= re.sub('[(){}<>]', '', name)

I did a time test here, using each method 100000 times in a loop. The results surprised me. (The results still surprise me after editing them in response to valid criticism in the comments.)
Here's the script:
import timeit
bad_chars = '(){}<>'
setup = """import re
import string
s = 'Barack (of Washington)'
bad_chars = '(){}<>'
rgx = re.compile('[%s]' % bad_chars)"""
timer = timeit.Timer('o = "".join(c for c in s if c not in bad_chars)', setup=setup)
print "List comprehension: ", timer.timeit(100000)
timer = timeit.Timer("o= rgx.sub('', s)", setup=setup)
print "Regular expression: ", timer.timeit(100000)
timer = timeit.Timer('for c in bad_chars: s = s.replace(c, "")', setup=setup)
print "Replace in loop: ", timer.timeit(100000)
timer = timeit.Timer('s.translate(string.maketrans("", "", ), bad_chars)', setup=setup)
print "string.translate: ", timer.timeit(100000)
Here are the results:
List comprehension: 0.631745100021
Regular expression: 0.155561923981
Replace in loop: 0.235936164856
string.translate: 0.0965719223022
Results on other runs follow a similar pattern. If speed is not the primary concern, however, I still think string.translate is not the most readable; the other three are more obvious, though slower to varying degrees.

string.translate with table=None works fine.
>>> name = "Barack (of Washington)"
>>> name = name.translate(None, "(){}<>")
>>> print name
Barack of Washington

Because strip() only strips trailing and leading characters, based on what you provided. I suggest:
>>> import re
>>> name = "Barack (of Washington)"
>>> name = re.sub('[\(\)\{\}<>]', '', name)
>>> print(name)
Barack of Washington

strip only strips characters from the very front and back of the string.
To delete a list of characters, you could use the string's translate method:
import string
name = "Barack (of Washington)"
table = string.maketrans( '', '', )
print name.translate(table,"(){}<>")
# Barack of Washington

Since strip only removes characters from start and end, one idea could be to break the string into list of words, then remove chars, and then join:
s = 'Barack (of Washington)'
x = [j.strip('(){}<>') for j in s.split()]
ans = ' '.join(j for j in x)
print(ans)

For example string s="(U+007c)"
To remove only the parentheses from s, try the below one:
import re
a=re.sub("\\(","",s)
b=re.sub("\\)","",a)
print(b)

How can I capitalize the first letter of each word in a string?

s = 'the brown fox'
...do something here...
s should be:
'The Brown Fox'
What's the easiest way to do this?

The .title() method of a string (either ASCII or Unicode is fine) does this:
>>> "hello world".title()
'Hello World'
>>> u"hello world".title()
u'Hello World'
However, look out for strings with embedded apostrophes, as noted in the docs.
The algorithm uses a simple language-independent definition of a word as groups of consecutive letters. The definition works in many contexts but it means that apostrophes in contractions and possessives form word boundaries, which may not be the desired result:
>>> "they're bill's friends from the UK".title()
"They'Re Bill'S Friends From The Uk"

The .title() method can't work well,
>>> "they're bill's friends from the UK".title()
"They'Re Bill'S Friends From The Uk"
Try string.capwords() method,
import string
string.capwords("they're bill's friends from the UK")
>>>"They're Bill's Friends From The Uk"
From the Python documentation on capwords:
Split the argument into words using str.split(), capitalize each word using str.capitalize(), and join the capitalized words using str.join(). If the optional second argument sep is absent or None, runs of whitespace characters are replaced by a single space and leading and trailing whitespace are removed, otherwise sep is used to split and join the words.

Just because this sort of thing is fun for me, here are two more solutions.
Split into words, initial-cap each word from the split groups, and rejoin. This will change the white space separating the words into a single white space, no matter what it was.
s = 'the brown fox'
lst = [word[0].upper() + word[1:] for word in s.split()]
s = " ".join(lst)
EDIT: I don't remember what I was thinking back when I wrote the above code, but there is no need to build an explicit list; we can use a generator expression to do it in lazy fashion. So here is a better solution:
s = 'the brown fox'
s = ' '.join(word[0].upper() + word[1:] for word in s.split())
Use a regular expression to match the beginning of the string, or white space separating words, plus a single non-whitespace character; use parentheses to mark "match groups". Write a function that takes a match object, and returns the white space match group unchanged and the non-whitespace character match group in upper case. Then use re.sub() to replace the patterns. This one does not have the punctuation problems of the first solution, nor does it redo the white space like my first solution. This one produces the best result.
import re
s = 'the brown fox'
def repl_func(m):
"""process regular expression match groups for word upper-casing problem"""
return m.group(1) + m.group(2).upper()
s = re.sub("(^|\s)(\S)", repl_func, s)
>>> re.sub("(^|\s)(\S)", repl_func, s)
"They're Bill's Friends From The UK"
I'm glad I researched this answer. I had no idea that re.sub() could take a function! You can do nontrivial processing inside re.sub() to produce the final result!

Here is a summary of different ways to do it, and some pitfalls to watch out for
They will work for all these inputs:
"" => ""
"a b c" => "A B C"
"foO baR" => "FoO BaR"
"foo bar" => "Foo Bar"
"foo's bar" => "Foo's Bar"
"foo's1bar" => "Foo's1bar"
"foo 1bar" => "Foo 1bar"
Splitting the sentence into words and capitalizing the first letter then join it back together:
# Be careful with multiple spaces, and empty strings
# for empty words w[0] would cause an index error,
# but with w[:1] we get an empty string as desired
def cap_sentence(s):
return ' '.join(w[:1].upper() + w[1:] for w in s.split(' '))
Without splitting the string, checking blank spaces to find the start of a word
def cap_sentence(s):
return ''.join( (c.upper() if i == 0 or s[i-1] == ' ' else c) for i, c in enumerate(s) )
Or using generators:
# Iterate through each of the characters in the string
# and capitalize the first char and any char after a blank space
from itertools import chain
def cap_sentence(s):
return ''.join( (c.upper() if prev == ' ' else c) for c, prev in zip(s, chain(' ', s)) )
Using regular expressions, from steveha's answer:
# match the beginning of the string or a space, followed by a non-space
import re
def cap_sentence(s):
return re.sub("(^|\s)(\S)", lambda m: m.group(1) + m.group(2).upper(), s)
Now, these are some other answers that were posted, and inputs for which they don't work as expected if we define a word as being the start of the sentence or anything after a blank space:
.title()
return s.title()
# Undesired outputs:
"foO baR" => "Foo Bar"
"foo's bar" => "Foo'S Bar"
"foo's1bar" => "Foo'S1Bar"
"foo 1bar" => "Foo 1Bar"
.capitalize() or .capwords()
return ' '.join(w.capitalize() for w in s.split())
# or
import string
return string.capwords(s)
# Undesired outputs:
"foO baR" => "Foo Bar"
"foo bar" => "Foo Bar"
using ' ' for the split will fix the second output, but not the first
return ' '.join(w.capitalize() for w in s.split(' '))
# or
import string
return string.capwords(s, ' ')
# Undesired outputs:
"foO baR" => "Foo Bar"
.upper()
Be careful with multiple blank spaces, this gets fixed by using ' ' for the split (like shown at the top of the answer)
return ' '.join(w[0].upper() + w[1:] for w in s.split())
# Undesired outputs:
"foo bar" => "Foo Bar"

Why do you complicate your life with joins and for loops when the solution is simple and safe??
Just do this:
string = "the brown fox"
string[0].upper()+string[1:]

Copy-paste-ready version of #jibberia anwser:
def capitalize(line):
return ' '.join(s[:1].upper() + s[1:] for s in line.split(' '))

If only you want the first letter:
>>> 'hello world'.capitalize()
'Hello world'
But to capitalize each word:
>>> 'hello world'.title()
'Hello World'

If str.title() doesn't work for you, do the capitalization yourself.
Split the string into a list of words
Capitalize the first letter of each word
Join the words into a single string
One-liner:
>>> ' '.join([s[0].upper() + s[1:] for s in "they're bill's friends from the UK".split(' ')])
"They're Bill's Friends From The UK"
Clear example:
input = "they're bill's friends from the UK"
words = input.split(' ')
capitalized_words = []
for word in words:
title_case_word = word[0].upper() + word[1:]
capitalized_words.append(title_case_word)
output = ' '.join(capitalized_words)

An empty string will raise an error if you access [1:]. Therefore I would use:
def my_uppercase(title):
if not title:
return ''
return title[0].upper() + title[1:]
to uppercase the first letter only.

Although all the answers are already satisfactory, I'll try to cover the two extra cases along with the all the previous case.
if the spaces are not uniform and you want to maintain the same
string = hello world i am here.
if all the string are not starting from alphabets
string = 1 w 2 r 3g
Here you can use this:
def solve(s):
a = s.split(' ')
for i in range(len(a)):
a[i]= a[i].capitalize()
return ' '.join(a)
This will give you:
output = Hello World I Am Here
output = 1 W 2 R 3g

As Mark pointed out, you should use .title():
"MyAwesomeString".title()
However, if would like to make the first letter uppercase inside a Django template, you could use this:
{{ "MyAwesomeString"|title }}
Or using a variable:
{{ myvar|title }}

The suggested method str.title() does not work in all cases.
For example:
string = "a b 3c"
string.title()
> "A B 3C"
instead of "A B 3c".
I think, it is better to do something like this:
def capitalize_words(string):
words = string.split(" ") # just change the split(" ") method
return ' '.join([word.capitalize() for word in words])
capitalize_words(string)
>'A B 3c'

To capitalize words...
str = "this is string example.... wow!!!";
print "str.title() : ", str.title();
#Gary02127 comment, the below solution works with title with apostrophe
import re
def titlecase(s):
return re.sub(r"[A-Za-z]+('[A-Za-z]+)?", lambda mo: mo.group(0)[0].upper() + mo.group(0)[1:].lower(), s)
text = "He's an engineer, isn't he? SnippetBucket.com "
print(titlecase(text))

You can try this. simple and neat.
def cap_each(string):
list_of_words = string.split(" ")
for word in list_of_words:
list_of_words[list_of_words.index(word)] = word.capitalize()
return " ".join(list_of_words)

Don't overlook the preservation of white space. If you want to process 'fred flinstone' and you get 'Fred Flinstone' instead of 'Fred Flinstone', you've corrupted your white space. Some of the above solutions will lose white space. Here's a solution that's good for Python 2 and 3 and preserves white space.
def propercase(s):
return ''.join(map(''.capitalize, re.split(r'(\s+)', s)))

The .title() method won't work in all test cases, so using .capitalize(), .replace() and .split() together is the best choice to capitalize the first letter of each word.
eg: def caps(y):
k=y.split()
for i in k:
y=y.replace(i,i.capitalize())
return y

You can use title() method to capitalize each word in a string in Python:
string = "this is a test string"
capitalized_string = string.title()
print(capitalized_string)
Output:
This Is A Test String

A quick function worked for Python 3
Python 3.6.9 (default, Nov 7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> capitalizeFirtChar = lambda s: s[:1].upper() + s[1:]
>>> print(capitalizeFirtChar('помните своих Предковъ. Сражайся за Правду и Справедливость!'))
Помните своих Предковъ. Сражайся за Правду и Справедливость!
>>> print(capitalizeFirtChar('хай живе вільна Україна! Хай живе Любовь поміж нас.'))
Хай живе вільна Україна! Хай живе Любовь поміж нас.
>>> print(capitalizeFirtChar('faith and Labour make Dreams come true.'))
Faith and Labour make Dreams come true.

Capitalize string with non-uniform spaces
I would like to add to #Amit Gupta's point of non-uniform spaces:
From the original question, we would like to capitalize every word in the string s = 'the brown fox'. What if the string was s = 'the brown fox' with non-uniform spaces.
def solve(s):
# If you want to maintain the spaces in the string, s = 'the brown fox'
# Use s.split(' ') instead of s.split().
# s.split() returns ['the', 'brown', 'fox']
# while s.split(' ') returns ['the', 'brown', '', '', '', '', '', 'fox']
capitalized_word_list = [word.capitalize() for word in s.split(' ')]
return ' '.join(capitalized_word_list)

Easiest solution for your question, it worked in my case:
import string
def solve(s):
return string.capwords(s,' ')
s=input()
res=solve(s)
print(res)

Another oneline solution could be:
" ".join(map(lambda d: d.capitalize(), word.split(' ')))

In case you want to downsize
# Assuming you are opening a new file
with open(input_file) as file:
lines = [x for x in reader(file) if x]
# for loop to parse the file by line
for line in lines:
name = [x.strip().lower() for x in line if x]
print(name) # Check the result

If you will use the method .title(), then the letters after ' will also become uppercase. Like this:
>>> "hello world's".title()
"Hello World'S"
To avoid this, use the capwords function from the string library.
Like this:
>>> import string
>>> string.capwords("hello world's")
"Hello World's"

I really like this answer:
Copy-paste-ready version of #jibberia anwser:
def capitalize(line):
return ' '.join([s[0].upper() + s[1:] for s in line.split(' ')])
But some of the lines that I was sending split off some blank '' characters that caused errors when trying to do s[1:]. There is probably a better way to do this, but I had to add in a if len(s)>0, as in
return ' '.join([s[0].upper() + s[1:] for s in line.split(' ') if len(s)>0])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

eliminating multiple occurrences of whitespace in a string in python - python

If I have a string "this is a string" How can I shorten it so that I only have one space between the words rather than multiple? (The number of white spaces is random) "this is a string"

re.sub(r'\s+', ' ', 'this is a string') You can pre-compile and store this for potentially better performance: MULT_SPACES = re.compile(r'\s+') MULT_SPACES.sub(' ', 'this is a string')

Related

efficient way to split multi-word hashtag in python

Is there a way to remove all characters except letters in a string in Python?

python string manipulation [duplicate]

Python strip() multiple characters?

How can I capitalize the first letter of each word in a string?

Categories

Resources