Python strip() multiple characters? - python

I want to remove any brackets from a string. Why doesn't this work properly?
>>> name = "Barack (of Washington)"
>>> name = name.strip("(){}<>")
>>> print name
Barack (of Washington

Because that's not what strip() does. It removes leading and trailing characters that are present in the argument, but not those characters in the middle of the string.
You could do:
name= name.replace('(', '').replace(')', '').replace ...
or:
name= ''.join(c for c in name if c not in '(){}<>')
or maybe use a regex:
import re
name= re.sub('[(){}<>]', '', name)

I did a time test here, using each method 100000 times in a loop. The results surprised me. (The results still surprise me after editing them in response to valid criticism in the comments.)
Here's the script:
import timeit
bad_chars = '(){}<>'
setup = """import re
import string
s = 'Barack (of Washington)'
bad_chars = '(){}<>'
rgx = re.compile('[%s]' % bad_chars)"""
timer = timeit.Timer('o = "".join(c for c in s if c not in bad_chars)', setup=setup)
print "List comprehension: ", timer.timeit(100000)
timer = timeit.Timer("o= rgx.sub('', s)", setup=setup)
print "Regular expression: ", timer.timeit(100000)
timer = timeit.Timer('for c in bad_chars: s = s.replace(c, "")', setup=setup)
print "Replace in loop: ", timer.timeit(100000)
timer = timeit.Timer('s.translate(string.maketrans("", "", ), bad_chars)', setup=setup)
print "string.translate: ", timer.timeit(100000)
Here are the results:
List comprehension: 0.631745100021
Regular expression: 0.155561923981
Replace in loop: 0.235936164856
string.translate: 0.0965719223022
Results on other runs follow a similar pattern. If speed is not the primary concern, however, I still think string.translate is not the most readable; the other three are more obvious, though slower to varying degrees.

string.translate with table=None works fine.
>>> name = "Barack (of Washington)"
>>> name = name.translate(None, "(){}<>")
>>> print name
Barack of Washington

Because strip() only strips trailing and leading characters, based on what you provided. I suggest:
>>> import re
>>> name = "Barack (of Washington)"
>>> name = re.sub('[\(\)\{\}<>]', '', name)
>>> print(name)
Barack of Washington

strip only strips characters from the very front and back of the string.
To delete a list of characters, you could use the string's translate method:
import string
name = "Barack (of Washington)"
table = string.maketrans( '', '', )
print name.translate(table,"(){}<>")
# Barack of Washington

Since strip only removes characters from start and end, one idea could be to break the string into list of words, then remove chars, and then join:
s = 'Barack (of Washington)'
x = [j.strip('(){}<>') for j in s.split()]
ans = ' '.join(j for j in x)
print(ans)

For example string s="(U+007c)"
To remove only the parentheses from s, try the below one:
import re
a=re.sub("\\(","",s)
b=re.sub("\\)","",a)
print(b)

Related

Is there a way to remove all characters except letters in a string in Python?

I call a function that returns code with all kinds of characters ranging from ( to ", and , and numbers.
Is there an elegant way to remove all of these so I end up with nothing but letters?
Given
s = '##24A-09=wes()&8973o**_##me' # contains letters 'Awesome'
You can filter out non-alpha characters with a generator expression:
result = ''.join(c for c in s if c.isalpha())
Or filter with filter:
result = ''.join(filter(str.isalpha, s))
Or you can substitute non-alpha with blanks using re.sub:
import re
result = re.sub(r'[^A-Za-z]', '', s)
A solution using RegExes is quite easy here:
import re
newstring = re.sub(r"[^a-zA-Z]+", "", string)
Where string is your string and newstring is the string without characters that are not alphabetic. What this does is replace every character that is not a letter by an empty string, thereby removing it. Note however that a RegEx may be slightly overkill here.
A more functional approach would be:
newstring = "".join(filter(str.isalpha, string))
Unfortunately you can't just call stron a filterobject to turn it into a string, that would look much nicer...
Going the pythonic way it would be
newstring = "".join(c for c in string if c.isalpha())
You didn't mention you want only english letters, here's an international solution:
import unicodedata
str = u"hello, ѱϘяԼϷ!"
print ''.join(c for c in str if unicodedata.category(c).startswith('L'))
Here's another one, using string.ascii_letters
>>> import string
>>> "".join(x for x in s if x in string.ascii_letters)
`
>>> import re
>>> string = "';''';;';1123123!##!##!#!$!sd sds2312313~~\"~s__"
>>> re.sub("[\W\d_]", "", string)
'sdsdss'
Well, I use this for myself in this kind of situations
Sorry, if it's outdated :)
string = "The quick brown fox jumps over the lazy dog!"
alphabet = "abcdefghijklmnopqrstuvwxyz"
def letters_only(source):
result = ""
for i in source.lower():
if i in alphabet:
result += i
return result
print(letters_only(string))
s = '##24A-09=wes()&8973o**_##me'
print(filter(str.isalpha, s))
# Awesome
About return value of filter:
filter(function or None, sequence) -> list, tuple, or string

python capitalize first letter only

I am aware .capitalize() capitalizes the first letter of a string but what if the first character is a integer?
this
1bob
5sandy
to this
1Bob
5Sandy
Only because no one else has mentioned it:
>>> 'bob'.title()
'Bob'
>>> 'sandy'.title()
'Sandy'
>>> '1bob'.title()
'1Bob'
>>> '1sandy'.title()
'1Sandy'
However, this would also give
>>> '1bob sandy'.title()
'1Bob Sandy'
>>> '1JoeBob'.title()
'1Joebob'
i.e. it doesn't just capitalize the first alphabetic character. But then .capitalize() has the same issue, at least in that 'joe Bob'.capitalize() == 'Joe bob', so meh.
If the first character is an integer, it will not capitalize the first letter.
>>> '2s'.capitalize()
'2s'
If you want the functionality, strip off the digits, you can use '2'.isdigit() to check for each character.
>>> s = '123sa'
>>> for i, c in enumerate(s):
... if not c.isdigit():
... break
...
>>> s[:i] + s[i:].capitalize()
'123Sa'
This is similar to #Anon's answer in that it keeps the rest of the string's case intact, without the need for the re module.
def sliceindex(x):
i = 0
for c in x:
if c.isalpha():
i = i + 1
return i
i = i + 1
def upperfirst(x):
i = sliceindex(x)
return x[:i].upper() + x[i:]
x = '0thisIsCamelCase'
y = upperfirst(x)
print(y)
# 0ThisIsCamelCase
As #Xan pointed out, the function could use more error checking (such as checking that x is a sequence - however I'm omitting edge cases to illustrate the technique)
Updated per #normanius comment (thanks!)
Thanks to #GeoStoneMarten in pointing out I didn't answer the question! -fixed that
Here is a one-liner that will uppercase the first letter and leave the case of all subsequent letters:
import re
key = 'wordsWithOtherUppercaseLetters'
key = re.sub('([a-zA-Z])', lambda x: x.groups()[0].upper(), key, 1)
print key
This will result in WordsWithOtherUppercaseLetters
As seeing here answered by Chen Houwu, it's possible to use string package:
import string
string.capwords("they're bill's friends from the UK")
>>>"They're Bill's Friends From The Uk"
a one-liner: ' '.join(sub[:1].upper() + sub[1:] for sub in text.split(' '))
You can replace the first letter (preceded by a digit) of each word using regex:
re.sub(r'(\d\w)', lambda w: w.group().upper(), '1bob 5sandy')
output:
1Bob 5Sandy
def solve(s):
for i in s[:].split():
s = s.replace(i, i.capitalize())
return s
This is the actual code for work. .title() will not work at '12name' case
I came up with this:
import re
regex = re.compile("[A-Za-z]") # find a alpha
str = "1st str"
s = regex.search(str).group() # find the first alpha
str = str.replace(s, s.upper(), 1) # replace only 1 instance
print str
def solve(s):
names = list(s.split(" "))
return " ".join([i.capitalize() for i in names])
Takes a input like your name: john doe
Returns the first letter capitalized.(if first character is a number, then no capitalization occurs)
works for any name length

Analyzing string input until it reaches a certain letter on Python

I need help in trying to write a certain part of a program.
The idea is that a person would input a bunch of gibberish and the program will read it till it reaches an "!" (exclamation mark) so for example:
input("Type something: ")
Person types: wolfdo65gtornado!salmontiger223
If I ask the program to print the input it should only print wolfdo65gtornado and cut anything once it reaches the "!" The rest of the program is analyzing and counting the letters, but those part I already know how to do. I just need help with the first part. I been trying to look through the book but it seems I'm missing something.
I'm thinking, maybe utilizing a for loop and then placing restriction on it but I can't figure out how to make the random imputed string input be analyzed for a certain character and then get rid of the rest.
If you could help, I'll truly appreciate it. Thanks!
The built-in str.partition() method will do this for you. Unlike str.split() it won't bother to cut the rest of the str into different strs.
text = raw_input("Type something:")
left_text = text.partition("!")[0]
Explanation
str.partition() returns a 3-tuple containing the beginning, separator, and end of the string. The [0] gets the first item which is all you want in this case. Eg.:
"wolfdo65gtornado!salmontiger223".partition("!")
returns
('wolfdo65gtornado', '!', 'salmontiger223')
>>> s = "wolfdo65gtornado!salmontiger223"
>>> s.split('!')[0]
'wolfdo65gtornado'
>>> s = "wolfdo65gtornadosalmontiger223"
>>> s.split('!')[0]
'wolfdo65gtornadosalmontiger223'
if it doesnt encounter a "!" character, it will just grab the entire text though. if you would like to output an error if it doesn't match any "!" you can just do like this:
s = "something!something"
if "!" in s:
print "there is a '!' character in the context"
else:
print "blah, you aren't using it right :("
You want itertools.takewhile().
>>> s = "wolfdo65gtornado!salmontiger223"
>>> '-'.join(itertools.takewhile(lambda x: x != '!', s))
'w-o-l-f-d-o-6-5-g-t-o-r-n-a-d-o'
>>> s = "wolfdo65gtornado!salmontiger223!cvhegjkh54bgve8r7tg"
>>> i = iter(s)
>>> '-'.join(itertools.takewhile(lambda x: x != '!', i))
'w-o-l-f-d-o-6-5-g-t-o-r-n-a-d-o'
>>> '-'.join(itertools.takewhile(lambda x: x != '!', i))
's-a-l-m-o-n-t-i-g-e-r-2-2-3'
>>> '-'.join(itertools.takewhile(lambda x: x != '!', i))
'c-v-h-e-g-j-k-h-5-4-b-g-v-e-8-r-7-t-g'
Try this:
s = "wolfdo65gtornado!salmontiger223"
m = s.index('!')
l = s[:m]
To explain accepted answer.
Splitting
partition() function splits string in list with 3 elements:
mystring = "123splitABC"
x = mystring.partition("split")
print(x)
will give:
('123', 'split', 'ABC')
Access them like list elements:
print (x[0]) ==> 123
print (x[1]) ==> split
print (x[2]) ==> ABC
Suppose we have:
s = "wolfdo65gtornado!salmontiger223" + some_other_string
s.partition("!")[0] and s.split("!")[0] are both a problem if some_other_string contains a million strings, each a million characters long, separated by exclamation marks. I recommend the following instead. It's much more efficient.
import itertools as itts
get_start_of_string = lambda stryng, last, *, itts=itts:\
str(itts.takewhile(lambda ch: ch != last, stryng))
###########################################################
s = "wolfdo65gtornado!salmontiger223"
start_of_string = get_start_of_string(s, "!")
Why the itts=itts
Inside of the body of a function, such as get_start_of_string, itts is global.
itts is evaluated when the function is called, not when the function is defined.
Consider the following example:
color = "white"
get_fleece_color = lambda shoop: shoop + ", whose fleece was as " + color + " as snow."
print(get_fleece_color("Igor"))
# [... many lines of code later...]
color = "pink polka-dotted"
print(get_fleece_color("Igor's cousin, 3 times removed"))
The output is:
Igor, whose fleece was white as snow.
Igor's cousin, 3 times removed Igor, whose fleece was as pink polka-dotted as snow.
You can extract the beginning of a string, up until the first delimiter is encountered, by using regular expressions.
import re
slash_if_special = lambda ch:\
"\\" if ch in "\\^$.|?*+()[{" else ""
prefix_slash_if_special = lambda ch, *, _slash=slash_if_special: \
_slash(ch) + ch
make_pattern_from_char = lambda ch, *, c=prefix_slash_if_special:\
"^([^" + c(ch) + "]*)"
def get_string_up_untill(x_stryng, x_ch):
i_stryng = str(x_stryng)
i_ch = str(x_ch)
assert(len(i_ch) == 1)
pattern = make_pattern_from_char(ch)
m = re.match(pattern, x_stryng)
return m.groups()[0]
An example of the code above being used:
s = "wolfdo65gtornado!salmontiger223"
result = get_string_up_untill(s, "!")
print(result)
# wolfdo65gtornado
We can use itertools
s = "wolfdo65gtornado!salmontiger223"
result = "".join(itertools.takewhile(lambda x : x!='!' , s))
>>"wolfdo65gtornado"

python string manipulation [duplicate]

I have a string s with nested brackets: s = "AX(p>q)&E((-p)Ur)"
I want to remove all characters between all pairs of brackets and store in a new string like this: new_string = AX&E
i tried doing this:
p = re.compile("\(.*?\)", re.DOTALL)
new_string = p.sub("", s)
It gives output: AX&EUr)
Is there any way to correct this, rather than iterating each element in the string?
Another simple option is removing the innermost parentheses at every stage, until there are no more parentheses:
p = re.compile("\([^()]*\)")
count = 1
while count:
s, count = p.subn("", s)
Working example: http://ideone.com/WicDK
You can just use string manipulation without regular expression
>>> s = "AX(p>q)&E(qUr)"
>>> [ i.split("(")[0] for i in s.split(")") ]
['AX', '&E', '']
I leave it to you to join the strings up.
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> re.compile("""\([^\)]*\)""").sub('', s)
'AX&E'
Yeah, it should be:
>>> import re
>>> s = "AX(p>q)&E(qUr)"
>>> p = re.compile("\(.*?\)", re.DOTALL)
>>> new_string = p.sub("", s)
>>> new_string
'AX&E'
Nested brackets (or tags, ...) are something that are not possible to handle in a general way using regex. See http://www.amazon.de/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&s=gateway&qid=1304230523&sr=8-1-spell for details why. You would need a real parser.
It's possible to construct a regex which can handle two levels of nesting, but they are already ugly, three levels will already be quite long. And you don't want to think about four levels. ;-)
You can use PyParsing to parse the string:
from pyparsing import nestedExpr
import sys
s = "AX(p>q)&E((-p)Ur)"
expr = nestedExpr('(', ')')
result = expr.parseString('(' + s + ')').asList()[0]
s = ''.join(filter(lambda x: isinstance(x, str), result))
print(s)
Most code is from: How can a recursive regexp be implemented in python?
You could use re.subn():
import re
s = 'AX(p>q)&E((-p)Ur)'
while True:
s, n = re.subn(r'\([^)(]*\)', '', s)
if n == 0:
break
print(s)
Output
AX&E
this is just how you do it:
# strings
# double and single quotes use in Python
"hey there! welcome to CIP"
'hey there! welcome to CIP'
"you'll understand python"
'i said, "python is awesome!"'
'i can\'t live without python'
# use of 'r' before string
print(r"\new code", "\n")
first = "code in"
last = "python"
first + last #concatenation
# slicing of strings
user = "code in python!"
print(user)
print(user[5]) # print an element
print(user[-3]) # print an element from rear end
print(user[2:6]) # slicing the string
print(user[:6])
print(user[2:])
print(len(user)) # length of the string
print(user.upper()) # convert to uppercase
print(user.lstrip())
print(user.rstrip())
print(max(user)) # max alphabet from user string
print(min(user)) # min alphabet from user string
print(user.join([1,2,3,4]))
input()

eliminating multiple occurrences of whitespace in a string in python

If I have a string
"this is a string"
How can I shorten it so that I only have one space between the words rather than multiple? (The number of white spaces is random)
"this is a string"
You could use string.split and " ".join(list) to make this happen in a reasonably pythonic way - there are probably more efficient algorithms but they won't look as nice.
Incidentally, this is a lot faster than using a regex, at least on the sample string:
import re
import timeit
s = "this is a string"
def do_regex():
for x in xrange(100000):
a = re.sub(r'\s+', ' ', s)
def do_join():
for x in xrange(100000):
a = " ".join(s.split())
if __name__ == '__main__':
t1 = timeit.Timer(do_regex).timeit(number=5)
print "Regex: ", t1
t2 = timeit.Timer(do_join).timeit(number=5)
print "Join: ", t2
$ python revsjoin.py
Regex: 2.70868492126
Join: 0.333452224731
Compiling this regex does improve performance, but only if you do call sub on the compiled regex, instead of passing the compiled form into re.sub as an argument:
def do_regex_compile():
pattern = re.compile(r'\s+')
for x in xrange(100000):
# Don't do this
# a = re.sub(pattern, ' ', s)
a = pattern.sub(' ', s)
$ python revsjoin.py
Regex: 2.72924399376
Compiled Regex: 1.5852200985
Join: 0.33763718605
re.sub(r'\s+', ' ', 'this is a string')
You can pre-compile and store this for potentially better performance:
MULT_SPACES = re.compile(r'\s+')
MULT_SPACES.sub(' ', 'this is a string')
Pretty the same answer by Ben Gartner, but, this adds the "if this is not an empty string" check.
>>> a = 'this is a string'
>>> ' '.join([k for k in a.split(" ") if k])
'this is a string'
>>>
if you don't check for empty strings you'll get this:
>>> ' '.join([k for k in a.split(" ")])
'this is a string'
>>>
Try this:
s = "this is a string"
tokens = s.split()
neat_s = " ".join(tokens)
The string's split function will return a list of non empty tokens split by whitespace. So if you try
"this is a string".split()
you will get back
['this', 'is', 'a', 'string']
The string's join function will join a list of tokens together using the string itself as a delimiter. In this case we want a space, so
" ".join("this is a string".split())
Will split on occurrences of a space, discard the empties, then join again, separating by spaces. For more about string operations, check out Python's common string function documentation.
EDIT: I misunderstood what happens when you pass a delimiter to the split function. See markuz's answer for this.

Categories

Resources