This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to refer to “\” sign in python string
I've quite large string data in which I've to remove all characters other than A-Z,a-z and 0-9
I'm able to remove almost every character but '\' is a problem.
every other character is removed but '\' is making problem
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
reps = {' ':'-','.':'-','"':'-',',':'-','/':'-',
'<':'-',';':'-',':':'-','*':'-','+':'-',
'=':'-','_':'-','?':'-','%':'-','!':'-',
'$':'-','(':'-',')':'-','\#':'-','[':'-',
']':'-','\&':'-','#':'-','\W':'-','\t':'-'}
x.name = x.name.lower()
x1 = replace_all(x.name,reps)
I've quite large string data in which I've to remove all characters other than A-Z,a-z and 0-9
In other words, you want to keep only those characters.
The string class already provides a test "is every character a letter or number?", called .isalnum(). So, we can just filter with that:
>>> filter(str.isalnum, 'foo-bar\\baz42')
'foobarbaz42'
If you have a string:
a = 'hi how \\are you'
you can remove it by doing:
a.replace('\\','')
>'hi how are you'
If you have a specific context where you are having trouble, I recommend posting a bit more detail.
birryee is correct, you need to escape the backslash with a second backslash.
to remove all characters other than A-Z, a-z and 0-9
Instead of trying to list all the characters you want to remove (that would take a long time), use a regular expression to specify those characters you wish to keep:
import re
text = re.sub('[^0-9A-Za-z]', '-', text)
Related
I have a string as follows where I tried to remove similar consecutive characters.
import re
input = "abccbcbbb";
for i in input :
input = re.sub("(.)\\1+", "",input);
print(input)
Now I need to let the user specify the value of k.
I am using the following python code to do it, but I got the error message TypeError: can only concatenate str (not "int") to str
import re
input = "abccbcbbb";
k=3
for i in input :
input= re.sub("(.)\\1+{"+(k-1)+"}", "",input)
print(input)
The for i in input : does not do what you need. i is each character in the input string, and your re.sub is supposed to take the whole input as a char sequence.
If you plan to match a specific amount of chars you should get rid of the + quantifier after \1. The limiting {min,} / {min,max} quantifier should be placed right after the pattern it modifies.
Also, it is more convenient to use raw string literals when defining regexps.
You can use
import re
input_text = "abccbcbbb";
k=3
input_text = re.sub(fr"(.)\1{{{k-1}}}", "", input_text)
print(input_text)
# => abccbc
See this Python demo.
The fr"(.)\1{{{k-1}}}" raw f-string literal will translate into (.)\1{2} pattern. In f-strings, you need to double curly braces to denote a literal curly brace and you needn't escape \1 again since it is a raw string literal.
If I were you, I would prefer to do it like suggested before. But since I've already spend time on answering this question here is my handmade solution.
The pattern described below creates a named group named "letter". This group updates iterative, so firstly it is a, then b, etc. Then it looks ahead for all the repetitions of the group "letter" (which updates for each letter).
So it finds all groups of repeated letters and replaces them with empty string.
import re
input = 'abccbcbbb'
result = 'abcbcb'
pattern = r'(?P<letter>[a-z])(?=(?P=letter)+)'
substituted = re.sub(pattern, '', input)
assert substituted == result
Just to make sure I have the question correct you mean to turn "abccbcbbb" into "abcbcb" only removing sequential duplicate characters. Is there a reason you need to use regex? you could likely do a simple list comprehension. I mean this is a really cut and dirty way to do it but you could just put
input = "abccbcbbb"
input = list(input)
previous = input.pop(0)
result = [previous]
for letter in input:
if letter != previous : result += letter
previous = letter
result = "".join(result)
and with a method like this, you could make it easier to read and faster with a bit of modification id assume.
This question already has answers here:
What do ^ and $ mean in a regular expression?
(2 answers)
Closed 2 years ago.
I've got a problem with carets and dollar signs in Python.
I want to find every word which starts with a number and ends with a letter
Here is what I've tried already:
import re
text = "Cell: 415kkk -555- 9999ll Work: 212-555jjj -0000"
phoneNumRegex = re.compile(r'^\d+\w+$')
print(phoneNumRegex.findall(text))
Result is an empty list:
[]
The result I want:
415kkk, 9999ll, 555jjj
Where is the problem?
Problems with your regex:
^...$ means you only want full matches over the whole string - get rid of that.
r'\w+' means "any word character" which means letters + numbers (case ignorant) plus underscore '_'. So this would match '5555' for '555' via
r'\d+' and another '5' as '\w+' hence add it to the result.
You need
import re
text = "Cell: 415kkk -555- 9999ll Work: 212-555jjj -0000"
phoneNumRegex = re.compile(r'\b\d+[a-zA-Z]+\b')
print(phoneNumRegex.findall(text))
instead:
['415kkk', '9999ll', '555jjj']
The '\b' are word boundaries so you do not match 'abcd1111' inside '_§$abcd1111+§$'.
Readup:
re-syntax
regex101.com - Regextester website that can handle python syntax
This question already has answers here:
Replace all the occurrences of specific words
(4 answers)
Find substring in string but only if whole words?
(8 answers)
Closed 6 years ago.
Want to replace a certain words in a string but keep getting the followinf result:
String: "This is my sentence."
User types in what they want to replace: "is"
User types what they want to replace word with: "was"
New string: "Thwas was my sentence."
How can I make sure it only replaces the word "is" instead of any string of the characters it finds?
Code function:
import string
def replace(word, new_word):
new_file = string.replace(word, new_word[1])
return new_file
Any help is much appreciated, thank you!
using regular expression word boundary:
import re
print(re.sub(r"\bis\b","was","This is my sentence"))
Better than a mere split because works with punctuation as well:
print(re.sub(r"\bis\b","was","This is, of course, my sentence"))
gives:
This was, of course, my sentence
Note: don't skip the r prefix, or your regex would be corrupt: \b would be interpreted as backspace.
A simple but not so all-round solution (as given by Jean-Francios Fabre) without using regular expressions.
' '.join(x if x != word else new_word for x in string.split())
This question already has answers here:
How to remove non-alphanumeric characters at the beginning or end of a string
(5 answers)
Closed 6 years ago.
I am wondering how I can implement a string check, where I want to make sure that the first (&last) character of the string is alphanumeric. I am aware of the isalnum, but how do I use this to implement this check/substitution?
So, I have a string like so:
st="-jkkujkl-ghjkjhkj*"
and I would want back:
st="jkkujkl-ghjkjhkj"
Thanks..
Though not exactly what you want, but using str.strip should serve your purpose
import string
st.strip(string.punctuation)
Out[174]: 'jkkujkl-ghjkjhkj'
You could use regex like shown below:
import re
# \W is a set of all special chars, and also include '_'
# If you have elements in the set [\W_] at start and end, replace with ''
p = re.compile(r'^[\W_]+|[\W_]+$')
st="-jkkujkl-ghjkjhkj*"
print p.subn('', st)[0]
Output:
jkkujkl-ghjkjhkj
Edit:
If your special chars are in the set: !"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~
#Abhijit's answer is much simpler and cleaner.
If you are not sure then this regex version is better.
You can use following two expressions:
st = re.sub('^\W*', '', st)
st = re.sub('\W*$', '', st)
This will strip all non alpha chars of the beginning and the end of the string, not just the first ones.
You could use a regular expression.
Something like this could work;
\w.+?\w
However I'm don't know how to do a regexp match in python..
hint 1: ord() can covert a letter to a character number
hint 2: alpha charterers are between 97 and 122 in ord()
hint 3: st[0] will return the first letter in string st[-1] will return the last
An exact answer to your question may be the following:
def stringCheck(astring):
firstChar = astring[0] if astring[0].isalnum() else ''
lastChar = astring[-1] if astring[-1].isalnum() else ''
return firstChar + astring[1:-1] + lastChar
This question already has answers here:
How do I remove a substring from the end of a string?
(24 answers)
Closed 4 years ago.
>>> path = "/Volumes/Users"
>>> path.lstrip('/Volume')
's/Users'
>>> path.lstrip('/Volumes')
'Users'
>>>
I expected the output of path.lstrip('/Volumes') to be '/Users'
lstrip is character-based, it removes all characters from the left end that are in that string.
To verify this, try this:
"/Volumes/Users".lstrip("semuloV/") # also returns "Users"
Since / is part of the string, it is removed.
You need to use slicing instead:
if s.startswith("/Volumes"):
s = s[8:]
Or, on Python 3.9+ you can use removeprefix:
s = s.removeprefix("/Volumes")
Strip is character-based. If you are trying to do path manipulation you should have a look at os.path
>>> os.path.split("/Volumes/Users")
('/Volumes', 'Users')
The argument passed to lstrip is taken as a set of characters!
>>> ' spacious '.lstrip()
'spacious '
>>> 'www.example.com'.lstrip('cmowz.')
'example.com'
See also the documentation
You might want to use str.replace()
str.replace(old, new[, count])
# e.g.
'/Volumes/Home'.replace('/Volumes', '' ,1)
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
For paths, you may want to use os.path.split(). It returns a list of the paths elements.
>>> os.path.split('/home/user')
('/home', '/user')
To your problem:
>>> path = "/vol/volume"
>>> path.lstrip('/vol')
'ume'
The example above shows, how lstrip() works. It removes '/vol' starting form left. Then, is starts again...
So, in your example, it fully removed '/Volumes' and started removing '/'. It only removed the '/' as there was no 'V' following this slash.
HTH
lstrip doc says:
Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
So you are removing every character that is contained in the given string, including both 's' and '/' characters.
Here is a primitive version of lstrip (that I wrote) that might help clear things up for you:
def lstrip(s, chars):
for i in range len(s):
char = s[i]
if not char in chars:
return s[i:]
else:
return lstrip(s[i:], chars)
Thus, you can see that every occurrence of a character in chars is is removed until a character that is not in chars is encountered. Once that happens, the deletion stops and the rest of the string is simply returned