I can only use a string in my program if it contains no special characters except underscore _. How can I check this?
I tried using unicodedata library. But the special characters just got replaced by standard characters.
You can use string.punctuation and any function like this
import string
invalidChars = set(string.punctuation.replace("_", ""))
if any(char in invalidChars for char in word):
print "Invalid"
else:
print "Valid"
With this line
invalidChars = set(string.punctuation.replace("_", ""))
we are preparing a list of punctuation characters which are not allowed. As you want _ to be allowed, we are removing _ from the list and preparing new set as invalidChars. Because lookups are faster in sets.
any function will return True if atleast one of the characters is in invalidChars.
Edit: As asked in the comments, this is the regular expression solution. Regular expression taken from https://stackoverflow.com/a/336220/1903116
word = "Welcome"
import re
print "Valid" if re.match("^[a-zA-Z0-9_]*$", word) else "Invalid"
You will need to define "special characters", but it's likely that for some string s you mean:
import re
if re.match(r'^\w+$', s):
# s is good-to-go
Everyone else's method doesn't account for whitespaces. Obviously nobody really considers a whitespace a special character.
Use this method to detect special characters not including whitespaces:
import re
def detect_special_characer(pass_string):
regex= re.compile('[#_!#$%^&*()<>?/\|}{~:]')
if(regex.search(pass_string) == None):
res = False
else:
res = True
return(res)
Like the method from Cybernetic, to get those extra characters missed, modify the 2nd line of the function from
regex= re.compile('[#_!#$%^&*()<>?/\|}{~:]')
to
regex= re.compile('[#_!#$%^&*()<>?/\\|}{~:\[\]]')
where the \ and ] characters are escaped with \
So in full:
import re
def detect_special_characer(pass_string):
regex= re.compile('[#_!#$%^&*()<>?/\\\|}{~:[\]]')
if(regex.search(pass_string) == None):
res = False
else:
res = True
return(res)
If a character is not numeric, a space, or is A-Z, then it is special
for character in my_string
if not (character.isnumeric() and character.isspace() and character.isalpha() and character != "_")
print(" \(character)is special"
Related
Unclear on how to frame the following function correctly:
Creating a function that will take in a string and return the string in camel case without spaces (or pascal case if the first letter was already capital), removing special characters
text = "This-is_my_test_string,to-capitalize"
def to_camel_case(text):
# Return 1st letter of text + all letters after
return text[:1] + text.title()[1:].replace(i" ") if not i.isdigit()
# Output should be "ThisIsMyTestStringToCapitalize"
the "if" statement at the end isn't working out, and I wrote this somewhat experimentally, but with a syntax fix, could the logic work?
Providing the input string does not contain any spaces then you could do this:
from re import sub
def to_camel_case(text, pascal=False):
r = sub(r'[^a-zA-Z0-9]', ' ', text).title().replace(' ', '')
return r if pascal else r[0].lower() + r[1:]
ts = 'This-is_my_test_string,to-capitalize'
print(to_camel_case(ts, pascal=True))
print(to_camel_case(ts))
Output:
ThisIsMyTestStringToCapitalize
thisIsMyTestStringToCapitalize
Here is a short solution using regex. First it uses title() as you did, then the regex finds non-alphanumeric-characters and removes them, and finally we take the first character to handle pascal / camel case.
import re
def to_camel_case(s):
s1 = re.sub('[^a-zA-Z0-9]+', '', s.title())
return s[0] + s1[1:]
text = "this-is2_my_test_string,to-capitalize"
print(to_camel_case(text)) # ThisIsMyTestStringToCapitalize
The below should work for your example.
Splitting apart your example by anything that isn's alphanumeric or a space. Then capitalizing each word. Finally, returning the re-joined string.
import re
def to_camel_case(text):
words = re.split(r'[^a-zA-Z0-9\s]', text)
return "".join([word.capitalize() for word in words])
text_to_camelcase = "This-is_my_test_string,to-capitalize"
print(to_camel_case(text_to_camelcase))
use the split function to split between anything that is not a letter or a whitespace and the function .capitalize() to capitalize single words
import re
text_to_camelcase = "This-is_my_test_string,to-capitalize"
def to_camel_case(text):
split_text = re.split(r'[^a-zA-Z0-9\s]', text)
cap_string = ''
for word in split_text:
cap_word = word.capitalize()
cap_string += cap_word
return cap_string
print(to_camel_case(text_to_camelcase))
I need to print a string, using this rules:
The first letter should be capital and make all other letters are lowercase. Only the characters a-z A-Z are allowed in the name, any other letters have to be deleted(spaces and tabs are not allowed and use underscores are used instead) and string could not be longer then 80 characters.
It seems to me that it is possible to do it somehow like this:
name = "hello2 sjsjs- skskskSkD"
string = name[0].upper() + name[1:].lower()
lenght = len(string) - 1
answer = ""
for letter in string:
x = letter.isalpha()
if x == False:
answer = string.replace(letter,"")
........
return answer
I think it's better to use a for loop or isalpha () here, but I can't think of a better way to do it. Can someone tell me how to do this?
For one-to-one and one-to-None mappings of characters, you can use the .translate() method of strings. The string module provides lists (strings) of the various types of characters including one for all letters in upper and lowercase (string.ascii_letters) but you could also use your own constant string such as 'abcdef....xyzABC...XYZ'.
import string
def cleanLetters(S):
nonLetters = S.translate(str.maketrans('','',' '+string.ascii_letters))
return S.translate(str.maketrans(' ','_',nonLetters))
Output:
cleanLetters("hello2 sjsjs- skskskSkD")
'hello_sjsjs_skskskSkD'
One method to accomplish this is to use regular expressions (regex) via the built-in re library. This enables the capturing of only the valid characters, and ignoring the rest.
Then, using basic string tools for the replacement and capitalisation, then a slice at the end.
For example:
import re
name = 'hello2 sjsjs- skskskSkD'
trans = str.maketrans({' ': '_', '\t': '_'})
''.join(re.findall('[a-zA-Z\s\t]', name)).translate(trans).capitalize()[:80]
>>> 'Hello_sjsjs_skskskskd'
Strings are immutable, so every time you do string.replace() it needs to iterate over the entire string to find characters to replace, and a new string is created. Instead of doing this, you could simply iterate over the current string and create a new list of characters that are valid. When you're done iterating over the string, use str.join() to join them all.
answer_l = []
for letter in string:
if letter == " " or letter == "\t":
answer_l.append("_") # Replace spaces or tabs with _
elif letter.isalpha():
answer_l.append(letter) # Use alphabet characters as-is
# else do nothing
answer = "".join(answer_l)
With string = 'hello2 sjsjs- skskskSkD', we have answer = 'hello_sjsjs_skskskSkD';
Now you could also write this using a generator expression instead of creating the entire list and then joining it. First, we define a function that returns the letter or "_" for our first two conditions, and an empty string for the else condition
def translate(letter):
if letter == " " or letter == "\t":
return "_"
elif letter.isalpha():
return letter
else:
return ""
Then,
answer = "".join(
translate(letter) for letter in string
)
To enforce the 80-character limit, just take answer[:80]. Because of the way slices work in python, this won't throw an error even when the length of answer is less than 80.
I am massaging strings so that the 1st letter of the string and the first letter following either a dash or a slash needs to be capitalized.
So the following string:
test/string - this is a test string
Should look look like so:
Test/String - This is a test string
So in trying to solve this problem my 1st idea seems like a bad idea - iterate the string and check every character and using indexing etc. determine if a character follows a dash or slash, if it does set it to upper and write out to my new string.
def correct_sentence_case(test_phrase):
corrected_test_phrase = ''
firstLetter = True
for char in test_phrase:
if firstLetter:
corrected_test_phrase += char.upper()
firstLetter = False
#elif char == '/':
else:
corrected_test_phrase += char
This just seems VERY un-pythonic. What is a pythonic way to handle this?
Something along the lines of the following would be awesome but I can't pass in both a dash and a slash to the split:
corrected_test_phrase = ' - '.join(i.capitalize() for i in test_phrase.split(' - '))
Which I got from this SO:
Convert UPPERCASE string to sentence case in Python
Any help will be appreciated :)
I was able to accomplish the desired transformation with a regular expression:
import re
capitalized = re.sub(
'(^|[-/])\s*([A-Za-z])', lambda match: match[0].upper(), phrase)
The expression says "anywhere you match either the start of the string, ^, or a dash or slash followed by maybe some space and a word character, replace the word character with its uppercase."
demo
If you don't want to go with a messy splitting-joining logic, go with a regex:
import re
string = 'test/string - this is a test string'
print(re.sub(r'(^([a-z])|(?<=[-/])\s?([a-z]))',
lambda match: match.group(1).upper(), string))
# Test/String - This is a test string
Using double split
import re
' - '.join([i.strip().capitalize() for i in re.split(' - ','/'.join([i.capitalize() for i in re.split('/',test_phrase)]))])
I'm using that:
import string
last = 'pierre-GARCIA'
if last not in [None, '']:
last = last.strip()
if '-' in last:
last = string.capwords(last, sep='-')
else:
last = string.capwords(last, sep=None)
I'd like to remove all characters before a designated character or set of characters (for example):
intro = "<>I'm Tom."
Now I'd like to remove the <> before I'm (or more specifically, I). Any suggestions?
Use re.sub. Just match all the chars upto I then replace the matched chars with I.
re.sub(r'^.*?I', 'I', stri)
str.find could find character index of certain string's first appearance:
intro[intro.find('I'):]
Since index(char) gets you the first index of the character, you can simply do string[index(char):].
For example, in this case index("I") = 2, and intro[2:] = "I'm Tom."
If you know the character position of where to start deleting, you can use slice notation:
intro = intro[2:]
Instead of knowing where to start, if you know the characters to remove then you could use the lstrip() function:
intro = intro.lstrip("<>")
str = "<>I'm Tom."
temp = str.split("I",1)
temp[0]=temp[0].replace("<>","")
str = "I".join(temp)
I looped through the string and passed the index.
intro_list = []
intro = "<>I'm Tom."
for i in range(len(intro)):
if intro[i] == '<' or intro[i] == '>':
pass
else:
intro_list.append(intro[i])
intro = ''.join(intro_list)
print(intro)
import re
date_div = "Blah blah\nblah, Updated: Aug. 23, 2012 Blah blah Updated: Feb. 13, 2019"
up_to_word = ":"
rx_to_first = r'^.*?{}'.format(re.escape(up_to_word))
rx_to_last = r'^.*{}'.format(re.escape(up_to_word))
# (Dot.) In the default mode, this matches any character except a newline.
# If the DOTALL flag has been specified, this matches any character including a newline.
print("Remove all up to the first occurrence of the word including it:")
print(re.sub(rx_to_first, '', date_div, flags=re.DOTALL).strip())
print("Remove all up to the last occurrence of the word including it:")
print(re.sub(rx_to_last, '', date_div, flags=re.DOTALL).strip())
>>> intro = "<>I'm Tom."
#Just split the string at the special symbol
>>> intro.split("<>")
Output = ['', "I'm Tom."]
>>> new = intro.split("<>")
>>> new[1]
"I'm Tom."
This solution works if the character is not in the string too, but uses if statements which can be slow.
if 'I' in intro:
print('I' + intro.split('I')[1])
else:
print(intro)
You can use itertools.dropwhile to all the characters before seeing a character to stop at. Then, you can use ''.join() to turn the resulting iterable back into a string:
from itertools import dropwhile
''.join(dropwhile(lambda x: x not in stop, intro))
This outputs:
I'm Tom.
Based on the #AvinashRaj answer, you can use re.sub to substituate a substring by a string or a character thanks to regex:
missing import re
output_str = re.sub(r'^.*?I', 'I', input_str)
import re
intro = "<>I'm Tom."
re.sub(r'<>I', 'I', intro)
I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)
Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '
TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.
Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5
#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.
Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'
s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)
Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>
The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.
For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)
This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)
Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.
This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))
After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'
function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk