Related
I am supposed to write a little program that takes in a Persian text and in some places changes the space to half-space. The half-space or a zero-width non-joiner is used in some languages to avoid a ligature when normalizing a text. It's unicode character is supposedly '\u200c' and in some text-editors it can be shown on the screen with a SHIFT+SPACE:
import re
txt = input('Please enter a Persian text: ')
original_pattern = r'\b(\w+)\s*(ها|هايي|هايم|هاي)\b'
new_pattern = r'\1 \2'
new_txt = re.sub (original_pattern, new_pattern, txt)
print (new_txt)
In the code above, new_pattern is supposed to introduce a half-space between \1 and \2, currently there is a space between them.
The question is: How can I put a half-space there? I tried the following and in both cases got a syntax error:
new_pattern = ur'\1\u200c\2'
new_pattern = r'\1\u200c\2'
By the way, although in the Wikipedia article the unicode character for ZWNJ is given as U+200c, it doesn't seem to be working that way in the python shell and it is actually doubling the space:
>>> print ('He is a',u'\u200c','boy')
He is a boy
>>> print ("کتاب",u"\u200c","ها")
کتاب ها
>>> print ("کتاب ها")
کتاب ها
>>>
Python adds a separator for arguments of print function, you can control this with sep argument, try
print ('He is a', '\u200c', 'boy', sep="")
For a pattern, try
new_pattern = '\\1\u200c\\2'
or
new_pattern = '\\1\N{ZERO WIDTH NON-JOINER}\\2'
reason is that when you add an r prefix, escapes \ are ignored, so \u200c part of pattern is threated as 5 charactes string, i.e. pattern equals \\1\\u200c\\2, hence your error.
I've exhausted myself trying to build either the right regex or just to remove the comma from the file that I"m parsing. Essentially, I am matching a specific string, then anything that follows that string, up to the comma. I need up to get the substring before the comma, not including the comma. I suppose I can do this either with the regex or remove the comma further down in the code.
I'm pretty new at this, so probably basic stuff, but can't seem to find the right thing in my searches
Here is my code:
import re
str = "FullName=TECIBW04 TECIBW04, TargetResource=k2vFe6yPvBoEdrmrE9t3i5UE2muLVW,"
match = re.search(r'FullName=.+?,', str)
if match:
print match.group() ##'found a match'
else:
print 'ainnussin zer'
I get:
FullName=TECIBW04 TECIBW04,
Great...I'm getting back what I need (and a little extra). I actually don't want the comma.
What's the best method to get rid of or not include that sucker?
Since, comma , is the delimiter here, just negate it in your regex as
match = re.search(r'FullName=[^,]+', str)
Put everything except comma in a saving group:
match = re.search(r'(FullName=.+?),', str)
if match: print match.group(1) ##'found a match' else: print 'ainnussin zer'
prints
FullName=TECIBW04 TECIBW04
How about using split on ,?
str.split(',')[0]
Edit
These are ways to do it without the regex.
For checking if the string starts with another substring, you can use
if str.startswith("FullName="):
print str.split(',')[0]
else:
print "ainnussin zer"
For doing this in one line, you can try
print str.split(',')[0] if str.startswith("FullName=") else "ainnussin zer"
You can also use str.partition:
>>> str = "FullName=TECIBW04 TECIBW04, TargetResource=k2vFe6yPvBoEdrmrE9t3i5UE2muLVW,"
>>> str.partition(',')
('FullName=TECIBW04 TECIBW04', ',', ' TargetResource=k2vFe6yPvBoEdrmrE9t3i5UE2muLVW,')
>>> str.partition(',')[0]
'FullName=TECIBW04 TECIBW04'
If you are going to use a regex, I would use this:
match=re.search(r'^FullName=[^,]+',str)
if match:
print match.group(0) ##'found a match'
else:
print 'ainnussin zer'
Or this if you are just trying to capture the RH of the =:
match=re.search(r'^FullName=([^,]+)',str)
if match:
print match.group(1) ##'found a match'
else:
print 'ainnussin zer'
Place what you want to match inside of a capturing group.
m = re.search(r'(FullName=.*?),', str)
if m:
print m.group(1)
I would like to replace (and not remove) all punctuation characters by " " in a string in Python.
Is there something efficient of the following flavour?
text = text.translate(string.maketrans("",""), string.punctuation)
This answer is for Python 2 and will only work for ASCII strings:
The string module contains two things that will help you: a list of punctuation characters and the "maketrans" function. Here is how you can use them:
import string
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
text = text.translate(replace_punctuation)
Modified solution from Best way to strip punctuation from a string in Python
import string
import re
regex = re.compile('[%s]' % re.escape(string.punctuation))
out = regex.sub(' ', "This is, fortunately. A Test! string")
# out = 'This is fortunately A Test string'
This workaround works in python 3:
import string
ex_str = 'SFDF-OIU .df !hello.dfasf sad - - d-f - sd'
#because len(string.punctuation) = 32
table = str.maketrans(string.punctuation,' '*32)
res = ex_str.translate(table)
# res = 'SFDF OIU df hello dfasf sad d f sd'
There is a more robust solution which relies on a regex exclusion rather than inclusion through an extensive list of punctuation characters.
import re
print(re.sub('[^\w\s]', '', 'This is, fortunately. A Test! string'))
#Output - 'This is fortunately A Test string'
The regex catches anything which is not an alpha-numeric or whitespace character
Replace by ''?.
What's the difference between translating all ; into '' and remove all ;?
Here is to remove all ;:
s = 'dsda;;dsd;sad'
table = string.maketrans('','')
string.translate(s, table, ';')
And you can do your replacement with translate.
In my specific way, I removed "+" and "&" from the punctuation list:
all_punctuations = string.punctuation
selected_punctuations = re.sub(r'(\&|\+)', "", all_punctuations)
print selected_punctuations
str = "he+llo* ithis& place% if you * here ##"
punctuation_regex = re.compile('[%s]' % re.escape(selected_punctuations))
punc_free = punctuation_regex.sub("", str)
print punc_free
Result: he+llo ithis& place if you here
string= "im fine.gds how are you"
if '.gds' or '.cdl' in string :
a=string.split("????????")
the above string may contain .gds or .cdl extension. I want to split the string based on the extension.
here how the parameters can be passed to split function.(EX if .gds is present in string then it should take as split(".gds")
if .cdl is there in string then it should get split(".cdl"))
I think you have to split the if statements:
if '.gds' in string:
a = string.split('.gds')
elif '.cdl' in string:
a = string.split('.cdl')
else:
a = string # this is a fallback in case none of the patterns is in the string
Furthermore, your in statement is incorrect; it should have been
if '.gds' in string or '.cdl' in string:
Note that this solution assumes that only one of the patterns will be in the string. If both patterns can occur on the same string, see Vikas's answer.
Use regular expression module re to split either by pattern1 or pattern2
import re
re.split('\.gds|\.cdl', your_string)
Example:
>>> re.split('\.gds|\.cdl', "im fine.gds how are you")
['im fine', ' how are you']
>>> re.split('\.gds|\.cdl', "im fine.cdl how are you")
['im fine', ' how are you']
>>> re.split('\.gds|\.cdl', "im fine.cdl how are.gds you")
['im fine', ' how are', ' you']
You can try to define a function like:
def split_on_extensions(string, *extensions):
for ext in extensions:
if ext in string:
return string.split(ext)
return string
Of course, the order in which you give the extensions is critical, as you'll split on the first one...
Are you guaranteed that one of the two of them will be there?
a = next( string.split(v) for v in ('.gds','.cdl') if v in string )
If you're not positive it will be there, you can catch the StopIteration that is raised in next:
try:
a = next( string.split(v) for v in ('.gds','.cdl') if v in string )
except StopIteration:
a = string #????
Another option is to use the BIF str.partition. This is how it works:
sring= "im fine.gds how are you"
three_parts_of_sring = sring.partition('.gds')
>>> three_parts_of_sring
('im fine', '.gds', ' how are you')
Put it into a little function and your set.
The tags is captured into the first backreference. The question mark in the regex makes the star lazy, to make sure it stops before the first closing tag, rather than before the last, like a greedy star would do.
This regex will not properly match tags nested inside themselves, like in <TAG>one<TAG>two</TAG>one</TAG>.
You can iter separators:
string= "im fine.gds how are you"
separators = ['.gds', '.cdl']
for separator in separators:
if separator in string:
a = string.split(separator)
break
else:
a = []
I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)
Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '
TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.
Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5
#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.
Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'
s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)
Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>
The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.
For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)
This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)
Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.
This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))
After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'
function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk