Half-space in regex - python

I am supposed to write a little program that takes in a Persian text and in some places changes the space to half-space. The half-space or a zero-width non-joiner is used in some languages to avoid a ligature when normalizing a text. It's unicode character is supposedly '\u200c' and in some text-editors it can be shown on the screen with a SHIFT+SPACE:
import re
txt = input('Please enter a Persian text: ')
original_pattern = r'\b(\w+)\s*(ها|هايي|هايم|هاي)\b'
new_pattern = r'\1 \2'
new_txt = re.sub (original_pattern, new_pattern, txt)
print (new_txt)
In the code above, new_pattern is supposed to introduce a half-space between \1 and \2, currently there is a space between them.
The question is: How can I put a half-space there? I tried the following and in both cases got a syntax error:
new_pattern = ur'\1\u200c\2'
new_pattern = r'\1\u200c\2'
By the way, although in the Wikipedia article the unicode character for ZWNJ is given as U+200c, it doesn't seem to be working that way in the python shell and it is actually doubling the space:
>>> print ('He is a',u'\u200c','boy')
He is a ‌ boy
>>> print ("کتاب",u"\u200c","ها")
کتاب ‌ ها
>>> print ("کتاب ها")
کتاب ها
>>>

Python adds a separator for arguments of print function, you can control this with sep argument, try
print ('He is a', '\u200c', 'boy', sep="")
For a pattern, try
new_pattern = '\\1\u200c\\2'
or
new_pattern = '\\1\N{ZERO WIDTH NON-JOINER}\\2'
reason is that when you add an r prefix, escapes \ are ignored, so \u200c part of pattern is threated as 5 charactes string, i.e. pattern equals \\1\\u200c\\2, hence your error.

Related

Remove #{word} from a string using regex (python)

I'm not familiar with regex, and it would be great to have some help.
I have a string: string = "You get #{turns} guesses."
and I would like to remove #{turns} in order to have string = "You get guesses."
I've tried it with:
string = re.sub(re.compile('#{.*?}'),"",string)
Any suggestions?
For this specific question you can also do it like so:
import re
string = "You get #{turns} guesses."
re.sub(r'#\S+ ', r'', string)
Output:
'You get guesses.'
Regex:
'#\S+ ' Match # and match as many non space characters and a single space.
Your code works, except that it does not remove a sufficient amount of spaces and that compilation is rather useless if you only use it once:
>>> string = "You get #{turns} guesses."
>>> string = re.sub(re.compile('#{.*?}'),"",string)
>>> string
'You get guesses.'
So you probably want to compile the regex once, and then use it, and you better alter it to - for instance - remove tailing spaces:
rgx = re.compile('#{.*?}\s*')
string = rgx.sub('',string)
Note the \s* which will match with an arbitrary amount of spaces after the tag, and thus remove these as well:
>>> string = "You get #{turns} guesses."
>>> rgx = re.compile('#{.*?}\s*')
>>> string = rgx.sub('',string)
>>> string
'You get guesses.'
In case it is one word between the curly brackets ({}), you better use \w to exclude spaces:
rgx = re.compile('#{\w*}\s*')

Why regex is not working?

I need to replace all occurences of normal whitespaces in «статья 1», «статьи 2» etc. with non-breaking spaces.
The construction below works fine:
re.sub('(стат.{0,4}) (\d+)', r'\1 \2', text) # 'r' in repl is important, otherwise the word is not replaced correctly, at least for texts in Russian.
however, I do not want to repeatedly use re.sub for «статья», then for «пункт», then for the names of months, I want to have a dictionary with regex expressions and replacements. Here's my code, but it does not work as expected: 'статья 1 статьи 2' should look like 'статья(non-breaking space here)1 статьи(non-breaking space here)2':
import re
text = 'статья 1 статьи 2'
dic = {'(cтат.{0,4}) (\d+)' : r'\1 \2'}
def replace():
global text
final_text = ''
for i in dic:
new_text = re.sub(str(i), str(dic[i]), text)
text = new_text
return text
print (replace())
The problem is that you copied and pasted wrong.
This pattern works:
'(стат.{0,4}) (\d+)'
This one doesn't:
'(cтат.{0,4}) (\d+)'
Why? Because in the first one, and in your search string, that first character is a U+0441, a Cyrillic small Es. But in the second one, it's a U+0063, a Latin small C. Of course the two look identical in most fonts, but they're not the same character.
So, how can you tell? Well, when I suspected this problem, here's what I did:
>>> a = '(стат.{0,4}) (\d+)' # copied and pasted from your working code
>>> b = '(cтат.{0,4}) (\d+)' # copied and pasted from your broken code
>>> print(a.encode('unicode-escape').decode('ascii'))
(\u0441\u0442\u0430\u0442.{0,4}) (\\d+)
>>> print(b.encode('unicode-escape').decode('ascii'))
(c\u0442\u0430\u0442.{0,4}) (\\d+)
And the difference is obvious: the first one has a \u0441 escape sequence where the second one has a plain ASCII c.

Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.
This can be done without regex:
>>> string = "Special $#! characters spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'
You can use str.isalnum:
S.isalnum() -> bool
Return True if all characters in S are alphanumeric
and there is at least one character in S, False otherwise.
If you insist on using regex, other solutions will do fine. However note that if it can be done without using a regular expression, that's the best way to go about it.
Here is a regex to match a string of characters that are not a letters or numbers:
[^A-Za-z0-9]+
Here is the Python command to do a regex substitution:
re.sub('[^A-Za-z0-9]+', '', mystring)
Shorter way :
import re
cleanString = re.sub('\W+','', string )
If you want spaces between words and numbers substitute '' with ' '
TLDR
I timed the provided answers.
import re
re.sub('\W+','', string)
is typically 3x faster than the next fastest provided top answer.
Caution should be taken when using this option. Some special characters (e.g. ø) may not be striped using this method.
After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:
string1 = 'Special $#! characters spaces 888323'
string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'
Example 1
'.join(e for e in string if e.isalnum())
string1 - Result: 10.7061979771
string2 - Result: 7.78372597694
Example 2
import re
re.sub('[^A-Za-z0-9]+', '', string)
string1 - Result: 7.10785102844
string2 - Result: 4.12814903259
Example 3
import re
re.sub('\W+','', string)
string1 - Result: 3.11899876595
string2 - Result: 2.78014397621
The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)
Example 3 can be 3x faster than Example 1.
Python 2.*
I think just filter(str.isalnum, string) works
In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'
Python 3.*
In Python3, filter( ) function would return an itertable object (instead of string unlike in above). One has to join back to get a string from itertable:
''.join(filter(str.isalnum, string))
or to pass list in join use (not sure but can be fast a bit)
''.join([*filter(str.isalnum, string)])
note: unpacking in [*args] valid from Python >= 3.5
#!/usr/bin/python
import re
strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr
you can add more special character and that will be replaced by '' means nothing i.e they will be removed.
Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.
For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:
import re
s = re.sub(r"[^a-zA-Z0-9]","",s)
This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".
In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.
Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.
import re
s = re.sub(r"[^a-z0-9]","",s.lower())
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?#[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))
Output:
'This is A test'
s = re.sub(r"[-()\"#/#;:<>{}`+=~|.!?,]", "", s)
Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:
>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>
The most generic approach is using the 'categories' of the unicodedata table which classifies every single character. E.g. the following code filters only printable characters based on their category:
import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien
PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))
def filter_non_printable(s):
result = []
ws_last = False
for c in s:
c = unicodedata.category(c) in PRINTABLE and c or u'#'
result.append(c)
return u''.join(result).replace(u'#', u' ')
Look at the given URL above for all related categories. You also can of course filter
by the punctuation categories.
For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü, ä, ö) simply add these to the regex search string:
Example for German:
re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)
This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.
import re
sample_str = "Hel&&lo %% Wo$#rl#d"
# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))
# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)
special_char_list = ["$", "#", "#", "&", "%"]
# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)
# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)
Use translate:
import string
def clean(instr):
return instr.translate(None, string.punctuation + ' ')
Caveat: Only works on ascii strings.
This will remove all non-alphanumeric characters except spaces.
string = "Special $#! characters spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))
Special characters spaces 888323
import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the
same as double quotes."""
# if we need to count the word python that ends with or without ',' or '.' at end
count = 0
for i in text:
if i.endswith("."):
text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
count += 1
print("The count of Python : ", text.count("python"))
After 10 Years, below I wrote there is the best solution.
You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.
from clean_text import clean
string = 'Special $#! characters spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'
function regexFuntion(st) {
const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
st = st.replace(/\s\s+/g, ' '); // remove multiple space
return st;
}
console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)
and you shall see your result as
'askhnlaskdjalsdk

Python: Replace string with prefixStringSuffix keeping original case, but ignoring case when searching for match

So what I'm trying to do is replace a string "keyword" with
"<b>keyword</b>"
in a larger string.
Example:
myString = "HI there. You should higher that person for the job. Hi hi."
keyword = "hi"
result I would want would be:
result = "<b>HI</b> there. You should higher that person for the job.
<b>Hi</b> <b>hi</b>."
I will not know what the keyword until the user types the keyword
and won't know the corpus (myString) until the query is run.
I found a solution that works most of the time, but has some false positives,
namely it would return "<b>hi<b/>gher"which is not what I want. Also note that I
am trying to preserve the case of the original text, and the matching should take
place irrespective of case. so if the keyword is "hi" it should replace
HI with <b>HI</b> and hi with <b>hi</b>.
The closest I have come is using a slightly derived version of this:
http://code.activestate.com/recipes/576715/
but I still could not figure out how to do a second pass of the string to fix all of the false positives mentioned above.
Or using the NLTK's WordPunctTokenizer (which simplifies some things like punctuation)
but I'm not sure how I would put the sentences back together given it does not
have a reverse function and I want to keep the original punctuation of myString. Essential, doing a concatenation of all the tokens does not return the original
string. For example I would not want to replace "7 - 7" with "7-7" when regrouping the tokens into its original text if the original text had "7 - 7".
Hope that was clear enough. Seems like a simple problem, but its a turned out a little more difficult then I thought.
This ok?
>>> import re
>>> myString = "HI there. You should higher that person for the job. Hi hi."
>>> keyword = "hi"
>>> search = re.compile(r'\b(%s)\b' % keyword, re.I)
>>> search.sub('<b>\\1</b>', myString)
'<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>.'
The key to the whole thing is using word boundaries, groups and the re.I flag.
You should be able to do this very easily with re.sub using the word boundary assertion \b, which only matches at a word boundary:
import re
def SurroundWith(text, keyword, before, after):
regex = re.compile(r'\b%s\b' % keyword, re.IGNORECASE)
return regex.sub(r'%s\0%s' % (before, after), text)
Then you get:
>>> SurroundWith('HI there. You should hire that person for the job. '
... 'Hi hi.', 'hi', '<b>', '</b>')
'<b>HI</b> there. You should hire that person for the job. <b>Hi</b> <b>hi</b>.'
If you have more complicated criteria for what constitutes a "word boundary," you'll have to do something like:
def SurroundWith2(text, keyword, before, after):
regex = re.compile(r'([^a-zA-Z0-9])(%s)([^a-zA-Z0-9])' % keyword,
re.IGNORECASE)
return regex.sub(r'\1%s\2%s\3' % (before, after), text)
You can modify the [^a-zA-Z0-9] groups to match anything you consider a "non-word."
I think the best solution would be regular expression...
import re
def reg(keyword, myString) :
regx = re.compile(r'\b(' + keyword + r')\b', re.IGNORECASE)
return regx.sub(r'<b>\1</b>', myString)
of course, you must first make your keyword "regular expression safe" (quote any regex special characters).
Here's one suggestion, from the nitpicking committee. :-)
myString = "HI there. You should higher that person for the job. Hi hi."
myString.replace('higher','hire')

Python regex: Turn "ThisFileName.txt" into "This File Name.txt"

I'm trying to add a space before every capital letter, except the first one.
Here's what I have so far, and the output I'm getting:
>>> tex = "ThisFileName.txt"
>>> re.sub('[A-Z].', ' ', tex)
' his ile ame.txt'
I want:
'This File Name.txt'
(It'd be nice if I could also get rid of .txt, but I can do that in a separate operation.)
Key concept here is backreferences in regular expressions:
import re
text = "ThisFileName.txt"
print re.sub('([a-z])([A-Z])', r'\1 \2', text)
# Prints: "This File Name.txt"
For pulling off the '.txt' in a reliable way, I recommend os.path.splitext()
import os
filename = "ThisFileName.txt"
print os.path.splitext(filename)
# Prints: ('ThisFileName', '.txt')
Another possible regular expression using a look behind:
(?<!^)([A-Z])
re.sub('([a-z])([A-Z])', '\\1 \\2', 'TheFileName.txt')
EDIT: StackOverflow eats some \s, when not in 'code mode'... Because I forgot to add a newline after the code above, it was not interpreted in 'code mode' :-((. Since I added that text here I didn't have to change anything and it's correct now.
It is not clear what you want to do if the filename is Hello123There.txt. So, if you want a space before all capital letters regardless of what precedes them, you can:
import re
def add_space_before_caps(text):
"Add a space before all caps except at start of text"
return re.sub(r"(?<!^)(?=[A-Z])", " ", text)
>>> add_space_before_caps("Hello123ThereIBM.txt")
'Hello123 There I B M.txt'

Categories

Resources