Python Regular expression to remove non unicode characters - python

I am trying to use python regular expression to remove some characters looks like non unicode from a string.
here is my code:
xxx='Juliana Gon\xe7alves Miguel'
t=re.sub('\w*','',xxx)
t
The result is like:
>>> xxx='Juliana Gon\xe7alves Miguel'
>>> t=re.sub('\w*','',xxx)
>>> t
' \xe7 '
This \xe7 is what I am trying to remove.
Can anyone have any ideas?

If the desired output is
'Juliana Gonalves Miguel'
then the following regex should do the trick.
re.sub('(?![ -~]).', '', xxx)
[ -~]: short and readable version for all ASCII characters
(?!): negative lookahead

Related

Python regex: Can I edit the set of numerical characters (\d)?

I would like to write a regex to insert a space between all non-numerical sequences and numerical sequences. For example:
"12Dollars" -> "12 Dollars"
My current solution looks like this:
string = re.sub(r'([\d]) *([^\d\W])', r'\1 \2', string)
However, now I realize I need to also account for the character "½", which is not considered a numerical character by regex.
Can I add ½ to \d somehow? (I figured that may be the cleanest way to handle this...)
It is actually a sad fact that \d in Python 3.x, although Unicode-aware by default, is not matching all chars defined in the \p{N} category, it only matches what \p{Nd} matches, and \p{Nl} and \p{No} should be added manually if you need to match them.
\p{Nl} can be defined as
[\u16EE-\u16F0\u2160-\u2182\u2185-\u2188\u3007\u3021-\u3029\u3038-\u303A\uA6E6-\uA6EF\U00010140-\U00010174\U00010341\U0001034A\U000103D1-\U000103D5\U00012400-\U0001246E]
\p{No} can be defined as
[\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A48\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00010F1D-\U00010F26\U00010F51-\U00010F54\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00011FC0-\U00011FD4\U00016B5B-\U00016B61\U00016E80-\U00016E96\U0001D2E0-\U0001D2F3\U0001D360-\U0001D378\U0001E8C7-\U0001E8CF\U0001EC71-\U0001ECAB\U0001ECAD-\U0001ECAF\U0001ECB1-\U0001ECB4\U0001ED01-\U0001ED2D\U0001ED2F-\U0001ED3D\U0001F100-\U0001F10C]
Use
[0-9\u16EE-\u16F0\u2160-\u2182\u2185-\u2188\u3007\u3021-\u3029\u3038-\u303A\uA6E6-\uA6EF\U00010140-\U00010174\U00010341\U0001034A\U000103D1-\U000103D5\U00012400-\U0001246E\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A48\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00010F1D-\U00010F26\U00010F51-\U00010F54\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00011FC0-\U00011FD4\U00016B5B-\U00016B61\U00016E80-\U00016E96\U0001D2E0-\U0001D2F3\U0001D360-\U0001D378\U0001E8C7-\U0001E8CF\U0001EC71-\U0001ECAB\U0001ECAD-\U0001ECAF\U0001ECB1-\U0001ECB4\U0001ED01-\U0001ED2D\U0001ED2F-\U0001ED3D\U0001F100-\U0001F10C]
Your code fix:
import re
s = "ↂ12½Dollars"
pN = r'0-9\u16EE-\u16F0\u2160-\u2182\u2185-\u2188\u3007\u3021-\u3029\u3038-\u303A\uA6E6-\uA6EF\U00010140-\U00010174\U00010341\U0001034A\U000103D1-\U000103D5\U00012400-\U0001246E\u00B2\u00B3\u00B9\u00BC-\u00BE\u09F4-\u09F9\u0B72-\u0B77\u0BF0-\u0BF2\u0C78-\u0C7E\u0D58-\u0D5E\u0D70-\u0D78\u0F2A-\u0F33\u1369-\u137C\u17F0-\u17F9\u19DA\u2070\u2074-\u2079\u2080-\u2089\u2150-\u215F\u2189\u2460-\u249B\u24EA-\u24FF\u2776-\u2793\u2CFD\u3192-\u3195\u3220-\u3229\u3248-\u324F\u3251-\u325F\u3280-\u3289\u32B1-\u32BF\uA830-\uA835\U00010107-\U00010133\U00010175-\U00010178\U0001018A\U0001018B\U000102E1-\U000102FB\U00010320-\U00010323\U00010858-\U0001085F\U00010879-\U0001087F\U000108A7-\U000108AF\U000108FB-\U000108FF\U00010916-\U0001091B\U000109BC\U000109BD\U000109C0-\U000109CF\U000109D2-\U000109FF\U00010A40-\U00010A48\U00010A7D\U00010A7E\U00010A9D-\U00010A9F\U00010AEB-\U00010AEF\U00010B58-\U00010B5F\U00010B78-\U00010B7F\U00010BA9-\U00010BAF\U00010CFA-\U00010CFF\U00010E60-\U00010E7E\U00010F1D-\U00010F26\U00010F51-\U00010F54\U00011052-\U00011065\U000111E1-\U000111F4\U0001173A\U0001173B\U000118EA-\U000118F2\U00011C5A-\U00011C6C\U00011FC0-\U00011FD4\U00016B5B-\U00016B61\U00016E80-\U00016E96\U0001D2E0-\U0001D2F3\U0001D360-\U0001D378\U0001E8C7-\U0001E8CF\U0001EC71-\U0001ECAB\U0001ECAD-\U0001ECAF\U0001ECB1-\U0001ECB4\U0001ED01-\U0001ED2D\U0001ED2F-\U0001ED3D\U0001F100-\U0001F10C'
rx = r'([{0}]) *([^{0}\W])'.format(pN)
s = re.sub(rx, r'\1 \2', s)
print(s) # => ↂ12½ Dollars
in regex, because the [ ] mean an OR operation for the characters inside, you could easily add your ½ to your regex:
string = re.sub(r'([\d½]) *([^\d½\W])', r'\1 \2', string)
and if your old regex does what you wish without the ½ , this will do it for you.
hope this helps.

regex unicode characters

The following regex working online but not working in python code and shows no matches:
https://regex101.com/r/lY1kY8/2
s=re.sub(r'\x.+[0-9]',' ',s)
required:
re.sub(r'\x.+[0-9]* ',' ',r'cats\xe2\x80\x99 faces')
Out[23]: 'cats faces'
basically wanted to remove the unicode special characters "\xe2\x80\x99"
As another option that doesn't require regex, you could instead remove the unicode characters by removing anything not listed in string.printable
>>> import string
>>> ''.join(i for i in 'cats\xe2\x80\x99 faces' if i in string.printable)
'cats faces'
print re.findall(r'\\x.*?[0-9]* ',r'cats\xe2\x80\x99 faces')
^^
Use raw mode flag.Use findall as match starts matching from beginning
print re.sub(ur'\\x.*?[0-9]+','',r'cats\xe2\x80\x99 faces')
with re.sub
s=r'cats\xe2\x80\x99 faces'
print re.sub(r'\\x.+?[0-9]*','',s)
EDIT:
The correct way would be to decode to utf-8 and then apply regex.
s='cats\xe2\x80\x99 faces'
\xe2\x80\x99 is U+2019
print re.sub(u'\u2019','',s.decode('utf-8'))
Assume you use Python 2.x
>>> s = 'cats\xe2\x80\x99 f'
>>> len(s), s[4]
(9, 'â')
Means chars like \xe2 is with 1 length, instead 3. So that you cannot match it with r'\\x.+?[0-9]*' to match it.
>>> s = '\x63\x61\x74\x73\xe2\x80\x99 f'
>>> ''.join([c for c in s if c <= 'z'])
'cats f'
Help this help a bit.

Replace String in python with matched pattern

I have to remove any punctuation marks from the start and at the end of the word.
I am using re.sub to do it.
re.sub(r'(\w.+)(?=[^\w]$)','\1',text)
Grouping not working out - all I get is ☺. for Mihir4. in command line
If you have string with multiple words, such as
text = ".adfdf. 'df' !3423? ld! :sdsd"
this will do the trick (it will also work for single words, of course):
>>> re.sub(r'[^\w\s]*(\w+)[^\w\s]*', r'\1', text)
'adfdf df 3423 ld sdsd'
Notice the r in r'\1'. This is equivalent to '\\1'.
>>> re.sub(r'[^\w\s]*(\w+)[^\w\s]*', '\\1', text)
'adfdf df 3423 ld sdsd'
Further reading: the backslash plague
The string literal '\1' is equivalent to '\x01'. You need to escape it or use raw string literal to mean backreference group 1.
BTW, you don't need to use the capturing group.
>>> re.sub(r'^[^-\w]+|[^-\w]$', '', 'Mihir4.')
'Mihir4'

Why does my python regex not work?

I wanna replace all the chars which occur more than one time,I used Python's re.sub and my regex looks like this data=re.sub('(.)\1+','##',data), But nothing happened...
Here is my Text:
Text
※※※※※※※※※※※※※※※※※Chapter One※※※※※※※※※※※※※※※※※※
This is the begining...
You need to use raw string here, 1 is interpreted as octal and then its ASCII value present at its integer equivalent is used in the string.
>>> '\1'
'\x01'
>>> chr(01)
'\x01'
>>> '\101'
'A'
>>> chr(0101)
'A'
Use raw string to fix this:
>>> '(.)\1+'
'(.)\x01+'
>>> r'(.)\1+' #Note the `r`
'(.)\\1+'
Use a raw string, so the regex engine interprets backslashes instead of the Python parser. Just put an r in front of the string:
data=re.sub(r'(.)\1+', '##', data)
^ this r is the important bit
Otherwise, \1 is interpreted as character value 1 instead of a backreference.

regular expression to parse option string in python

I can't seem to create the correct regular expression to extract the correct tokens from my string. Padding the beginning of the string with a space generates the correct output, but seems less than optimal:
>>> import re
>>> s = '-edge_0triggered a-b | -level_Sensitive c-d | a-b-c'
>>> re.findall(r'\W(-[\w_]+)',' '+s)
['-edge_0triggered', '-level_Sensitive'] # correct output
Here are some of the regular expressions I've tried, does anyone have a regex suggestion that doesn't involve changing the original string and generates the correct output
>>> re.findall(r'(-[\w_]+)',s)
['-edge_0triggered', '-b', '-level_Sensitive', '-d', '-b', '-c']
>>> re.findall(r'\W(-[\w_]+)',s)
['-level_Sensitive']
r'(?:^|\W)(-\w+)'
\w already includes the underscore.
Change the first qualifier to accept either a beginning anchor or a not-word, instead of only a not-word:
>>> re.findall(r'(?:^|\W)(-[\w_]+)', s)
['-edge_0triggered', '-level_Sensitive']
The ?: at the beginning of the group simply tells the regex engine to not treat that as a group for purposes of results.
You could use a negative-lookbehind:
re.findall(r'(?<!\w)(-\w+)', s)
the (?<!\w) part means "match only if not preceded by a word-character".

Categories

Resources