regex unicode characters - python

The following regex working online but not working in python code and shows no matches:
https://regex101.com/r/lY1kY8/2
s=re.sub(r'\x.+[0-9]',' ',s)
required:
re.sub(r'\x.+[0-9]* ',' ',r'cats\xe2\x80\x99 faces')
Out[23]: 'cats faces'
basically wanted to remove the unicode special characters "\xe2\x80\x99"

As another option that doesn't require regex, you could instead remove the unicode characters by removing anything not listed in string.printable
>>> import string
>>> ''.join(i for i in 'cats\xe2\x80\x99 faces' if i in string.printable)
'cats faces'

print re.findall(r'\\x.*?[0-9]* ',r'cats\xe2\x80\x99 faces')
^^
Use raw mode flag.Use findall as match starts matching from beginning
print re.sub(ur'\\x.*?[0-9]+','',r'cats\xe2\x80\x99 faces')
with re.sub
s=r'cats\xe2\x80\x99 faces'
print re.sub(r'\\x.+?[0-9]*','',s)
EDIT:
The correct way would be to decode to utf-8 and then apply regex.
s='cats\xe2\x80\x99 faces'
\xe2\x80\x99 is U+2019
print re.sub(u'\u2019','',s.decode('utf-8'))

Assume you use Python 2.x
>>> s = 'cats\xe2\x80\x99 f'
>>> len(s), s[4]
(9, 'â')
Means chars like \xe2 is with 1 length, instead 3. So that you cannot match it with r'\\x.+?[0-9]*' to match it.
>>> s = '\x63\x61\x74\x73\xe2\x80\x99 f'
>>> ''.join([c for c in s if c <= 'z'])
'cats f'
Help this help a bit.

Related

Python Regular expression to remove non unicode characters

I am trying to use python regular expression to remove some characters looks like non unicode from a string.
here is my code:
xxx='Juliana Gon\xe7alves Miguel'
t=re.sub('\w*','',xxx)
t
The result is like:
>>> xxx='Juliana Gon\xe7alves Miguel'
>>> t=re.sub('\w*','',xxx)
>>> t
' \xe7 '
This \xe7 is what I am trying to remove.
Can anyone have any ideas?
If the desired output is
'Juliana Gonalves Miguel'
then the following regex should do the trick.
re.sub('(?![ -~]).', '', xxx)
[ -~]: short and readable version for all ASCII characters
(?!): negative lookahead

Why can’t I get rid of the L with this python regular expression?

I’m trying to get rid of the Ls at the ends of integers with a regular expression in python:
import re
s = '3535L sadf ddsf df 23L 2323L'
s = re.sub(r'\w(\d+)L\w', '\1', s)
However, this regex doesn't even change the string. I've also tried s = re.sub(r'\w\d+(L)\w', '', s) since I thought that maybe the L could be captured and deleted, but that didn't work either.
I'm not sure what you're trying to do with those \ws in the first place, but to match a string of digits followed by an L, just use \d+L, and to remove the L you just need to put the \d+ part in a capture group so you can sub it for the whole thing:
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> re.sub(r'(\d+)L', r'\1', s)
'3535 sadf ddsf df 23 2323'
Here's the regex in action:
(\d+)L
Debuggex Demo
Of course this will also convert, e.g., 123LBQ into 123BQ, but I don't see anything in your examples or in your description of the problem that indicates that this is possible, or which possible result you want for that, so…
\w = [a-zA-Z0-9_]
In other words, \w does not include whitespace characters. Each L is at the end of the word and therefore doesn't have any "word characters" following it. Perhaps you were looking for word boundaries?
re.sub(r'\b(\d+)L\b', '\1', s)
Demo
You can use look behind assertion
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> s = re.sub(r'\w(?<=\d)L\b', '', s)
>>> s
'353 sadf ddsf df 2 232'
(?<=\d)L asserts that the L is presceded by a digit, in which case replace it with null''
Try this:
re.sub(r'(?<=\d)L', '\1', s)
This uses a lookbehind to find a digit followed by an "L".
Why not use a - IMO more readable - generator expression?
>>> s = '3535L sadf ddsf df 23L 2323L'
>>> ' '.join(x.rstrip('L') if x[-1:] =='L' and x[:-1].isdigit() else x for x in s.split())
'3535 sadf ddsf df 23 2323'

Python converting string to latex using regular expression

Say I have a string
string = "{1/100}"
I want to use regular expressions in Python to convert it into
new_string = "\frac{1}{100}"
I think I would need to use something like this
new_string = re.sub(r'{.+/.+}', r'', string)
But I'm stuck on what I would put in order to preserve the characters in the fraction, in this example 1 and 100.
You can use () to capture the numbers. Then use \1 and \2 to refer to them:
new_string = re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', string)
# \frac{1}{100}
Note: Don't forget to escape the backslash \\.
Capture the numbers using parens and then reference them in the replacement text using \1 and \2. For example:
>>> print re.sub(r'{(.+)/(.+)}', r'\\frac{\1}{\2}', "{1/100}")
\frac{1}{100}
Anything inside the braces would be a number/number. So in the regex place numbers([0-9]) instead of a .(dot).
>>> import re
>>> string = "{1/100}"
>>> new = re.sub(r'{([0-9]+)/([0-9]+)}', r'\\frac{\1}{\2}', string)
>>> print new
\frac{1}{100}
Use re.match. It's more flexible:
>>> m = re.match(r'{(.+)/(.+)}', string)
>>> m.groups()
('1', '100')
>>> new_string = "\\frac{%s}{%s}"%m.groups()
>>> print new_string
\frac{1}{100}

replace all "\" with "\\" python

Does anyone know how replace all \ with \\ in python?
Ive tried:
re.sub('\','\\',string)
But it screws it up because of the escape sequence.
does anyone know the awnser to my question?
You just need to escape the backslashes in your strings: (also there's no need for regex stuff)
>>> s = "cats \\ dogs"
>>> print s
cats \ dogs
>>> print s.replace("\\", "\\\\")
cats \\ dogs
you should do:
re.sub(r'\\', r'\\\\', string)
As r'\' is not a valid string
BTW, you should always use raw (r'') strings with regex as many things are done with backslashes.
You should escape backslashes, and also you don't need regex for this simple operation:
>>> my_string = r"asd\asd\asd\\"
>>> print(my_string)
asd\asd\asd\\
>>> replaced = my_string.replace("\\", "\\\\")
>>> print(replaced)
asd\\asd\\asd\\\\
You either need re.sub("\\\\","\\\\\\\\",string) or re.sub(r'\\', r'\\\\', string) because you need to escape each slash twice ... once for the string and once for the regex.
>>> whatever = r'z\w\r'
>>> print whatever
z\w\r
>>> print re.sub(r"\\",r"\\\\", whatever)
z\\w\\r
>> print re.sub("\\\\","\\\\\\\\",whatever)
z\\w\\r

Python remove anything that is not a letter or number

I'm having a little trouble with Python regular expressions.
What is a good way to remove all characters in a string that are not letters or numbers?
Thanks!
[\w] matches (alphanumeric or underscore).
[\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
You need [\W_] to remove ALL non-alphanumerics.
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.
Now all you need is to define alphanumerics:
str object, only ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str object, only locale-defined alphanumerics:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode object, all alphanumerics:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str object:
>>> import re, locale
>>> sall = ''.join(chr(i) for i in xrange(256))
>>> len(sall)
256
>>> re.sub('[\W_]+', '', sall)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
>>> locale.setlocale(locale.LC_ALL, '')
'English_Australia.1252'
>>> re.sub('[\W_]+', '', sall, flags=re.LOCALE)
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\
x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\
xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\
xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\
xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
# above output wrapped at column 80
Unicode example:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE)
u'abAZ\xff\u0404'
In the char set matching rule [...] you can specify ^ as first char to mean "not in"
import re
re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z
"", # replaced with nothing
"this is a test!!") # in this string
--> 'thisisatest'
'\W' is the same as [^A-Za-z0-9_] plus accented chars from your locale.
>>> re.sub('\W', '', 'text 1, 2, 3...')
'text123'
Maybe you want to keep the spaces or have all the words (and numbers):
>>> re.findall('\w+', 'my. text, --without-- (punctuation) 123')
['my', 'text', 'without', 'punctuation', '123']
Also you can try to use isalpha and isnumeric methods the following way:
text = 'base, sample test;'
getVals = lambda x: (c for c in text if c.isalpha() or c.isnumeric())
map(lambda word: ' '.join(getVals(word)): text.split(' '))
There are other ways also you may consider e.g. simply loop thru string and skip unwanted chars e.g. assuming you want to delete all ascii chars which are not letter or digits
>>> newstring = [c for c in "a!1#b$2c%3\t\nx" if c in string.letters + string.digits]
>>> "".join(newstring)
'a1b2c3x'
or use string.translate to map one char to other or delete some chars e.g.
>>> todelete = [ chr(i) for i in range(256) if chr(i) not in string.letters + string.digits ]
>>> todelete = "".join(todelete)
>>> "a!1#b$2c%3\t\nx".translate(None, todelete)
'a1b2c3x'
this way you need to calculate todelete list once or todelete can be hard-coded once and use it everywhere you need to convert string
you can use predefined regex in python : \W corresponds to the set [^a-zA-Z0-9_]. Then,
import re
s = 'Hello dutrow 123'
re.sub('\W', '', s)
--> 'Hellodutrow123'
You need to be more specific:
What about Unicode "letters"? ie, those with diacriticals.
What about white space? (I assume this is what you DO want to delete along with punctuation)
When you say "letters" do you mean A-Z and a-z in ASCII only?
When you say "numbers" do you mean 0-9 only? What about decimals, separators and exponents?
It gets complex quickly...
A great place to start is an interactive regex site, such as RegExr
You can also get Python specific Python Regex Tool

Categories

Resources