Once again, I am very confused with a unicode question. I can't figure out how to successfully use unicodedata.normalize to convert non-ASCII characters as expected. For instance, I want to convert the string
u"Cœur"
To
u"Coeur"
I am pretty sure that unicodedata.normalize is the way to do this, but I can't get it to work. It just leaves the string unchanged.
>>> s = u"Cœur"
>>> unicodedata.normalize('NFKD', s) == s
True
What am I doing wrong?
You could try Unidecode:
# -*- coding: utf-8 -*-
from unidecode import unidecode # $ pip install unidecode
print(unidecode(u"Cœur"))
# -> Coeur
Your problem seems not to have to do with Python, but that the character you are trying to decompose (u'\u0153' - 'œ') is not a composition itself.
Check as your code works with a string containing normal composite characters like "ç" and "ã":
>>> a1 = a
>>> a = u"maçã"
>>> for norm in ('NFC', 'NFKC', 'NFD','NFKD'):
... b = unicodedata.normalize(norm, a)
... print b, len(b)
...
maçã 4
maçã 4
maçã 6
maçã 6
And then, if you check the unicode reference for both characters (yours and c + cedila) you will see that the later has a "decomposition" specification the former lacks:
http://www.fileformat.info/info/unicode/char/153/index.htm
http://www.fileformat.info/info/unicode/char/00e7/index.htm
It like "œ" is not formally equivalent to "oe" - (at least not for the people who defined this unicode part) - so, the way to go to normalize text containing this is to make a manual replacement of the char for the sequence with unicode.replace - as hacky as it sounds.
As jsbueno says, some letters just don't have a compatibility decomposition.
You can use the Unicode CLDR Latin-ASCII transform to generate a mapping of manual replacements.
Related
import re
string="b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '"
print(re.findall(r"\x[0-9a-z]{2}",string))
The the list returned by the findall() function is empty :(
The problem here is that your string is the Python representation of a Python bytes object, which is pretty much useless.
Most likely, you had a bytes object, like this:
b = b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '
… and you converted it to a string, like this:
s = str(b)
Don't do that. Instead, decode it:
s = b.decode('utf-8')
That will get you the actual characters, which you can then match easily, instead of trying to match the characters in the string representation of the bytes representation and then reconstructing the actual characters laboriously from the results.
However, it's worth noting that \xe2\x80\xa6 is not an emoji, it's a horizontal ellipsis character, …. If that isn't what you wanted, you already corrupted the data before this point.
Not a regexp per se, but might help you out any way.
def emojis(s):
return [c for c in s if ord(c) in range(0x1F600, 0x1F64F)]
print(emojis("hello world 😊")) # sample usage
You need to re.compile(ur'A\xe2\x80\xa6',re.UNICODE)
Compile a Unicode regex and use that pattern matching for your find,find all’s,subs,etc.
Try this. I joined the string in your question with that in your title to make the final search string
import re
k = r"#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python"
print(k)
print()
p = re.findall(r"((\\x[a-z0-9]{1,}){1,})", k)
for each in p:
print(each[0])
Output
#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python
\xe2\x80\xa6
\x60\xe2\x4b
I would like to encode an IP address in as short a string as possible using all the printable characters. According to https://en.wikipedia.org/wiki/ASCII#Printable_characters these are codes 20hex to 7Ehex.
For example:
shorten("172.45.1.33") --> "^.1 9" maybe.
In order to make decoding easy I also need the length of the encoding always to be the same. I also would like to avoid using the space character in order to make parsing easier in the future.
How can one do this?
I am looking for a solution that works in Python 2.7.x.
My attempt so far to modify Eloims's answer to work in Python 2:
First I installed the ipaddress backport for Python 2 (https://pypi.python.org/pypi/ipaddress) .
#This is needed because ipaddress expects character strings and not byte strings for textual IP address representations
from __future__ import unicode_literals
import ipaddress
import base64
#Taken from http://stackoverflow.com/a/20793663/2179021
def to_bytes(n, length, endianess='big'):
h = '%x' % n
s = ('0'*(len(h) % 2) + h).zfill(length*2).decode('hex')
return s if endianess == 'big' else s[::-1]
def def encode(ip):
ip_as_integer = int(ipaddress.IPv4Address(ip))
ip_as_bytes = to_bytes(ip_as_integer, 4, endianess="big")
ip_base85 = base64.a85encode(ip_as_bytes)
return ip_base
print(encode("192.168.0.1"))
This now fails because base64 doesn't have an attribute 'a85encode'.
An IP stored in binary is 4 bytes.
You can encode it in 5 printable ASCII characters using Base85.
Using more printable characters won't be able to shorten the resulting string more than that.
import ipaddress
import base64
def encode(ip):
ip_as_integer = int(ipaddress.IPv4Address(ip))
ip_as_bytes = ip_as_integer.to_bytes(4, byteorder="big")
ip_base85 = base64.a85encode(ip_as_bytes)
return ip_base85
print(encode("192.168.0.1"))
I found this question looking for a way to use base85/ascii85 on python 2. Eventually I discovered a couple of projects available to install via pypi. I settled on one called hackercodecs because the project is specific to encoding/decoding whereas the others I found just offered the implementation as a byproduct of necessity
from __future__ import unicode_literals
import ipaddress
from hackercodecs import ascii85_encode
def encode(ip):
return ascii85_encode(ipaddress.ip_address(ip).packed)[0]
print(encode("192.168.0.1"))
https://pypi.python.org/pypi/hackercodecs
https://github.com/jdukes/hackercodecs
I am trying to replace the space between two tokens written in the Arabic alphabet with a ZWNJ but what the function returns is not decoded properly on the screen:
>>> nm.normalize("رشته ها")
'رشته\u200cها'
\u200 should be rendered as a half-space that would be placed between 'رشته' and 'ها' here, but it gets messed up like that. I am using Python 3.3.3
The function returned a string object with the \u200c character as part of it, but Python shows you the representation. The \uxxxx syntax is used to make the representation useful as a debugging value, you can now copy that representation and paste it back into Python and get the exact same value.
In other words, the function worked exactly as advertised; the space was indeed replaced by a U+200C ZERO WIDTH NON-JOINER codepoint.
If you wanted to write the string to your terminal or console, use print():
print(nm.normalize("رشته ها"))
Demo:
>>> result = 'رشته\u200cها'
>>> len(result)
7
>>> result[4]
'\u200c'
>>> print(result)
رشتهها
You can see that character 5 (index 4) is a single character here, not 6 separate characters.
I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?
You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.
I'm a Python beginner, and I have a utf-8 problem.
I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').
u-umlaut has unicode code point 252, so I tried this:
>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'
I expected the last string to be u'ueber'.
What I ultimately want to do is replace all u-umlauts in a file with 'ue':
import sys
import codecs
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f:
print repr(line).replace(unichr(252), 'ue')
Thanks for your help! (I'm using Python 2.3.)
I would define a dictionary of special characters (that I want to map) then I use translate method.
line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'
special_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(special_char_map))
you will get the following result:
Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.
I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).
>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'
There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.
You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file
#-*- coding: UTF-8 -*-
Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.
Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.
(I am trying on Python 2.6, I think in Python 2.3 the result is the same)
repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.
You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.
If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:
repr(str.replace(unichr(252), 'ue'))
You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.
I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.
# coding: ascii
translations = (
(u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
(u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
# et cetera
)
test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'
out = test
for from_str, to_str in translations:
out = out.replace(from_str, to_str)
print out
output:
Moeller von Muenchen