Iterate over utf-8 characters in python - python

I am using python 3.6 to read a file encoded in utf-8, in Spanish (thus, including letter ñ). I open the file with the utf-8 codec, and it loads correctly: while debugging, I can see ñ in the loaded text.
However, when I iterate over characters, ñ is read as two characters, n and ~. Concretely, when I run:
for c in text:
hexc = int(hex(ord(c)), 16)
if U_LETTERS[lang][0] <= hexc <= U_LETTERS[lang][1] \
or hexc in U_LETTERS[lang][2:] \
or hexc == U_SPACE:
filtered_text+=c
and text includes an ñ, the variable c takes it as an n (and therefore, hexc is 110 instead of 241), and then it takes ~ (and hexc is 771). I guess there is an internal conversion to an 8 bit char when iterating in this way. What is the proper way to do this?
Thanks in advance.

This has to do with Unicode normalisation. The letter "ñ" can be expressed either with a single character with the codepoint 0xF1 (241), or with the two character "n" and a combining character for the superposed tilde, ie. the codepoints 0x6E and 0x0303 (110 and 771).
These two ways of expressing the letter are considered equivalent; however, they are not the same in string comparison.
Python provides functionality to convert from one form to the other by means of the unicodedata module.
The first form is called composed (NFC), the second one decomposed (NFD) normalised form.
An example explains it the easiest way:
>>> import unicodedata
>>> '\xf1'
'ñ'
>>> [ord(c) for c in '\xf1']
[241]
>>> [ord(c) for c in unicodedata.normalize('NFD', '\xf1')]
[110, 771]
>>> [ord(c) for c in unicodedata.normalize('NFC', 'n\u0303')]
[241]
>>>
So, to solve your problem, convert all of the text to the desired normalisation form before any further processing.
Note: Unicode normalisation is a problem separate from encoding. You can have this with UTF16 or UTF32 just as well. In the decomposed form, you actually have two (or more) characters (each of which might be represented with multiple bytes, depending on the encoding). It's up the displaying device (the terminal emulator, an editor...) to show this as a single letter with marks above/below the base character.

Related

python special letters, fix string [duplicate]

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)
PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards,
Christoph
The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.
To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).
If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5
The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.
Don't forget to use unicode and unicode literals in your code.
which Python version are you using?
Python 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards
Djoudi
You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.
"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.
You can use the unicodedata.name() function to tell you what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now read #bobince's answer :-)

How many displayable characters in a unicode string (Japanese / Chinese)

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?
This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.
If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

Python str vs unicode types

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode strings using the escape char \?:
Executing a module with:
# -*- coding: utf-8 -*-
a = 'á'
ua = u'á'
print a, ua
Results in: á, á
More testing using Python shell:
>>> a = 'á'
>>> a
'\xc3\xa1'
>>> ua = u'á'
>>> ua
u'\xe1'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> ua
u'\xe1'
So, the unicode string seems to be encoded using latin1 instead of utf-8 and the raw string is encoded using utf-8? I'm even more confused now! :S
unicode is meant to handle text. Text is a sequence of code points which may be bigger than a single byte. Text can be encoded in a specific encoding to represent the text as raw bytes(e.g. utf-8, latin-1...).
Note that unicode is not encoded! The internal representation used by python is an implementation detail, and you shouldn't care about it as long as it is able to represent the code points you want.
On the contrary str in Python 2 is a plain sequence of bytes. It does not represent text!
You can think of unicode as a general representation of some text, which can be encoded in many different ways into a sequence of binary data represented via str.
Note: In Python 3, unicode was renamed to str and there is a new bytes type for a plain sequence of bytes.
Some differences that you can see:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Note that using str you have a lower-level control on the single bytes of a specific encoding representation, while using unicode you can only control at the code-point level. For example you can do:
>>> 'àèìòù'
'\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9'
>>> print 'àèìòù'.replace('\xa8', '')
à�ìòù
What before was valid UTF-8, isn't anymore. Using a unicode string you cannot operate in such a way that the resulting string isn't valid unicode text.
You can remove a code point, replace a code point with a different code point etc. but you cannot mess with the internal representation.
Unicode and encodings are completely different, unrelated things.
Unicode
Assigns a numeric ID to each character:
0x41 → A
0xE1 → á
0x414 → Д
So, Unicode assigns the number 0x41 to A, 0xE1 to á, and 0x414 to Д.
Even the little arrow → I used has its Unicode number, it's 0x2192. And even emojis have their Unicode numbers, 😂 is 0x1F602.
You can look up the Unicode numbers of all characters in this table. In particular, you can find the first three characters above here, the arrow here, and the emoji here.
These numbers assigned to all characters by Unicode are called code points.
The purpose of all this is to provide a means to unambiguously refer to a each character. For example, if I'm talking about 😂, instead of saying "you know, this laughing emoji with tears", I can just say, Unicode code point 0x1F602. Easier, right?
Note that Unicode code points are usually formatted with a leading U+, then the hexadecimal numeric value padded to at least 4 digits. So, the above examples would be U+0041, U+00E1, U+0414, U+2192, U+1F602.
Unicode code points range from U+0000 to U+10FFFF. That is 1,114,112 numbers. 2048 of these numbers are used for surrogates, thus, there remain 1,112,064. This means, Unicode can assign a unique ID (code point) to 1,112,064 distinct characters. Not all of these code points are assigned to a character yet, and Unicode is extended continuously (for example, when new emojis are introduced).
The important thing to remember is that all Unicode does is to assign a numerical ID, called code point, to each character for easy and unambiguous reference.
Encodings
Map characters to bit patterns.
These bit patterns are used to represent the characters in computer memory or on disk.
There are many different encodings that cover different subsets of characters. In the English-speaking world, the most common encodings are the following:
ASCII
Maps 128 characters (code points U+0000 to U+007F) to bit patterns of length 7.
Example:
a → 1100001 (0x61)
You can see all the mappings in this table.
ISO 8859-1 (aka Latin-1)
Maps 191 characters (code points U+0020 to U+007E and U+00A0 to U+00FF) to bit patterns of length 8.
Example:
a → 01100001 (0x61)
á → 11100001 (0xE1)
You can see all the mappings in this table.
UTF-8
Maps 1,112,064 characters (all existing Unicode code points) to bit patterns of either length 8, 16, 24, or 32 bits (that is, 1, 2, 3, or 4 bytes).
Example:
a → 01100001 (0x61)
á → 11000011 10100001 (0xC3 0xA1)
≠ → 11100010 10001001 10100000 (0xE2 0x89 0xA0)
😂 → 11110000 10011111 10011000 10000010 (0xF0 0x9F 0x98 0x82)
The way UTF-8 encodes characters to bit strings is very well described here.
Unicode and Encodings
Looking at the above examples, it becomes clear how Unicode is useful.
For example, if I'm Latin-1 and I want to explain my encoding of á, I don't need to say:
"I encode that a with an aigu (or however you call that rising bar) as 11100001"
But I can just say:
"I encode U+00E1 as 11100001"
And if I'm UTF-8, I can say:
"Me, in turn, I encode U+00E1 as 11000011 10100001"
And it's unambiguously clear to everybody which character we mean.
Now to the often arising confusion
It's true that sometimes the bit pattern of an encoding, if you interpret it as a binary number, is the same as the Unicode code point of this character.
For example:
ASCII encodes a as 1100001, which you can interpret as the hexadecimal number 0x61, and the Unicode code point of a is U+0061.
Latin-1 encodes á as 11100001, which you can interpret as the hexadecimal number 0xE1, and the Unicode code point of á is U+00E1.
Of course, this has been arranged like this on purpose for convenience. But you should look at it as a pure coincidence. The bit pattern used to represent a character in memory is not tied in any way to the Unicode code point of this character.
Nobody even says that you have to interpret a bit string like 11100001 as a binary number. Just look at it as the sequence of bits that Latin-1 uses to encode the character á.
Back to your question
The encoding used by your Python interpreter is UTF-8.
Here's what's going on in your examples:
Example 1
The following encodes the character á in UTF-8. This results in the bit string 11000011 10100001, which is saved in the variable a.
>>> a = 'á'
When you look at the value of a, its content 11000011 10100001 is formatted as the hex number 0xC3 0xA1 and output as '\xc3\xa1':
>>> a
'\xc3\xa1'
Example 2
The following saves the Unicode code point of á, which is U+00E1, in the variable ua (we don't know which data format Python uses internally to represent the code point U+00E1 in memory, and it's unimportant to us):
>>> ua = u'á'
When you look at the value of ua, Python tells you that it contains the code point U+00E1:
>>> ua
u'\xe1'
Example 3
The following encodes Unicode code point U+00E1 (representing character á) with UTF-8, which results in the bit pattern 11000011 10100001. Again, for output this bit pattern is represented as the hex number 0xC3 0xA1:
>>> ua.encode('utf-8')
'\xc3\xa1'
Example 4
The following encodes Unicode code point U+00E1 (representing character á) with Latin-1, which results in the bit pattern 11100001. For output, this bit pattern is represented as the hex number 0xE1, which by coincidence is the same as the initial code point U+00E1:
>>> ua.encode('latin1')
'\xe1'
There's no relation between the Unicode object ua and the Latin-1 encoding. That the code point of á is U+00E1 and the Latin-1 encoding of á is 0xE1 (if you interpret the bit pattern of the encoding as a binary number) is a pure coincidence.
Your terminal happens to be configured to UTF-8.
The fact that printing a works is a coincidence; you are writing raw UTF-8 bytes to the terminal. a is a value of length two, containing two bytes, hex values C3 and A1, while ua is a unicode value of length one, containing a codepoint U+00E1.
This difference in length is one major reason to use Unicode values; you cannot easily measure the number of text characters in a byte string; the len() of a byte string tells you how many bytes were used, not how many characters were encoded.
You can see the difference when you encode the unicode value to different output encodings:
>>> a = 'á'
>>> ua = u'á'
>>> ua.encode('utf8')
'\xc3\xa1'
>>> ua.encode('latin1')
'\xe1'
>>> a
'\xc3\xa1'
Note that the first 256 codepoints of the Unicode standard match the Latin 1 standard, so the U+00E1 codepoint is encoded to Latin 1 as a byte with hex value E1.
Furthermore, Python uses escape codes in representations of unicode and byte strings alike, and low code points that are not printable ASCII are represented using \x.. escape values as well. This is why a Unicode string with a code point between 128 and 255 looks just like the Latin 1 encoding. If you have a unicode string with codepoints beyond U+00FF a different escape sequence, \u.... is used instead, with a four-digit hex value.
It looks like you don't yet fully understand what the difference is between Unicode and an encoding. Please do read the following articles before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
When you define a as unicode, the chars a and á are equal. Otherwise á counts as two chars. Try len(a) and len(au). In addition to that, you may need to have the encoding when you work with other environments. For example if you use md5, you get different values for a and ua

String characters misinterpreted when iterating over string with unicode chars

I am running on python 2.7 on mac os x 10.6 with file in utf8 and terminal in utf8.
I want to add a period after each occurrence of the vowels å,ä or ö that exist in a given string.
Here is the dumbed down version of what I am trying to do:
# coding: utf8
a = 'change these letters äöå'
b = map( (lambda x: a.replace(x, "{0}.".format(x))), 'åäö')
for c in b:
print c
which procudes the following output:
change these letters ?.??.??.?
change these letters äöå.
change these letters ?.??.??.?
change these letters ä.öå
change these letters ?.??.??.?
change these letters äö.å
Why do I get the lines with the question mark? Upon further research just doing this would produce the same question marks.
# coding: utf8
for letter in 'åäö':
print letter
output:
?
?
?
?
?
?
But explicitly adding the u before gives the
# coding: utf8
for letter in u'åäö':
print letter
output:
å
ä
ö
Decoding and encoding back the string explicitly to utf8 still produces the question marks. What is the problem here? What is happing in this loop?
Side note: In the dumb example you see what I am trying to do. In actuality I am using an object that saves the string so that the mapped operations occur on the same string. So the map() call actually calls the object's method with one new vowel each time, thus updating the string saved in the object. The object's method performs the replace with a vowel from the second argument of map and updates the stored string.
You're mapping the anonymous function over a string; you should be mapping it over a list of strings. The Python interpreter will still accept the instruction you're giving it, treating the string as a sequence and applying the lambda to each component of that sequence. But in that case the components are the individual bytes of the string, and each of the unicode characters is two bytes. So the replacement is performed six times.
Moreover, in three of those iterations the replacement is the identical operation of replacing the unicode prefix byte 0xc3 (which occurs three times in äöå), with 0xc3., which breaks the character encoding in the string a and produces raw byte gibberish. In the other three iterations you replace the second byte of a unicode char with that byte followed by a period, so the resulting string still contains a byte sequence for the character in question and you get your desired result. But that's not because you're replacing the entire character with that character followed by a period.
Compare:
>>> a = 'change these letters äöå'
>>> b = map( (lambda x: a.replace(x, "{0}.".format(x))), 'å ä ö'.split())
>>> for c in b:
... print c
...
change these letters äöå.
change these letters ä.öå
change these letters äö.å
You're iterating over the bytes in a bytestring. Since non-ASCII characters encoded as UTF-8 use multiple bytes, you're breaking the characters. If you must iterate over characters then iterate over the characters of a unicode.

Python returning the wrong length of string when using special characters

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)
PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards,
Christoph
The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.
To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).
If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5
The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.
Don't forget to use unicode and unicode literals in your code.
which Python version are you using?
Python 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards
Djoudi
You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.
"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.
You can use the unicodedata.name() function to tell you what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now read #bobince's answer :-)

Categories

Resources