How many displayable characters in a unicode string (Japanese / Chinese)

How many displayable characters in a unicode string (Japanese / Chinese) - python

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?

This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.

If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

Related

python special letters, fix string [duplicate]

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.

UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)
PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards,
Christoph

The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.
To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).
If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5

The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.
Don't forget to use unicode and unicode literals in your code.

which Python version are you using?
Python 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards
Djoudi

You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.
"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.
You can use the unicodedata.name() function to tell you what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now read #bobince's answer :-)

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7

There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.

These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.

To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё

python: extended ASCII codes

Hi I want to know how I can append and then print extended ASCII codes in python.
I have the following.
code = chr(247)
li = []
li.append(code)
print li
The result python print out is ['\xf7'] when it should be a division symbol. If I simple print code directly "print code" then I get the division symbol but not if I append it to a list. What am I doing wrong?
Thanks.

When you print a list, it outputs the default representation of all its elements - ie by calling repr() on each of them. The repr() of a string is its escaped code, by design. If you want to output all the elements of the list properly you should convert it to a string, eg via ', '.join(li).
Note that as those in the comments have stated, there isn't really any such thing as "extended ASCII", there are just various different encodings.

You probably want the charmap encoding, which lets you turn unicode into bytes without 'magic' conversions.
s='\xf7'
b=s.encode('charmap')
with open('/dev/stdout','wb') as f:
f.write(b)
f.flush()
Will print ÷ on my system.
Note that 'extended ASCII' refers to any of a number of proprietary extensions to ASCII, none of which were ever officially adopted and all of which are incompatible with each other. As a result, the symbol output by that code will vary based on the controlling terminal's choice of how to interpret it.

There's no single defined standard named "extend ASCII Codes"> - there are however, plenty of characters, tens of thousands, as defined in the Unicode standards.
You can be limited to the charset encoding of your text terminal, which you may think of as "Extend ASCII", but which might be "latin-1", for example (if you are on a Unix system such as Linux or Mac OS X, your text terminal will likely use UTF-8 encoding, and able to display any of the tens of thousands chars available in Unicode)
So, you must read this piece in order to understand what text is, after 1992 -
If you try to do any production application believing in "extended ASCII" you are harming yourself, your users and the whole eco-system at once: http://www.joelonsoftware.com/articles/Unicode.html
That said, Python2's (and Python3's) print will call the an implicit str conversion for the objects passed in. If you use a list, this conversion does not recursively calls str for each list element, instead, it uses the element's repr, which displays non ASCII characters as their numeric representation or other unsuitable notations.
You can simply join your desired characters in a unicode string, for example, and then print them normally, using the terminal encoding:
import sys
mytext = u""
mytext += unichr(247) #check the codes for unicode chars here: http://en.wikipedia.org/wiki/List_of_Unicode_characters
print mytext.encode(sys.stdout.encoding, errors="replace")

You are doing nothing wrong.
What you do is to add a string of length 1 to a list.
This string contains a character outside the range of printable characters, and outside of ASCII (which is only 7 bit). That's why its representation looks like '\xf7'.
If you print it, it will be transformed as good as the system can.
In Python 2, the byte will be just printed. The resulting output may be the division symbol, or any other thing, according to what your system's encoding is.
In Python 3, it is a unicode character and will be processed according to how stdout is set up. Normally, this indeed should be the division symbol.
In a representation of a list, the __repr__() of the string is called, leading to what you see.

How to get rid of non-ascii characters in Perl & Python [both]?

How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.

The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.

For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";

In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.

Python returning the wrong length of string when using special characters

I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.

UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)
PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards,
Christoph

The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.
To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).
If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5

The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.
Don't forget to use unicode and unicode literals in your code.

which Python version are you using?
Python 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards
Djoudi

You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.
"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.
You can use the unicodedata.name() function to tell you what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now read #bobince's answer :-)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How many displayable characters in a unicode string (Japanese / Chinese) - python

Related

python special letters, fix string [duplicate]

Unicode (Cyrillic) character indexing, re-writing in python

python: extended ASCII codes

How to get rid of non-ascii characters in Perl & Python [both]?

Python returning the wrong length of string when using special characters

Categories

Resources