I have found out a problem with python 3.6.7 when I tried to stringfy a hexadecimal value. The original hexadecimal number in the string is wrongly converted into an acsii letter Ë. Is there any way for solving this?
>>> '\xcb\x85\x04\x08'
'Ë\x85\x04\x08'
You are using characters outside of the ascii code. If you are trying to use unicode, use \u____.
print("\xCB\x85\x04\x08")
print("\uCB89\u0408")
Output:
Ë
쮉Ј
You can find an ascii table at asciitable.com. Characters outside of the range 00-7F are subject to variance across regions due to the use by many countries to store extra characters that are useful in their common language, such as russian characters in russia.
Related
Is there a way to access or find character controls in Python, like these NUL, DEL, CR, LF, BEL which is its form as a single ASCII Unicode character to use as a parameter in the ord() built-in method to get a numeric value.
You can find their Unicode representation on the ASCII page on Wikipedia.
Key
Unicode
Unicode-Hex
NUL
␀
\u2400
DEL
␡
\u2421
CR
␍
\u240d
LF
␊
\u240a
BEL
␇
\u2407
To "access" or "search" to the control characters, unicode database module provides access to a set of data of characters through different methods by each type or name representation. I was a bit confused as to its representation in the text data type format and how you could use it ord function; lookup returns the name of the corresponding character thus solving the unknown about the subject to "search" control characters by gap unicode characters due to little experience or knowledge of the library and ASCII standards.
It is also important to note that the question is vaguely phrased and also does not provide a reproducible example or algorithm problem, and since these characters have different representations and they make them not only usable in form of text or letters or even names of control characters, this due to the wide variety in which ASCII is ruled, this does not mean that the question could not be answered because it is known that from the beginning it started as a problem for which a clarification or explanation has been provided to the doubts presented on this post.
I have tried to change the format of strings from latin1 to ascii, and most of the strings were changed well except for some characters, æ ø. Æ, and Ø.
I have checked the characters were changed correctly when using R package (stringi::stri_trans_general(loc1, "latin-ascii) but Python's unicodedata package did not work well.
Is there any way to convert them correctly in Python? I guess it may need an additional dictionary.
For information, I have applied the following function to change the format:
unicodedata.normalize('NFKD', "Latin strings...").encode('latin1', 'ignore').decode('ascii')
It's important to understand a) what encodings and decodings are; b) how text works; and c) what unicode normalization does.
Strings do not have a "format" in the sense that you describe, so talking about converting from latin1 to ascii format does not make sense. The string has representations (what it looks like when you print it out; or what the code looks like when you create it directly in your code; etc.), and it can be encoded. latin1, ascii etc. are encodings - that means, rules that explain how to store your string as a raw sequence of bytes.
So if you have a string, it is not "in latin1 format" just because the source data was in latin1 encoding - it is not in any format, because that concept doesn't apply. It's just a string.
Similarly, we cannot ask for a string "in ascii format" that we convert to. We can ask for an ascii encoding of the string - which is a sequence of bytes, and not text. (That "not" is one of the most important "not"s in all of computer science, because many people, tools and programs will lie to you about this.)
Of course, the problem here is that ascii cannot represent all possible text. There are over a million "code points" that can theoretically be used as elements of a string (this includes a lot of really weird things like emoji). The latin-1 and ascii encodings both use a single byte per code point in the string. Obviously, this means they can't represent everything. Latin-1 represents only the first 256 possible code points, and ascii represents only the first 128. So if we have data that comes from a latin-1 source, we can get a string with those characters like Æ in it, which cause a problem in our encoding step.
The 'ignore' option for .encode makes the encoder skip things that can't be handled by the encoding. So if you have the string 'barentsøya', since the ø cannot be represented in ascii, it gets skipped and you get the bytes b'barentsya' (using the unfortunately misleading way that Python displays bytes objects back to you).
When you normalize a string, you convert the code points into some plain format that's easier to work with, and treats distinct ways of writing a character - or distinct ways of writing very similar characters - the same way. There are a few different normalization schemes. The NFKD chooses decomposed representations for accented characters - that is, instead of using a single symbol to represent a letter with an accent, it will use two symbols, one that represents the plain letter, and one representing the "combining" version of the accent. That might seem useful - for example, it would turn an accented A into a plain A and an accent character. You might think that you can then just encode this as ascii, let the accent characters be ignored, and get the result you want. However, it turns out that this is not enough, because of how the normalization works.
Unfortunately, I think the best you can do is to either use a third-party library (and please note that recommendations are off-topic for Stack Overflow) or build the look-up table yourself and just translate each character. (Have a look at the built-in string methods translate and maketrans for help with this.)
I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)
PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards,
Christoph
The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.
To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).
If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5
The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.
Don't forget to use unicode and unicode literals in your code.
which Python version are you using?
Python 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards
Djoudi
You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.
"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.
You can use the unicodedata.name() function to tell you what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now read #bobince's answer :-)
This question already has answers here:
python different length for same unicode
(2 answers)
Closed 5 years ago.
Downloading files from Korean websites, often filenames are wrongly encoded/decoded and end up being all jumbled up. I found out that by encoding with 'iso-8859-1' and decoding with 'euc-kr', I can fix this problem. However, I have a new problem where the same-looking character is in fact, different. Check out the Python shell bellow:
>>> first_string = 'â'
>>> second_string = 'â'
>>> len(first_string)
1
>>> len(second_string)
2
>>> list(first_string)
['â']
>>> list(second_string)
['a', '̂']
>>>
Encoding the first string with 'iso-8859-1' is possible. The latter is not. So the question:
What is the difference between these two strings?
Why would downloads from the same website have the same character in varying format? (If that's what the difference is.)
And how can I fix this? (e.g. convert second_string to the likeness of first_string)
Thank you.
An easy way to find out exactly what a character is is to ask vim. Put the cursor over a character and type ga to get info on it.
The first one is:
<â> 226, Hex 00e2, Octal 342
And the second:
<a> 97, Hex 61, Octal 141 < ̂> 770, Hex 0302, Octal 1402
In other words, the first is a complete "a with circumflex" character, and the second is a regular a followed by a circumflex combining character.
Ask the website operators. How would we know?!
You need something which turns combining characters into regular characters. A Google search yielded this question, for example.
As you pointed out in your comment, and as clemens pointed out in another answer, in Python you can use unicodedata.normalize with 'NFC' as the form.
There are different representations for accents and diaeresis in Unicode. There is a single character at code point U+00E2, and the COMBINING CIRCUMFLEX ACCENT (U+0302), which is created by u'a\u0302' in Python 2.7. It consists of two characters: the 'a' and the circumflex.
A possible reason for the different representations is, that the creator of the website had copied the texts from different sources. For example, PDF documents often display umlauts and accent marks using two composite characters, while typing these characters on keyboards produces single character representations generally.
You max use unicodedata.normalize to convert the combining characters into single characters, e.g.:
from unicodedata import normalize
s = u'a\u0302'
print s, len(s), len(normalize("NFC", s))
will output â 2 1.
I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Is there any possible way in Python to have a character like ë́ be represented as 1?
I'm using UTF-8 encoding for the actual code and web page it is being outputted to.
edit: Just some background on why I need to do this. I am working on a project that translates English to Seneca (a form of Native American language) and ë́ shows up quite a bit. Some rewrite rules for certain words require knowledge of letter position (itself and surrounding letters) and other characteristics, such as accents and other diacritic markings.
UTF-8 is an unicode encoding which uses more than one byte for special characters. If you don't want the length of the encoded string, simple decode it and use len() on the unicode object (and not the str object!).
Here are some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access single characters in an unicode object like you would do in a str object (they are both inheriting from basestring and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If you develop localized applications, it's generally a good idea to use only unicode-objects internally, by decoding all inputs you get. After the work is done, you can encode the result again as 'UTF-8'. If you keep to this principle, you will never see your server crashing because of any internal UnicodeDecodeErrors you might get otherwise ;)
PS: Please note, that the str and unicode datatype have changed significantly in Python 3. In Python 3 there are only unicode strings and plain byte strings which can't be mixed anymore. That should help to avoid common pitfalls with unicode handling...
Regards,
Christoph
The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
Yes. That's how code points are defined by Unicode. In general, you can ask Python to convert a letter and a separate ‘combining’ diacritical mark like U+0301 COMBINING ACUTE ACCENT using Unicode normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single character in Unicode for “e with diaeresis and acute accent” because no language in the world has ever used the letter ‘ë́’. (Pinyin transliteration has “u with diaeresis and acute accent”, but not ‘e’.) Consequently font support is poor; it renders really badly in many cases and is a messy blob on my web browser.
To work out where the ‘editable points’ in a string of Unicode code points are is a tricky job that requires quite a bit of domain knowledge of languages. It's part of the issue of “complex text layout”, an area which also includes issues such as bidirectional text and contextual glpyh shaping and ligatures. To do complex text layout you'll need a library such as Uniscribe on Windows, or Pango generally (for which there is a Python interface).
If, on the other hand, you merely want to completely ignore all combining characters when doing a count, you can get rid of them easily enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5
The best you can do is to use unicodedata.normalize() to decompose the character and then filter out the accents.
Don't forget to use unicode and unicode literals in your code.
which Python version are you using?
Python 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards
Djoudi
You said: I have a string ë́aúlt that I want to get the length of a manipulate based on character positions and so on. The problem is that the first ë́ is being counted twice, or I guess ë is in position 0 and ´ is in position 1.
The first step in working on any Unicode problem is to know exactly what is in your data; don't guess. In this case your guess is correct; it won't always be.
"Exactly what is in your data": use the repr() built-in function (for lots more things apart from unicode). A useful advantage of showing the repr() output in your question is that answerers then have exactly what you have. Note that your text displays in only FOUR positions instead of 5 with some browsers/fonts -- the 'e' and its diacritics and the 'a' are mangled together in one position.
You can use the unicodedata.name() function to tell you what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now read #bobince's answer :-)