String characters misinterpreted when iterating over string with unicode chars - python

I am running on python 2.7 on mac os x 10.6 with file in utf8 and terminal in utf8.
I want to add a period after each occurrence of the vowels å,ä or ö that exist in a given string.
Here is the dumbed down version of what I am trying to do:
# coding: utf8
a = 'change these letters äöå'
b = map( (lambda x: a.replace(x, "{0}.".format(x))), 'åäö')
for c in b:
print c
which procudes the following output:
change these letters ?.??.??.?
change these letters äöå.
change these letters ?.??.??.?
change these letters ä.öå
change these letters ?.??.??.?
change these letters äö.å
Why do I get the lines with the question mark? Upon further research just doing this would produce the same question marks.
# coding: utf8
for letter in 'åäö':
print letter
output:
?
?
?
?
?
?
But explicitly adding the u before gives the
# coding: utf8
for letter in u'åäö':
print letter
output:
å
ä
ö
Decoding and encoding back the string explicitly to utf8 still produces the question marks. What is the problem here? What is happing in this loop?
Side note: In the dumb example you see what I am trying to do. In actuality I am using an object that saves the string so that the mapped operations occur on the same string. So the map() call actually calls the object's method with one new vowel each time, thus updating the string saved in the object. The object's method performs the replace with a vowel from the second argument of map and updates the stored string.

You're mapping the anonymous function over a string; you should be mapping it over a list of strings. The Python interpreter will still accept the instruction you're giving it, treating the string as a sequence and applying the lambda to each component of that sequence. But in that case the components are the individual bytes of the string, and each of the unicode characters is two bytes. So the replacement is performed six times.
Moreover, in three of those iterations the replacement is the identical operation of replacing the unicode prefix byte 0xc3 (which occurs three times in äöå), with 0xc3., which breaks the character encoding in the string a and produces raw byte gibberish. In the other three iterations you replace the second byte of a unicode char with that byte followed by a period, so the resulting string still contains a byte sequence for the character in question and you get your desired result. But that's not because you're replacing the entire character with that character followed by a period.
Compare:
>>> a = 'change these letters äöå'
>>> b = map( (lambda x: a.replace(x, "{0}.".format(x))), 'å ä ö'.split())
>>> for c in b:
... print c
...
change these letters äöå.
change these letters ä.öå
change these letters äö.å

You're iterating over the bytes in a bytestring. Since non-ASCII characters encoded as UTF-8 use multiple bytes, you're breaking the characters. If you must iterate over characters then iterate over the characters of a unicode.

Related

Converting escaped characters to utf in Python

Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".
The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ

Python 2 string somehow saved as pure Unicode

I have the following strings in Chinese that are saved in a following form as "str" type:
\u72ec\u5230
\u7528\u8272
I am on Python 2.7, when I print those strings they are printed as actual Chinese characters:
chinese_list = ["\u72ec\u5230", "\u7528\u8272", "\u72ec"]
print(chinese_list[0], chinese_list[1], chinese_list[2])
>>> 独到 用色 独
I can't really figure out how they were saved in that form, to me it looks like Unicode. The goal would be to take other Chinese characters that I have and save them in the same kind of encoding. Say I have "国道" and I would need them to be saved in the same way as in the original chinese_list.
I've tried to encode it as utf-8 and also other encodings but I never get the same output as in the original:
new_string = u"国道"
print(new_string.encode("utf-8"))
# >>> b'\xe5\x9b\xbd\xe9\x81\x93'
print(new_string.encode("utf-16"))
# >>> b'\xff\xfe\xfdVS\x90'
Any help appreciated!
EDIT: it doesn't have to have 2 Chinese characters.
EDIT2: Apparently, the encoding was unicode-escape. Thanks #deceze.
print(u"国".encode('unicode-escape'))
>>> \u56fd
The \u.... is unicode escape syntax. It works similar to how \n is a newline, not the two characters \ and n.
The elements of your list never actually contain a byte string with literal characters of \, u, 7 and so on. They contain a unicode string with the actual unicode characters, i.e. 独 and so on.
Note that this only works with unicode strings! In Python2, you need to write u"\u....". Python3 always uses unicode strings.
The unicode escape value of a character can be gotten with the ord builtin. For example, ord(u"国") gives 22269 - the same value as 0x56fd.
To get the hexadezimal escape value, convert the result to hex.
>>> def escape_literal(character):
... return r'\u' + hex(ord(character))[2:]
...
>>> print(escape_literal('国'))
\u56fd

Iterate over utf-8 characters in python

I am using python 3.6 to read a file encoded in utf-8, in Spanish (thus, including letter ñ). I open the file with the utf-8 codec, and it loads correctly: while debugging, I can see ñ in the loaded text.
However, when I iterate over characters, ñ is read as two characters, n and ~. Concretely, when I run:
for c in text:
hexc = int(hex(ord(c)), 16)
if U_LETTERS[lang][0] <= hexc <= U_LETTERS[lang][1] \
or hexc in U_LETTERS[lang][2:] \
or hexc == U_SPACE:
filtered_text+=c
and text includes an ñ, the variable c takes it as an n (and therefore, hexc is 110 instead of 241), and then it takes ~ (and hexc is 771). I guess there is an internal conversion to an 8 bit char when iterating in this way. What is the proper way to do this?
Thanks in advance.
This has to do with Unicode normalisation. The letter "ñ" can be expressed either with a single character with the codepoint 0xF1 (241), or with the two character "n" and a combining character for the superposed tilde, ie. the codepoints 0x6E and 0x0303 (110 and 771).
These two ways of expressing the letter are considered equivalent; however, they are not the same in string comparison.
Python provides functionality to convert from one form to the other by means of the unicodedata module.
The first form is called composed (NFC), the second one decomposed (NFD) normalised form.
An example explains it the easiest way:
>>> import unicodedata
>>> '\xf1'
'ñ'
>>> [ord(c) for c in '\xf1']
[241]
>>> [ord(c) for c in unicodedata.normalize('NFD', '\xf1')]
[110, 771]
>>> [ord(c) for c in unicodedata.normalize('NFC', 'n\u0303')]
[241]
>>>
So, to solve your problem, convert all of the text to the desired normalisation form before any further processing.
Note: Unicode normalisation is a problem separate from encoding. You can have this with UTF16 or UTF32 just as well. In the decomposed form, you actually have two (or more) characters (each of which might be represented with multiple bytes, depending on the encoding). It's up the displaying device (the terminal emulator, an editor...) to show this as a single letter with marks above/below the base character.

Python reversing an UTF-8 string

I'm currently learning Python and as a Slovenian I often use UTF-8 characters to test my programs. Normally everything works fine, but there is one catch that I can't overtake. Even though I've got encoding declared on the top of the file it fails when I try to reverse a string containing special characters
#-*- coding: utf-8 -*-
a = "čšž"
print a #prints čšž
b = a[::-1]
print b #prints �šō� instead of žšč
Is there any way to fix that?
Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.
Your bytestring consists of 6 bytes, every two bytes form one character:
>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'
However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!
In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š and ō):
>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
| š ō |
\ \
invalid UTF8 byte opening UTF-8 byte missing a second byte
You'd have to decode the byte string to a unicode object, which consists of single characters. Reversing that object gives you the right results:
b = a.decode('utf8')[::-1]
print b
You can always encode the object back to UTF-8 again:
b = a.decode('utf8')[::-1].encode('utf8')
Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:
>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe
You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.

How do I convert a list of strings to a unicode value? [duplicate]

This question already has answers here:
Get unicode code point of a character using Python
(5 answers)
Closed 9 years ago.
I receive the following:
value = ['\', 'n']
and my regular routine of converting to unicode and calling ord throws the error:
ord() expects a character, but string of length 2 found
It would seem that I need to join the characters within the list if len(value) > 2.
How do I go about doing this?
If you're trying to figure out how to treat this as a single string '\\n' that can then be interpreted as the single character '\n' according to some set of rules, like Python's unicode-escape rules, you have to decide exactly what you want before you can code it.
First, to turn a list of two single-character strings into one two-character string, just use join:
>>> value = ['\\', 'n']
>>> escaped_character = ''.join(value)
>>> escaped_character
'\\n'
Next, to interpret a two-character escape sequence as a single character, you have to know which escape rules you're trying to undo. If it's Python's Unicode escape, there's a codec named unicode_escape that does that:
>>> character = escaped_character.decode('unicode_escape')
>>> character
u'\n'
If, on the other hand, you're trying to undo UTF-8 encoding followed by Python string-escape, or C backslash escapes, or something different, you obviously have to write something different. And given what you've said about UTF-8, I think you probably do want something different. For example, u'é'.encode('UTF-8') is the two-byte sequence '\xce\xa9'. Just calling decode('unicode_escape') on that will give you the two-character sequence u'\u00c3\u00a9', which is not what you want.
Anyway, now that you've got a single character, just call ord:
>>> char_ord = ord(character)
>>> char_ord
10
I'm not sure what the convert-to-unicode bit is about. If this is Python 3.x, the strings are already Unicode. If it's 2.x, and the strings are ASCII, it's guaranteed that ord(s) == ord(unicode(s)). If it's 2.x, and the strings are in some other encoding, just calling unicode on them is going to give you a UnicodeError or mojibake; you need to pass an encoding in as well, in which case you might as well use the decode method.

Categories

Resources