Why does Emacs get my literal Unicode strings wrong?

Why does Emacs get my literal Unicode strings wrong? - python

As far as I know, these should be equivalent in a system that uses UTF-8 as the default encoding:
pattern1 = 'Wörterbuch Wortformen'.decode('utf8')
pattern2 = u'Wörterbuch Wortformen'
However, when I send these lines from an Emacs buffer to the Python process (M-x python-shell-send-region) something strange happens.
>>> pattern1
u'W\xf6rterbuch Wortformen'
>>> pattern2
u'W\xc3\xb6rterbuch Wortformen'
In a Python shell run in a terminal, both lines result in u'W\xf6rterbuch Wortformen'.
What is going on here?
My locale is configured to use UTF-8.

Here's what I did (might appear helpful later):
Created a single-bit encoded file, say /tmp/test.dat Opened it in Emacs using hexl-mode.
Using hexl-insert-hex-char command inserted bytes C3 and B6.
Opened this file as text (using text-mode). Emacs recognized it as file with multibyte encoding and displayed ö in place of the previous bytes.
Conclusion: you need the encoding system in your buffer which contains the source code to be utf-8 to send two bytes for ö. However, if it is a single-byte encoding, and given that you select the locale that maps the byte F6 to ö, you will get that byte.
PS. Make sure you have -*- coding: utf-8 -*- comment.

It turns out that it was a bug in python.el.

Related

Convert string of unknown encoding to UTF-8

I am consuming a text response from a third party API. This text is in an encoding which is unknown to me. I consume the text in python3 and want to change the encoding into UTF-8.
This is an example of the contents I get:
Danke
"TrÃ¤ume groÃŸ"
ðŸ™ŒðŸ¼
Super Idee!!!
I was able to get the messed up characters readable by doing the following manually:
Open new document in Notepad++
Via the Encoding menu switch the encoding of the document to ANSI
Paste the contents
Again use the Encoding menu, this time switch to UTF-8
Now the text is properly legible like below
Correct content:
Danke
"Träume groß"
🙌🏼
Super Idee!!!
I want to repeat this process in python3, but struggle to do so. From the notepad workflow I gather that the encoding shouldn't be converted, rather the existing characters should be interpreted with a different encoding. That's because if I select Convert to UTF-8 in the Encoding menu, it doesn't work.
From what I have read on SO, there are the encode and decode methods to do that. Also ANSI isn't really an encoding but rather refers to the standard encoding the current machine uses. This would most likely be cp1525 on my windows machine. I have messed around with all combinations of cp1252 and utf-8 as source and/or target, but to no avail. I always end up with a UnicodeEncodeError.
I have also tried using the chardet module to determine the encoding of my input string, but it requires bytes as input and b'ðŸ™ŒðŸ¼' is rejected with SyntaxError: bytes can only contain ASCII literal characters.

"TrÃ¤ume groÃŸ" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.
A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:
print('"TrÃ¤ume groÃŸ"'.encode('cp1252').decode('utf8'))
gives as expected:
"Träume groß"
But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.

You can use bytes() to convert a string to bytes, and then decode it with .decode()
>>> bytes("TrÃ¤ume groÃŸ", "cp1252").decode("utf-8")
'Träume groß'

chardet could probably be useful here -
Quoting straight from the docs
import urllib.request
rawdata = urllib.request.urlopen('http://yahoo.co.jp/').read()
import chardet
chardet.detect(rawdata) {'encoding': 'EUC-JP', 'confidence': 0.99}

UTF-8 Character Printing Error Python 2.7

I write codes that contain non-ASCII characters like this;
print "Öüç"
I know that Python's default encoding is ASCII. So I add this to my code.
#-*- coding:utf-8 -*-
When I launch my code, "Öüç" string appear like this;
├û├╝├ğ
What should I do?

That is only loosely related to Python. Even #-*- coding:utf-8 -*- is useless here: it is only meant to allow to use encoded unicode litterals in Python source.
It just allowed me to guess that your source was UTF-8 encoded, so "Öüç" is in fact the following string: '\xc3\x96\xc3\xbc\xc3\xa7'. And what you see is those characters in the code page 437.
I assume that you use a Windows system, and that the chcp command in a CMD windows would confirm you that the code page used is indeed 437.
What can be done? First you must select in the console a code page able to display the 3 characters, I would advise the 850 code page: chcp 850 before starting Python
Then in Python, you decode the UTF-8 string into unicode and encode it in cp850:
print "Öüç".decode("utf8").encode('cp850')
Alternatively, you can use the windows 1252 code page which is close to Latin1: chcp 1252 before starting Python and then:
print "Öüç".decode("utf8").encode('latin1')

SyntaxError: Non-ASCII character - Scrapy [duplicate]

Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.

I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.

Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-

First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals

The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.

Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !

You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.

Inconsistent output of unicode box-drawing characters in python IDLE

I have the following code:
# -*- coding: utf-8 -*-
print "╔╤╤╦╤╤╦╤╤╗"
print "╠╪╪╬╪╪╬╪╪╣"
print "╟┼┼╫┼┼╫┼┼╢"
print "╚╧╧╩╧╧╩╧╧╝"
print "║"
print "│"
and for some reason, only the third line (╚╧╧╩╧╧╩╧╧╝) actually outputs properly, the rest is an odd combination of symbols. I assume this is due to some encoding issues. The full output in IDLE is as follows:
â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
â• â•ªâ•ªâ•¬â•ªâ•ªâ•¬â•ªâ•ªâ•£
â•Ÿâ”¼â”¼â•«â”¼â”¼â•«â”¼â”¼â•¢
╚╧╧╩╧╧╩╧╧╝
â•‘
â”‚
What is causing this and how can I fix this? I'm using a tablet (Surface Pro 3 with Win10) with only a touch keyboard, so any solution with the least amount of typing (especially typing out weird characters) would be ideal, but obviously all help is appreciated.

Mojibake indicates that the text encoded in one encoding is shown in another incompatible encoding:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u"╔╤╤╦╤╤╦╤╤╗".encode('utf-8').decode('cp1252')) #XXX: DON'T DO IT
# -> â•”â•¤â•¤â•¦â•¤â•¤â•¦â•¤â•¤â•—
There are several places where the wrong encoding could be used.
# coding: utf-8 encoding declaration says how non-ascii characters in your source code (e.g., inside string literals) should be interpreted. If print u"╔╤╤╦╤╤╦╤╤╗" works in your case then it means that the source code itself is decoded to Unicode correctly. For debugging, you could write the string using only ascii characters: u'\u2554\u2557' == u'╔╗'.
print "╔╤╤╦╤╤╦╤╤╗" (DON'T DO IT) prints bytes (text encoded using utf-8 in this case) as is. IDLE itself works with Unicode (BMP). The bytes must be decoded into Unicode text before they can be shown in IDLE. It seems IDLE uses ANSI code page such as cp1252 (locale.getpreferredencoding(False)) to decode the output bytes on Windows. Don't print text as bytes. It will fail in any environment that uses a character encoding different from your source code e.g., you would get ΓòöΓòù... mojibake if you run the code from the question in Windows console that uses cp437 OEM code page.
You should use Unicode for all text in your program. Python 3 even forbids non-ascii characters inside a bytes literal. You would get SyntaxError there.
print(u'\u2554\u2557') might fail with UnicodeEncodeError if you would run the code in Windows console and OEM code page such as cp437 weren't be able to represent the characters. To print arbitrary Unicode characters in Windows console, use win-unicode-console package. You don't need it if you use IDLE.

Putting a u before the strings fixed the issue, as per #FredLarson's suggestion:
print u"╔╤╤╦╤╤╦╤╤╗"
print u"╠╪╪╬╪╪╬╪╪╣"
print u"╟┼┼╫┼┼╫┼┼╢"
print u"╚╧╧╩╧╧╩╧╧╝"
print u"║"
print u"│"
The exact cause still isn't known, since it seemed to work on other systems and it's odd that the third line worked fine.

hi-ascii characters python string

I am always perplexed with the whole hi-ascii handling in python 2.x. I am currently facing an issue in which I have a string with hiascii characters in it. I have a few questions related to it.
How can a string store hiascii characters in it (not a unicode string, but a normal str in python 2.x), which I thought can handle only ascii chars. Does python internally convert the hiascii to something else ?
I have a cli which I spawn as a subprocess from my python code, when I pass this string to the cli, it works fine. While, if I encode this string to utf-8, the cli fails( this string is a password, so it fails saying the password is invalid).
For the second point, I actually did a bit of research and found the following:
1) In windows(sucks), the command line args are encoded in mbcs (sys.getfilesystemencoding). The question I still don't get is, if I read the same string using raw_input, it is encoded in Windows console encoding(on EN windows, it was cp437).
I have a different question that am confused about now regarding Windows encoding. Is the windows sys.stdin.encoding different from Windows console encoding ?
If yes, is there a pythonic way to figure out what my windows console encoding is. I needed this because when I read input using raw_input, its encoded in Windows console encoding, and I want to convert it to say, utf-8. Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.

To answer your first question, Python 2.x byte strings contain the source-encoded bytes of the characters, meaning the exact bytes used to store the string on disk in the source file. For example, here is a Python 2.x program where the source is saved in Windows-1252 encoding (Notepad's default on US Windows):
#!python2
#coding:windows-1252
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)
Output:
'\xe6\xfc\xff\x80\xe9\xea\xe8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'
The byte string contains the bytes that represent the characters in Windows-1252.
The Python decodes that same sequence of using the declared source encoding (!#coding:Windows-1252) into Unicode codepoints. Since Windows-1252 is very close to iso-8859-1, and iso-8859-1 is a 1:1 mapping to the first 0-255 Unicode codepoints, the code points are almost the same, except for the Euro character.
But save the source in a different encoding, and you'll get those bytes instead for the byte string:
#!python2
#coding:utf8
s = 'æüÿ€éêè'
u = u'æüÿ€éêè'
print repr(s)
print repr(u)
Output:
'\xc3\xa6\xc3\xbc\xc3\xbf\xe2\x82\xac\xc3\xa9\xc3\xaa\xc3\xa8'
u'\xe6\xfc\xff\u20ac\xe9\xea\xe8'
So, Python 2.X just gives you the source code bytes directly, without decoding them to Unicode codepoints, like a Unicode string would do.
Python 3.X notes that this is confusing, and just forbids non-ASCII characters in byte strings:
#!python3
#coding:utf8
s = b'æüÿ€éêè'
u = 'æüÿ€éêè'
print(repr(s))
print(repr(u))
Output:
File "C:\test.py", line 3
s = b'æüÿ\u20acéêè'
^
SyntaxError: bytes can only contain ASCII literal characters.
To answer your second question, please edit your question to show an example that demonstrates the problem.

Is the windows sys.stdin.encoding different from Windows console encoding?
Yes. There are two locale-specific codepages:
the ANSI code page, aka mbcs, used for strings in Win32 ...A APIs and hence for C runtime operations like reading the command line;
the IO code page, used for stdin/stdout/stderr streams.
These do not have to be the same encoding, and typically they aren't. In my locale (UK), the ANSI code page is 1252 and the IO code page defaults to 850. You can change the console code page using the chcp command, so you can make the two encodings match using eg chcp 1252 before running the Python command.
(You also have to be using a TrueType font in the console for chcp to make any difference.)
is there a pythonic way to figure out what my windows console encoding is.
Python reads it at startup using the Win32 API GetConsoleOutputCP and—unless overridden by PYTHONIOENCODING—writes that to the property sys.stdout.encoding. (Similarly GetConsoleCP for stdin though they will generally be the same code page.)
If you need to read this directly, regardless of whether PYTHONIOENCODING is set, you might have to use ctypes to call GetConsoleOutputCP directly.
Note: I have already set my sys.stdin.encoding to utf-8, but it doesnt seem to make any effect in the read input.
(How have you done that? It's a read-only property.)
Although you can certainly treat input and output as UTF-8 at your end, the Windows console won't supply or display content in that encoding. Most other tools you call via the command line will also be treating their input as encoded in the IO code page, so would misinterpret any UTF-8 sent to them.
You can affect what code page the console side uses by calling the Win32 SetConsoleCP/SetConsoleOutputCP APIs with ctypes (equivalent of the chcp command and also requires TTF console font). In principle you should be able to set code page 65001 and get something that is nearly UTF-8. Unfortunately long-standing console bugs usually make this approach infeasible.
windows(sucks)
yes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.