Trying to print unicode characters to the console using string.format()

Trying to print unicode characters to the console using string.format() - python

below is the snippet in question:
print '{:─^10}'.format('') # Print '─' character 10 times
I'm using this to create nice console borders and such. The problem is, running this in my py file with # coding UTF-8 gives me: ValueError: Invalid conversion specification
If I run this same script in the python shell, it spits out the escaped characters: '\xc4\xc4\xc4\x...'
I don't know how (in my script) to get this to print out the '─' character. It can print the '─' character just fine if I use print '─' because of the UTF-8 encoding, but for some reason it won't allow it in the string.format() function.
Any suggestions? I'm sure this is probably easy to fix, but i'm VERY new to python programming.
Thanks in advance.

Assuming you're using Python2, you need to use unicode (u'') strings:
print u'{:─^10}'.format(u'')

Related

Encoding/converting a string with spanish or polish characters

1) How do I convert a variable with a string like "wdzi\xc4\x99czno\xc5\x9bci" into "wdzięczności"?
2) Also how do I convert string variable with characters like "Â±", "Ä™", "Ä†" into correct letters?
I emphasise "variable" because all I've got from googling was examples with " u'some string' " and the like and I can't get anything like that to work.
I use "# -*- coding: utf-8 -*-" in second line of my script and I still crash into these problems.
Also I was said that simple print should output correctly - but it does not.

In Python 2.7 IDLE, I get this output:
>>> print "wdzi\xc4\x99czno\xc5\x9bci".decode('utf-8')
wdzięczności
Your first string appears to be a UTF-8 byte string, so all that's necessary is to decode it into a Unicode string. When Python prints that string, it will encode it back to the proper encoding based on your environment.
If you're using Python 3 then you have a string that has been decoded improperly and will need a little more work to fix the damage.
>>> print("wdzi\xc4\x99czno\xc5\x9bci".encode('iso-8859-1').decode('utf-8'))
wdzięczności

Python string has a character that is present in Linux but missing in Windows

This is the first question I'm posting so please pardon my ignorance. I am using python to write into files and then read them. Using the usual suspects (file.read(), file.write())
The code is being run on both windows and Linux.
A particular string I'm reading, say str is giving a length of 6 on Windows, while it is giving a length of 7 on Linux.
I tried exploring what this magic character is but it turns out I cant print it!
If i try printing str, it gives the same results on Windows and Linux.
If i try printing str[6] on Linux, it prints blank!
I have verified that it is not a whitespace or newline(\n) character. I am even unable to print the ascii value of this character. Are there characters out there without ascii values?
I have found that strip() function eliminates this magic character but I am still curious as to what it is.

In Windows, newline is "\r\n", while on Linux it is "\n". This is why there is a character count discrepancy.

Is there any possible way to display accented characters in Python interpreter?

I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.

import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.

I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.

My first step in Python

I'm trying to start learning Python, but I became confused from the first step.
I'm getting started with Hello, World, but when I try to run the script, I get:
Syntax Error: Non-UTF-8 code starting with '\xe9' in file C:\Documents and Settings\Home\workspace\Yassine frist stared\src\firstModule.py on line 5 but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details.

add to the first line is
# -*- coding: utf-8 -*-

Put as the first line of your program this:
# coding: utf-8
See also Correct way to define Python source code encoding

First off, you should know what an encoding is. Read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Now, the problem you are having is that most people write code in ASCII. Roughly speaking, that means that they use Latin letters, numerals and basic punctuation only in the code files themselves. You appear to have used a non-ASCII character code inside your program, which is confusing Python.
There are two ways to fix this. The first is to tell Python with what encoding you would like it to read the text file. You can do that by adding a # coding declaration at the top of the tile. The second, and probably better, is to restrict yourself to ASCII code. Remember that you can always have whatever characters you like inside strings, by writing them in their encoded form as e.g. \x00 or whatever.

When you run Python through the interpreter, you must run it in this format: python filename.py (command line args) or you will also get this error. I made the comment because you mentioned you were a beginner.

Unicode conversion issue using Python in Emacs

I'm trying to understand the difference in a bit of Python script behavior when run on the command line vs run as part of an Emacs elisp function.
The script looks like this (I'm using Python 2.7.1 BTW):
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape")
that is, [in general] take a JSON segment containing unicode characters, dumpstring it to it's unicode escaped version, then decode it back to it's unicode representation. When run on the command line, the dumps part of this returns:
'{"Foo": "\\u30b6"}'
which when printed looks like:
'{"Foo": "\u30b6"}'
the decode part of this looks like:
u'{"Foo": "\u30b6"}'
which when printed looks like:
{"Foo": "ザ"}
i.e., the original string representation of the structure, at least in a terminal/console that supports unicode (in my testbed, an xterm). In a Windows console, the output is not correct with respect to the unicode character, but the script does not error out.
In Emacs, the dumps conversion is the same as on the command line (at least as far as confirming with a print), but the decode part blows out with the dreaded:
File "", line 1, in
UnicodeEncodeError: 'ascii' codec can't encode character u'\u30b6' in position 9: ordinal not in range(128)`
I've a feeling I'm missing something basic here with respect to either the script or Emacs (in my testbed 23.1.1). Is there some auto-magic part of print invoking the correct codec/locale that happens at the command line but not in Emacs? I've tried explicitly setting the locale for an Emacs invocation (here's a stub test without the json logic):
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s'"
produces the same exception, while
"LC_ALL=\"en_US.UTF-8\" python -c 'import sys; enc=sys.stdout.encoding; print enc' "
indicates that the encoding is 'None'.
If I attempt to coerce the conversion using:
"LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Fooザ\"; print s.encode(\"utf8\",\"replace\")'"
the error goes away, but the result is the "garbled" version of the string seen in the non-unicode console:
Fooa?¶
Any ideas?
UPDATE: thanks to unutbu -- b/c the locale identification falls down, the command needs to be explicitly decorated with the utf8-encode (see the answer for working directly with a unicode string). In my case, I am getting what is needed from the dumps/decode sequence, so I add the additional required decoration to achieve the desired result:
import json; t = {"Foo":"ザ"}; print json.dumps(t).decode("unicode_escape").encode("utf8","replace")
Note that this is the "raw" Python without the necessary escaping required by Emacs.
As you may have guessed from looking at the original part of this question, I'm using this as part of some JSON formatting logic in Emacs -- see my answer to this question.

The Python wiki page, "PrintFails" says
When Python does not detect the desired character set of the output,
it sets sys.stdout.encoding to None, and print will invoke the "ascii"
codec.
It appears that when python is being run from an elisp function, it can not detect the desired character set, so it is defaulting to "ascii". So trying to print unicode is tacitly causing python to encode the unicode as ascii, which is reason for the error.
Replacing u\"Fooザ\" with u\"Foo\\u30b6\" seems to work:
(defun mytest ()
(interactive)
(shell-command-on-region (point)
(point) "LC_ALL=\"en_US.UTF-8\" python -c 's = u\"Foo\\u30b6\"; print s.encode(\"utf8\",\"replace\")'" nil t))
C-x C-e M-x mytest
yields
Fooザ

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to print unicode characters to the console using string.format() - python

Assuming you're using Python2, you need to use unicode (u'') strings: print u'{:─^10}'.format(u'')

Related

Encoding/converting a string with spanish or polish characters

Python string has a character that is present in Linux but missing in Windows

Is there any possible way to display accented characters in Python interpreter?

My first step in Python

Unicode conversion issue using Python in Emacs

Categories

Resources