█ character string indexed in python - python

I'm trying to get the index of 'J' in a string that is similar to myString = "███ ███ J ██" so I use myString.find('J') but it returns a really high value and if I replace '█' by 'M' or another character of the alphabet I get a lower value. I don't really understand what's the cause of that.

Try doing myString = u"███ ███ J ██". This will make it a Unicode string instead of the python 2.x default of an ASCII string.
If you are reading it from a file or a file-like object, instead of doing file.read(), do file.read().encode('utf-8-sig').

To check your encoding run: python -c 'import sys; print(sys.getdefaultencoding())'
For Python 2.x the output is ascii and this is a default encoding for your programs. To use some non-ascii characters developers predicted a unicode() type. See for yourself. Just create a variable myString = u"███ ███ J ██" and follow on it .find('J') method. This u prefix says to interpreter that it deals with Unicode-encoded string. Then you can use this variable like if it was normal str.
I've used Unicode in some places where I should write UTF-8. For difference check this great answer if you want to.
Unicode is a default encoding in Python 3.x, so this problem does not occur.

Check the settings of the console/ssh client you are using. Set it to be UTF-8.

Related

How does Python's "print" function is working?

I'm interested how does Python's print function determines what is the string encoding, and how to handle it?
For example I've got the string:
str1 = u'\u041e\u0431\u044a\u0435\u043c
print(str1) # Will be converted to Объем`
What is going on under the hood of python?
Update
I'm interested in CPython 2.7 implementation of python
It uses the encoding in sys.stdout.encoding, which comes from the environment it's running in.
The u in front of the string makes a difference.The 'u' in front of the string values means the string has been represented as unicode. It is a way to represent more characters than normal ascii can manage.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal.
More info here

Python - printing "u" character for every list values [duplicate]

I tried the following on Codecademy's Python lesson
hobbies = []
# Add your code below!
for i in range(3):
Hobby = str(raw_input("Enter a hobby:"))
hobbies.append(Hobby)
print hobbies
With this, it works fine but if instead I try
Hobby = raw_input("Enter a hobby:")
I get [u'Hobby1', u'Hobby2', u'Hobby3']. Where are the extra us coming from?
The question's subject line might be a bit misleading: Python 2's raw_input() normally returns a byte string, NOT a Unicode string.
However, it could return a Unicode string if it or sys.stdin has been altered or replaced (by an application, or as part of an alternative implementation of Python).
Therefore, I believe #ByteCommander is on the right track with his comment:
Maybe this has something to do with the console it's running in?
The Python used by Codecademy is ostensibly 2.7, but (a) it was implemented by compiling the Python interpreter to JavaScript using Emscripten and (b) it's running in the browser; so between those factors, there could very well be some string encoding and decoding injected by Codecademy that isn't present in plain-vanilla CPython.
Note: I have not used Codecademy myself nor do I have any inside knowledge of its inner workings.
'u' means its a unicode. You can also specify raw_input().encode('utf8') to convert to string.
Edited:
I checked in python 2.7 it returns byte string not unicode string. So problem is something else here.
Edited:
raw_input() returns unicode if sys.stdin.encoding is unicode.
In codeacademy python environment, sys.stdin.encoding and sys.stdout.decoding both are none and default endcoding scheme is ascii.
Python will use this default encoding only if it is unable to find proper encoding scheme from environment.
Where are the extra us coming from?
raw_input() returns Unicode strings in your environment
repr() is called for each item of a list if you print it (convert to string)
the text representation (repr()) of a Unicode string is the same as Unicode literal in Python: u'abc'.
that is why print [raw_input()] may produce: [u'abc'].
You don't see u'' in the first code example because str(unicode_string) calls the equivalent of unicode_string.encode(sys.getdefaultencoding()) i.e., it converts Unicode strings to bytestrings—don't do it unless you mean it.
Can raw_input() return unicode?
Yes:
#!/usr/bin/env python2
"""Demonstrate that raw_input() can return Unicode."""
import sys
class UnicodeFile:
def readline(self, n=-1):
return u'\N{SNOWMAN}'
sys.stdin = UnicodeFile()
s = raw_input()
print type(s)
print s
Output:
<type 'unicode'>
☃
The practical example is win-unicode-console package which can replace raw_input() to support entering Unicode characters outside of the range of a console codepage on Windows. Related: here's why sys.stdout should be replaced.
May raw_input() return unicode?
Yes.
raw_input() is documented to return a string:
The function then reads a line from input, converts it to a string
(stripping a trailing newline), and returns that.
String in Python 2 is either a bytestring or Unicode string :isinstance(s, basestring).
CPython implementation of raw_input() supports Unicode strings explicitly: builtin_raw_input() can call PyFile_GetLine() and PyFile_GetLine() considers bytestrings and Unicode strings to be strings—it raises TypeError("object.readline() returned non-string") otherwise.
You could encode the strings before appending them to your list:
hobbies = []
# Add your code below!
for i in range(3):
Hobby = raw_input("Enter a hobby:")
hobbies.append(Hobby.encode('utf-8')
print hobbies

python character set conversion by the compiler

C:\c>python -m pydoc wordspyth^A.split
no python documentation found for 'wordspyth\x01.split'
I understand that python documentation doesn't exist, but why does ^A convert to \x01?
Ctrl+A is a control character with value 1, those are echoed hexidecimal by default. As they might break your prompt/terminal and/or would be illegible.
That pydoc doesn't know about the non-standard function wordspyth doesn't mean there is no documentation
Like Anthon said , ctrl + A is a non-printable character , when you add that to a string in python and the string is printed out, python internally converts many such non-printable characters to printable unicode format.
This was done in Python 3000 through http://legacy.python.org/dev/peps/pep-3138/

Is there any possible way to display accented characters in Python interpreter?

I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.
import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.
I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.

Print the "approval" sign/check mark (✓) U+2713 in Python

How can I print the check mark sign "✓" in Python?
It's the sign for approval, not a square root.
You can print any Unicode character using an escape sequence. Make sure to make a Unicode string.
print u'\u2713'
Since Python 2.1 you can use \N{name} escape sequence to insert Unicode characters by their names. Using this feature you can get check mark symbol like so:
$ python -c "print(u'\N{check mark}')"
✓
Note: For this feature to work you must use unicode string literal. u prefix is used for this reason. In Python 3 the prefix is not mandatory since string literals are unicode by default.
Solution defining python source file encoding:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
print '✓'
http://ideone.com/dTW5D8
Like this:
print u'\u2713'.encode('utf8')
The encoding should match the one of your terminal (or wherever you are sending output to).

Categories

Resources