python character set conversion by the compiler - python

C:\c>python -m pydoc wordspyth^A.split
no python documentation found for 'wordspyth\x01.split'
I understand that python documentation doesn't exist, but why does ^A convert to \x01?

Ctrl+A is a control character with value 1, those are echoed hexidecimal by default. As they might break your prompt/terminal and/or would be illegible.
That pydoc doesn't know about the non-standard function wordspyth doesn't mean there is no documentation

Like Anthon said , ctrl + A is a non-printable character , when you add that to a string in python and the string is printed out, python internally converts many such non-printable characters to printable unicode format.
This was done in Python 3000 through http://legacy.python.org/dev/peps/pep-3138/

Related

python3 shell mode can output an utf-8 character for some bytes and cannot for some other, what is the reason?

what i have already known:
b'\xce\xb8'.decode('UTF-8') gives 'θ', because decode() function is designed for doing this job - decoding the bytes.
what i want to know is, dose python3 shell mode have some default config to control following behavior (Python3) .
>>> sys.getdefaultencoding()
'utf-8'
>>> b'\xce\xb8'.decode()
'θ'
>>> b'\xce\xb8'
b'\xce\xb8'
>>> b'\x41'
b'A'
>>> print(b'\xce\xb6')
b'\xce\xb6'
>>> print(b'\xce\xb6'.decode('utf8'))
ζ
it seems like shell mode use ASCII as default encoding rather than utf8.
the question is, is this true? if yes, what the path where the config is located in?
This has nothing to do with the encoding. Python is just showing you in the shell what the value is that you just gave it, in a more literal sense. Try this instead:
a = b'\xce\xb8'
print(a)
result:
θ
So 'a' is indeed encoded as UTF-8, just as you expected. You're just misinterpreting what Python is echoing back to the console.
BTW, you're also I think not doing what you think you are with the 'b' prefix. It appears you're using Python 2.X. In that version of Python, the 'b' prefix is ignored. I know that because it doesn't show up in the echoed result. See here:
Python 2.x:
>>> b'\xce\xb8'
'\xce\xb8'
Python 3.X
>>> b'\xce\xb8'
b'\xce\xb8'
So in Python 2.X, you'll get the same result with and without the 'b'. In Python 3.X, you get different behavior either way than what you get in Python 2.X. I haven't done much with Python 3.X, but I believe that this is because how strings are represented changed in 3.X.
PS: If you really just care how Python is echoing strings back to you, I don't know that there's a way to change that. I wonder, however, why that matters to you.
Python 3 represents bytes as the equivalent ASCII character if the value of the byte is within the ASCII range, otherwise it displays the escaped hex value.
From the docs for the byte type:
Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.
This is a deliberate design decision (from the same doc)
to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data
The interpreter doesn't display characters for bytes outside the ASCII range because it cannot know whether the bytes are encoded as UTF-8, some other encoding, or even if they represent text data at all.
As user Steve points out in their answer, this behaviour is not related to encoding. It is not configurable; if you want to see the characters corresponding to a UTF-8 encoded bytestring, decode to str.

How does Python's "print" function is working?

I'm interested how does Python's print function determines what is the string encoding, and how to handle it?
For example I've got the string:
str1 = u'\u041e\u0431\u044a\u0435\u043c
print(str1) # Will be converted to Объем`
What is going on under the hood of python?
Update
I'm interested in CPython 2.7 implementation of python
It uses the encoding in sys.stdout.encoding, which comes from the environment it's running in.
The u in front of the string makes a difference.The 'u' in front of the string values means the string has been represented as unicode. It is a way to represent more characters than normal ascii can manage.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal.
More info here

Is there any possible way to display accented characters in Python interpreter?

I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.
import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.
I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.

Operating on character '№'

I want to search and replace the character '№' in a string.
I am not sure if it's actually a single character or two.
How do I do it? What is its unicode?
If it's any help, I am using Python3.
EDIT: The sentence "I am not sure if it's actually a single character or two" kind of deformed my question. I actually wanted to know its unicode so that I could use the code instead of pasting the character in my python script.
In Python 3 it is always a single character.
3>> 'foo№bar'.replace('№', '#')
'foo#bar'
That character is U+2116 ɴᴜᴍᴇʀᴏ sɪɢɴ.
You can just type it directly in your source file, taking care to to specify the source file encoding as per PEP-236.
Alternatively, you can use either the numeric Unicode escapes, or the more readable named Unicode escapes:
>>> 'foo\u2116'
'foo№'
>>> 'foo\N{NUMERO SIGN}'
'foo№'

How to get rid of non-ascii characters in Perl & Python [both]?

How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.
The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.
For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";
In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.

Categories

Resources