Hi I want to know how I can append and then print extended ASCII codes in python.
I have the following.
code = chr(247)
li = []
li.append(code)
print li
The result python print out is ['\xf7'] when it should be a division symbol. If I simple print code directly "print code" then I get the division symbol but not if I append it to a list. What am I doing wrong?
Thanks.
When you print a list, it outputs the default representation of all its elements - ie by calling repr() on each of them. The repr() of a string is its escaped code, by design. If you want to output all the elements of the list properly you should convert it to a string, eg via ', '.join(li).
Note that as those in the comments have stated, there isn't really any such thing as "extended ASCII", there are just various different encodings.
You probably want the charmap encoding, which lets you turn unicode into bytes without 'magic' conversions.
s='\xf7'
b=s.encode('charmap')
with open('/dev/stdout','wb') as f:
f.write(b)
f.flush()
Will print ÷ on my system.
Note that 'extended ASCII' refers to any of a number of proprietary extensions to ASCII, none of which were ever officially adopted and all of which are incompatible with each other. As a result, the symbol output by that code will vary based on the controlling terminal's choice of how to interpret it.
There's no single defined standard named "extend ASCII Codes"> - there are however, plenty of characters, tens of thousands, as defined in the Unicode standards.
You can be limited to the charset encoding of your text terminal, which you may think of as "Extend ASCII", but which might be "latin-1", for example (if you are on a Unix system such as Linux or Mac OS X, your text terminal will likely use UTF-8 encoding, and able to display any of the tens of thousands chars available in Unicode)
So, you must read this piece in order to understand what text is, after 1992 -
If you try to do any production application believing in "extended ASCII" you are harming yourself, your users and the whole eco-system at once: http://www.joelonsoftware.com/articles/Unicode.html
That said, Python2's (and Python3's) print will call the an implicit str conversion for the objects passed in. If you use a list, this conversion does not recursively calls str for each list element, instead, it uses the element's repr, which displays non ASCII characters as their numeric representation or other unsuitable notations.
You can simply join your desired characters in a unicode string, for example, and then print them normally, using the terminal encoding:
import sys
mytext = u""
mytext += unichr(247) #check the codes for unicode chars here: http://en.wikipedia.org/wiki/List_of_Unicode_characters
print mytext.encode(sys.stdout.encoding, errors="replace")
You are doing nothing wrong.
What you do is to add a string of length 1 to a list.
This string contains a character outside the range of printable characters, and outside of ASCII (which is only 7 bit). That's why its representation looks like '\xf7'.
If you print it, it will be transformed as good as the system can.
In Python 2, the byte will be just printed. The resulting output may be the division symbol, or any other thing, according to what your system's encoding is.
In Python 3, it is a unicode character and will be processed according to how stdout is set up. Normally, this indeed should be the division symbol.
In a representation of a list, the __repr__() of the string is called, leading to what you see.
Related
I'm interested how does Python's print function determines what is the string encoding, and how to handle it?
For example I've got the string:
str1 = u'\u041e\u0431\u044a\u0435\u043c
print(str1) # Will be converted to Объем`
What is going on under the hood of python?
Update
I'm interested in CPython 2.7 implementation of python
It uses the encoding in sys.stdout.encoding, which comes from the environment it's running in.
The u in front of the string makes a difference.The 'u' in front of the string values means the string has been represented as unicode. It is a way to represent more characters than normal ascii can manage.
The default encoding for Python source code is UTF-8, so you can simply include a Unicode character in a string literal.
More info here
I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?
This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.
If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.
So I'm using TwitterSearch Library. The function is simple, to print Twitter search result.
So here's a trouble. Your tweet is passed by The TwitterSearch from this dict (or list. Whatever is the actual one is)
tweet['text']
And if your python 2.7 have an unicode that this python can't solve, BOOM. Program Err
So I tried to make it like thise
a=unicode(tweet['text'], errors='ignore')
print a
The purpose is that I want the unicode converted to string, while ignoring unresolved unicode in the process (This is what I understand from the documentation. I may fail to understand the documentation so come up with this code)
I got this cute Error message.
typeError: decoding Unicode is not suported
My question
1: Why? Doesn't this Unicode stuff is part of default python Library
2: What should I do so I can I have unicode converted to string, while ignoring unresolved unicode in the process
PS: This is my first unicode issue and this is the best I can do at this point. Don't kill me.
You need to understand the distinction between Unicode objects and byte strings. In Python 2.7 unicode class is a Unicode object. These already consist of characters defined in the Unicode standard. From my understanding of the evidence you've provided, your tweet['text'] is already a unicode instance.
You can verify this by printing type(tweet['text']):
>>> print type(tweet['text'])
<type 'unicode'>
Now unicode objects are a high-level representation of a concept that does not have a single defined representation in computer memory. They are very useful as they allow you to use characters outside of the ASCII standard range that is limited to basic latin letters and numbers. But a character in Unicode is not remembered by the computer as its shape, instead we use their numbers provided by the standard and referred to as code points.
On the other hand pretty much every part of your computer operates using bytes. Network protocols transfer bytes, input and output streams transfer bytes. To be able to send a Unicode string across the network or even print it on a device such as your terminal you need to use a protocol that both communicating parties (e.g. your program and the terminal) understand. We call these encodings.
>>> u'żółw'.encode('utf-8')
'\xc5\xbc\xc3\xb3\xc5\x82w'
>>> print type(u'żółw'.encode('utf-8'))
<type 'str'>
There are many encoding and a single unicode object can often be encoded into many different byte strings depending on the encoding you choose. To pick one that is correct requires the knowledge of the context you want to use the resulting string in. If your terminal understands UTF-8 then all unicode objects will require encoding to UTF-8 before being sent to the output stream. If it only understands ASCII then you might need to drop some of the characters.
>>> print u'żółw'.encode('utf-8')
żółw
So if Python's default output encoding is either incorrect or cannot handle all the characters you're trying to print, you can always encode the object manually and output the resulting str instead. But before you do, please read all of the documents linked to in comments directly under your question.
I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.
import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.
I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.
I am retrieving a value that is set by another application from memcached using python-memcached library. But unfortunately this is the value that I am getting:
>>> mc.get("key")
'\x04\x08"\nHello'
Is it possible to parse this mixed ASCII code into plain string using python function?
Thanks heaps for your help
It is a "plain string", to the extent that such a thing exists. I have no idea what kind of output you're expecting, but:
There ain't no such thing as plain text.
The Python (in 2.x, anyway) str type is really a container for bytes, not characters. So it isn't really text in the first place :) It displays the bytes assuming a very simple encoding, using escape sequence to represent every byte that's even slightly "weird". It will be formatted differently again if you print the string (what you're seeing right now is syntax for creating such a literal string in your code).
In simpler times, we naively assumed that we could just map bytes to these symbols we call "characters", and that would be that. Then it turned out that there were approximately a zillion different mappings that people wanted to use, and lots of them needed more symbols than a byte could represent. Which is why we have Unicode now: it represents every symbol you could conceivably need for any real-world language (and several for fake languages and other purposes), and it abstractly assigns numbers to those symbols but does not say how to collect and interpret the bytes as numbers. (That is the purpose of the encoding).
If you know that the string data is encoded in a particular way, you can decode it to a Unicode string. It could either be an encoding of actual Unicode data, or it could be in some other format (for example, Japanese text is often found in something called "Shift-JIS", because it has approximately the same significance to them as "Latin-1" - a common extension of ASCII - does to us). Either way, you get an in-memory representation of a series of Unicode code points (the numbers referred to in the previous paragraph). This, for all intents and purposes, is really "text", but it isn't really "plain" :)
But it looks like the data you have is really a binary blob of bytes that simply happens to consist mostly of "readable text" if interpreted as ASCII.
What you really need to do is figure out why the first byte has a value of 4 and the next byte has a value of 8, and proceed accordingly.
If you just need to trim the '\x04\x08"\n', and it's always the same (you haven't put your question very clearly, I'm not certain if that's what it is or what you want), do something like this:
to_trim = '\x04\x08"\n'
string = mc.get('key')
if string.startswith(to_trim):
string = string[len(to_trim):]