I'm using python pycurl module to download content from various web pages. Since I also wanted to support potential unicode text I've been avoiding the cStringIO.StringIO function which according to python docs: cStringIO - Faster version of StringIO
Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.
... does not support unicode strings. Actually it states that it does not support unicode strings that can not be converted to ASCII strings. Can someone please clarify this to me? Which can and which can not be converted?
I've tested with the following code and it seems to work just fine with unicode:
import pycurl
import cStringIO
downloadedContent = cStringIO.StringIO()
curlHandle = pycurl.Curl()
curlHandle.setopt(pycurl.WRITEFUNCTION, downloadedContent.write)
curlHandle.setopt(pycurl.URL, 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')
curlHandle.perform()
content = downloadedContent.getvalue()
fileHandle = open('unicode-test.txt','w')
for char in content:
fileHandle.write(char)
And the file is correctly written. I can even print the whole content in the console, all characters show up fine... So what I'm puzzled about is, where does the cStringIO fail ? Is there any reason why I should not use it?
[Note: I'm using Python 2.6 and need to stick to this version]
Any text that only uses ASCII codepoints (byte values 00-7F hexadecimal) can be converted to ASCII. Basically any text that uses characters not often used in American English is not ASCII.
In your example code, you are not converting the input to Unicode text; you are treating it as un-decoded bytes. The test page in question is encoded in UTF-8, and you never decode that to Unicode.
If you were to decode the value to a Unicode string, you won't be able to store that string in a cStringIO object.
You may want to read up on the difference between Unicode and text encodings such as ASCII and UTF-8. I can recommend:
Joel Spolsky's minimum Unicode article
The Python Unicode HOWTO.
Related
I am using a package in python that returns a string using ASCII characters as opposed to unicode (eg. returns 'seré' as opposed to seré).
Given this is python 3.8, the string is actually encoded in unicode, the package just seems to output it as if it were ASCII. As such, when I try to perform x.decode('utf-8') or x.encode('ascii'), neither work. Is there a way to make python treat the string as if it were ASCII, such that I can decode it to unicode? Or is there a package that can serve this purpose.
I am relatively new to python so I apologise if my explanation is unclear. I am happy to clarify things if needed.
Code
from spanishconjugator import Conjugator as c
verb = c().conjugate('pasar', 'preterite', 'indicative', 'yo')
print(verb)
This returns the string 'pasé' where it should return 'pasé'.
Update
From further searching and from your answers, it appears to be an issue to do with single 2-byte UTF-8 (é) characters being literally interpreted as two 1-byte latin-1 (é) characters (nothing to do with ASCII, my mistake).
Managed to fix it with:
verb.encode('latin-1').decode('utf-8')
Thank you to those that commented.
If the input string contains the raw byte ordinals (such as \xc3\xa9/é instead of é) use latin1 to encode it to bytes verbatim, then decode with the desired encoding.
>>> "pasé".encode('latin1').decode()
'pasé'
I'm using a Python 2.x-library email to iterate over some .eml-files, but I have Python 3.x installed.
I extract the filename in the header of each payload (attachment) using .get_filename(). Encoding is not set in the header and thus I believe Python 3.x interprets the returned string as utf-8. The string however looks like this, when it contains special characters, e.g. like "ø":
=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?=
I have failed in numerous ways to convert this string into utf-8 making it into bytes or not and de- and encoding using latin-1, ISO-8859-1 (should be the same though) and utf-8.
I've also tried using:
ast.literal_eval(r"b'=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?='")
and decoding that, but it still returns the original string containing the encoded characters.
How do one go about this?
You are handling email, so you can use email handling functions:
Try with https://docs.python.org/3.5/library/email.header.html.
The last example (and second one, very small module:
>>> from email.header import decode_header
>>> decode_header('=?iso-8859-1?q?p=F6stal?=')
[(b'p\xf6stal', 'iso-8859-1')]
There is also a version for python 2.7.
So for your case:
subj = '=?ISO-8859-1?Q?Sp=F8rgeskema=2Edoc?='
subject, encoder = email.header.decode_header(subj)[0]
print(subject.decode(encoder))
Say, I have a source file encoded in utf8, when python interpreter loads that source file, will it convert file content to unicode in memory and then try to evaluate source code in unicode?
If I have a string with non ASCII char in it, like
astring = '中文'
and the file is encoded in gbk.
Running that file with python 2, I found that string actually is still in raw gbk bytes.
So I dboubt, python 2 interpret does not convert source code to unicode. Beacause if so, the string content will be in unicode(I heard it is actually UTF16)
Is that right? And if so, how about python 3 interpreter? Does it convert source code to unicode format?
Acutally, I know how to define unicode and raw string in both Python2 and 3.
I'm just curious about one detail when the interpreter loads source code.
Will it convert the WHOLE raw source code (encoded bytes) to unicode at very beginning and then try to interpret unicode format source code piece by piece?
Or instead, it just loads raw source piece by piece, and only decodes what it think should. For example, when it hits the statement u'中文' , OK, decode to unicode. While it hits statment b'中文', OK, no need to decode.
Which way the interpreter will go?
If your source file is encoded with GBK, put this line at the top of the file (first or second line):
# coding: gbk
This is required for both Python 2 and 3.
If you omit this encoding declaration, the interpreter will assume ASCII in the case of Python 2, and UTF-8 for Python 3.
The encoding declaration controls how the interpreter reads the bytes of the source file. This is mainly relevant for string literals (like in your example), but theoretically also applies to comments and even identifiers (it's probably not a good idea to use non-ASCII in identifiers, though).
As for the question whether you get byte strings or unicode strings: this depends on the syntax, not on the choice and declaration of encoding.
As pointed out in Ignacio's answer, if you want to have unicode strings in Python 2, you need to use the u'...' notation.
In Python 3, the u prefix is optional.
So, with a correct encoding declaration in the file header, it is sufficient to write astring = '中文' to get a correct unicode string in Python 3.
Update
By comment, the OP asks about the interpretation of b'中文'.
In Python 3, this isn't allowed (byte strings can only contain ASCII characters), but you can test this yourself in Python 2.x:
# coding: gbk
x = b'中文'
y = u'中文'
print repr(x)
print repr(y)
This will produce:
'\xd6\xd0\xce\xc4'
u'\u4e2d\u6587'
The first line reflects the actual bytes contained in the source file (if you saved it with GBK, of course).
So there seems to be no decoding happening for b'中文'.
However, I don't know how the interpreter internally represents the source code with respect to encoding (that seems to be your question).
This is implementation-dependent anyway, so the answer might even be different for cPython, Jython, IronPython etc.
So I dboubt, python 2 interpret does not convert source code to unicode.
It never does. If you want to use Unicode rather than bytes then you need to use a unicode instead.
astring = u'中文'
Python source is only plain ASCII, meaning that the actual encoding does not matter except for litteral strings, be them unicode strings or byte strings. Identifiers can use non ascii characters (IMHO it would be a very bad practice), but their meaning is normally internal to the Python interpreter, so the way it reads them is not really important
Byte strings are always left unchanged. That means that normal strings in Python 2 and byte litteral strings in Python 3 are never converted.
Unicode strings are always converted:
if the special string coding: charset_name exists in a comment on first or second line, the original byte string is converted as it would be with decode(charset_name)
if not encoding is specified, Python 2 will assume ASCII and Python 3 will assume utf8
I am using Python's re module to censor some text. I have to censor both ASCII and Unicode text, so I need to set re's Unicode flag if the text is Unicode. Is there a way that I can detect whether the text is Unicode or not?
ASCII is a subset of Unicode, you don't have to do anything -- unless you have reasons to suspect your text is neither ASCII not Unicode (eg Windows CP 1252 et altri), just go with Unicode by defaut.
You can just use
isinstance( s, unicode)
to see if the object is unicode. But if you have all your strings as encoded unicode, then you need to know the encoding. For email-processing applications that can be a nightmare. In the past, I have used chardet to that end.
You can try text.decode('utf-8') and if it succeeds without error, the text is UTF-8 encoded Unicode (of which pure ASCII is a subset). If it's anything else, i.e. a code page, it will probably raise an exception.
I wanted to url encode a python string and got exceptions with hebrew strings.
I couldn't fix it and started doing some guess oriented programming.
Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.
Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).
My original string was a unicode string anyways (i.e. prefixed by a u)
...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).
To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.
Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).
Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.
You original string was a unicode object containing raw Unicode code points, after encoding it as UTF-8 it is a normal byte string that contains UTF-8 encoded data.
The URL encoder seems to expect a byte string, so that it can URL-encode one byte after another and doesn't have to deal with Unicode code points. When you give it a unicode object, it tries to convert it to a byte string using some default encoding, probably ASCII. For Hebrew characters that cannot be represented as ASCII, this will lead to errors.
What does .encode("utf8") do?
It depends on which version of Python you're using:
In Python 3.x, it converts a str object (encoded in UTF-16 or UTF-32) into a bytes object containing the UTF-8 representation of the string.
In Python 2.x, it converts a unicode object into a str object encoded in UTF-8. But str has an encode method too, and writing '...'.encode('UTF-8') is equivalent to writing '...'.decode('ascii').encode('UTF-8').
Since you mentioned the "u" prefix, you must be using 2.x. If you don't require any 2.x-only libraries, I'd recommend switching to 3.x, which has a nice clear distinction between text and binary data.
Dive into Python 3 has a good explanation of the issue.
Can somebody explain what happened?
It would help if you told us what the error message was.
The urllib.quote function expects a str object. It also happens to work with unicode objects that contain only ASCII characters, but not when they contain Hebrew letters.
In Python 3.x, urllib.parse.quote accepts both str (=Python 2.x unicode) and bytes objects. Strings are automatically encoded in UTF-8.
"...".encode("utf-8") transforms the in-memory representation of the string into an UTF-8 -encoded string.
url encoder likely expected a bytestring, that is, string representation where each character is represented with a single byte.
It returns a UTF-8 encoded version of the Unicode string, mystr. It is important to realize that UTF-8 is simply 1 way of encoding Unicode. Python can work with many other encodings (eg. mystr.encode("utf32") or even mystr.encode("ascii")).
The link that balpha posted explains it all. In short:
The fact that your string was prefixed with "u" just means it's composed of Unicode characters (or code points). UTF-8 is an encoding of this string into a sequence of bytes.