when using post in django, a ascii string will be automatic transfer into unicode string.
for example:
s = '\xe2\x80\x99'
is a str type string. (Which is utf-8 format)
when post this string to django, and then get it from request.POST, it is transferred to unicode string:
u'\xe2\x80\x99'
this may cause decode/encode error, because python thought it was a unicode string, but it is a utf-8 string in fact.
My question is how to FORCE convert unicode string to ascii string? Which means just remove the pre 'u' from u'\xe2\x80\x99' to '\xe2\x80\x99'. The traditional method like decode and encode may not work in this situation.
When receiving the request, the encoding of the response is mis-declared as (probably) iso-8859-1, or perhaps not declared at all and defaulting to that encoding. The web site should declare its encoding correctly with a header:
<headers>
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
</headers>
But if that isn't under your control, you can undo the encoding and decode it correctly:
>>> s = u'\xe2\x80\x99'
>>> s.encode('iso-8859-1')
'\xe2\x80\x99'
>>> s.encode('iso-8859-1').decode('utf8')
u'\u2019'
Related
According to the documentation, it is possible to define the encoding of the literals used in the python source like this:
# -*- coding: latin-1 -*-
u = u'abcdé' # This is a unicode string encoded in latin-1
Is there any syntax support to specify the encoding on a literal basis? I am looking for something like:
latin1 = u('latin-1')'abcdé' # This is a unicode string encoded in latin-1
utf8 = u('utf-8')'xxxxx' # This is a unicode string encoded in utf-8
I know that syntax does not make sense, but I am looking for something similar. What can I do? Or is it maybe not possible to have a single source file with unicode strings in different encodings?
There is no way for you to mark a unicode literal as having using a different encoding from the rest of the source file, no.
Instead, you'd manually decode the literal from a bytestring instead:
latin1 = 'abcdé'.decode('latin1') # provided `é` is stored in the source as a E9 byte.
or using escape sequences:
latin1 = 'abcd\xe9'.decode('latin1')
The whole point of the source-code codec line is to support using an arbitrary codec in your editor. Source code should never use mixed encodings, really.
I have the following function
import urllib.request
def seek():
web = urllib.request.urlopen("http://wecloudforyou.com/")
text = web.read().decode("utf8")
return text
texto = seek()
print(texto)
When I decode to utf-8, I get the html code with indentation and carriage returns and all, just like it's seen on the actual website.
<!DOCTYPE html>
<html>
<head>
<title>We Cloud for You |
If I remove .decode('utf8'), I get the code, but the indentation is gone and it's replaced by \n.
<!DOCTYPE html>\n<html>\n <head>\n <title>We Cloud for You
So, why is this happening? As far as I know, when you decode, you are basically converting some encoded string into Unicode.
My sys.stdout.encoding is CP1252 (Windows 1252 encoding)
According to this thread: Why does Python print unicode characters when the default encoding is ASCII?
Python outputs non-unicode strings as raw data, without considering
its default encoding. The terminal just happens to display them if its
current encoding matches the data. - Python outputs Unicode strings
after encoding them using the scheme specified in sys.stdout.encoding.
- Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the
terminal's encoding is independant from the shell's.
So, it seems like python needs to read the text in Unicode before it can convert it to CP1252 and then it's printed on the terminal. But I don't understand why if the text is not decoded, it replaces the indentation with \n.
sys.getdefaultencoding() returns utf8.
In Python 3, when you pass a byte value (raw bytes from the network without decoding) you get to see the representation of the byte value as a Python byte literal. This includes representing newlines as \n characters.
By decoding, you now have a unicode string value instead, and print() can handle that directly:
>>> print(b'Newline\nAnother line')
b'Newline\nAnother line'
>>> print(b'Newline\nAnother line'.decode('utf8'))
Newline
Another line
This is perfectly normal behaviour.
When I get a webpage, I use UnicodeDammit to convert it to utf-8 encoding, just like:
import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)
but when I use:
text = doc.text_content()
print type(text)
The output is <type 'lxml.etree._ElementUnicodeResult'>.
why? I thought it would be a utf-8 string.
lxml.etree._ElementUnicodeResult is a class that inherits from unicode:
$ pydoc lxml.etree._ElementUnicodeResult
lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
| Method resolution order:
| _ElementUnicodeResult
| __builtin__.unicode
| __builtin__.basestring
| __builtin__.object
In Python, it's fairly common to have classes that extend from base types to add some module-specific functionality. It should be safe to treat the object like a regular Unicode string.
You might want to skip the re-encoding step, as lxml.html will automatically use the encoding specified in the source file, and as long as it ends up as valid unicode, there's perhaps no reason to be concerned with how it was initially encoded.
Unless your project is so small and informal that you can be sure you will never encounter 8-bit strings (i.e. it's always 7-bit ASCII, English with no special characters), it's wise to get your text into unicode as early as possible (like right after retrieval) and keep it that way until you need to serialize it for writing to a file or sending over a socket.
The reason you're seeing <type 'lxml.etree._ElementUnicodeResult'> is because lxml.html.fromstring() is automatically doing the decode step for you. Note this means your code above will not work for a page encoded with UTF-16, for example, since the 8-bit string will be encoded in UTF-8 but the html will still be saying utf-16
<meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
and lxml will try to decode the string based on utf-16 encoding rules, raising an exception in short order I would expect.
If you want the output serialized as a UTF-8 encoded 8-bit string, all you need is this:
>>> text = doc.text_content().encode('utf-8')
>>> print type(text)
<type 'str'>
Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?
The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)
Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.
I'm dealing with unknown data and trying to insert into a MySQL database using Python/Django. I'm getting some errors that I don't quite understand and am looking for some help. Here is the error.
Incorrect string value: '\xEF\xBF\xBDs m...'
My guess is that the string is not being properly converted to unicode? Here is my code for unicode conversion.
s = unicode(content, "utf-8", errors="replace")
Without the above unicode conversion, the error I get is
'utf8' codec can't decode byte 0x92 in position 31: unexpected code byte. You passed in 'Fabulous home on one of Decatur\x92s most
Any help is appreciated!
What is the original encoding? I'm assuming "cp1252", from pixelbeat's answer. In that case, you can do
>>> orig # Byte string, encoded in cp1252
'Fabulous home on one of Decatur\x92s most'
>>> uni = orig.decode('cp1252')
>>> uni # Unicode string
u'Fabulous home on one of Decatur\u2019s most'
>>> s = uni.encode('utf8')
>>> s # Correct byte string encoded in utf-8
'Fabulous home on one of Decatur\xe2\x80\x99s most'
0x92 is right single curly quote in windows cp1252 encoding.
\xEF\xBF\xBD is the UTF8 encoding of the unicode replacement character
(which was inserted instead of the erroneous cp1252 character).
So it looks like your database is not accepting the valid UTF8 data?
2 options:
1. Perhaps you should be using unicode(content,"cp1252")
2. If you want to insert UTF-8 into the DB, then you'll need to config it appropriately. I'll leave that answer to others more knowledgeable
The "Fabulous..." string doesn't look like utf-8: 0x92 is above 128 and as such should be a continuation of a multi-byte character. However, in that string it appears on its own (apparently representing an apostrophe).