Python Web Server UTF8 encoding - python

I'm trying to send some data to my Python Web Server through POST, the problem is that the data contains special characters.
I printed it's data to the browser back, but im getting this:
Sent data: text with spécial
Received Data: text with sp\xc3\xa9cial
I have already set the code to # -- coding: utf-8 -- and tried to encode or decode the string to UTF-8 but the browser receives only it.

b'sp\xc3\xa9cial' is a valid Python bytes literal. You could decode it into Unicode string (.decode('utf-8')), to get u'spécial'.
A likely reason is that you've printed a compound structure such as list that contains the bytestring. repr() is called on individual items:
>>> print 'spécial'
spécial
>>> print ['spécial']
['sp\xc3\xa9cial']
# -*- coding: utf-8 -*- defines the source code encoding. It has nothing to do with character encoding at runtime.

Related

Convert string from Latin-1 to UTF-8 and back to Latin-1

A system (not under my control) sends a latin-1 encoded string (such as Öland) which I can convert to utf-8 but not back to latin-1.
Consider this code:
text = '\xc3\x96land' # This is what the external system sends
iso = text.encode(encoding='latin-1') # this is my best guess
print(iso.decode('utf-8'))
print(u"Öland".encode(encoding='latin-1'))
This is the output:
Öland
b'\xd6land'
Now, how to I mimic the system?
Obviously '\xc3\x96land' is not '\xd6land'
if your external system sends it to you then you should first decode it rather than encoding it since it is sent as encoded.
you don't have to encode the encoded!!
hey=u"Öland".encode('latin-1')
print hey
gives output like this ?land
print hey.decode('latin-1')
gives output like this Öland
Turns out the external system send the data in utf-8 already.
Converting the string forth and back works like this now:
#!/usr/bin/env python3.4
# -*- coding: utf-8 -*-
text = '\xc3\x96land'
encoded = text.encode(encoding='raw_unicode_escape')
print(encoded)
utf8 = encoded.decode('utf-8')
print(utf8)
mimic = utf8.encode('utf-8', 'unicode_escape')
print(mimic)
And the output
b'\xc3\x96land'
Öland
b'\xc3\x96land'
Thank you for your support!

Django - writing an hebrew string

I'm trying to send an hebrew string through parse rest API in a Django code
the code is fine - sending a string in english works perfectly
when the letters are in hebrew I get the following error:
Non-ASCII character '\xd7' but no encoding declared;
how can I set encoding programmatically for a specific line?
It's explained in the docs:
Python supports writing Unicode literals in any encoding, but you have
to declare the encoding being used. This is done by including a
special comment as either the first or second line of the source file
In your case:
# -*- coding: utf-8 -*-

Python webpage source read with special characters

I am reading a page source from a webpage, then parsing a value from that source.
There I am facing a problem with special characters.
In my python controller file iam using # -*- coding: utf-8 -*-.
But I am reading a webpage source which is using charset=iso-8859-1
So when I read the page content without specifying any encoding it is throwing error as UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte
when I use string.decode("iso-8859-1").encode("utf-8") then it is parsing data without any error. But it is displaying the value as 'F\u00fcnke' instead of 'Fünke'.
Please let me know how I can solve this issue.
I would greatly appreciate any suggestions.
Encoding is a PITA in Python3 for sure (and 2 in some cases as well).
Try checking these links out, they might help you:
Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)
http://docs.python.org/2/library/codecs.html
Also it would be nice with the code for "So when I read the page content without specifying any encoding" My best guess is that your console doesn't use utf-8 (for instance, windows.. your # -*- coding: utf-8 -*- only tells Python what type of characters to find within the sourcecode, not the actual data the code is going to parse or analyze itself.
For instance i write:
# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

How to encode 'Importação de petróleo' string in python?

I want to use "Importação de petróleo" in my program.
How can I do that because all encodings give me errors as cannot encode.
I think you're confusing the string __repr__ with its __str__:
>>> s = u"Importação de petróleo"
>>> s
u'Importa\xe7\xe3o de petr\xf3leo'
>>> print s
Importação de petróleo
There's no problem with \xe7 and friends; they are just the encoding representation for those special characters. You can't avoid them and you shouldn't need to :)
A must-to-read link about unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Do this
# -*- coding: utf-8 -*-
print 'Importação de petróleo'
place
# -*- coding: utf-8 -*-
on very top of the program (first line).
Also save your code as utf-8 (default if you are using linux)
If you are using characters in a source (.py) file which are outside of the ASCII range, then you will need to specify the encoding at the top of the file, so that the Python lexer knows how to read and interpret the characters in the file.
If this is the case, then, as the very first line of your file, use this:
# coding: utf-8
(If your file is actually in a different encoding, such as ISO-8859-1, then you will need to use that instead. Python can handle several different character encodings; you just have to tell it what to expect)
Adding a 'u' in front of the string makes it unicode. The documentation here gives details regarding Unicode handling in Python 2.x:-
Python 2.x Unicode support
As specialscope mentioned, first thing, you have add this as the first line of your program:
# -*- coding: utf-8 -*-
If you don’t, you’ll get an error which looks something like this:
SyntaxError: Non-ASCII character '\xc3' in file /tmp/blah.py on line 10,
but no encoding declared; see http://www.python.org/peps/pep-0263.html
for details
So far, so good. Now, you have to make sure that every string that contains anything besides plain ASCII is prefixed with u:
print u'Importação de petróleo'
But there’s one more step. This is a separate topic, but chances are that you’re going to have to end up re-encoding that string before you send it to stdout or a file.
Here are the rules of thumb for Unicode in Python:
If at all possible make sure that any data you’re working with is in UTF-8.
When you read external UTF-8 encoded data into your program, immediately decode it into Unicode.
When you send data out of your program (to a file or stdout), make sure that you re-encode it as UTF-8.
This all changes in Python 3, by the way.
Help on class unicode in module builtin:
class unicode(basestring)
| unicode(string [, encoding[, errors]]) -> object
|
| Create a new Unicode object from the given encoded string.
| encoding defaults to the current default string encoding.
| errors can be 'strict', 'replace' or 'ignore' and defaults to 'strict'.
|
try using "utf8" as the encoding for unicode()

convert a String '\u05d9\u05d7\u05e4\u05d9\u05dd' to its unicode character in python

I get a Json object from a URL which has values in the form like above:
title:'\u05d9\u05d7\u05e4\u05d9\u05dd'
I need to print these values as readable text however I'm not able to convert them as they are taken as literal strings and not unicode objects.
doing unicode(myStr) does not work
doing a = u'%s' % myStr does not work
all are escaped as string so return the same sequence of characters.
Does any one know how I can do this conversion in python?
May be the right approach is to change the encoding of the response, how do I do that?
You should use the json module to load the JSON data into a Python object. It will take care of this for you, and you'll have Unicode strings. Then you can encode them to match your output device, and print them.
json strings always use ", not ' so '\u05d9\u05d7\u05e4\u05d9\u05dd' is not a json string.
If you load a valid json text then all Python strings in it are Unicode so you don't need to decode anything. To display them you might need to encode them using a character encoding suitable for your terminal.
Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
d = json.loads(u'''{"title": "\u05d9\u05d7\u05e4\u05d9\u05dd"}''')
print d['title'].encode('utf-8') # -> יחפים
Note: it is a coincidence that the source encoding (specified in the first line) is equal to the output encoding (the last line) they are unrelated and can be different.
If you'd like to see less \uxxxx sequences in a json text then you could use ensure_ascii=False:
Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
L = ['יחפים']
json_text = json.dumps(L) # default encoding for input bytes is utf-8
print json_text # all non-ASCII characters are escaped
json_text = json.dumps(L, ensure_ascii=False)
print json_text # output as-is
Output
["\u05d9\u05d7\u05e4\u05d9\u05dd"]
["יחפים"]
If you have a string like this outside of your JSON object for some reason, you can decode the string using raw_unicode_escape to get the unicode string you want:
>>> '\u05d9\u05d7\u05e4\u05d9\u05dd'.decode('raw_unicode_escape')
u'\u05d9\u05d7\u05e4\u05d9\u05dd'
>>> print '\u05d9\u05d7\u05e4\u05d9\u05dd'.decode('raw_unicode_escape')
יחפים

Categories

Resources