Python convert hanzi character - python

How do I convert between a hanzi character and it's unicode value as depicted below?
与 to U+4E0E
今 to U+4ECA
令 to U+4EE4
免 to U+514D
Appears unsupported by default:
>>> a = '安'
Unsupported characters in input

The small 'u' in front of the quote indicates that a Unicode string is supposed to be created.
>>> a = u'与'
>>> a
u'\u4e0e'
See the the string documentation for more information: http://docs.python.org/tutorial/introduction.html#unicode-strings
Update:
Set the source file encoding according to the actual encoding of the file, so that the interpreter knows how to parse it.
For example, to use UTF-8 just add this string to the header of the file:
# -*- coding: utf8 -*-

Related

Python Web Server UTF8 encoding

I'm trying to send some data to my Python Web Server through POST, the problem is that the data contains special characters.
I printed it's data to the browser back, but im getting this:
Sent data: text with spécial
Received Data: text with sp\xc3\xa9cial
I have already set the code to # -- coding: utf-8 -- and tried to encode or decode the string to UTF-8 but the browser receives only it.
b'sp\xc3\xa9cial' is a valid Python bytes literal. You could decode it into Unicode string (.decode('utf-8')), to get u'spécial'.
A likely reason is that you've printed a compound structure such as list that contains the bytestring. repr() is called on individual items:
>>> print 'spécial'
spécial
>>> print ['spécial']
['sp\xc3\xa9cial']
# -*- coding: utf-8 -*- defines the source code encoding. It has nothing to do with character encoding at runtime.

Python unicode string literals in module declared as utf-8

I have a dummie Python module with the utf-8 header that looks like this:
# -*- coding: utf-8 -*-
a = "á"
print type(a), a
Which prints:
<type 'str'> á
But I thought that all string literals inside a Python module declared as utf-8 whould automatically be of type unicode, intead of str. Am I missing something or is this the correct behaviour?
In order to get a as an unicode string I use:
a = u"á"
But this doesn't seem very "polite", nor practical. Is there a better option?
# -*- coding: utf-8 -*-
doesn't make the string literals Unicode. Take this example, I have a file with an Arabic comment and string, file is utf-8:
# هذا تعليق عربي
print type('نص عربي')
if I run it it will throw a SyntaxError exception:
SyntaxError: Non-ASCII character '\xd9' in file file.py
on line 2, but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
so to allow this I have to add that line to tell the interpreter that the file is UTF-8 encoded:
# -*-coding: utf-8 -*-
# هذا تعليق عربي
print type('نص عربي')
now it runs fine but it still prints <type 'str'> unless I make the string Unicode:
# -*-coding: utf-8 -*-
# هذا تعليق عربي
print type(u'نص عربي')
No, the codec at the top only informs Python how to interpret the source code, and uses that codec to interpret Unicode literals. It does not turn literal bytestrings into unicode values. As PEP 263 states:
This PEP proposes to introduce a syntax to declare the encoding of
a Python source file. The encoding information is then used by the
Python parser to interpret the file using the given encoding. Most
notably this enhances the interpretation of Unicode literals in
the source code and makes it possible to write Unicode literals
using e.g. UTF-8 directly in an Unicode aware editor.
Emphasis mine.
Without the codec declaration, Python has no idea how to interpret non-ASCII characters:
$ cat /tmp/test.py
example = '☃'
$ python2.7 /tmp/test.py
File "/tmp/test.py", line 1
SyntaxError: Non-ASCII character '\xe2' in file /tmp/test.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
If Python behaved the way you expect it to, you would not be able to literal bytestring values that contain non-ASCII byte values either.
If your terminal is configured to display UTF-8 values, then printing a UTF-8 encoded byte string will look 'correct', but only by virtue of luck that the encodings match.
The correct way to get unicode values, is by using unicode literals or by otherwise producing unicode (decoding from byte strings, converting integer codepoints to unicode characters, etc.):
unicode_snowman = '\xe2\x98\x83'.decode('utf8')
unicode_snowman = unichr(0x2603)
In Python 3, the codec also applies to how variable names are interpreted, as you can use letters and digits outside of the ASCII range in names. The default codec in Python 3 is UTF-8, as opposed to ASCII in Python 2.
No this is just source code encoding. Please see http://www.python.org/dev/peps/pep-0263/
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
or (using formats recognized by popular editors)
#!/usr/bin/python
# -*- coding: <encoding name> -*-
or
#!/usr/bin/python
# vim: set fileencoding=<encoding name> :
This doesn't make all literals unicode just point how unicode literals should be decoded.
One should use unicode function or u prefix to set literal as unicode.
N.B. in python3 all strings are unicode.

using £ in function and write it to csv in Python

I want to remove pound sign from the string that is parsed from url using Beautifulsoup. And I got the following error for pound sign.
SyntaxError: Non-ASCII character '\xa3' in file
I tried to put this # -*- coding: utf-8 -*- at the start of the class but still got the error.
This is the code. After I get the float number, I want to write it csv file.
mainTag = SoupStrainer('table', {'class':'item'})
soup = BeautifulSoup(resp,parseOnlyThese=mainTag)
tag= soup.findAll('td')[3]
price = tag.text.strip()
pr = float(price.lstrip(u'£').replace(',', ''))
The problem is likely one of encoding, and bytes vs characters. What encoding was the CSV file created with? What sequence of bytes is in the file where the £ symbol occurs? What are the bytes contained in the variable price? You'll need to replace the bytes that actually occur in the string. One piece of the puzzle is the contents of the data in your source code. That's where the # -*- coding: utf-8 -*- marker at the top of the source is significant: it tells python how to interpret the bytes in a string literal. It is possible you will need (or want) to decode the bytes from the CSV file to create a Unicode string before replacing the character.
I will point out that the documentation for the csv module in Python 2.7 says:
Note: This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
The examples section includes the following code demonstrating decoding the bytes provided by the csv module to Unicode strings.
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]

How to encode 'Importação de petróleo' string in python?

I want to use "Importação de petróleo" in my program.
How can I do that because all encodings give me errors as cannot encode.
I think you're confusing the string __repr__ with its __str__:
>>> s = u"Importação de petróleo"
>>> s
u'Importa\xe7\xe3o de petr\xf3leo'
>>> print s
Importação de petróleo
There's no problem with \xe7 and friends; they are just the encoding representation for those special characters. You can't avoid them and you shouldn't need to :)
A must-to-read link about unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Do this
# -*- coding: utf-8 -*-
print 'Importação de petróleo'
place
# -*- coding: utf-8 -*-
on very top of the program (first line).
Also save your code as utf-8 (default if you are using linux)
If you are using characters in a source (.py) file which are outside of the ASCII range, then you will need to specify the encoding at the top of the file, so that the Python lexer knows how to read and interpret the characters in the file.
If this is the case, then, as the very first line of your file, use this:
# coding: utf-8
(If your file is actually in a different encoding, such as ISO-8859-1, then you will need to use that instead. Python can handle several different character encodings; you just have to tell it what to expect)
Adding a 'u' in front of the string makes it unicode. The documentation here gives details regarding Unicode handling in Python 2.x:-
Python 2.x Unicode support
As specialscope mentioned, first thing, you have add this as the first line of your program:
# -*- coding: utf-8 -*-
If you don’t, you’ll get an error which looks something like this:
SyntaxError: Non-ASCII character '\xc3' in file /tmp/blah.py on line 10,
but no encoding declared; see http://www.python.org/peps/pep-0263.html
for details
So far, so good. Now, you have to make sure that every string that contains anything besides plain ASCII is prefixed with u:
print u'Importação de petróleo'
But there’s one more step. This is a separate topic, but chances are that you’re going to have to end up re-encoding that string before you send it to stdout or a file.
Here are the rules of thumb for Unicode in Python:
If at all possible make sure that any data you’re working with is in UTF-8.
When you read external UTF-8 encoded data into your program, immediately decode it into Unicode.
When you send data out of your program (to a file or stdout), make sure that you re-encode it as UTF-8.
This all changes in Python 3, by the way.
Help on class unicode in module builtin:
class unicode(basestring)
| unicode(string [, encoding[, errors]]) -> object
|
| Create a new Unicode object from the given encoded string.
| encoding defaults to the current default string encoding.
| errors can be 'strict', 'replace' or 'ignore' and defaults to 'strict'.
|
try using "utf8" as the encoding for unicode()

convert a String '\u05d9\u05d7\u05e4\u05d9\u05dd' to its unicode character in python

I get a Json object from a URL which has values in the form like above:
title:'\u05d9\u05d7\u05e4\u05d9\u05dd'
I need to print these values as readable text however I'm not able to convert them as they are taken as literal strings and not unicode objects.
doing unicode(myStr) does not work
doing a = u'%s' % myStr does not work
all are escaped as string so return the same sequence of characters.
Does any one know how I can do this conversion in python?
May be the right approach is to change the encoding of the response, how do I do that?
You should use the json module to load the JSON data into a Python object. It will take care of this for you, and you'll have Unicode strings. Then you can encode them to match your output device, and print them.
json strings always use ", not ' so '\u05d9\u05d7\u05e4\u05d9\u05dd' is not a json string.
If you load a valid json text then all Python strings in it are Unicode so you don't need to decode anything. To display them you might need to encode them using a character encoding suitable for your terminal.
Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
d = json.loads(u'''{"title": "\u05d9\u05d7\u05e4\u05d9\u05dd"}''')
print d['title'].encode('utf-8') # -> יחפים
Note: it is a coincidence that the source encoding (specified in the first line) is equal to the output encoding (the last line) they are unrelated and can be different.
If you'd like to see less \uxxxx sequences in a json text then you could use ensure_ascii=False:
Example
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
L = ['יחפים']
json_text = json.dumps(L) # default encoding for input bytes is utf-8
print json_text # all non-ASCII characters are escaped
json_text = json.dumps(L, ensure_ascii=False)
print json_text # output as-is
Output
["\u05d9\u05d7\u05e4\u05d9\u05dd"]
["יחפים"]
If you have a string like this outside of your JSON object for some reason, you can decode the string using raw_unicode_escape to get the unicode string you want:
>>> '\u05d9\u05d7\u05e4\u05d9\u05dd'.decode('raw_unicode_escape')
u'\u05d9\u05d7\u05e4\u05d9\u05dd'
>>> print '\u05d9\u05d7\u05e4\u05d9\u05dd'.decode('raw_unicode_escape')
יחפים

Categories

Resources