Use latin characters in appengine - python

How can store latin characters in appengine? (e.g. "peña") when I want to store this I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf1 in position 2: ordinal not in range(128)
I can change the Ñ by N, but, there not another and better way?
And if i encode the value, how can print "Peña" again?

GAE stores strings in unicode. Perhaps encode your string in unicode before saving it.
value = "peña"
value.encode("utf8")

From the error ("Unicode Decode Error"), it seems you could have more luck using Unicode - I'd try UTF-8.

Related

Python mmh3: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

I'm querying a DB for jokes and am getting back Python strs. I want to use them as Unicode objects, so I do:
joke = unicode(joke, 'utf-8')
This works for all my DB results and does not cause any issues.
Then I try to hash each word in each joke like this:
result = mmh3.hash(joke)
and I get back:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
I inspected the text and it's Japanese. Does this mean I should drop all non-ascii characters before hashing or is there a better way to handle this?
Thanks!
The .hash(...) function appears to require either bytes or ascii-convertible text.
The easiest way (if you're dealing entirely with unicode objects) is to convert them to bytes as you call mmh3.hash:
result = mmh3.hash(joke.encode('UTF-8'))

Python: Unicode problems

I am getting an error at this line
logger.info(u"Data: {}".format(data))
I'm getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 4: ordinal not in range(128)
Before that line, I tried adding data = data.decode('utf8') and I still get the same error.
I tried data = data.encode('utf8') and it says UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)
How do I fix this? I don't know if I should encode or decode but neither works.
Use a string literal:
if isinstance(data, unicode):
data = data.encode('utf8')
logger.info("Data: {}".format(data))
The logging module needs you to pass in string values as these values are passed on unaltered to formatters and the handlers. Writing log messages to a file means that unicode values are encoded with the default (ASCII) codec otherwise. But you also need to pass in a bytestring value when formatting.
Passing in a str value into a unicode .format() template leads to decoding errors, passing in a unicode value into a str .format() template leads to encoding errors, and passing a formatted unicode value to logger.info() leads to encoding errors too.
Better not mix and encode explicitly beforehand.
You could do something such as
data.decode('utf-8').encode("ascii",errors="ignore")
This will "ignore" the unicode characters
edit: data.encode('ascii',error='ignore') may be enough but i'm not in a position to test this currently.

Printing decoded JSON string

I am receiving a JSON string, pass it through json.loads and ends with an array of unicode strings. That's all well and good. One of the strings in the array is:
u'\xc3\x85sum'
now should translate into 'Åsum' when decoded using decode('utf8') but instead I get an error:
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
To test what's wrong I did the following
'Åsum'.encode('utf8')
'\xc3\x85sum'
print '\xc3\x85sum'.decode('utf8')
Åsum
So that worked fine, but if I make it to a unicode string as json.loads does I get the same error:
print u'\xc3\x85sum'.decode('utf8')
UnicodeEncodeError: 'charmap' codec can't encode character u'\x85' in position 1: character maps to <undefined>
I tried doing json.loads(jsonstring, encoding = 'uft8') but that changes nothing.
Is there a way to solve it? Make json.loads not make it unicode or make it decode using 'utf8' as I ask it to.
Edit:
The original string I receive look like this, or the part that causes trouble:
"\\u00c3\\u0085sum"
You already have a Unicode value, so trying to decode it forces an encode first, using the default codec.
It looks like you received malformed JSON instead; JSON values are already unicode. If you have UTF-8 data in your Unicode values, the only way to recover is to encode to Latin-1 (which maps the first 255 codepoints to bytes one-on-one), then decode from that as UTF8:
>>> print u'\xc3\x85sum'.encode('latin1').decode('utf8')
Åsum
The better solution is to fix the JSON source, however; it should not doubly-encode to UTF-8. The correct representation would be:
json.dumps(u'Åsum')
'"\\u00c5sum"'

What happens when you call str() on a unicode string?

I'm wondering what happens internally when you call str() on a unicode string.
# coding: utf-8
s2 = str(u'hello')
Is s2 just the unicode byte representation of the str() arg?
It will try to encode it with your default encoding. On my system, that's ASCII, and if there's any non-ASCII characters, it will fail:
>>> str(u'あ')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
Note that this is the same error you'd get if you called encode('ascii') on it:
>>> u'あ'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u3042' in position 0: ordinal not in range(128)
As you might imagine, str working on some arguments and failing on others makes it easy to write code that on first glance seems to work, but stops working once you throw some international characters in there. Python 3 avoids this by making the problem blatantly obvious: you can't convert Unicode to a byte string without an explicit encoding:
>>> bytes(u'あ')
TypeError: string argument without an encoding

Convert or strip out "illegal" Unicode characters

I've got a database in MSSQL that I'm porting to SQLite/Django. I'm using pymssql to connect to the database and save a text field to the local SQLite database.
However for some characters, it explodes. I get complaints like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 1916: ordinal not in range(128)
Is there some way I can convert the chars to proper unicode versions? Or strip them out?
Once you have the string of bytes s, instead of using it as a unicode obj directly, convert it explicitly with the right codec, e.g.:
u = s.decode('latin-1')
and use u instead of s in the code that follows this point (presumably the part that writes to sqlite). That's assuming latin-1 is the encoding that was used to make the byte string originally -- it's impossible for us to guess, so try to find out;-).
As a general rule, I suggest: don't process in your applications any text as encoded byte strings -- decode them to unicode objects right after input, and, if necessary, encode them back to byte strings right before output.
When you decode, just pass 'ignore' to strip those characters
there is some more way of stripping / converting those are
'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'
'ignore': ignore malformed data and continue without further notice
'backslashreplace': replace with backslashed escape sequences (for encoding only)
Test
>>> "abcd\x97".decode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 4: ordinal not in range(128)
>>>
>>> "abcd\x97".decode("ascii","ignore")
u'abcd'

Categories

Resources