Emoticons in python string - \xF0\x9F\x92\x96 \xF0 [duplicate] - python

Look at the following:
/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))
The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string.
I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224.
I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed.
So, any help?

If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.
You could use a regular expression for that:
>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
Alternatively, you could use the .translate() function with a mapping table that only contains None values:
>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '
However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.
This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.
I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.

I think you should use utf8mb4 collation instead of utf8 and run
SET NAMES UTF8MB4
after connection with DB (link, link, link)

simple normalization for string without regex and translate:
def normalize_unicode(s):
return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])

Related

Why doesn't 'encode("utf-8", 'ignore').decode("utf-8")' strip non-UTF8 chars in Python 3?

I'm using Python 3.7 and Django 2.0. I want to strip out non-UTF-8 characters from a string, that I'm obtaining by reading this CSV file. I tried this ...
web_site = row['website'].strip().encode("utf-8", 'ignore').decode("utf-8")
but this doesn't seem to be doing the job, since I have a resulting string that looks like ...
web_site: "wbez.org<200e>"
Whatever this "<200e>" thing is, is evidently non-UTF-8 string, because when I try and insert this into a MySQL database (deployed as a docker image), I get the following error ...
web_1 | django.db.utils.OperationalError: Problem installing fixture '/app/maps/fixtures/seed_data.yaml': Could not load maps.Coop(pk=191): (1366, "Incorrect string value: '\\xE2\\x80\\x8E' for column 'web_site' at row 1")
Your row['website'] is already a Unicode string. UTF-8 can support all valid Unicode code points, so .encode('utf8','ignore') doesn't typically ignore anything and encodes the entire string in UTF-8, and .decode('utf8') changes it back to a Unicode string again.
If you simply want to strip non-ASCII characters, use the following to filter only ASCII characters and ignore the rest.
row['website'].encode('ascii','ignore').decode('ascii')
I think you are confusing the encodings.
Python has a standard character set: Unicode
UTF-8 is just and encoding of Unicode. All characters in Unicode can be encoded in UTF-8, and all valid UTF-8 codes can be interpreted as unicode characters.
So you are just encoding and decoding Unicode strings, so the code should do nothing. (There is really some exceptional cases: Python strings really are a superset of Unicode, so your code would just remove non Unicode characters, see surrogateescape, for such extremely seldom case, usually you will enconter only by reading sys.argv or os.environ).
In any case, I think you are doing thing wrong. Search in this site for the general question (e.g. "remove non-ascii characters"). It is often better to decompose (with K, compatibility), and then remove accent, and then remove non-ascii characters, so that you will get more characters translated. There are various function to create slug, which do a better job, or there is also a library which translate more characters in "nearly equivalent" ascii characters (Unicode has various representation of LETTER A, and you may want to translate also Alpha and Aleph and ... into A (better then discarding, especially if you have a foreign language, which possibly you will discard everything).

Python - decoded unicode string does not stay decoded

It may be too late at night for me to be still doing programming (so apologies if this is a very silly thing to ask), but I have spotted a weird behaviour with string decoding in Python:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> name = bs.decode("utf-8", "replace")
>>> print(name)
I n t e l ( R )
>>> list_of_dict = []
>>> list_of_dict.append({'name': name})
>>> list_of_dict
[{'name': 'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00'}]
How can the list contain unicode characters if it has already been decoded?
Decoding bytes by definition produces "Unicode" (text really, where Unicode is how you can store arbitrary text, so Python uses it internally for all text), so when you say "How can the list contain unicode characters if it has already been decoded?" it betrays a fundamental misunderstanding of what Unicode is. If you have a str in Python 3, it's text, and that text is composed of a series of Unicode code points (with unspecified internal encoding; in fact, modern Python stores in ASCII, latin-1, UCS-2 or UCS-4, depending on highest ordinal value, as well as sometimes caching a UTF-8 representation, or a native wchar representation for use with legacy extension modules).
You're seeing the repr of the nul character (Unicode ordinal 0) and thinking it didn't decode properly, and you're likely right (there's nothing illegal about nul characters, they're just not common in plain text); your input data is almost certainly encoded in UTF-16-LE, not UTF-8. Use the correct codec, and the text comes out correctly:
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> bs.decode('utf-16-le') # No need to replace things, this is legit UTF-16-LE
'Intel(R)'
>>> list_of_dict = [{'name': _}]
>>> list_of_dict
[{'name': 'Intel(R)'}]
Point is, while producing nul characters is legal, unless it's a binary file, odds are it won't have any, and if you're getting them, you probably picked the wrong codec.
The discrepancy between printing the str and displaying is as part of a list/dict is because list/dict stringify with the repr of their contents (what you'd type to reproduce the object programmatically in many cases), so the string is rendered with the \x00 escapes. printing the str directly doesn't involve the repr, so the nul characters get rendered as spaces (since there is no printable character for nul, so your terminal chose to render it as spaces).
So what I think is happening is that the null terminated characters \x00 are not properly decoded and remain in the string after decoding. However, since these are null characters they do not mess up when you print the string which interprets them as nothing or spaces (in my case I tested your code on arch linux on python2 and python3 and they were completely ommited)
Now the thing is that you got a \x00 character for each of your string characters when you decode with utf-8 so what this means is that your bytestream consists actually out of 16bit characters and not 8bit. Therefore, if you try to decode using utf-16 your code will work like a charm :)
>>> bs = bytearray(b'I\x00n\x00t\x00e\x00l\x00(\x00R\x00)\x00')
>>> t = bs.decode("utf-16", "replace")
>>> print(t)
Intel(R)
>>> t
'Intel(R)'

Python unicode vs utf-8

I am building a string query (cypher query) to execute it against a database (Neo4J).
I need to concatenate some strings but I am having trouble with encoding.
I am trying to build a unicode string.
# -*- coding: utf-8 -*-
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ repr(value)
print line.encode("utf-8")
I expected to have:
Name = "D'Santana Carlos Lãnez"
But i getting:
Name = u"D'Santana Carlos L\xe3nez"
I imagine that repr is returning a unicode. Or probably i am not using the right function.
Python literal (repr) syntax is not a valid substitute for Cypher string literal syntax. The leading u is only one of the differences between them; notably, Cypher string literals don't have \x escapes, which Python will use for characters between U+0080–U+00FF.
If you need to create Cypher string literals from Python strings you would need to write your own string escaping function that writes output matching that syntax. But you should generally avoid creating queries from variable input. As with SQL databases, the better answer is query parameterisation.
value = u"D'Santana Carlos Lãnez"
key = u"Name"
line = key + u" = "+ value
print(line)
value is already unicode because you use prefix u in u"..." so you don't need repr() (and unicode() or decode())
Besides repr() doesn't convert to unicode. But it returns string very useful for debuging - it shows hex codes of native chars and other things.

Python's .format() minilanguage and Unicode

I'm trying to use some of the simple unicode characters in a command line program I'm writing, but drawing these things into a table becomes difficult because Python appears to be treating single-character symbols as multi-character strings.
For example, if I try to print(u"\u2714".encode("utf-8")) I see the unicode checkmark. However, if I try to add some padding to that character (as one might in tabular structure), Python seems to be interpreting this single-character string as a 3-character one. All three of these lines print the same thing:
print("|{:1}|".format(u"\u2714".encode("utf-8")))
print("|{:2}|".format(u"\u2714".encode("utf-8")))
print("|{:3}|".format(u"\u2714".encode("utf-8")))
Now I think I understand why this is happening: it's a multibyte string. My question is, how do I get Python to pad this string appropriately?
Make your format strings unicode:
from __future__ import print_function
print(u"|{:1}|".format(u"\u2714"))
print(u"|{:2}|".format(u"\u2714"))
print(u"|{:3}|".format(u"\u2714"))
outputs:
|✔|
|✔ |
|✔ |
Don't encode('utf-8') at that point do it latter:
>>> u"\u2714".encode("utf-8")
'\xe2\x9c\x94'
The UTF-8 encoding is three bytes long. Look at how format works with Unicode strings:
>>> u"|{:1}|".format(u"\u2714")
u'|\u2714|'
>>> u"|{:2}|".format(u"\u2714")
u'|\u2714 |'
>>> u"|{:3}|".format(u"\u2714")
u'|\u2714 |'
Tested on Python 2.7.3.

strange python regex behavior - maybe connected to unicode or sqlalchemy

I'm trying to search for a pattern in sqlalchemy results (actually filter by a 'like' or 'op'('regexp')(pattern) which I believe is implanted with regex somewhere) - the string and the search string are both in hebrew, and presumably (maybe I'm wrong-)-unicode
where r = u'לבן' and c = u'לבן, ורוד, '
when I do re.search(r,c) I get the SRE.match object
but when I query the db like:
f = session.query(classname)
c = f[0].color
and c gives me:
'\xd7\x9c\xd7\x91\xd7\x9f,\xd7\x95\xd7\xa8\xd7\x95\xd7\x93,'
or print (c):
לבן,ורוד,
practicaly the same but running re.search(r,c) gives me no match object.
Since I suspected a unicode issue I tried to transform to unicode with unicode(c)
and I get an 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal' which I guess means this is already unicode string - so where's the catch here?
I would prefer using the sqlalchemy 'like' but I get the same problem there = where I know for sure (as I showed in my example that the data contains the string)
Should I transform the search string,pattern somehow? is this related to unicode? something else?
The db table (which I'm quering) collation is utf8_unicode_ci
c = f[0].color
is not returning a Unicode string (or its repr() would show a u'...' kind of string), but a UTF-8 encoded string.
Try
c = f[0].color.decode("utf-8")
which results in
u'\u05dc\u05d1\u05df,\u05d5\u05e8\u05d5\u05d3,'
or
u'לבן,ורוד,'
if your console can display Hebrew characters.
'\xd7\x9c\xd7\x91\xd7\x9f,\xd7\x95\xd7\xa8\xd7\x95\xd7\x93, is encoded representation of string u'לבן, ורוד, '. So in the second example you should write re.search(r,c.decode('utf-8'))
You're trying to do almost the same except setting encoding parameter. It makes python try ascii encoding

Categories

Resources