strange python regex behavior - maybe connected to unicode or sqlalchemy - python

I'm trying to search for a pattern in sqlalchemy results (actually filter by a 'like' or 'op'('regexp')(pattern) which I believe is implanted with regex somewhere) - the string and the search string are both in hebrew, and presumably (maybe I'm wrong-)-unicode
where r = u'לבן' and c = u'לבן, ורוד, '
when I do re.search(r,c) I get the SRE.match object
but when I query the db like:
f = session.query(classname)
c = f[0].color
and c gives me:
'\xd7\x9c\xd7\x91\xd7\x9f,\xd7\x95\xd7\xa8\xd7\x95\xd7\x93,'
or print (c):
לבן,ורוד,
practicaly the same but running re.search(r,c) gives me no match object.
Since I suspected a unicode issue I tried to transform to unicode with unicode(c)
and I get an 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal' which I guess means this is already unicode string - so where's the catch here?
I would prefer using the sqlalchemy 'like' but I get the same problem there = where I know for sure (as I showed in my example that the data contains the string)
Should I transform the search string,pattern somehow? is this related to unicode? something else?
The db table (which I'm quering) collation is utf8_unicode_ci

c = f[0].color
is not returning a Unicode string (or its repr() would show a u'...' kind of string), but a UTF-8 encoded string.
Try
c = f[0].color.decode("utf-8")
which results in
u'\u05dc\u05d1\u05df,\u05d5\u05e8\u05d5\u05d3,'
or
u'לבן,ורוד,'
if your console can display Hebrew characters.

'\xd7\x9c\xd7\x91\xd7\x9f,\xd7\x95\xd7\xa8\xd7\x95\xd7\x93, is encoded representation of string u'לבן, ורוד, '. So in the second example you should write re.search(r,c.decode('utf-8'))
You're trying to do almost the same except setting encoding parameter. It makes python try ascii encoding

Related

Python 3 Decoding Strings

I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)

String encode/decode issue - missing character from end

I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).
# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'
# prints something in chineese
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭
The closest I have gone is via encoding it to utf-16 as:
>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5
But the actual value that I need as per the value store in database is:
WAGCenter-04190517953516060503-20160605124857-4190-51
I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.
Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.
Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end
>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5
On trying for some other columns, it is working. What might be the cause of this intermittent issue?
You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:
the type of the database
the way you input values in it
the way you extract values to obtain your pseudo unicode string
the actual content if you use direct (native) database access
What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:
def cvt(ustring):
l = []
for uc in ustring:
l.append(chr(ord(uc) & 0xFF)) # low order byte
l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
return ''.join(l)
cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'
The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.
Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.

Emoticons in python string - \xF0\x9F\x92\x96 \xF0 [duplicate]

Look at the following:
/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string
value: '\xF0\x9F\x91\x8A\xF0\x9F...' for column 't_content' at row 1
n = self.cursor.execute(self.sql, (item['topic'], item['url'], item['content']))
The string '\xF0\x9F\x91\x8A, actually is a 4-byte unicode: u'\U0001f62a'. The mysql's character-set is utf-8 but inserting 4-byte unicode it will truncate the inserted string.
I googled for such a problem and found that mysql under 5.5.3 don't support 4-byte unicode, and unfortunately mine is 5.5.224.
I don't want to upgrade the mysql server, so I just want to filter the 4-byte unicode in python, I tried to use regular expression but failed.
So, any help?
If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.
You could use a regular expression for that:
>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '
Alternatively, you could use the .translate() function with a mapping table that only contains None values:
>>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
>>> example.translate(nohigh)
u'Some example text with a sleepy face: '
However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.
This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.
I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.
I think you should use utf8mb4 collation instead of utf8 and run
SET NAMES UTF8MB4
after connection with DB (link, link, link)
simple normalization for string without regex and translate:
def normalize_unicode(s):
return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])

Reading unicode from sqlite db using python

The data stored in unicode (in database) has to be retrieved and convert into a different form.
The following snippet
def convert(content):
content = content.replace("ஜௌ", "n\[s");
return content;
mydatabase = "database.db"
connection = sqlite3.connect(mydatabase)
cursor = connection.cursor()
query = ''' select unicode_data from table1'''
cursor.execute(query)
for row in cursor.fetchone():
print convert(row)
yields the following error message in convert method.
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in
position 0: ordinal not in range(128)
If the database content is "ஜௌஜௌஜௌ", the output should be "n\[sn\[sn\[s"
The documentation suggests to use ignore or replace to avoid the error, when creating the unicode string.
when the iteration is changed as follows:
for row in cursor.fetchone():
print convert(unicode(row, errors='replace'))
it returns
exceptions.TypeError: decoding Unicode is not supported
which informs that row is already a unicode.
Any light on this to make it work is highly appreciated. Thanks in advance.
content = content.replace("ஜௌ", "n\[s");
Suggest you mean:
content = content.replace(u'ஜௌ', ur'n\[s');
or for safety where the encoding of your file is uncertain:
content = content.replace(u'\u0B9C\u0BCC', ur'n\[s');
The content you have is already Unicode, so you should do Unicode string replacements on it. "ஜௌ" without the u is a string of bytes that represents those characters in some encoding dependent on your source file charset. (Byte strings work smoothly together with Unicode strings only in the most unambiguous cases, which is for ASCII characters.)
(The r-string means not having to worry about including bare backslashes.)

Python encoding issue

i m working with some python script, got a raw string with UTF8 encoding. first of all i decoded it to utf8 then some processing is done and at the end i encode it back to utf8 and inserted to DB(mysql) but chars in DB are not presented in real format.
str = '<term>Beiträge</term>'
str = str.decode('utf8')
...
...
...
str = str.encode('utf8')
after that string is found in txt file in its real form but in MYSQL_DB, i found it like this
<term>"Beiträge</term>
any idea why this happened? :-(
Assuming you are using the MySQLdb library, you need to create connections using the keyword arguments:
use_unicode
If True, text-like columns are returned as unicode objects using the
connection's character set. Otherwise,
text-like columns are returned as
strings. columns are returned as
normal strings. Unicode objects will
always be encoded to the connection's
character set regardless of this
setting.
&
charset
If supplied, the connection character set will be changed to this
character set (MySQL-4.1 and newer).
This implies use_unicode=True.
You should also check the encoding of your db tables.
To make a string a Unicode string you should use the stringprefix 'u'. See also here http://docs.python.org/reference/lexical_analysis.html#literals
Maybe your example works by just adding the prefix in the initial assignment.

Categories

Resources