String encode/decode issue - missing character from end

String encode/decode issue - missing character from end - python

I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).
# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'
# prints something in chineese
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭
The closest I have gone is via encoding it to utf-16 as:
>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5
But the actual value that I need as per the value store in database is:
WAGCenter-04190517953516060503-20160605124857-4190-51
I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.
Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.
Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end
>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5
On trying for some other columns, it is working. What might be the cause of this intermittent issue?

You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:
the type of the database
the way you input values in it
the way you extract values to obtain your pseudo unicode string
the actual content if you use direct (native) database access
What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:
def cvt(ustring):
l = []
for uc in ustring:
l.append(chr(ord(uc) & 0xFF)) # low order byte
l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
return ''.join(l)
cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'

The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.
Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.

Related

How to convert a unicode character representation from string to unicode in python?

Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?
My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.
Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)
All I want is the plain encoded string!
(yea no not at all past me. All you want is the unicode representation of a character as string)
PS: All in python if that wasn't obvious from the title already.
Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.
The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.
This did help but I just stumbled across this horribly written question and just couldn't leave it like that.

"\xe3" in python is a string literal that represents a single byte with value 227:
>>> print len("\xe3")
1
>>> print ord("\xe3")
227
This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).
"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):
>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163
This byte sequence is the UTF-8 encoding of the character "ã".
So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:
>>> "ã".decode("utf-8")
u'\xe3'
Now, you can take that unicode string and encode it however you like (e.g. into latin-1):
>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'

Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.
Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.

Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:
http://hexutf8.com/?q=U+e3
I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.
UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)

coding (unicode) hex to octal

I read a text file, which has some characters like that '\260' (it means '°'), and then I add it to DB (sqlite3).
After that, I try to get the information from DB, but the sql-query will be built with '\xb0'(it means '°' too), because I get this information from XML file.
I try to replace hex characters with octal chracters: text = text.replace(r'\xb0', '\260') but it doesn't work, why? I cannot build correct sql-query.
Maybe there are some solutions for this problem e.g. encode, decode etc.

\260 is the same thing as \xb0:
>>> '\xb0'
'\xb0'
>>> '\260'
'\xb0'
You probably want to decode your input to unicode and insert that instead. If your data is encoded to Latin 1 then decode:
>>> print '\xb0'.decode('latin1')
°
sqlite3 can handle unicode just fine, and by decoding you make sure you are handling text values, not byte values, which can differ from codec to codec.

string decode method error in python

I have a function like this:
def convert_to_unicode(data):
row = {}
if data == None:
return data
try:
for key, val in data.items():
if isinstance(val, str):
row[key] = unicode(val.decode('utf8'))
else:
row[key] = val
return row
except Exception, ex:
log.debug(ex)
to which I feed a result set (got using MySQLdb.cursors.DictCursor) row by row to transform all the string values to unicode (example {'column_1':'XXX'} becomes {'column_1':u'XXX'}).
Problem is one of the rows has a value like {'column_1':'Gabriel García Márquez'}
and it does not get transformed. it throws this error:
'utf8' codec can't decode byte 0xed in position 12: invalid continuation byte
When I searched for this it seems that this has to do with ascii encoding.
The solutions i tried are:
adding # -*- coding: utf-8 -*- at the beginning of my file ... does not help
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('utf8', 'ignore')) ... as expected it ignores the non-ascii character and returns {'column_1':u'Gabriel Garca Mrquez'}
changing the line row[key] = unicode(val.decode('utf8')) to row[key] = unicode(val.decode('latin-1')) ... Does the job but I am afraid it will support only West Europe characters (as per Here )
Can anybody point me towards a right direction please.

Firstly:
The data you're getting in your result set is clearly latin-1 encoded, or you wouldn't be observing this behavior. It is entirely correct that trying to decode a latin-1-encoded byte string as though it were utf-8-encoded blows up in your face. Once you have a latin-1-encoded byte string foo, if you want to convert it to the unicode type, foo.decode('latin1') is the right thing to do.
I noticed the expression unicode(val.decode('utf8')) in your code. This is equivalent to just val.decode('utf8'); calling the .decode method of a byte string converts it to unicode, so you're calling unicode() on a unicode string, which just returns the unicode string.
Secondly:
Your real problem here - if you want to be able to deal with characters not included in the character set supported by the latin-1 encoding - is not with Python's string types, per se, so much as it is with the MySQLdb library. I don't know this problem in intimate detail, but as I understand it, in ancient versions of MySQL, the default encoding used by MySQL databases was latin-1, but now it is utf-8 (and has been for many years). The MySQLdb library, however, still by default establishes latin-1-encoded connections with the database. There are literally dozens of StackOverflow questions relating to MySQL, Python, and string encoding, and while I don't fully understand them, one easy-to-use solution to all such problems that seems to work for people is this one:
http://www.dasprids.de/blog/2007/12/17/python-mysqldb-and-utf-8
I wish I could give you a more comprehensive and confident answer on the MySQLdb issue, but I've never even used MySQL and I don't want to risk posting anything untrue. Perhaps someone can come along and provide more detail. Nonetheless, I hope this helps you.

Your third solution - changing the encoding to "latin-1" - is correct. Your input data is encoded as Latin-1, so that's what you have to decode it as. Unless someone somewhere did something very silly, it should be impossible for that input data to contain invalid characters for that encoding.

Python convert and save unicode string to a list

I need to insert a series of names (like 'Alam\xc3\xa9') into a list, and than I have to save them into a SQLite database.
I know that I can render these names correctly by tiping:
print eval(repr(NAME)).decode("utf-8")
But I have to insert them into a list, so I can't use the print
Other way for doing this without the print?

Lots and lots of misconceptions here.
The string you quote is not Unicode. It is a byte string, encoded in UTF-8.
You can convert it to Unicode by decoding it:
unicode_name = name.decode('utf-8')
When you print the value of unicode_name to the console, you will see one of two things:
>>> unicode_name
u'Alam\xe9'
>>> print unicode_name
Alamé
Here, you can see that just typing the name and pressing enter shows a representation of the Unicode code points. This is the same as typing print repr(unicode_name). However, doing print unicode_name prints the actual characters - ie behind the scenes, it encodes it to the correct encoding for your terminal, and prints the result.
But this is all irrelevant, because Unicode strings can only be represented internally. As soon as you want to store it in a database, or a file, or anywhere, you need to encode it. And the most likely encoding to choose is UTF-8 - which is what it was in originally.
>>> name
'Alam\xc3\xa9'
>>> print name
Alamé
As you can see, using the original non-decoded version of the name, repr and print once again show the codes and the characters. So it's not that converting it to Unicode actually makes it any more "really" the correct character.
So, what to do if you want to store it in a database? Nothing. Nothing at all. Sqlite accepts UTF-8 input, and stores its data in UTF-8 format on the disk. So there is absolutely no conversion needed to store the original value of name in the database.

Are you looking for something like this?
[n.decode("utf-8") for n in ['Alam\xc3\xa9', 'Alam\xc3\xa9', 'Alam\xc3\xa9']]

Python: Convert Unicode to ASCII without errors for CSV file

I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"
buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
row = cr.fetchone()
writer.writerow([s.encode('ascii','ignore') for s in row])
The value of row is
(56, u"LIMPIADOR BA\xd1O 1'5 L")
where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).
I'd like to save the value to the CSV, preferably with the ñ ("LIMPIADOR BAÑO 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").

Correct, ñ is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here:
What is the best way to remove accents in a Python unicode string?
But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.
So just switch "ascii" for whatever encoding you want.
Your actual code, according to your comment is:
writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
where
row = (56, u"LIMPIADOR BA\xd1O 1'5 L")
Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:
col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row)
Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.