I'm currently persisting filenames in a sqlite database for my own purposes. Whenever I try to insert a file that has a special character (like é etc.), it throws the following error:
pysqlite2.dbapi2.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
When I do "switch my application over to Unicode strings" by wrapping the value sent to pysqlite with the unicode method like: unicode(filename), it throws this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 66: ordinal not in range(128)
Is there something I can do to get rid of this? Modifying all of my files to conform isn't an option.
UPDATE
If I decode the text via filename.decode("utf-8"), I'm still getting the ProgrammingError above.
My actual code looks like this:
cursor.execute("select * from musiclibrary where absolutepath = ?;",
[filename.decode("utf-8")])
What should my code here look like?
You need to specify the encoding of filename for conversion to Unicode, for example: filename.decode('utf-8'). Just using unicode(...) picks the console encoding, which is often unreliable (and often ascii).
You should pass as Unicode the arguments of your SQL statement.
Now, it all depends on how you obtain the filename list. Perhaps you're reading the filesystem using os.listdir or os.walk? If that is the case, there is a way to have directly the filenames as Unicode just by passing a Unicode argument to either of these functions:
Examples:
os.listdir(u'.')
os.walk(u'.')
Of course, you can substitute the u'.' directory with the actual directory whose contents you are reading. Just make sure it's a Unicode string.
Have you tried to pass the unicode string directly:
cursor.execute("select * from musiclibrary where absolutepath = ?;",(u'namé',))
You will need to add the file encoding at the beginning of the script:
# coding: utf-8
You figured this out already, but:
I don't think you could actually get that ProgrammingError exception from cursor.execute("select * from musiclibrary where absolutepath = ?;", [filename.decode("utf-8")]), as the question currently states.
Either the utf-8 decode would explode, or the cursor.execute call would be happy with the result.
Try to change to this:
cursor.execute("select * from musiclibrary where absolutepath = ?;",
[unicode(filename,'utf8')])
In your filename origin not encode with utf8, change utf8 to your encoding.
Related
I need to call MySQL stored procedure from my python script. As one of parameters I'm passing a unicode string (Russian language), but I get an error;
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256)
My script:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName")
self.cursor=self.db.cursor()
args=("какой-то текст") #this is string in russian
self.cursor.callproc('pr_MyProc', args)
self.cursor.execute('SELECT #_pr_MyProc_2') #getting result from sp
result=self.cursor.fetchone()
self.db.commit()
I've read that setting charset='utf8' shuld resolve this problem, but when I use string:
self.db=MySQLdb.connect("localhost", "usr", "pass", "dbName", charset='utf8')
This gives me another error;
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' in position 20: surrogates not allowed
Also I've trying to set parametr use_unicode=True, that's not working.
More things to check on:
http://mysql.rjweb.org/doc.php/charcoll#python
Likely items:
Start code file with # -*- coding: utf-8 -*- -- (for literals in code)
Literals should be u'...'
Can you extract the HEX? какой-то текст should be this in utf8: D0BA D0B0 D0BA D0BE D0B9 2D D182 D0BE D182 20 D0B5 D0BA D181 D182
Here are some thoughts. Maybe not a response. I've been playing with python/mysql/utf-8/unicode in the past and this is the things i remember:
Looking at Saltstack mysql module's comment :
https://github.com/saltstack/salt/blob/develop/salt/modules/mysql.py#L314-L322
# MySQLdb states that this is required for charset usage
# but in fact it's more than it's internally activated
# when charset is used, activating use_unicode here would
# retrieve utf8 strings as unicode() objects in salt
# and we do not want that.
#_connarg('connection_use_unicode', 'use_unicode')
connargs['use_unicode'] = False
_connarg('connection_charset', 'charset')
We see that to avoid altering the result string the use_unicode is set to False, while the charset (which could be utf-8) is set as a parameter. use_unicode is more a 'request' to get responses as unicode strings.
You can check real usage in the tests, here:
https://github.com/saltstack/salt/blob/develop/tests/integration/modules/test_mysql.py#L311-L361 with a database named '標準語'.
Now about the message UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd1' **. You are using **unicode but you tell the module it is utf-8. It is not utf-8 until you encode your unicode string in utf-8.
Maybe you should try with:
args=(u"какой-то текст".encode('utf-8'))
At least in python3 this is required, because your "какой-то текст" is not in utf-8 by default.
The MySQLdb module is not compatible with python 3. That might be why you are getting problems. I would advise to use a different connector, like PyMySQL or mysqlclient.
Related: 23376103.
Maybe you can reload your sys in utf-8 and try to decode the string into utf-8 as following :
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
...
stringUtf8 = u''.join(string_original).decode('utf-8')
I had a similar problem very recently but with PostgreSQL. After trying tons of suggestions from SO/ internet, I realized the issue was with my database. I had to drop my database and reinstall Postgres, because for some reason it was not allowing me to change the database's default collation. I was in a hurry so couldn't find a better solution, but would recommend the same, since I was only starting my application in the deployment environment.
All the best.
What's your database's charset?
use :
show variables like "characetr%";
or see your database's charset
I see here two problems.
You have unicode but you try to define it as utf-8 by setting parameter "charset". You should first encode your unicode to utf-8 or another encoding system.
If it however doesn't work, try to do so with init_command='SET NAMES UTF8' parameter.
So it will look like:
conn = MySQLdb.connect(charset='utf8', init_command='SET NAMES UTF8')
You can try also this:
cursor = db.cursor()
cursor.execute("SET NAMES UTF8;")
I encountered a similar issue, which was caused by invalid utf-8 data in the database; it seems that MySQL doesn't care about that, but Python does, because it's following the UTF-8 spec, which says that:
surrogate pairs are not allowed in utf-8
unpaired surrogates are not allowed in utf-8
If you want to "make it work", you'll have to intercept the MySQL packet and use your own converter which will perform ad-hoc replacements.
Here's one way to "handle" invalid data containing surrogates:
def borked_utf8_decode(data):
"""
Work around input with unpaired surrogates or surrogate pairs,
replacing by XML char refs: look for "&#\d+;" after.
"""
return data.decode("utf-8", "surrogatepass") \
.encode("utf-8", "xmlcharrefreplace") \
.decode("utf-8")
Note that the proper way to handle that is context-dependent, but there are some common replacement scenarios, like this one.
And here's one way of plugging this into pymysql (another way is to monkey-patch field processing, see eg. https://github.com/PyMySQL/PyMySQL/issues/631):
import pymysql.converters
# use this in your connection
pymysql_use_unicode = False
conversions = pymysql.converters.conversions
conversions[pymysql.converters.FIELD_TYPE.STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VAR_STRING] = borked_utf8_decode
conversions[pymysql.converters.FIELD_TYPE.VARCHAR] = borked_utf8_decode
The data stored in unicode (in database) has to be retrieved and convert into a different form.
The following snippet
def convert(content):
content = content.replace("ஜௌ", "n\[s");
return content;
mydatabase = "database.db"
connection = sqlite3.connect(mydatabase)
cursor = connection.cursor()
query = ''' select unicode_data from table1'''
cursor.execute(query)
for row in cursor.fetchone():
print convert(row)
yields the following error message in convert method.
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in
position 0: ordinal not in range(128)
If the database content is "ஜௌஜௌஜௌ", the output should be "n\[sn\[sn\[s"
The documentation suggests to use ignore or replace to avoid the error, when creating the unicode string.
when the iteration is changed as follows:
for row in cursor.fetchone():
print convert(unicode(row, errors='replace'))
it returns
exceptions.TypeError: decoding Unicode is not supported
which informs that row is already a unicode.
Any light on this to make it work is highly appreciated. Thanks in advance.
content = content.replace("ஜௌ", "n\[s");
Suggest you mean:
content = content.replace(u'ஜௌ', ur'n\[s');
or for safety where the encoding of your file is uncertain:
content = content.replace(u'\u0B9C\u0BCC', ur'n\[s');
The content you have is already Unicode, so you should do Unicode string replacements on it. "ஜௌ" without the u is a string of bytes that represents those characters in some encoding dependent on your source file charset. (Byte strings work smoothly together with Unicode strings only in the most unambiguous cases, which is for ASCII characters.)
(The r-string means not having to worry about including bare backslashes.)
I am a newbie in python.
I have a unicode in Tamil.
When I use the sys.getdefaultencoding() I get the output as "Cp1252"
My requirement is that when I use text = testString.decode("utf-8") I get the error "UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to undefined"
When I use the
sys.getdefaultencoding() I get the
output as "Cp1252"
Two comments on that: (1) it's "cp1252", not "Cp1252". Don't type from memory. (2) Whoever caused sys.getdefaultencoding() to produce "cp1252" should be told politely that that's not a very good idea.
As for the rest, let me guess. You have a unicode object that contains some text in the Tamil language. You try, erroneously, to decode it. Decode means to convert from a str object to a unicode object. Unfortunately you don't have a str object, and even more unfortunately you get bounced by one of the very few awkish/perlish warts in Python 2: it tries to make a str object by encoding your unicode string using the system default encoding. If that's 'ascii' or 'cp1252', encoding will fail. That's why you get a Unicode*En*codeError instead of a Unicode*De*codeError.
Short answer: do text = testString.encode("utf-8"), if that's what you really want to do. Otherwise please explain what you want to do, and show us the result of print repr(testString).
add this as your 1st line of code
# -*- coding: utf-8 -*-
later in your code...
text = unicode(testString,"UTF-8")
you need to know which character-encoding is testString using. if not utf8, an error will occur when using decode('utf8').
I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added
sys.setdefaultencoding('latin_1')
sure enough, now working with latin1 strings works fine.
But, in case I encounter something that is not encoded in latin1:
s=str(u'abc\u2013')
I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.
I tried doing different things with codecs.register_error but to no avail.
please help?
There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.
Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.
Explicit decoding takes a parameter specifying behavior for undecodable bytes.
You can define your own custom handler and use it instead to do as you please. See this example:
import codecs
from logging import getLogger
log = getLogger()
def custom_character_handler(exception):
log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
exception.reason,
exception.object[exception.start:exception.end],
exception.encoding,
exception.start,
exception.end )
return ("?", exception.end)
codecs.register_error("custom_character_handler", custom_character_handler)
print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )
Running it, you will see:
invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'
References:
https://docs.python.org/3/library/codecs.html#codecs.register_error
https://docs.python.org/3/library/exceptions.html#UnicodeError
How to ignore invalid lines in a file?
'str' object has no attribute 'decode'. Python 3 error?
How to replace invalid unicode characters in a string in Python?
UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?
Ok, I have a hardcoded string I declare like this
name = u"Par Catégorie"
I have a # -- coding: utf-8 -- magic header, so I am guessing it's converted to utf-8
Down the road it's outputted to xml through
xml_output.toprettyxml(indent='....', encoding='utf-8')
And I get a
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)
Most of my data is in French and is ouputted correctly in CDATA nodes, but that one harcoded string keep ... I don't see why an ascii codec is called.
what's wrong ?
The coding header in your source file tells Python what encoding your source is in. It's the encoding Python uses to decode the source of the unicode string literal (u"Par Catégorie") into a unicode object. The unicode object itself has no encoding; it's raw unicode data. (Internally, Python will use one of two encodings, depending on how it was configured, but Python code shouldn't worry about that.)
The UnicodeDecodeError you get means that somewhere, you are mixing unicode strings and bytestrings (normal strings.) When mixing them together (concatenating, performing string interpolation, et cetera) Python will try to convert the bytestring into a unicode string by decoding the bytestring using the default encoding, ASCII. If the bytestring contains non-ASCII data, this will fail with the error you see. The operation being done may be in a library somewhere, but it still means you're mixing inputs of different types.
Unfortunately the fact that it'll work just fine as long as the bytestrings contain just ASCII data means this type of error is all too frequent even in library code. Python 3.x solves that problem by getting rid of the implicit conversion between unicode strings (just str in 3.x) and bytestrings (the bytes type in 3.x.)
Wrong parameter name? From the doc, I can see the keyword argument name is supposed to be encoding and not coding.