I am importing data from MS-Excel to PostgreSQL in python(2.6) using pyodbc.
The problem faced is:
There are characters like left single quotation mark(ANSI hex code : 0x91), etc in the excel source. Now, when it is import into PostgreSQL using pyodbc, it terminates and gives the error DatabaseError: invalid byte sequence for encoding "UTF8": 0x91.
What I tried: I used decode('unicode_escape') for the time being. But, this cannot be done as this simply removes/escapes the concerned character.
Alternate trial: Decode initially, Unicode everywhere and then Encode later when needed from database. This can also not be done due to the expanse of the project at hand.
Please suggest me some method/procedure/in-built functions to accomplish the task.
Find out the real encoding of the source document. It might be WIN1251. Either transcode it (for instance with iconv) or set the client_encoding of PostgreSQL accordingly.
If you don't have a setting in pyodbc (which I don't know), you can always issue a plain SQL command:
SET CLIENT_ENCODING TO 'WIN1251';
More in the chapter "Automatic Character Set Conversion Between Server and Client" of the manual.
Related
I have a few questions about character encoding between MySQL, python, and HTML.
Here's the situation. I have a python script that gets information from a website and loads it into a MySQL table. Then I have a PHP and HTML frontend that takes the data from the table and loads it into an HTML table. I'm running into what I believe are just some encoding incongruencies between all these languages and platforms. I've included pictures of the result of running the python script in shell, as well as the resulting output in MySQL and finally on the HTML frontend.
When I load from python to MySQL, it does fine with symbols like 'degrees' or 'micro' but doesn't like 'ohms' for example (that's what I believe is causing the error statement in the python shell). And when I go from MySQL to HTML, it doesn't like any of the special characters like 'degrees', 'ohms', or 'micro'. What is happening here and is there a way to fix it?
Hoping someone can shed some light on this as I am very confused and somewhat of a newbie when dealing with character encoding.
Try changing mysql database encoding.
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
How to convert an entire MySQL database characterset and collation to UTF-8?
It's not bad. MySQL just write special characters as e.g. ð but when you select it back to you application it show it correctly.
I pasted example for HTML page but for Python It's same :)
But when it's showing question mark symbol there is problem with Charset so check if your database have utf8_general_ci coding
I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance
If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.
I've written a wrapper around pymssql to connect to the DBs where I work. I've run into unicode decode/encode errors, and I'm trying to stem them at the source.
When I specify charset='latin1' or'iso-8859-1'`, the Connection fails with the following error:
File "pymssql.pyx", line 549, in pymssql.connect (pymssql.c:7672)
raise OperationalError(e[0])
pymssql.OperationalError: (20017, 'DB-Lib error message 20017, severity 9:\nUnexpected EOF from the server\nDB-Lib error message 20002, severity 9:\nAdaptive Server connection failed\n')
The DB encoding looks to be 'latin1':
SELECT SERVERPROPERTY('Collation')
returns
SQL_Latin1_General_CP1_CI_AS
which, I assume, is the same as Python's 'latin1'.
Am I doing this correctly? Did I choose the wrong coded (i.e., latin1 or iso-8859-1?
It appears that it is quite picky about what you enter.
Consider entering charset="ISO-8859-1"
Use uppercase letters such as "ISO-8859-1" or "LATIN1".
pymssql is using the GNU iconv conventions.
https://www.gnu.org/software/libiconv/
For historical reasons, international text is often encoded using a language or country dependent character encoding. With the advent of the internet and the frequent exchange of text across countries - even the viewing of a web page from a foreign country is a "text exchange" in this context -, conversions between these encodings have become important. They have also become a problem, because many characters which are present in one encoding are absent in many other encodings. To solve this mess, the Unicode encoding has been created. It is a super-encoding of all others and is therefore the default encoding for new text formats like XML.
Still, many computers still operate in locale with a traditional (limited) character encoding. Some programs, like mailers and web browsers, must be able to convert between a given text encoding and the user's encoding. Other programs internally store strings in Unicode, to facilitate internal processing, and need to convert between internal string representation (Unicode) and external string representation (a traditional encoding) when they are doing I/O. GNU libiconv is a conversion library for both kinds of applications.
My system uses "SQL_Latin1_General_CP1_CI_AS" collation setting as well, and I found even connecting with "LATIN1", characters in CHAR/VARCHAR columns are still being returned malencoded.
According to Microsoft document on SQL Server Code Page Architecture, the code page to use is Windows-1252.
Using charset='WINDOWS-1252' in pymssql.connect gives the correct result for me.
I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.
Here is what we are doing:
Take Data from an Oracle DB using Python and cx_Oracle.
Write the data to a file using Python.
Ingest the file into Postgres using Python and psycopg2.
Here are the important Oracle settings:
SQL> select * from NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET US7ASCII
According to this NLS_LANG faq, you are meant to set the NLS_LANG according to what your client OS is using.
Running locale gives us: LANG=en_US.UTF-8 (all of the other fields were also en_US.UTF-8).
So, in our Python script, we set it like this:
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"
Then we import the data and write it to a file.
row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.
We ingest that file into our UTF-8 Postgres DB.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
(In some text editors, the symbol shows up as �).
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Note: The Oracle field is a CHAR field and not a NCHAR field.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
Thanks for your time, I hope you have a good day.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
It is.
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.
Here's the scenario:
I have a url in a MySQL database that contains Unicode. The database uses the Latin-1 encoding. Now, when I read the record from MySQL using Python, it gets converted to Unicode because all strings follow the Unicode format in Python.
I want to write the URL into a text file -- to do so, it needs to be converted to bytes (UTF-8). This was done successfully.
Now, given the URLS that are in the text file, I want to query the db for these SAME urls in the database. I do so by calling the source command to execute a few select queries.
Result: I get no matches.
I suspect that the problem stems from my conversion to UTF-8, which somehow is messing up the symbols.
You most probably need to set your mysql shell client to use utf8.
You can set it either in mysql shell directly by running set character set utf8.
Or by adding default-character-set=utf8 to your ~/.my.cnf.