Python encoding problems with PostgreSQL Database - python

I'm working on a program where I want to compare addresses I read from a csv file to a postgres Database. (Its a Plugin for QGis)
I can successfully establish a connection and also read data from the database as long as I send queries without my own parameters.
So what I do:
I read a csv file and store it in a list.
Then i select an output file.
Next I click a button, that on click, should compare the entries in the csv file to entries in my database.
If an entry from the csv file (postal code, town, address) has the exact same properties in the database, I write it into a list "Successful Matches", if one doesnt match, I write it into a List "Error List)
My problem now occurs, when i execute a statement with my own parameters.
The Sql Error Message I get back says:
Invalid Byte-Sequence for Encoding UTF8: 0xdf 0x65
I think, that the error is in the first list I fill from the csv file. My addresses have special characters like öäüß...
Here is the code that is used:
This Method writes the succesfully matched addresses to a file, the failed ones to a lineEdit
def write_output_file(self):
compare_input_with_database()
try:
with open(self.outputfile, 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in geocoded_list:
writer.writerow(row)
if len(error_list) > 0:
self.writefailedaddresses()
raiseInformation("Es konnten nicht alle Adressen geocodiert werden!")
else:
raiseInformation("Adressen erfolgreich geocodiert!")
except csv.Error:
raiseException("Fehler beim schreiben der Datei")
This method, compares a row entry from the list/csvfile to the database.
def compare_input_with_database():
dbcursor = database_connection.open_connection()
for row in addressList:
entry = str(row[0])
addresssplit = entry.split(';')
try:
resultset = database_connection.select_specific_address(dbcursor, int(addresssplit[0]), addresssplit[1], addresssplit[2])
geocoded_list.append(resultset)
except psycopg2.DatabaseError, e:
raiseException(e)
error_list.append(addresssplit)
database_connection.close_connection()
def select_specific_address(cursor, plz, town, address):
cursor.execute("SELECT plz,ort,strasse,breitengrad,laengengrad from addresses where plz=%s AND ort=%s AND strasse=%s", (plz, town, address))
resultset = cursor.fetchone()
return resultset
This Method reads a csv file and populates it in a list
def loadFileToList(addressfile, dlg):
del addressList[:]
if os.path.exists(addressfile):
if file_is_empty(addressfile):
raiseException("Ungueltige Quelldatei! Quelldatei ist leer!")
return -1
else:
with open(addressfile, 'rb') as csvfile:
filereader = csv.reader(csvfile, delimiter=';')
for row in filereader:
addressList.append(row)
return addressList
else:
raiseException("Pfad der Quelldatei nicht gefunden!")
return -1
Thanks!
EDIT:
When I display the address containing the special charachter it shows as "Hauptstra\xdfe" instead of "Hauptstraße
Sorry im Bad with Encoding, is this unicode?
Does that mean, that it get sends to the database like that and i nead to encode it differently?
EDIT 2:
I took a look at the orkaround and tried to implement it:
def loadFileToList(addressfile, dlg):
del addressList[:]
if os.path.exists(addressfile):
if file_is_empty(addressfile):
raiseException("Ungueltige Quelldatei! Quelldatei ist leer!")
return -1
else:
#with open(addressfile, 'rb') as csvfile:
#filereader = csv.reader(csvfile, delimiter=';')
reader = unicode_csv_reader(open(addressfile))
for row in reader:
addressList.append(row)
return addressList
else:
raiseException("Pfad der Quelldatei nicht gefunden!")
return -1
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
But now i get the following Error Message when executing the code:
for row in reader:
File "C:/Users/Constantin/.qgis2/python/plugins\Geocoder\logic.py",
line 46, in unicode_csv_reader
yield [unicode(cell, 'utf-8') for cell in row] UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 19: invalid
continuation byte
I just dont get why it just cant decode it -.-
UPDATE:
Some lines from my csv file:
1190;Wien;Weinberggasse
1190;Wien;Hauptstraße
1190;Wien;Kärnterstraße

The syntax of your except implies you are using Python 2. But since you are using non-ASCII (Unicode) characters in your strings, Python 3 is a dramatically better choice. You seem to be working in German, so at least a few ü and ß are going to sneak in.
In Python 3, the reading is just:
rows = []
with open('test.csv', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in reader:
rows.append(row)
And writing:
with open('testout.csv', mode="w", encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in rows:
writer.writerow(row)
Note that the read/write is not binary, and encoding is handled as a matter of course. And the results are exactly what you'd expect. Something like:
Artikelname;Menge
Äpfel;3
Bäume;12
With all characters properly encoded. On disk, the data are UTF-8 encoded. In memory, full Unicode.
If you can't use Python 3, then you must take great care that Unicode characters are properly encoded and decoded, especially at I/O boundaries--e.g. when reading and writing data, or communicating with an external system like PostgreSQL. There are at least a few ways the code as it stands does not take care. The use of str() to cast what are not necessarily ASCII characters, for example, and the lack of UTF-8 encoding.
Unfortunately, there is no trivial fix in Python 2. Python 2's CSV module is schrecklich kaput. From the docs: "The csv module doesn’t directly support reading and writing Unicode." In 2016?! There are, however, workarounds. One of the recipes is right there in the docs. This Stack Overflow answer restates it, and provides several other alternatives.
So, use Python 3 if you can. It will simplify this and many other non-ASCII I/O issues. Otherwise, deploy the one of the CSV workarounds in that other SO answer.
Update
If you are having trouble using the workarounds, I don't like the standard answer either. Here is a non-canonical reader solution that works for me:
import csv
rows = []
with open('test.csv', mode='rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in reader:
urow = [unicode(cell, 'utf-8') for cell in row]
rows.append(urow)
print rows
This is decidedly non-portable to Python 3, but it works and doesn't require importing any other modules. Note, if you're using IPython/Jupyter or the interactive Python console ("REPL"), you are going to see strings at a low level, such as:
[ [u'Artikelname', u'Menge'],
[u'\xc4pfel', u'3'],
[u'B\xe4ume', u'12]
]
So instead of a nice, neat Ä, the string has \xc4. It's annoying, but it's not wrong. Entering u'\u00c4pfel' gives you the same thing. And it's easy to confirm that c4 is the correct Unicode code point. Python 2 is just doing a poor job of dealing with non-ASCII characters. Just one of 4,094 reasons to use Python 3 when you can.
The manual equivalent for output, btw:
with open('testout.csv', mode='wb') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in rows:
urow = [cell.encode('utf-8') for cell in row]
writer.writerow(urow)
All of this depends utterly on your input files truly being in UTF-8 encoding. If they're in anything else, that opens another horrible kettle of rotting fish.

Related

How to read CSV with NULL and different headers?

I have a task to convert one CSV file from UTF8 encoding to ANSI encoding and format it. I have learned that ANSI is really the encoding of the system in some sense, but this is not the issue at the moment.
Before converting, I have to be able to read my CSV file first. The file contains null values and the headers are different from the rest of the file. Whenever I try to read it, I always get error regarding the NULL values or headers. The headers are different in a sense, that they do not have any quotation at all, but the rest of the file has 3 quotation marks on each side of strings (and for some reason also around NUL values). The file columns are coma separated and each row ends with a new line.
When I try to read with QUOTE_NONNUMERIC (to get rid of the null values), I get the error:
batch_data = list(reader)
ValueError: could not convert string to float: 'my_first_column_name'
When I try with QUOTE_ALL (in hopes to quote the headers as well), I get the error:
batch_data = list(reader)
_csv.Error: line contains NULL byte
Here is my code:
import csv
file_in = r'<file_in>'
with open(file_in, mode='r') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_NONNUMERIC)
batch_data = list(reader)
for row in batch_data:
print(row, end="")
After reading some materials I understand that, I guess, I have to read headers separately from the rest of the file. How would one do it exactly? I was trying to skip them with reader.next(), but then I get the error, that next is not a method of reader. At this point I can't really believe that it has already taken so much of my time reading through the documentations and trying different things.
SOLUTION
I ended up using list(next()) to skip over the header and then replace all the null values with empty string. Then opening the output file with the configuration I needed and writing my rows in it. I do still have to deal with different formatting since this solution puts double quotes around every cell and I need some to be numbers. This also replaced all single quotes from sells to double quotes, which still has to be fixed. file_in and file_out variables have assigned the input file and output file locations to them.
import csv
import json
file_in = r'<input_filepath>'
file_out = r'<output_filepath>'
with open(file_in, mode='r', encoding='utf-8-sig') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_ALL)
header_data = list(next(reader))
infile = infile.read().replace('\0', '').splitlines()
reader2 = csv.reader(infile)
with open(file_out,'w',newline='', encoding='cp1252') as outfile:
writer = csv.writer(outfile, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(header_data)
for row in reader2:
row = str(row).replace('"','').replace("'",'"')
row = json.loads(row)
writer.writerow(row)

Header written as row in CSV Python

I have the following code which I am using to create and add a new row to a csv file.
def calcPrice(data):
fieldnames = ["ReferenceID","clientName","Date","From","To","Rate","Price"]
with open('rec2.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow(data)
return
However, it as the header as a new row as well. How can I prevent this?
Here's a link to the gist with the whole code: https://gist.github.com/chriskinyua/5ff8a527b31451ddc7d7cf157c719bba
You could check if the file already exists
import os
def calcPrice(data):
filename = 'rec2.csv'
write_header = not os.path.exists(filename)
fieldnames = ["ReferenceID","clientName","Date","From","To","Rate","Price"]
with open(filename, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if write_header:
writer.writeheader()
writer.writerow(data)
Let's assume there's a function we can call that will tell us whether we should write out the header or not, so the code would look like this:
import csv
def calcPrice(data):
fieldnames = ["ReferenceID","clientName","Date","From","To","Rate","Price"]
with open('rec2.csv', 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
if should_write_header(csvfile):
writer.writeheader()
writer.writerow(data)
What will should_write_header look like? Here are three possibilities. For all of them, we will need to import the io module:
import io
The logic of all these functions is the same: we want to work out if the end of the file is the same as the beginning of the file. If that is true, then we want to write the header row.
This function is the most verbose: it finds the current position using the file's tell method, moves to the beginning of the file using its seek method, then runs tell again to see if the reported positions are the same. If they are not it seeks back to the end of the file before returning the result. We don't simply compare the value of EOF to zero because the Python docs state that the result of tell for text files does not necessarily correspond to the actual position of the file pointer.
def should_write_header1(fileobj):
EOF = fileobj.tell()
fileobj.seek(0, io.SEEK_SET)
res = fileobj.tell() == EOF
if not res:
fileobj.seek(EOF, io.SEEK_SET)
return res
This version assumes that while the tell method does not necessarily correspond to the position of the file pointer in general, tell will always return zero for an empty file. This will probably work in common cases.
def should_write_header2(fileobj):
return fileobj.tell() == 0
This version accesses the tell method of the binary stream that TextIOWrapper (the text file object class) wraps. For binary streams, tell is documented to return the actual file pointer position. This is removes the uncertainty of should_write_header2, but unfortunately buffer is not guaranteed to exist in all Python implementations, so this isn't portable.
def should_write_header3(fileobj):
return fileobj.buffer.tell() == 0
So for 100% certainty, use should_write_header1. For less certainty but shorter code, use one of the others. If performance is a concern favour should_write_header3, because tell in binary streams is faster than tell in text streams.

unicode method doesn't work in Python3

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
filename = '/Users/congminmin/Downloads/kg-temp.csv'
reader = unicode_csv_reader(open(filename))
out_filename = '/Users/congminmin/Downloads/kg-temp.out'
#writer = open(out_filename, "w", "utf-8")
for question, answer in reader:
print(question+ " " + json.loads(answer)[0]['content'])
#writer.write(question + " " + answer)
reader.close();
This code works in Python 2.7. But it gives an error message in Python 3.6:
Unresolved reference 'unicode'
How to adapt it to Python 3.6?
Python 3 has excellent unicode support already. Any time you open a file in text mode, you can use a specific encoding, or let it default over to UTF-8. There is no longer a difference between str and unicode in Python 3. The latter does not exist, and the former has full unicode support. This simplifies your job greatly since you don't need your setup method at all. You can just iterate over a normal csv.reader.
As an additional note, you should always open files in a with block, so things get cleaned up if there is any kind of exception. As a bonus, your file will get closed automatically when the block ends:
with open(filename) as f: # The default mode is 'rt', with utf-8 encoding
for question, answer in csv.reader(f):
# Do your thing here. Both question and answer are normal strings
This will only work properly if you are sure that each row contains exactly 2 elements. You may be better off doing something more like
with open(filename) as f: # The default mode is 'rt', with utf-8 encoding
for row in csv.reader(f):
if len(row) != 2:
continue # Or handle the anomaly by other means
question, answer = row
# Do your thing here as before
Simply ensure your data is str, not a bytestring, to begin with, and just use csv.reader without this decoding stuff.
data = utf8_data.decode('utf-8')
for row in csv.reader(data, dialect=csv.excel, ...):
# ...

Writing results from SQL query to CSV and avoiding extra line-breaks

I have to extract data from several different database engines. After this data is exported, I send the data to AWS S3 and copy that data to Redshift using a COPY command. Some of the tables contain lots of text, with line breaks and other characters present in the column fields. When I run the following code:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
a.writerows(rows)
Some of the columns that have carriage returns/linebreaks will create new lines:
"2017-01-05 17:06:32.802700"|"SampleJob"|""|"Date"|"error"|"Job.py"|"syntax error at or near ""from"" LINE 34: select *, SYSDATE, from staging_tops.tkabsences;
^
-<class 'psycopg2.ProgrammingError'>"
which causes the import process to fail. I can work around this by hard-coding for exceptions:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
for row in rows:
list_of_rows = []
for c in row:
if isinstance(c, str):
c = c.replace("\n", "\\n")
c = c.replace("|", "\|")
c = c.replace("\\", "\\\\")
list_of_rows.append(c)
else:
list_of_rows.append(c)
a.writerow([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])
But this takes a long time to process larger files, and seems like bad practice in general. Is there a faster way to export data from a SQL cursor to CSV that will not break when faced with text columns that contain carriage returns/line breaks?
If you're doing SELECT * FROM table without a WHERE clause, you could use COPY table TO STDOUT instead, with the right options:
copy_command = """COPY some_schema.some_message_log TO STDOUT
CSV QUOTE '"' DELIMITER '|' FORCE QUOTE *"""
with open('data.csv', 'w', newline='') as fp:
cursor.copy_expert(copy_command)
This, in my testing, results in literal '\n' instead of actual newlines, where writing through the csv writer gives broken lines.
If you do need a WHERE clause in production you could create a temporary table and copy it instead:
cursor.execute("""CREATE TEMPORARY TABLE copy_me AS
SELECT this, that, the_other FROM table_name WHERE conditions""")
(edit) Looking at your question again I see you mention "ever all different database engines". The above works with psyopg2 and postgresql but could probably be adapted for other databases or libraries.
I suspect the issue is as simple as making sure the Python CSV export library and Redshift's COPY import speak a common interface. In short, check your delimiters and quoting characters and make sure both the Python output and the Redshift COPY command agree.
With slightly more detail: the DB drivers will have already done the hard work of getting to Python in a well-understood form. That is, each row from the DB is a list (or tuple, generator, etc.), and each cell is individually accessible. And at the point you have a list-like structure, Python's CSV exporter can do the rest of the work and -- crucially -- Redshift will be able to COPY FROM the output, embedded newlines and all. In particular, you should not need to do any manual escaping; the .writerow() or .writerows() functions should be all you need do.
Redshift's COPY implementation understands the most common dialect of CSV by default, which is to
delimit cells by a comma (,),
quote cells with double quotes ("),
and to escape any embedded double quotes by doubling (" → "").
To back that up with documentation from Redshift FORMAT AS CSV:
... The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character. ...
However, your Python CSV export code uses a pipe (|) as the delimiter and sets the quotechar to double quote ("). That, too, can work, but why stray from the defaults? Suggest using CSV's namesake and keeping your code simpler in the process:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w') as fp:
csvw = csv.writer( fp )
csvw.writerows(rows)
From there, tell COPY to use the CSV format (again with no need for non-default specifications):
COPY your_table FROM your_csv_file auth_code FORMAT AS CSV;
That should do it.
Why write to the database after every row?
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp, delimiter='|', quoting=csv.QUOTE_ALL, quotechar='"', doublequote=True, lineterminator='\n')
list_of_rows = []
for row in rows:
for c in row:
if isinstance(c, basestring):
c = c.replace("\n", "\\n")
c = c.replace("|", "\|")
c = c.replace("\\", "\\\\")
list_of_rows.append(row)
a.writerows([x.encode('utf-8') if isinstance(x, str) else x for x in list_of_rows])
The problem is that you are using the Redshift COPY command with its default parameters, which use a pipe as a delimiter (see here and here) and require escaping of newlines and pipes within text fields (see here and here). However, the Python csv writer only knows how to do the standard thing with embedded newlines, which is to leave them as-is, inside a quoted string.
Fortunately, the Redshift COPY command can also use the standard CSV format. Adding the CSV option to your COPY command gives you this behavior:
Enables use of CSV format in the input data. To automatically escape delimiters, newline characters, and carriage returns, enclose the field in the character specified by the QUOTE parameter. The default quote character is a double quotation mark ( " ). When the quote character is used within a field, escape the character with an additional quote character."
This is exactly the approach used by the Python CSV writer, so it should take care of your problems. So my advice would be to create a standard csv file using code like this:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(fp) # no need for special settings
a.writerows(rows)
Then in Redshift, change your COPY command to something like this (note the added CSV tag):
COPY logdata
FROM 's3://mybucket/data/data.csv'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
CSV;
Alternatively, you could continue manually converting your fields to match the default settings for Redshift's COPY command. Python's csv.writer won't do this for you on its own, but you may be able to speed up your code a bit, especially for big files, like this:
cursor.execute('''SELECT * FROM some_schema.some_message_log''')
rows = cursor.fetchall()
with open('data.csv', 'w', newline='') as fp:
a = csv.writer(
fp,
delimiter='|', quoting=csv.QUOTE_ALL,
quotechar='"', doublequote=True, lineterminator='\n'
)
a.writerows(
c.replace("\\", "\\\\").replace("\n", "\\\n").replace("|", "\\|").encode('utf-8')
if isinstance(c, str)
else c
for row in rows
for c in row
)
As another alternative, you could experiment with importing the query data into a pandas DataFrame with .from_sql, doing the replacements in the DataFrame (a whole column at a time), then writing the table out with .to_csv. Pandas has incredibly fast csv code, so this may give you a significant speedup.
Update: I just noticed that in the end I basically duplicated #hunteke's answer. The key point (which I missed the first time through) is that you probably haven't been using the CSV argument in your current Redshift COPY command; if you add that, this should get easy.

industrial strength csv reader (python)

Here's my use case: It's my job to clean CSV files which are often scrapped from web pages (most are english but some german and other weird non unicode characters sneak in there). Python 3 is "utf-8" by default and the usual
import csv
#open file
with open('input.csv','r',encoding = 'utf-8')
reader = csv.reader(f)
fails with UnicodeEncodeError even with try/catch blocks everywhere
I can't figure out how to clean the input if I can't even open it. My end goal is simply to read each line into a list I call text.
I'm out of ideas I've even tried the following:
for encoding in ('utf-8','latin-1',etc, etc):
try:
//open the file
I can't make any assumptions about the encoding as they may be written on a unix machine in another part of the world and I'm on a windows machine. The input are just simple strings otherwise example
test case: "This is an example of a test case and the test may wrap around to a new line when opened in a text processor"
Maybe try reading in the contents entirely, then using bytes.decode() in much the same way you mentioned:
#!python3
import csv
from io import StringIO
with open('input.csv', 'rb') as binfile:
csv_bytes = binfile.readall()
for enc in ('utf-8', 'utf-16', 'latin1'):
try:
csv_string = csv_bytes.decode(encoding=enc, errors='strict')
break
except UnicodeError as e:
last_err = e
else: #none worked
raise last_err
with StringIO(csv_string) as csvfile:
csv = csv.reader(csvfile)
for row in csv:
print(row[0])

Categories

Resources