How to read CSV with NULL and different headers? - python

I have a task to convert one CSV file from UTF8 encoding to ANSI encoding and format it. I have learned that ANSI is really the encoding of the system in some sense, but this is not the issue at the moment.
Before converting, I have to be able to read my CSV file first. The file contains null values and the headers are different from the rest of the file. Whenever I try to read it, I always get error regarding the NULL values or headers. The headers are different in a sense, that they do not have any quotation at all, but the rest of the file has 3 quotation marks on each side of strings (and for some reason also around NUL values). The file columns are coma separated and each row ends with a new line.
When I try to read with QUOTE_NONNUMERIC (to get rid of the null values), I get the error:
batch_data = list(reader)
ValueError: could not convert string to float: 'my_first_column_name'
When I try with QUOTE_ALL (in hopes to quote the headers as well), I get the error:
batch_data = list(reader)
_csv.Error: line contains NULL byte
Here is my code:
import csv
file_in = r'<file_in>'
with open(file_in, mode='r') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_NONNUMERIC)
batch_data = list(reader)
for row in batch_data:
print(row, end="")
After reading some materials I understand that, I guess, I have to read headers separately from the rest of the file. How would one do it exactly? I was trying to skip them with reader.next(), but then I get the error, that next is not a method of reader. At this point I can't really believe that it has already taken so much of my time reading through the documentations and trying different things.
SOLUTION
I ended up using list(next()) to skip over the header and then replace all the null values with empty string. Then opening the output file with the configuration I needed and writing my rows in it. I do still have to deal with different formatting since this solution puts double quotes around every cell and I need some to be numbers. This also replaced all single quotes from sells to double quotes, which still has to be fixed. file_in and file_out variables have assigned the input file and output file locations to them.
import csv
import json
file_in = r'<input_filepath>'
file_out = r'<output_filepath>'
with open(file_in, mode='r', encoding='utf-8-sig') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_ALL)
header_data = list(next(reader))
infile = infile.read().replace('\0', '').splitlines()
reader2 = csv.reader(infile)
with open(file_out,'w',newline='', encoding='cp1252') as outfile:
writer = csv.writer(outfile, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(header_data)
for row in reader2:
row = str(row).replace('"','').replace("'",'"')
row = json.loads(row)
writer.writerow(row)

Related

How to print certain specific character from row to column?

I have an input file that contains data in the same format repeatedly across 5 rows. I need to format this data into one row (CSV file) and have only few fields relevant to me. How do i achieve the mentioned output with the input file provided.
Note - I'm very new to learning any language and haven't reached to this depth of details yet to write my own. I have already written the code where i'm importing the input file, reaching to a specif word and then printing the rest of the data(this is where i need help as i don't need all the information in the input as using space is delimiter is not giving the output in correct columns). I have also written the code to write the output in a csv file.
Note 2 - I'm very to this forum as well and kindly excuse me in case i have made any posting in posting my query.
Input -
Input File
Output -
Output File
import itertools, csv
You should read in the file and parse it manually, then use the csv module to write it to a .csv file:
import re
with open('myfile.txt', 'r') as f:
lines = f.readlines()
# divide on whitespace characters, but not single spaces
lines = [re.split("\s\s+", line) for line in lines]
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL)
for line in lines:
writer.writerow(lines)
But this will include every piece of data. You can iterate through lines and remove the fields you don't want to keep. So before you do the csv writing, you could do:
def filter_line(line):
# see how the input file was parsed
print(line)
# for example, only keep the first 2 columns
return [line[0], line[1]]
lines = [filter_line(line) for line in lines]

Line contains null byte error while reading csv file

I have tried reading a csv file using the following format of code.
def csv_dict_reader(file_obj):
reader = csv.reader(file_obj, delimiter=',')
for row in reader:
# some operation
file = open("data.csv", "r")
csv_dict_reader(file)
I have referred the solutions given here, but none of them seem to work. What could be the most probable reason for this.
The error:
for row in reader:
_csv.Error: line contains NULL byte
The file contains one or more NULL bytes which is not compatible with the CSV reader. As a workaround, you could read the file a line at time and if a NULL byte is detected, replace it will a space character. The resulting line could then by parsed by a CSV reader by converting the resulting string into a file like object. Note, the delimiter by default is , so it does not need to be specified. By adding enumerate(), you could then display which lines in your file contain the NULL bytes.
As you are using a DictReader(), an extra step is needed to first extract the header from your file using a normal csv.reader(). This row can be used to manually specify the fieldnames parameter to your DictReader.
import csv
import StringIO
with open('data.csv', 'rb') as f_input:
# Use a normal CSV reader to get the header line
header = next(csv.reader(f_input))
for line_number, raw_line in enumerate(f_input, start=1):
if '\x00' in raw_line:
print "Line {} - NULL found".format(line_number)
raw_line = raw_line.replace('\x00', ' ')
row = next(csv.DictReader(StringIO.StringIO(raw_line), fieldnames=header))
print row
Lastly, when using a csv.reader(), you should open the file in binary mode e.g. rb.

Python encoding problems with PostgreSQL Database

I'm working on a program where I want to compare addresses I read from a csv file to a postgres Database. (Its a Plugin for QGis)
I can successfully establish a connection and also read data from the database as long as I send queries without my own parameters.
So what I do:
I read a csv file and store it in a list.
Then i select an output file.
Next I click a button, that on click, should compare the entries in the csv file to entries in my database.
If an entry from the csv file (postal code, town, address) has the exact same properties in the database, I write it into a list "Successful Matches", if one doesnt match, I write it into a List "Error List)
My problem now occurs, when i execute a statement with my own parameters.
The Sql Error Message I get back says:
Invalid Byte-Sequence for Encoding UTF8: 0xdf 0x65
I think, that the error is in the first list I fill from the csv file. My addresses have special characters like öäüß...
Here is the code that is used:
This Method writes the succesfully matched addresses to a file, the failed ones to a lineEdit
def write_output_file(self):
compare_input_with_database()
try:
with open(self.outputfile, 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in geocoded_list:
writer.writerow(row)
if len(error_list) > 0:
self.writefailedaddresses()
raiseInformation("Es konnten nicht alle Adressen geocodiert werden!")
else:
raiseInformation("Adressen erfolgreich geocodiert!")
except csv.Error:
raiseException("Fehler beim schreiben der Datei")
This method, compares a row entry from the list/csvfile to the database.
def compare_input_with_database():
dbcursor = database_connection.open_connection()
for row in addressList:
entry = str(row[0])
addresssplit = entry.split(';')
try:
resultset = database_connection.select_specific_address(dbcursor, int(addresssplit[0]), addresssplit[1], addresssplit[2])
geocoded_list.append(resultset)
except psycopg2.DatabaseError, e:
raiseException(e)
error_list.append(addresssplit)
database_connection.close_connection()
def select_specific_address(cursor, plz, town, address):
cursor.execute("SELECT plz,ort,strasse,breitengrad,laengengrad from addresses where plz=%s AND ort=%s AND strasse=%s", (plz, town, address))
resultset = cursor.fetchone()
return resultset
This Method reads a csv file and populates it in a list
def loadFileToList(addressfile, dlg):
del addressList[:]
if os.path.exists(addressfile):
if file_is_empty(addressfile):
raiseException("Ungueltige Quelldatei! Quelldatei ist leer!")
return -1
else:
with open(addressfile, 'rb') as csvfile:
filereader = csv.reader(csvfile, delimiter=';')
for row in filereader:
addressList.append(row)
return addressList
else:
raiseException("Pfad der Quelldatei nicht gefunden!")
return -1
Thanks!
EDIT:
When I display the address containing the special charachter it shows as "Hauptstra\xdfe" instead of "Hauptstraße
Sorry im Bad with Encoding, is this unicode?
Does that mean, that it get sends to the database like that and i nead to encode it differently?
EDIT 2:
I took a look at the orkaround and tried to implement it:
def loadFileToList(addressfile, dlg):
del addressList[:]
if os.path.exists(addressfile):
if file_is_empty(addressfile):
raiseException("Ungueltige Quelldatei! Quelldatei ist leer!")
return -1
else:
#with open(addressfile, 'rb') as csvfile:
#filereader = csv.reader(csvfile, delimiter=';')
reader = unicode_csv_reader(open(addressfile))
for row in reader:
addressList.append(row)
return addressList
else:
raiseException("Pfad der Quelldatei nicht gefunden!")
return -1
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
But now i get the following Error Message when executing the code:
for row in reader:
File "C:/Users/Constantin/.qgis2/python/plugins\Geocoder\logic.py",
line 46, in unicode_csv_reader
yield [unicode(cell, 'utf-8') for cell in row] UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 19: invalid
continuation byte
I just dont get why it just cant decode it -.-
UPDATE:
Some lines from my csv file:
1190;Wien;Weinberggasse
1190;Wien;Hauptstraße
1190;Wien;Kärnterstraße
The syntax of your except implies you are using Python 2. But since you are using non-ASCII (Unicode) characters in your strings, Python 3 is a dramatically better choice. You seem to be working in German, so at least a few ü and ß are going to sneak in.
In Python 3, the reading is just:
rows = []
with open('test.csv', encoding='utf-8') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in reader:
rows.append(row)
And writing:
with open('testout.csv', mode="w", encoding='utf-8') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in rows:
writer.writerow(row)
Note that the read/write is not binary, and encoding is handled as a matter of course. And the results are exactly what you'd expect. Something like:
Artikelname;Menge
Äpfel;3
Bäume;12
With all characters properly encoded. On disk, the data are UTF-8 encoded. In memory, full Unicode.
If you can't use Python 3, then you must take great care that Unicode characters are properly encoded and decoded, especially at I/O boundaries--e.g. when reading and writing data, or communicating with an external system like PostgreSQL. There are at least a few ways the code as it stands does not take care. The use of str() to cast what are not necessarily ASCII characters, for example, and the lack of UTF-8 encoding.
Unfortunately, there is no trivial fix in Python 2. Python 2's CSV module is schrecklich kaput. From the docs: "The csv module doesn’t directly support reading and writing Unicode." In 2016?! There are, however, workarounds. One of the recipes is right there in the docs. This Stack Overflow answer restates it, and provides several other alternatives.
So, use Python 3 if you can. It will simplify this and many other non-ASCII I/O issues. Otherwise, deploy the one of the CSV workarounds in that other SO answer.
Update
If you are having trouble using the workarounds, I don't like the standard answer either. Here is a non-canonical reader solution that works for me:
import csv
rows = []
with open('test.csv', mode='rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in reader:
urow = [unicode(cell, 'utf-8') for cell in row]
rows.append(urow)
print rows
This is decidedly non-portable to Python 3, but it works and doesn't require importing any other modules. Note, if you're using IPython/Jupyter or the interactive Python console ("REPL"), you are going to see strings at a low level, such as:
[ [u'Artikelname', u'Menge'],
[u'\xc4pfel', u'3'],
[u'B\xe4ume', u'12]
]
So instead of a nice, neat Ä, the string has \xc4. It's annoying, but it's not wrong. Entering u'\u00c4pfel' gives you the same thing. And it's easy to confirm that c4 is the correct Unicode code point. Python 2 is just doing a poor job of dealing with non-ASCII characters. Just one of 4,094 reasons to use Python 3 when you can.
The manual equivalent for output, btw:
with open('testout.csv', mode='wb') as csvfile:
writer = csv.writer(csvfile, delimiter=';', quoting=csv.QUOTE_MINIMAL)
for row in rows:
urow = [cell.encode('utf-8') for cell in row]
writer.writerow(urow)
All of this depends utterly on your input files truly being in UTF-8 encoding. If they're in anything else, that opens another horrible kettle of rotting fish.

Generating CSV and blank line

I am generating and parsing CSV files and I'm noticing something odd.
When the CSV gets generated, there is always an empty line at the end, which is causing issues when subsequently parsing them.
My code to generate is as follows:
with open(file, 'wb') as fp:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='')
a.writerow(["Id", "Builing", "Age", "Gender"])
results = get_results()
for val in results:
if any(val):
a.writerow(val)
It doesn't show up via the command line, but I do see it in my IDE/text editor
Does anyone know why it is doing this?
Could it be possible whitespace?
Is the problem the line terminator? It could be as simple as changing one line:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='', lineterminator='\n')
I suspect this is it since I know that csv.writer defaults to using carriage return + line feed ("\r\n") as the line terminator. The program you are using to read the file might be expecting just a line feed ("\n"). This is common in switching file back and forth between *nix and Windows.
If this doesn't work, then the program you are using to read the file seems to be expecting no line terminator for the last row, I'm not sure the csv module supports that. For that, you could write the csv to a StringIO, "strip()" it and then write that your file.
Also since you are not quoting anyting, is there a reason to use csv at all? Why not:
with open(file, 'wb') as fp:
fp.write("\n".join( [ ",".join([ field for field in record ]) for record in get_results()]))

Python csv writer : "Unknown Dialect" Error

I have a very large string in the CSV format that will be written to a CSV file.
I try to write it to CSV using the simplest if the python script
results=""" "2013-12-03 23:59:52","/core/log","79.223.39.000","logging-4.0",iPad,Unknown,"1.0.1.59-266060",NA,NA,NA,NA,3,"1385593191.865",true,ERROR,"app_error","iPad/Unknown/webkit/537.51.1",NA,"Does+not",false
"2013-12-03 23:58:41","/core/log","217.7.59.000","logging-4.0",Win32,Unknown,"1.0.1.59-266060",NA,NA,NA,NA,4,"1385593120.68",true,ERROR,"app_error","Win32/Unknown/msie/9.0",NA,"Does+not,false
"2013-12-03 23:58:19","/core/client_log","79.240.195.000","logging-4.0",Win32,"5.1","1.0.1.59-266060",NA,NA,NA,NA,6,"1385593099.001",true,ERROR,"app_error","Win32/5.1/mozilla/25.0",NA,"Could+not:+{"url":"/all.json?status=ongoing,scheduled,conflict","code":0,"data":"","success":false,"error":true,"cached":false,"jqXhr":{"readyState":0,"responseText":"","status":0,"statusText":"error"}}",false"""
resultArray = results.split('\n')
with open(csvfile, 'wb') as f:
writer = csv.writer(f)
for row in resultArray:
writer.writerows(row)
The code returns
"Unknown Dialect"
Error
Is the error because of the script or is it due to the string that is being written?
EDIT
If the problem is bad input how do I sanitize it so that it can be used by the csv.writer() method?
You need to specify the format of your string:
with open(csvfile, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar="'", quoting=csv.QUOTE_ALL)
You might also want to re-visit your writing loop; the way you have it written you will get one column in your file, and each row will be one character from the results string.
To really exploit the module, try this:
import csv
lines = ["'A','bunch+of','multiline','CSV,LIKE,STRING'"]
reader = csv.reader(lines, quotechar="'")
with open('out.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(list(reader))
out.csv will have:
A,bunch+of,multiline,"CSV,LIKE,STRING"
If you want to quote all the column values, then add quoting=csv.QUOTE_ALL to the writer object; then you file will have:
"A","bunch+of","multiline","CSV,LIKE,STRING"
To change the quotes to ', add quotechar="'" to the writer object.
The above code does not give csv.writer.writerows input that it expects. Specifically:
resultArray = results.split('\n')
This creates a list of strings. Then, you pass each string to your writer and tell it to writerows with it:
for row in resultArray:
writer.writerows(row)
But writerows does not expect a single string. From the docs:
csvwriter.writerows(rows)
Write all the rows parameters (a list of row objects as described above) to the writer’s file object, formatted according to the current dialect.
So you're passing a string to a method that expects its argument to be a list of row objects, where a row object is itself expected to be a sequence of strings or numbers:
A row must be a sequence of strings or numbers for Writer objects
Are you sure your listed example code accurately reflects your attempt? While it certainly won't work, I would expect the exception produced to be different.
For a possible fix - if all you are trying to do is to write a big string to a file, you don't need the csv library at all. You can just write the string directly. Even splitting on newlines is unnecessary unless you need to do something like replacing Unix-style linefeeds with DOS-style linefeeds.
If you need to use the csv module after all, you need to give your writer something it understands - in this example, that would be something like writer.writerow(['A','bunch+of','multiline','CSV,LIKE,STRING']). Note that that's a true Python list of strings. If you need to turn your raw string "'A','bunch+of','multiline','CSV,LIKE,STRING'" into such a list, I think you'll find the csv library useful as a reader - no need to reinvent the wheel to handle the quoted commas in the substring 'CSV,LIKE,STRING'. And in that case you would need to care about your dialect.
you can use 'register_dialect':
for example for escaped formatting:
csv.register_dialect('escaped', escapechar='\\', doublequote=True, quoting=csv.QUOTE_ALL)

Categories

Resources