Line contains null byte error while reading csv file - python

I have tried reading a csv file using the following format of code.
def csv_dict_reader(file_obj):
reader = csv.reader(file_obj, delimiter=',')
for row in reader:
# some operation
file = open("data.csv", "r")
csv_dict_reader(file)
I have referred the solutions given here, but none of them seem to work. What could be the most probable reason for this.
The error:
for row in reader:
_csv.Error: line contains NULL byte

The file contains one or more NULL bytes which is not compatible with the CSV reader. As a workaround, you could read the file a line at time and if a NULL byte is detected, replace it will a space character. The resulting line could then by parsed by a CSV reader by converting the resulting string into a file like object. Note, the delimiter by default is , so it does not need to be specified. By adding enumerate(), you could then display which lines in your file contain the NULL bytes.
As you are using a DictReader(), an extra step is needed to first extract the header from your file using a normal csv.reader(). This row can be used to manually specify the fieldnames parameter to your DictReader.
import csv
import StringIO
with open('data.csv', 'rb') as f_input:
# Use a normal CSV reader to get the header line
header = next(csv.reader(f_input))
for line_number, raw_line in enumerate(f_input, start=1):
if '\x00' in raw_line:
print "Line {} - NULL found".format(line_number)
raw_line = raw_line.replace('\x00', ' ')
row = next(csv.DictReader(StringIO.StringIO(raw_line), fieldnames=header))
print row
Lastly, when using a csv.reader(), you should open the file in binary mode e.g. rb.

Related

write text file into csv file in python

I'm trying to write a text file into csv file using csv package in python using this code:
import csv
with open(r'C:\Users\soso-\Desktop\SVM\DataSet\drug_list.txt') as f:
with open('integrated_x.csv', 'w') as out_file:
writer= csv.writer(out_file)
for line in f:
print(line.strip())
writer.writerow(line.strip())
enter image description here
I expected the the file to be as in terminal, but I dont know why it written in the csv file like this
how can I fix it
the data appears in csv like this :
D,B,0,0,6,3,2
D,B,0,0,4,3,3
D,B,0,1,1,9,4
D,B,0,0,2,7,3
D,B,0,2,5,3,0
I want the csv file to be like this :
DB00632,
DB00433,
DB01194,
DB00273,
DB02530,
The unexpected error you are seeing with output like "D,B,0,0,6,3,2" is due to writer.writerow() expecting a list or iterable argument. The string DB00632 is then iterated into characters and CSV output has each letter separated by a comma without using a CSV writer.
If just want to append a comma (,) to end of each drug id then can just output the drug name followed by a comma and a new line.
Note the CSV file doesn't yet have a header line and there is nothing in the second column.
with open('drug_list.txt') as f:
with open('integrated_x.csv', 'w') as out_file:
for line in f:
line = line.strip()
print(line)
out_file.write(f"{line},\n")
If input looks like a drug id on each line (e.g. DB00632) then the output would be as expected:
DB00632,
DB00433,
DB01194,
DB00273,
DB02530,

How to read CSV with NULL and different headers?

I have a task to convert one CSV file from UTF8 encoding to ANSI encoding and format it. I have learned that ANSI is really the encoding of the system in some sense, but this is not the issue at the moment.
Before converting, I have to be able to read my CSV file first. The file contains null values and the headers are different from the rest of the file. Whenever I try to read it, I always get error regarding the NULL values or headers. The headers are different in a sense, that they do not have any quotation at all, but the rest of the file has 3 quotation marks on each side of strings (and for some reason also around NUL values). The file columns are coma separated and each row ends with a new line.
When I try to read with QUOTE_NONNUMERIC (to get rid of the null values), I get the error:
batch_data = list(reader)
ValueError: could not convert string to float: 'my_first_column_name'
When I try with QUOTE_ALL (in hopes to quote the headers as well), I get the error:
batch_data = list(reader)
_csv.Error: line contains NULL byte
Here is my code:
import csv
file_in = r'<file_in>'
with open(file_in, mode='r') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_NONNUMERIC)
batch_data = list(reader)
for row in batch_data:
print(row, end="")
After reading some materials I understand that, I guess, I have to read headers separately from the rest of the file. How would one do it exactly? I was trying to skip them with reader.next(), but then I get the error, that next is not a method of reader. At this point I can't really believe that it has already taken so much of my time reading through the documentations and trying different things.
SOLUTION
I ended up using list(next()) to skip over the header and then replace all the null values with empty string. Then opening the output file with the configuration I needed and writing my rows in it. I do still have to deal with different formatting since this solution puts double quotes around every cell and I need some to be numbers. This also replaced all single quotes from sells to double quotes, which still has to be fixed. file_in and file_out variables have assigned the input file and output file locations to them.
import csv
import json
file_in = r'<input_filepath>'
file_out = r'<output_filepath>'
with open(file_in, mode='r', encoding='utf-8-sig') as infile:
reader = csv.reader(infile, quoting=csv.QUOTE_ALL)
header_data = list(next(reader))
infile = infile.read().replace('\0', '').splitlines()
reader2 = csv.reader(infile)
with open(file_out,'w',newline='', encoding='cp1252') as outfile:
writer = csv.writer(outfile, delimiter=';', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(header_data)
for row in reader2:
row = str(row).replace('"','').replace("'",'"')
row = json.loads(row)
writer.writerow(row)

Reading CSV with newlines within row

I'm using Pandas to read a CSV file but some of the rows will continue on the next line and the delimiter(") will be at the start of the next line. I'm trying to find a parameter for 'pd.read_csv' that will fix it but after reading the documentation, I still not sure if there is one.
Ex:
"0","","0","0","0","0","0
","0","0","0","0","0","0"
In other words,
"0","","0","0","0","0","0\r\n","0","0","0","0","0","0"
with open(file, 'rU') as myfile:
filtered = (line.replace('\n', '') for line in myfile)
for row in csv.reader(filtered):

How to write output file in CSV format in python?

I tried to write output file as a CSV file but getting either an error or not the expected result. I am using Python 3.5.2 and 2.7 also.
Getting error in Python 3.5:
wr.writerow(var)
TypeError: a bytes-like object is required, not 'str'
and
In Python 2.7, I am getting all column result in one column.
Expected Result:
An output file same format as the input file.
Code:
import csv
f1 = open("input_1.csv", "r")
resultFile = open("out.csv", "wb")
wr = csv.writer(resultFile, quotechar=',')
def sort_duplicates(f1):
for i in range(0, len(f1)):
f1.insert(f1.index(f1[i])+1, f1[i])
f1.pop(i+1)
for var in f1:
#print (var)
wr.writerow([var])
If I am using resultFile = open("out.csv", "w"), I get one row extra in the output file.
If I am using above code, getting one row and column extra.
On Python 3, csv requires that you open the file in text mode, not binary mode. Drop the b from your file mode. You should really use newline='' too:
resultFile = open("out.csv", "w", newline='')
Better still, use the file object as a context manager to ensure it is closed automatically:
with open("input_1.csv", "r") as f1, \
open("out.csv", "w", newline='') as resultFile:
wr = csv.writer(resultFile, dialect='excel')
for var in f1:
wr.writerow([var.rstrip('\n')])
I've also stripped the lines from f1 (just to remove the newline) and put the line in a list; csv.writer.writerow wants a sequence with columns, not a single string.
Quoting the csv.writer() documentation:
If csvfile is a file object, it should be opened with newline='' [1]. [...] All other non-string data are stringified with str() before being written.
[1] If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
Others have answered that you should open the output file in text mode when using Python 3, i.e.
with open('out.csv', 'w', newline='') as resultFile:
...
But you also need to parse the incoming CSV data. As it is your code reads each line of the input CSV file as a single string. Then, without splitting that line into its constituent fields, it passes the string to the CSV writer. As a result, the csv.writer will treat the string as a sequence and output each character , including any terminating new line character, as a separate field. For example, if your input CSV file contains:
1,2,3,4
Your output file would be written like this:
1,",",2,",",3,",",4,"
"
You should change the for loop to this:
for row in csv.reader(f1):
# process the row
wr.writerow(row)
Now the input CSV file will be parsed into fields and row will contain a list of strings - one for each field. For the previous example, row would be:
for row in csv.reader(f1):
print(row)
['1', '2', '3', '4']
And when that list is passed to the csv.writer the output to the file will be:
1,2,3,4
Putting all of that together you get this code:
import csv
with open('input_1.csv') as f1, open('out.csv', 'w', newline='') as resultFile:
wr = csv.writer(resultFile, dialect='excel')
for row in csv.reader(f1):
wr.writerow(row)
open file without b mode
b mode open your file as binary
you can open file as w
open_file = open("filename.csv", "w")
You are opening the input file in normal read mode but the output file is opened in binary mode, correct way
resultFile = open("out.csv", "w")
As shown above if you replace "wb" with "w" it will work.

Generating CSV and blank line

I am generating and parsing CSV files and I'm noticing something odd.
When the CSV gets generated, there is always an empty line at the end, which is causing issues when subsequently parsing them.
My code to generate is as follows:
with open(file, 'wb') as fp:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='')
a.writerow(["Id", "Builing", "Age", "Gender"])
results = get_results()
for val in results:
if any(val):
a.writerow(val)
It doesn't show up via the command line, but I do see it in my IDE/text editor
Does anyone know why it is doing this?
Could it be possible whitespace?
Is the problem the line terminator? It could be as simple as changing one line:
a = csv.writer(fp, delimiter=",", quoting=csv.QUOTE_NONE, quotechar='', lineterminator='\n')
I suspect this is it since I know that csv.writer defaults to using carriage return + line feed ("\r\n") as the line terminator. The program you are using to read the file might be expecting just a line feed ("\n"). This is common in switching file back and forth between *nix and Windows.
If this doesn't work, then the program you are using to read the file seems to be expecting no line terminator for the last row, I'm not sure the csv module supports that. For that, you could write the csv to a StringIO, "strip()" it and then write that your file.
Also since you are not quoting anyting, is there a reason to use csv at all? Why not:
with open(file, 'wb') as fp:
fp.write("\n".join( [ ",".join([ field for field in record ]) for record in get_results()]))

Categories

Resources