I'm working with some CSV files, with the following code:
reader = csv.reader(open(filepath, "rU"))
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
And one file is throwing this error:
file my.csv, line 1: line contains NULL byte
What can I do? Google seems to suggest that it may be an Excel file that's been saved as a .csv improperly. Is there any way I can get round this problem in Python?
== UPDATE ==
Following #JohnMachin's comment below, I tried adding these lines to my script:
print repr(open(filepath, 'rb').read(200)) # dump 1st 200 bytes of file
data = open(filepath, 'rb').read()
print data.find('\x00')
print data.count('\x00')
And this is the output I got:
'\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1\x00\x00\x00\x00\x00\x00\x00\x00\ .... <snip>
8
13834
So the file does indeed contain NUL bytes.
As #S.Lott says, you should be opening your files in 'rb' mode, not 'rU' mode. However that may NOT be causing your current problem. As far as I know, using 'rU' mode would mess you up if there are embedded \r in the data, but not cause any other dramas. I also note that you have several files (all opened with 'rU' ??) but only one causing a problem.
If the csv module says that you have a "NULL" (silly message, should be "NUL") byte in your file, then you need to check out what is in your file. I would suggest that you do this even if using 'rb' makes the problem go away.
repr() is (or wants to be) your debugging friend. It will show unambiguously what you've got, in a platform independant fashion (which is helpful to helpers who are unaware what od is or does). Do this:
print repr(open('my.csv', 'rb').read(200)) # dump 1st 200 bytes of file
and carefully copy/paste (don't retype) the result into an edit of your question (not into a comment).
Also note that if the file is really dodgy e.g. no \r or \n within reasonable distance from the start of the file, the line number reported by reader.line_num will be (unhelpfully) 1. Find where the first \x00 is (if any) by doing
data = open('my.csv', 'rb').read()
print data.find('\x00')
and make sure that you dump at least that many bytes with repr or od.
What does data.count('\x00') tell you? If there are many, you may want to do something like
for i, c in enumerate(data):
if c == '\x00':
print i, repr(data[i-30:i]) + ' *NUL* ' + repr(data[i+1:i+31])
so that you can see the NUL bytes in context.
If you can see \x00 in the output (or \0 in your od -c output), then you definitely have NUL byte(s) in the file, and you will need to do something like this:
fi = open('my.csv', 'rb')
data = fi.read()
fi.close()
fo = open('mynew.csv', 'wb')
fo.write(data.replace('\x00', ''))
fo.close()
By the way, have you looked at the file (including the last few lines) with a text editor? Does it actually look like a reasonable CSV file like the other (no "NULL byte" exception) files?
data_initial = open("staff.csv", "rb")
data = csv.reader((line.replace('\0','') for line in data_initial), delimiter=",")
This works for me.
Reading it as UTF-16 was also my problem.
Here's my code that ended up working:
f=codecs.open(location,"rb","utf-16")
csvread=csv.reader(f,delimiter='\t')
csvread.next()
for row in csvread:
print row
Where location is the directory of your csv file.
You could just inline a generator to filter out the null values if you want to pretend they don't exist. Of course this is assuming the null bytes are not really part of the encoding and really are some kind of erroneous artifact or bug.
with open(filepath, "rb") as f:
reader = csv.reader( (line.replace('\0','') for line in f) )
try:
for row in reader:
print 'Row read successfully!', row
except csv.Error, e:
sys.exit('file %s, line %d: %s' % (filename, reader.line_num, e))
I bumped into this problem as well. Using the Python csv module, I was trying to read an XLS file created in MS Excel and running into the NULL byte error you were getting. I looked around and found the xlrd Python module for reading and formatting data from MS Excel spreadsheet files. With the xlrd module, I am not only able to read the file properly, but I can also access many different parts of the file in a way I couldn't before.
I thought it might help you.
Converting the encoding of the source file from UTF-16 to UTF-8 solve my problem.
How to convert a file to utf-8 in Python?
import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "utf-16") as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Why are you doing this?
reader = csv.reader(open(filepath, "rU"))
The docs are pretty clear that you must do this:
with open(filepath, "rb") as src:
reader= csv.reader( src )
The mode must be "rb" to read.
http://docs.python.org/library/csv.html#csv.reader
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
appparently it's a XLS file and not a CSV file as http://www.garykessler.net/library/file_sigs.html confirm
Instead of csv reader I use read file and split function for string:
lines = open(input_file,'rb')
for line_all in lines:
line=line_all.replace('\x00', '').split(";")
I got the same error. Saved the file in UTF-8 and it worked.
This happened to me when I created a CSV file with OpenOffice Calc. It didn't happen when I created the CSV file in my text editor, even if I later edited it with Calc.
I solved my problem by copy-pasting in my text editor the data from my Calc-created file to a new editor-created file.
I had the same problem opening a CSV produced from a webservice which inserted NULL bytes in empty headers. I did the following to clean the file:
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
data = myfile.read()
# clean file first if dirty
if data.count( '\x00' ):
print 'Cleaning...'
with codecs.open('my.csv.tmp', 'w', 'utf-8') as of:
for line in data:
of.write(line.replace('\x00', ''))
shutil.move( 'my.csv.tmp', 'my.csv' )
with codecs.open ('my.csv', 'rb', 'utf-8') as myfile:
myreader = csv.reader(myfile, delimiter=',')
# Continue with your business logic here...
Disclaimer:
Be aware that this overwrites your original data. Make sure you have a backup copy of it. You have been warned!
I opened and saved the original csv file as a .csv file through Excel's "Save As" and the NULL byte disappeared.
I think the original encoding for the file I received was double byte unicode (it had a null character every other character) so saving it through excel fixed the encoding.
For all those 'rU' filemode haters: I just tried opening a CSV file from a Windows machine on a Mac with the 'rb' filemode and I got this error from the csv module:
Error: new-line character seen in unquoted field - do you need to
open the file in universal-newline mode?
Opening the file in 'rU' mode works fine. I love universal-newline mode -- it saves me so much hassle.
I encountered this when using scrapy and fetching a zipped csvfile without having a correct middleware to unzip the response body before handing it to the csvreader. Hence the file was not really a csv file and threw the line contains NULL byte error accordingly.
Have you tried using gzip.open?
with gzip.open('my.csv', 'rb') as data_file:
I was trying to open a file that had been compressed but had the extension '.csv' instead of 'csv.gz'. This error kept showing up until I used gzip.open
One case is that - If the CSV file contains empty rows this error may show up. Check for row is necessary before we proceed to write or read.
for row in csvreader:
if (row):
do something
I solved my issue by adding this check in the code.
I have an csv sheet that i read it like this:
with open(csvFilePath, 'rU') as csvFile:
reader = csv.reader(csvFile, delimiter= '|')
numberOfMovies = 0
for row in reader:
title = row[1:2][0]
as you see, i am taking the value of title
Then i surf the internet for some info about that value and then i write to a file, the writing is like this:
def writeRDFToFile(rdf, fileName):
f = open("movies/" + fileName + '.ttl','a')
try:
#rdf = rdf.encode('UTF-8')
f.write(rdf) # python will convert \n to os.linesep
except:
print "exception happened for movie " + movieTitle
f.close()
In that function, i am writing the rdf variable to a file.
As you see there is a commetted line
If the value of rdf variable contains unicode char and that line was not commeted, that code doesn't write anything to the file.
However, if I just commet that line, that code writes to a file.
Okay you can say that: commit that line and everything will be fine, but that is not correct, because i have another java process (which is Fuseki server) that reads the file and if the file contains unicode chars, it throws an error.
so i need to solve the file myself, i need to encode that data to ut8,
help please
The normal csv library can have difficulty writing unicode to files. I suggest you use the unicodecsv library instead of the csv library. It supports writing unicode to CSVs.
Practically speaking, just write:
import unicodecsv as csv
So I am trying to create a Python script that will sort a specified column of an Excel sheet. So far my code is...
import csv
import operator
with open('case_name.csv') as infile:
data = list(csv.reader(infile, dialect=csv.excel_tab))
data.sort(key=operator.itemgetter(2))
with open('case_name_sorted.csv', 'w') as outfile:
writer = csv.writer(outfile, dialect='excel')
writer.writerows(data)
However, when I run this code I continue to get an error that says...
data = list(csv.reader(infile, dialect=csv.excel_tab))
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
I did some research and found out that .csv files do not work well on a Mac. So what should I change the file to that will keep it working like an Excel sheet? Also, if anyone has any pointers on how else I could sort a column, I would very much appreciate some tips. Thanks!
try opening the file with the mode
open('case_name.csv', mode='rU')
compare with: https://docs.python.org/2/library/functions.html#open
How can I tell Python to open a CSV file, and merge all columns per line, into new lines in a new TXT file?
To explain:
I'm trying to download a bunch of member profiles from a website, for a research project. To do this, I want to write a list of all the URLs in a TXT file.
The URLs are akin to this: website.com-name-country-title-id.html
I have written a script that takes all these bits of information for each member and saves them in columns (name/country/title/id), in a CSV file, like this:
mark japan rookie married
john sweden expert single
suzy germany rookie married
etc...
Now I want to open this CSV and write a TXT file with lines like these:
www.website.com/mark-japan-rookie-married.html
www.website.com/john-sweden-expert-single.html
www.website.com/suzy-germany-rookie-married.html
etc...
Here's the code I have so far. As you can probably tell I barely know what I'm doing so help will be greatly appreciated!!!
import csv
x = "http://website.com/"
y = ".html"
csvFile=csv.DictReader(open("NameCountryTitleId.csv")) #This file is stored on my computer
file = open("urls.txt", "wb")
for row in csvFile:
strArgument=str(row['name'])+"-"+str(row['country'])+"-"+str(row['title'])+"-"+str(row['id'])
try:
file.write(x + strArgument + y)
except:
print(strArgument)
file.close()
I don't get any error messages after running this, but the TXT file is completely empty.
Rather than using a DictReader, use a regular reader to make it easier to join the row:
import csv
url_format = "http://website.com/{}.html"
csv_file = 'NameCountryTitleId.csv'
urls_file = 'urls.txt'
with open(csv_file, 'rb') as infh, open(urls_file, 'w') as outfh:
reader = csv.reader(infh)
for row in reader:
url = url_format.format('-'.join(row))
outfh.write(url + '\n')
The with statement ensures the files are closed properly again when the code completes.
Further changes I made:
In Python 2, open a CSV files in binary mode, the csv module handles line endings itself, because correctly quoted column data can have embedded newlines in them.
Regular text files should be opened in text mode still though.
When writing lines to a file, do remember to add a newline character to delineate lines.
Using a string format (str.format()) is far more flexible than using string concatenations.
str.join() lets you join a sequence of strings together with a separator.
its actually quite simple, you are working with strings yet the file you are opening to write to is being opened in bytes mode, so every single time the write fails and it prints to the screen instead. try changing this line:
file = open("urls.txt", "wb")
to this:
file = open("urls.txt", "w")
EDIT:
i stand corrected, however i would like to point out that with an absence of newlines or some other form of separator, how do you intend to use the URLs later on? if you put newlines between each URL they would be easy to recover
I am reading a text file which was copied from a CSV file. When I read the file in python, I get a ton of unnecessary repeating lines as seen below. How can i strip away those three unwanted lines, including \cf0 and \cell\row at the beginning and end of each text?
Or should I read the text directly from the csv file itself? the text is in just one of the columns of the CSV file.
\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil
\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640
\pard\intbl\itap1\pardeftab720
\cf0 i have been using your product and it has been helping me a lot to solve business problem,\cell \row
\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil
\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640
\pard\intbl\itap1\pardeftab720
\cf0 I am very happy with your products. Very easy to use.\cell \row
\itap1\trowd \taflags1 \trgaph108\trleft-108 \trbrdrl\brdrnil \trbrdrr\brdrnil
\clvertalc \clshdrawnil \clbrdrt\brdrs\brdrw20\brdrcf2 \clbrdrl\brdrs\brdrw20\brdrcf2 \clbrdrb\brdrs\brdrw20\brdrcf2 \clbrdrr\brdrs\brdrw20\brdrcf2 \clpadl100 \clpadr100 \gaph\cellx8640
\pard\intbl\itap1\pardeftab720
\cf0 Many improvements with income tracker, and other time saving elements. Newer look, easier navigation. I believe there definitely is a time savings from past versions.\cell \row
Here is a snippet of the csv file:
page_url Review_title Product_id Rating Publish_date Review_Description
www.blabla.com Great! 777777 5 01/01/14 Excellent upgrade! Was not disappointed!
I only copied text from the Review_Description column and pasted them all in a text file.
Here is my python code to just read the file:
text_file=open("my_text.txt", "r")
lines=text_file.readlines()
print lines
Your real problem here appears to be that you pasted the CSV into an RTF file, not a text file. Pasting into Wordpad on Windows or TextEdit on Mac (especially if you copied from, say, Excel or Numbers) and saving it without explicitly telling it to "save as plain text" or "convert to plain text" will generally "help" you this way automatically.
While you could try to parse the RTF to recover the original text, you're much better off just using the original text if possible. Parsing CSV files in Python—either with Pandas, or with the stdlib's csv module—is very easy.
For example, your file appears to use tabs as delimiters, and no other non-default features. If so:
import csv
with open('my_csv.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter='\t')
reviews = [row['Review_Description'] for row in reader]
Now you have a list of all the reviews, and can do anything you want with them. If you just want to print them out, it's even simpler:
import csv
with open('my_csv.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter='\t')
for row in reader:
print row['Review_Description']