Trouble with UTF-8 CSV input in Python

Trouble with UTF-8 CSV input in Python - python

This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I'm attempting to follow the principle of the "Unicode Sandwich" and decode upon reading the file in:
import codecs
import csv
with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file:
input_file = csv.reader(file, delimiter=",", quotechar='|')
list = []
for row in input_file:
list.extend(row)
This produces the dread 'codec can't encode characters in position, ordinal not in range(128)' error.
I've also tried adapting a solution from this answer, which returns a similar error
def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
filename = 'inputs\encode.csv'
reader = unicode_csv_reader(open(filename))
target_list = []
for field1 in reader:
target_list.extend(field1)
A very similar solution adapted from the docs returns the same error.
def unicode_csv_reader(utf8_data, dialect=csv.excel):
csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect)
for row in csv_reader:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
filename = 'inputs\encode.csv'
reader = unicode_csv_reader(open(filename))
target_list = []
for field1 in reader:
target_list.extend(field1)
Clearly I'm missing something. Most of the questions that I've seen regarding this problem seem to predate Python 2.7, so an update here might be useful.

Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.
Your 2nd and 3rd snippets are confused. Something like the following is all that you need:
f = open('your_utf8_encoded_file.csv', 'rb')
reader = csv.reader(f)
for utf8_row in reader:
unicode_row = [x.decode('utf8') for x in utf8_row]
print unicode_row

At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

I'd suggest trying just:
input_file = csv.reader(open('utf8file.csv', 'r'), delimiter=",", quotechar='|')
or
input_file = csv.reader(open('utf8file.csv', 'rb'), delimiter=",", quotechar='|')
csv should be unicode aware, and it should just work.

Related

Missing headers in csv output file (python)

I am trying to output a csv file, but the problem is, the headers are gone, and I tried looking at my code line by line but I don't know what's wrong with my code..
My sample data is :
ABC.csv (assuming there are multiple data in it so I also add the code on how to remove it)
KeyID,GeneralID
145258,KL456
145259,BG486
145260,HJ789
145261,KL456
145259,BG486
145259,BG486
My code:
import csv
import fileinput
from collections import Counter
file_path_1 = "ABC.csv"
key_id = []
general_id = []
total_general_id = []
with open(file_path_1, 'rU') as f:
reader = csv.reader(f)
header = next(reader)
lines = [line for line in reader]
counts = Counter([l[1] for l in lines])
new_lines = [l + [str(counts[l[1])] for l in lines]
with open(file_path_1, 'wb') as f:
writer = csv.writer(f)
writer.writerow(header + ['Total_GeneralID'])
writer.writerows(new_lines)
with open(file_path_1, 'rU') as f:
reader = csv.DictReader(f)
for row in reader:
key_id.append(row['KeyID'])
general_id.append(row['GeneralID'])
total_general_id.append(['Total_GeneralID'])
New_List = [[] for _ in range(len(key_id))]
for attr in range(len(key_id)):
New_List[attr].append(key_id[attr])
New_List[attr].append(general_id[attr])
New_List[attr].append(total_general_id[attr])
with open('result_id_with_total.csv', 'wb+') as newfile:
header = ['KEY ID', 'GENERAL ID' , 'TOTAL GENERAL ID']
wr = csv.writer(newfile, delimiter=',', quoting = csv.QUOTE_MINIMAL)
wr.writerow(header) #I already add the headers but it won't work.
for item in New_List:
if item not in newfile:
wr.writerow(item)
Unfortunately, my output would be like this(result_id_with_total.csv);
145258,KL456,2
145259,BG486,1
145260,HJ789,1
145261,KL456,2
What I am trying to achieve;
KEY ID,GENERAL ID,TOTAL GENERAL ID
145258,KL456,2
145259,BG486,1
145260,HJ789,1
145261,KL456,2
My main problem in this code:
wr.writerow(header)
won't work.

This is to do with opening a file with wb+ (write bytes). Because when you write a file in bytes mode you need to pass to it an array of bytes and not strings.
I get this error in the console when I run it:
TypeError: a bytes-like object is required, not 'str'
Try changing wb+ to just w, this does the trick.
with open('result_id_with_total.csv', 'w') as newfile:
header = ['KEY ID', 'GENERAL ID' , 'TOTAL GENERAL ID']
wr = csv.writer(newfile, delimiter=',', quoting = csv.QUOTE_MINIMAL)

Trying to read the first line of CSV file returns ['/']

I have uploaded a CSV file through Django and I trying to read the first line of it. The file is stored on the server in
/tmp/csv_file/test.csv
The file looks like this:
column_1,column_2,column_3
2175,294,Nuristan
2179,299,Sar-e-Pul
I am trying to get the headings of the file like:
absolute_base_file = '/tmp/csv_file/test.csv'
csv_reader = csv.reader(absolute_base_file)
csv_headings = next(csv_reader)
print csv_headings
I only get this in return:
['/']
EDITED
The permissions of the CSV file are:
-rw-rw-r--
Which should be ok.
EDITED AGAIN
Based on the recommendations and the help of #EdChum and #Moses Koledoye
I have checked if the file is read correctly using:
print (os.stat(absolute_base_file).st_size) # returns 64
Then I tried to see if seek(0) and csvfile.read(1) return a single printable character.
print csvfile.seek(0) returns None
print csvfile.read(1) returns 'c'
Then I thought perhaps there is a particular issue with next() function and I tried an alternative:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
print ("csv_reader")
Again this didn't work.

You passed a string instead of a file object which is why you get the slash, change to this:
with open (absolute_base_file) as csvfile:
csv_reader = csv.reader(csvfile)
check the docs
See this working:
In [5]:
import csv
with open (r'c:\data\csv_test.csv') as csvfile:
csv_reader = csv.reader(csvfile)
csv_headings = next(csv_reader)
print (csv_headings)
['column_1', 'column_2', 'column_3']
To successively access each row call next:
In [7]:
import csv
with open (r'c:\data\csv_test.csv') as csvfile:
csv_reader = csv.reader(csvfile)
csv_headings = next(csv_reader)
print (csv_headings)
first_row = next(csv_reader)
print( 'first row: ', first_row)
['column_1', 'column_2', 'column_3']
first row: ['2175', '294', 'Nuristan']

You should pass a file object to your csv.reader not a string literal.
absolute_base_file = open(r'/tmp/csv_file/test.csv') # notice open
csv_reader = csv.reader(absolute_base_file)
csv_headings = next(csv_reader)
print csv_headings

How to write to a CSV line by line?

I have data which is being accessed via http request and is sent back by the server in a comma separated format, I have the following code :
site= 'www.example.com'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
soup = soup.get_text()
text=str(soup)
The content of text is as follows:
april,2,5,7
may,3,5,8
june,4,7,3
july,5,6,9
How can I save this data into a CSV file.
I know I can do something along the lines of the following to iterate line by line:
import StringIO
s = StringIO.StringIO(text)
for line in s:
But i'm unsure how to now properly write each line to CSV
EDIT---> Thanks for the feedback as suggested the solution was rather simple and can be seen below.
Solution:
import StringIO
s = StringIO.StringIO(text)
with open('fileName.csv', 'w') as f:
for line in s:
f.write(line)

General way:
##text=List of strings to be written to file
with open('csvfile.csv','wb') as file:
for line in text:
file.write(line)
file.write('\n')
OR
Using CSV writer :
import csv
with open(<path to output_csv>, "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in data:
writer.writerow(line)
OR
Simplest way:
f = open('csvfile.csv','w')
f.write('hi there\n') #Give your csv text here.
## Python will convert \n to os.linesep
f.close()

You could just write to the file as you would write any normal file.
with open('csvfile.csv','wb') as file:
for l in text:
file.write(l)
file.write('\n')
If just in case, it is a list of lists, you could directly use built-in csv module
import csv
with open("csvfile.csv", "wb") as file:
writer = csv.writer(file)
writer.writerows(text)

I would simply write each line to a file, since it's already in a CSV format:
write_file = "output.csv"
with open(write_file, "wt", encoding="utf-8") as output:
for line in text:
output.write(line + '\n')
I can't recall how to write lines with line-breaks at the moment, though :p
Also, you might like to take a look at this answer about write(), writelines(), and '\n'.

To complement the previous answers, I whipped up a quick class to write to CSV files. It makes it easier to manage and close open files and achieve consistency and cleaner code if you have to deal with multiple files.
class CSVWriter():
filename = None
fp = None
writer = None
def __init__(self, filename):
self.filename = filename
self.fp = open(self.filename, 'w', encoding='utf8')
self.writer = csv.writer(self.fp, delimiter=';', quotechar='"', quoting=csv.QUOTE_ALL, lineterminator='\n')
def close(self):
self.fp.close()
def write(self, elems):
self.writer.writerow(elems)
def size(self):
return os.path.getsize(self.filename)
def fname(self):
return self.filename
Example usage:
mycsv = CSVWriter('/tmp/test.csv')
mycsv.write((12,'green','apples'))
mycsv.write((7,'yellow','bananas'))
mycsv.close()
print("Written %d bytes to %s" % (mycsv.size(), mycsv.fname()))
Have fun

What about this:
with open("your_csv_file.csv", "w") as f:
f.write("\n".join(text))
str.join() Return a string which is the concatenation of the strings in iterable.
The separator between elements is
the string providing this method.

In my situation...
with open('UPRN.csv', 'w', newline='') as out_file:
writer = csv.writer(out_file)
writer.writerow(('Name', 'UPRN','ADMIN_AREA','TOWN','STREET','NAME_NUMBER'))
writer.writerows(lines)
you need to include the newline option in the open attribute and it will work
https://www.programiz.com/python-programming/writing-csv-files

I want an unicode type of the following code ::

I got the below code from SO expert but it's working for ANSI Strings and my input is UNICODE STRING. How to make this code work for both of the versions? TIA
import csv
from collections import defaultdict
summary = defaultdict(list)
csvin = csv.reader(open('qwetry.txt'), delimiter='|')
for row in csvin:
summary[(row[1].split()[0], row[2])].append(int(row[5]))
csvout = csv.writer(open('datacopy.out','wb'), delimiter='|')
for who, what in summary.iteritems():
csvout.writerow( [' '.join(who), len(what), sum(what)] )
courtsey: Jon Clements

The csv module doesn’t directly support reading and writing Unicode. you can find the details here. The generator for the same is as below::
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')

python csv write only certain fieldnames, not all

I must be missing something, but I don't get it. I have a csv, it has 1200 fields. I'm only interested in 30. How do you get that to work? I can read/write the whole shebang, which is ok, but i'd really like to just write out the 30. I have a list of the fieldnames and I'm kinda hacking the header.
How would I translate below to use DictWriter/Reader?
for file in glob.glob( os.path.join(raw_path, 'P12*.csv') ):
fileReader = csv.reader(open(file, 'rb'))
fileLength = len(file)
fileGeom = file[fileLength-7:fileLength-4]
table = TableValues[fileGeom]
filename = file.split(os.sep)[-1]
with open(out_path + filename, "w") as fileout:
for line in fileReader:
writer = csv.writer(fileout, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
if 'ID' in line:
outline = line.insert(0,"geometryTable")
else:
outline = line.insert(0,table) #"%s,%s\n" % (line, table)
writer.writerow(line)

Here's an example of using DictWriter to write out only fields you care about. I'll leave the porting work to you:
import csv
headers = ['a','b','d','g']
with open('in.csv','rb') as _in, open('out.csv','wb') as out:
reader = csv.DictReader(_in)
writer = csv.DictWriter(out,headers,extrasaction='ignore')
writer.writeheader()
for line in reader:
writer.writerow(line)
in.csv
a,b,c,d,e,f,g,h
1,2,3,4,5,6,7,8
2,3,4,5,6,7,8,9
Result (out.csv)
a,b,d,g
1,2,4,7
2,3,5,8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble with UTF-8 CSV input in Python - python

At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

I'd suggest trying just: input_file = csv.reader(open('utf8file.csv', 'r'), delimiter=",", quotechar='|') or input_file = csv.reader(open('utf8file.csv', 'rb'), delimiter=",", quotechar='|') csv should be unicode aware, and it should just work.

Related

Missing headers in csv output file (python)

Trying to read the first line of CSV file returns ['/']

How to write to a CSV line by line?

I want an unicode type of the following code ::

python csv write only certain fieldnames, not all

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble with UTF-8 CSV input in Python - python

At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

I'd suggest trying just: input_file = csv.reader(open('utf8file.csv', 'r'), delimiter=",", quotechar='|') or input_file = csv.reader(open('utf8file.csv', 'rb'), delimiter=",", quotechar='|') csv should be unicode aware, and it should just work.

Related

Missing headers in csv output file (python)

Trying to read the first line of CSV file returns ['/']

How to write to a CSV line by line?

I want an **unicode type** of the following code ::

python csv write only certain fieldnames, not all

Categories

Resources

I want an unicode type of the following code ::