I’m updating a Python script from 2 to 3. It reads in a manifest (i.e., [batchdate]xmlList.xml), iterates through each XML file identified in the manifest, collects stats, then outputs a stats file in tab-delimited text format. The formatting and encoding of the tab file is off, and I can’t figure out how to fix it.
for encoding in utf-8:
class UnicodeWriter:
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
def writerow(self, row):
self.writer.writerow([str(s).encode("utf-8") for s in row])
data = self.queue.getvalue()
self.stream.write(data)
self.queue.truncate(0)
read in xmllist.xml manifest:
xmlListPath = input('Enter the filepath of the xmlList.xml file: ').replace('"', '')
xmlListFile = codecs.open(xmlListPath)
xmlList = etree.parse(xmlListFile)
listRoot = xmlList.getroot()
xmlListFile.close()
create stats file and write header:
batchID = path.split(xmlListPath)[1]
statsFile = 'S:/Metadata/ETD/Documentation/Statistics/' + batchID.replace('xmlList.xml', '.stats.txt')
stats = open(statsFile, 'w')
wtrStats = UnicodeWriter(stats, delimiter='\t')
statsHeader = ['Author', 'Degree', 'Department', 'Embargo Start Date', 'Date Web Available',
'Embargo Code', 'Identifier', 'PURL', 'Title', 'Comments']
wtrStats.writerow(statsHeader)
Here is how the tab file is coming out:
b'Author' b'Degree' b'Department' b'Embargo Start Date' b'Date Web Available' b'Embargo Code' b'Identifier' b'PURL' b'Title' b'Comments'
b'Confer, Matthew Phelan' b'Ph.D.' b'Chemical & Biological Engineering' b'01/01/2021' b'01/01/2026' b'4' b'u0015_0000001_0003682' b'http://purl.lib.ua.edu/177826' b'EXPERIMENTAL AND COMPUTATIONAL STUDIES OF MATERIALS DECOMPOSITION' b''
Thanks for any help.
The thing is that in Python3, the CSV module readers and writers expect to find strings (unicode text) - when you feed them bytes, by pre-encoding your strings, it uses the representation of those bytes objects, which is a b'...' prefixed string.
TL;DR: simply open your output file in the desired encoding, and point your csv.writer object to it - there is absolutely no need for this UnicodeWriter intermediate class you are listing.
import csv
...
stats = open(statsFile, 'w', encoding="utf-8")
wtrStats = csv.writer(stats, delimiter="\t")
...
Related
I am trying to pickle a python object into a csv file. I want to write the pickle of an object as the third column in my file. I want to use pickle to avoid writing serialization for my complex objects.
Code to write to csv :
with open(self.file_path, 'a') as csv_file:
wr = csv.writer(csv_file, delimiter='|')
row = ['klines', symbol]
row.extend(pickle.dumps(object))
wr.writerow(row)
Code to read csv :
with open(self.simulation_file_name, 'r') as csv_file:
line = csv_file.readline()
while line != '':
line = line.strip('\n')
columns = line.split('|')
event_type = line.pop(0)
symbol = line.pop(0)
pickled = line.pop(0)
klines = pickle.loads(klines)
I get the following error :
TypeError: a bytes-like object is required, not 'str'
To write bytes/binary in text file like CSV, use base64 or other methods to avoid any escaping problem. Code simplified & python3 assumed.
import base64
with open('a.csv', 'a', encoding='utf8') as csv_file:
wr = csv.writer(csv_file, delimiter='|')
pickle_bytes = pickle.dumps(obj) # unsafe to write
b64_bytes = base64.b64encode(pickle_bytes) # safe to write but still bytes
b64_str = b64_bytes.decode('utf8') # safe and in utf8
wr.writerow(['col1', 'col2', b64_str])
# the file contains
# col1|col2|gANdcQAu
with open('a.csv', 'r') as csv_file:
for line in csv_file:
line = line.strip('\n')
b64_str = line.split('|')[2] # take the pickled obj
obj = pickle.loads(base64.b64decode(b64_str)) # retrieve
P.S. If you are not writing a utf8 file (e.g. ascii file), simply replace the encoding method.
P.S. Writing bytes in CSV is possible yet hardly elegant. One alternative is dumping a whole dict with dumped objects as values and storing keys in the CSV.
This question already has answers here:
Read and Write CSV files including unicode with Python 2.7
(7 answers)
Closed 5 years ago.
I have some variables in Unicode.
title
u'\u0410\u0434\u043c\u0438\u043d\u0438\u0441\u0442\u0440\u0430\u0442\u043e\u0440 \u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442-\u043c\u0430\u0433\u0430\u0437\u0438\u043d\u0430'
type(title)
unicode
If I print this vaiable, I get:
print (title)
Администратор интернет-магазин
When I try to write this data (Cyrillic symbols) to CSV file:
with open('avito.csv','a') as f:
writer=csv.writer(f)
writer.writerow((title))
This error occurs:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0410' in position 0: ordinal not in range(128)
How can I write this variable as Cyrillic symbols to a CSV?
You have to write to the file with the correct encoding, and from your comment I guess, it is cp1251:
import io
title = u'\u0410\u0434\u043c\u0438\u043d\u0438\u0441\u0442\u0440\u0430\u0442\u043e\u0440 \u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442-\u043c\u0430\u0433\u0430\u0437\u0438\u043d\u0430'
with io.open('avito.csv', 'a', encoding='cp1251') as output:
output.write(title + '\n')
Three ways on Python 2.7. Note that to open the files in Excel that program likes a UTF-8 BOM encoded at the start of the file. I write it manually in the brute force method, but the utf-8-sig codec will handle it for you otherwise. Skip the BOM signature if you aren't dealing with lame editors (Windows Notepad) or Excel.
import csv
import codecs
import cStringIO
title = u'\u0410\u0434\u043c\u0438\u043d\u0438\u0441\u0442\u0440\u0430\u0442\u043e\u0440 \u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442-\u043c\u0430\u0433\u0430\u0437\u0438\u043d\u0430'
print(title)
# Brute force
with open('avito.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8')) # writes "byte order mark" UTF-8 signature
writer=csv.writer(f)
writer.writerow([title.encode('utf8')])
# Example from the documentation for csv module
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('avito2.csv','wb') as f:
w = UnicodeWriter(f)
w.writerow([title])
# 3rd party module, install from pip
import unicodecsv
with open('avito3.csv','wb') as f:
w = unicodecsv.writer(f,encoding='utf-8-sig')
w.writerow([title])
I'd like someone to help me with part of my code, there is a problem on the output file that should come out in .csv format using unicode, easy to read on excel. The problem is that the output file comes out without format and the text in it comes in ASCII (7bit).
I really apreaciate your help i've been on this for 4 hours now and can't find the problem yet :/
The last part of the script:
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8").replace("\n"," ").replace("\r"," ").replace("\t",'') for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
Python Version is 2.7 on windows 10
is in Ascii
Writing .csv format using unicode, for instance:
import io, csv
outfile = 'test/out.csv'
fieldnames = ['field1', 'field2']
content_dict = {'field1':'John', 'field2':'Doo'}
with io.open(outfile, 'w', newline='', encoding='utf-8') as csv_out:
writer = csv.DictWriter(csv_out, fieldnames=fieldnames)
writer.writeheader()
for row_dict in content_dict:
writer.writerow(row_dict)
I would like to export data from a csv file which contains unicode strings.
Previously I tried a Python script which works fine for ASCII data only. But it won't support unicode stuff either:
#! /usr/bin/env python
import csv
csv.register_dialect('custom',delimiter=','
doublequote=True,
escapechar=None,
quotechar='"',
quoting=csv.QUOTE_MINIMAL, skipinitialspace=False)
with open('input.csv') as ifile:
data = csv.reader(ifile, dialect='custom')
for record in data:
for i, field in enumerate(record):
print (" <field%s>" % i + field + "</field%s>" % i)
Traceback (most recent call last): for record in data: _csv.Error:
line contains NULL byte
use this unicode-csv library instead
https://github.com/jdunck/python-unicodecsv
import unicodecsv as csv
with open('input.csv') as ifile:
rows = [row for row in csv.reader(ifile, encoding='utf-8')]
print rows
You can wrap the csv.reader in a class to handle it for you. The following is taken from the csv documentation examples and works for me:
#! /usr/bin/env python
import csv, codecs
class UTF8Recoder:
"""
Iterator that reads an encoded stream and reencodes the input to UTF-8
"""
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("utf-8")
class UnicodeReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
f = UTF8Recoder(f, encoding)
self.reader = csv.reader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return [unicode(s, "utf-8") for s in row]
def __iter__(self):
return self
csv.register_dialect('custom', delimiter=',',
doublequote=True,
escapechar=None,
quotechar='"',
quoting=csv.QUOTE_MINIMAL, skipinitialspace=False)
with open('input.csv') as ifile:
data = UnicodeReader(ifile, dialect='custom')
for record in data:
for i, field in enumerate(record):
print (" <field%s>" % i + field + "</field%s>" % i)
There is also a UnicodeWriter class there if you need that functionality.
It seems you are using Python 3. Follow the very first code example in the docs:
#!/usr/bin/env python3
import csv
with open('input.csv', newline='', encoding=encoding) as csvfile:
reader = csv.reader(csvfile, dialect="custom")
for row in reader:
print(", ".join(row))
where "custom" dialect is defined in the code in your question and encoding is the character encoding of your file such as "utf-16". If you omit encoding argument; the encoding returned by locale.getpreferredencoding(False) is used.
I got the below code from SO expert but it's working for ANSI Strings and my input is UNICODE STRING. How to make this code work for both of the versions? TIA
import csv
from collections import defaultdict
summary = defaultdict(list)
csvin = csv.reader(open('qwetry.txt'), delimiter='|')
for row in csvin:
summary[(row[1].split()[0], row[2])].append(int(row[5]))
csvout = csv.writer(open('datacopy.out','wb'), delimiter='|')
for who, what in summary.iteritems():
csvout.writerow( [' '.join(who), len(what), sum(what)] )
courtsey: Jon Clements
The csv module doesn’t directly support reading and writing Unicode. you can find the details here. The generator for the same is as below::
import csv
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')