Ascii Code error while converting from xlsx to csv - python

I have referred some post related to unicode error but didn't get any solution for my problem. I am converting xlsx to csv fom a workbook of 6 sheets.
Use the following code
def csv_from_excel(file_loc):
#file_acess check
print os.access(file_loc, os.R_OK)
wb = xlrd.open_workbook(file_loc)
print wb.nsheets
sheet_names = wb.sheet_names()
print sheet_names
counter = 0
while counter < wb.nsheets:
try:
sh = wb.sheet_by_name(sheet_names[counter])
file_name = str(sheet_names[counter]) + '.csv'
print file_name
fh = open(file_name, 'wb')
wr = csv.writer(fh, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
except Exception as e:
print str(e)
finally:
fh.close()
counter += 1
I get an error in 4th sheet
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)"
but position 0 is blank and it has converted to csv till 33rd row.
I am unable to figure out. CSV was easy way to read content and put in my data structure .

You'll need to manually encode Unicode values to bytes; for CSV usually UTF-8 is fine:
for rownum in xrange(sh.nrows):
wr.writerow([unicode(c).encode('utf8') for c in sh.row_values(rownum)])
Here I use unicode() for column data that is not text.
The character you encountered is the U+2018 LEFT SINGLE QUOTATION MARK, which is just a fancy form of the ' single quote. Office software (spreadsheets, word processors, etc.) often auto-replace single and double quotes with the 'fancy' versions. You could also just replace those with ASCII equivalents. You can do that with the Unidecode package:
from unidecode import unidecode
for rownum in xrange(sh.nrows):
wr.writerow([unidecode(unicode(c)) for c in sh.row_values(rownum)])
Use this when non-ASCII codepoints are only used for quotes and dashes and other punctuation.

Related

python cant parse csv as list ( utf-8 bom ) [duplicate]

This question already has answers here:
Convert UTF-8 with BOM to UTF-8 with no BOM in Python
(7 answers)
Closed 6 months ago.
edit: this questions Convert UTF-8 with BOM to UTF-8 with no BOM in Python which only works on txt files, does not solve my issue with csv files
I have two csv files
rtc_csv_file="csv_migration\\rtc-test.csv"
ads_csv_file="csv_migration\\ads-test.csv"
here is the ads-test.csv file (which is causing issues)
https://easyupload.io/bk1krp
the file is UTF-8 with BOM is what vscode bottom right corner says when i open the csv.
and I am trying to write a python function to read in every row, and convert it to a dict object.
my function works for the first file rtc-test.csv just fine, but for the second file ads-test.csv I get an error UTF-16 stream does not start with BOM when i use utf-16. so ive tried to use utf-8 and utf-8-sig but it only reads in each line as a string with commas separating values. I cant split by comma because I will have column values which include commas.
my python code correctly reads in rtc-test.csv as a list of values. How can I read in ads-test.csv as a list of values when the csv is encoded using utf-8 with bom?
code:
rtc_csv_file="csv_migration\\rtc-test.csv"
ads_csv_file="csv_migration\\ads-test.csv"
from csv import reader
import csv
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
if csv_cols is None:
csv_cols = row
dict['csv_cols']=csv_cols
print('csv_cols=',csv_cols)
else:
row_id_val = row[csv_cols.index(str(id_format))]
print('row_id_val=',row_id_val)
dict['rows'][row_id_val] = row
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
rtc_dict = read_csv_as_map(rtc_csv_file, 'Id', 'utf-16')
ads_dict = read_csv_as_map(ads_csv_file, 'ID', 'utf-16')
console output:
filename: csv_migration\rtc-test.csv, id_format: Id, encoding: utf-16
csv_cols= ['Summary', 'Status', 'Type', 'Id', '12NC']
row_id_val= 262998
done
filename: csv_migration\ads-test.csv, id_format: ID, encoding: utf-16
err= UTF-16 stream does not start with BOM
if i try to use utf-16-le instead, i get a different error 'utf-16-le' codec can't decode byte 0x22 in position 0: truncated data
if i try to use utf-16-be, i get this error: 'utf-16-be' codec can't decode byte 0x22 in position 0: truncated data
why cant my python code read this csv file?
Your CSV is encoded with UTF-8 (the default) instead of UTF-16, so pass that as the encoding:
ads_csv_file="ads-test.csv"
from csv import reader
# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
dict={}
dict['rows']={}
try:
with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
csv_reader = reader(read_obj, delimiter='\t')
csv_cols = None
for row in csv_reader:
if csv_cols is None:
csv_cols = row
dict['csv_cols']=csv_cols
print('csv_cols=',csv_cols)
else:
row_id_val = row[csv_cols.index(str(id_format))]
print('row_id_val=',row_id_val)
dict['rows'][row_id_val] = row
print('done')
return dict
except Exception as e:
print('err=',e)
return {}
ads_dict = read_csv_as_map(ads_csv_file, 'ID', 'utf-8') # <- updated here
Here's the CSV for reference:
Title,State,Work Item Type,ID,12NC
"453560751251 TOOL, SQ-59 CORNER CLAMP","To Do","FRUPS","6034","453560751251"

Handling special characters from Excel to CSV using Python

Hello am having issues handling a special character from Excel sheet to CSV using python
when I used
else:
# Encode strings into format to preserve content of cell
row_values.append(cell.value.encode("UTF-8").strip())
am getting special character as 'Â'
and when I use
else:
# Encode strings into ISO-8859-1 format to preserve content of cell
row_values.append(cell.value.encode("iso-8859-1").strip())
am getting Special charater as '�' easy to say ? in diamond
i believe it's something to do with encoding but not sure which one to use. these characters are from Excel sheet converted to CSV.
here is the code I used
def convert_to_csv(excel_file, input_dir, output_dir):
"""Convert an excel file to a CSV file by removing irrelevant data"""
try:
sheet = read_excel(excel_file)
except UnicodeDecodeError:
print 'File %s is possibly corrupt. Please check again.' % (excel_file)
sys.exit(1)
row_num = sheet.get_highest_row() # Number of rows
col_num = sheet.get_highest_column() # Number of columns
all_rows = []
# Loop through rows and columns
for row in range(row_num):
row_values = []
for column in range(col_num):
# Get cell element
cell = sheet.cell(row=row, column=column)
# Ignore empty cells
if cell.value is not None:
if type(cell.value) == int or type(cell.value) == float:
# String encoding not applicable for integers and floating point numbers
row_values.append(cell.value)
else:
# Encode strings into ISO-8859-1 format to preserve content of cell
row_values.append(cell.value.encode("iso-8859-1").strip())
else:
row_values.append('')
# Append rows only having more than three values each
if len(set(row_values)-{''}) > 3:
# print row_values
all_rows.append(row_values)
# Saving the data to a csv extension with the same name as the given excel file
output_path = os.path.join(output_dir, excel_file.split('.')[0] + '.csv')
with open(output_path, 'wb') as f:
writer = csv.writer(f, delimiter=";", quoting=csv.QUOTE_ALL)
writer.writerows(all_rows[1:])
using Python 2.6.9
was wondering if we can use regular expresion just before writing to CSV
Is there anyway we can handle this ?
Thanks in Advance.
well got it fixed
` else:
# Encode strings into ISO-8859-1 format to preserve content of cell
row_values.append(
re.sub(r'[^\x00-\x7f]', r'', cell.value).strip())`

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Python error "Ordinal not in range" with accents

I'm scraping a table from the Internet and saving as a CSV file. There are characters with French accents in the text, resulting in a unicode error on save:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-6: ordinal not in range(128)
I'd like to find an elegant solution for saving accented characters that I can apply to any situation. I've sometimes used the following:
encode('ascii','ignore')
but it doesn't work this time, for reasons unknown. I'm also trying to replace the <sup> tags in a cell, so I'm converting it using str() first.
Here's the pertinent part of my code:
data = [
str(td[0]).split('<sup')[0].split('>')[1].split('<')[0],
td[1].getText()
]
output.append(data)
csv_file = csv.writer(open('savedFile.csv', 'w'), delimiter=',')
for line in output:
csv_file.writerow(line)
If td[0] is u"a<sup>b</sup>c" :
td[0].split('<sup') is u"a".
td[0].partition('>')[2].split('<')[0] is u"b".
td[0][td[0].rindex('>') + 1:] is u"c".
If this kind of string indexing and matching is too simple you might consider createing a regular expression and matching it against the text in the html tag:
import re
r = re.compile("[^<]*<sup>([^<]*)</sup>")
m = r.match("some<sup>text</sup>")
print(m.groups()[0])
The csv.reader() and csv.writer() require the files opened in binary mode. You should also close the file at the end. Therefore, you should write it like:
f = open('output.csv', 'wb')
writer = csv.writer(f, delimiter=',')
for row in output:
writer.writerow(row)
f.close()
Or you can use the with construct when using newer versions of Python:
with open('output.csv', 'wb') as f:
writer = csv.writer(f, delimiter=',')
for row in output:
writer.writerow(row)
... and the file will be closed automatically.
Anyway, the csv.writer() expects the row composed of the byte sequences (not Unicode strings). If you have Unicode strings, convert them using .encode('utf-8'):
for row in output:
encoded_row = [s.encode('utf-8') for s in row]
writer.writerow(encoded_row)

Python - Finding unicode/ascii problems

I am csv.reader to pull in info from a very long sheet. I am doing work on that data set and then I am using the xlwt package to give me a workable excel file.
However, I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)
My question to you all is, how can I find exactly where that error is in my data set? Also, is there some code that I can write which will look through my data set and find out where the issues lie (because some data sets run without the above error and others have problems)?
The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception :
try:
# decode using utf-8 (use ascii if you want)
unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
print "The error is there !"
this will save you from many troubles; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it.
Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now.
The csv module doesn't support unicode and null characters. You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in):
import codecs
import csv
class AsciiFile:
def __init__(self, path):
self.f = codecs.open(path, 'rb', 'utf-8')
def close(self):
self.f.close()
def __iter__(self):
for line in self.f:
# 'replace' for unicode characters -> ?, 'ignore' to ignore them
y = line.encode('ascii', 'replace')
y = y.replace('\0', '?') # Can't handle null characters!
yield y
f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()
If you want to find the positions of the characters which you can't be handled by the CSV module, you could do e.g:
import codecs
lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
for x, c in enumerate(line):
if not c.encode('ascii', 'ignore') or c == '\0':
print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
lineno += 1
f.close()
Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters:
import codecs
def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
infile = codecs.open(Path, "rb", Encoding, errors=Errors)
for Line in infile:
Line = Line.strip('\r\n')
if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
elif Qualifier != '(None)':
# Take a note of the chars 'before' just
# in case of excel-style """ quoting.
cB41 = ''; cB42 = ''
L = ['']
qMode = False
for c in Line:
if c==Qualifier and c==cB41==cB42 and qMode:
# Triple qualifiers, so allow it with one
L[-1] = L[-1][:-2]
L[-1] += c
elif c==Qualifier:
# A qualifier, so reverse qual mode
qMode = not qMode
elif c in Delims and not qMode:
# Not in qual mode and delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
cB42 = cB41
cB41 = c
yield L
else:
# There aren't any qualifiers.
cB41 = ''; cB42 = ''
L = ['']
for c in Line:
cB42 = cB41; cB41 = c
if c in Delims:
# Delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
yield L
for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
...
You can refer to code snippets in the question below to get a csv reader with unicode encoding support:
General Unicode/UTF-8 support for csv files in Python 2.6
PLEASE give the full traceback that you got along with the error message. When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer.
It is very possible that your input data is not all plain old ASCII. What produces it, and in what encoding?
To find where the problems (not necessarily errors) are, try a little script like this (untested):
import sys, glob
for pattern in sys.argv[1:]:
for filepath in glob.glob(pattern):
for linex, line in enumerate(open(filepath, 'r')):
if any(c >= '\x80' for c in line):
print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
print repr(line)
It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be.
I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt, avoiding any encode/decode problems.
Have you worked through the tutorial from the python-excel.org site?

Categories

Resources