Python - Finding unicode/ascii problems - python

I am csv.reader to pull in info from a very long sheet. I am doing work on that data set and then I am using the xlwt package to give me a workable excel file.
However, I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 34: ordinal not in range(128)
My question to you all is, how can I find exactly where that error is in my data set? Also, is there some code that I can write which will look through my data set and find out where the issues lie (because some data sets run without the above error and others have problems)?

The answer is quite simple actually : As soon as you read your data from your file, convert it to unicode using the encoding of your file, and handle the UnicodeDecodeError exception :
try:
# decode using utf-8 (use ascii if you want)
unicode_data = str_data.decode("utf-8")
except UnicodeDecodeError, e:
print "The error is there !"
this will save you from many troubles; you won't have to worry about multibyte character encoding, and external libraries (including xlwt) will just do The Right Thing if they need to write it.
Python 3.0 will make it mandatory to specify the encoding of a string, so it's a good idea to do it now.

The csv module doesn't support unicode and null characters. You might be able to replace them by doing something like this though (Replace 'utf-8' with the encoding which your CSV data is encoded in):
import codecs
import csv
class AsciiFile:
def __init__(self, path):
self.f = codecs.open(path, 'rb', 'utf-8')
def close(self):
self.f.close()
def __iter__(self):
for line in self.f:
# 'replace' for unicode characters -> ?, 'ignore' to ignore them
y = line.encode('ascii', 'replace')
y = y.replace('\0', '?') # Can't handle null characters!
yield y
f = AsciiFile(PATH)
r = csv.reader(f)
...
f.close()
If you want to find the positions of the characters which you can't be handled by the CSV module, you could do e.g:
import codecs
lineno = 0
f = codecs.open(PATH, 'rb', 'utf-8')
for line in f:
for x, c in enumerate(line):
if not c.encode('ascii', 'ignore') or c == '\0':
print "Character ordinal %s line %s character %s is unicode or null!" % (ord(c), lineno, x)
lineno += 1
f.close()
Alternatively again, you could use this CSV opener which I wrote which can handle Unicode characters:
import codecs
def OpenCSV(Path, Encoding, Delims, StartAtRow, Qualifier, Errors):
infile = codecs.open(Path, "rb", Encoding, errors=Errors)
for Line in infile:
Line = Line.strip('\r\n')
if (StartAtRow - 1) and StartAtRow > 0: StartAtRow -= 1
elif Qualifier != '(None)':
# Take a note of the chars 'before' just
# in case of excel-style """ quoting.
cB41 = ''; cB42 = ''
L = ['']
qMode = False
for c in Line:
if c==Qualifier and c==cB41==cB42 and qMode:
# Triple qualifiers, so allow it with one
L[-1] = L[-1][:-2]
L[-1] += c
elif c==Qualifier:
# A qualifier, so reverse qual mode
qMode = not qMode
elif c in Delims and not qMode:
# Not in qual mode and delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
cB42 = cB41
cB41 = c
yield L
else:
# There aren't any qualifiers.
cB41 = ''; cB42 = ''
L = ['']
for c in Line:
cB42 = cB41; cB41 = c
if c in Delims:
# Delim
L.append('')
else:
# Nothing to see here, move along
L[-1] += c
yield L
for listItem in openCSV(PATH, Encoding='utf-8', Delims=[','], StartAtRow=0, Qualifier='"', Errors='replace')
...

You can refer to code snippets in the question below to get a csv reader with unicode encoding support:
General Unicode/UTF-8 support for csv files in Python 2.6

PLEASE give the full traceback that you got along with the error message. When we know where you are getting the error (reading CSV file, "doing work on that data set", or in writing an XLS file using xlwt), then we can give a focused answer.
It is very possible that your input data is not all plain old ASCII. What produces it, and in what encoding?
To find where the problems (not necessarily errors) are, try a little script like this (untested):
import sys, glob
for pattern in sys.argv[1:]:
for filepath in glob.glob(pattern):
for linex, line in enumerate(open(filepath, 'r')):
if any(c >= '\x80' for c in line):
print "Non-ASCII in line %d of file %r" % (linex+1, filepath)
print repr(line)
It would be useful if you showed some samples of the "bad" lines that you find, so that we can judge what the encoding might be.
I'm curious about using "csv.reader to pull in info from a very long sheet" -- what kind of "sheet"? Do you mean that you are saving an XLS file as CSV, then reading the CSV file? If so, you could use xlrd to read directly from the input XLS file, getting unicode text which you can give straight to xlwt, avoiding any encode/decode problems.
Have you worked through the tutorial from the python-excel.org site?

Related

Python codecs encoding not working

I have this code
import collections
import csv
import sys
import codecs
from xml.dom.minidom import parse
import xml.dom.minidom
String = collections.namedtuple("String", ["tag", "text"])
def read_translations(filename): #Reads a csv file with rows made up of 2 columns: the string tag, and the translated tag
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
result = [String(tag=row[0], text=row[1]) for row in csv_reader]
return result
The CSV file I'm reading contains Brazilian portuguese characters. When I try to run this, I get an error:
'utf8' codec can't decode byte 0x88 in position 21: invalid start byte
I'm using Python 2.7. As you can see, I'm encoding with codecs, but it doesn't work.
Any ideas?
The idea of this line:
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
is to say "This file was saved as utf-8. Please make appropriate conversions when reading from it."
That works fine if the file was actually saved as utf-8. If some other encoding was used, then it is bad.
What then?
Determine which encoding was used. Assuming the information cannot be obtained from the software which created the file - guess.
Open the file normally and print each line:
with open(filename, 'rt') as f:
for line in f:
print repr(line)
Then look for a character which is not ASCII, e.g. ñ - this letter will be printed as some code, e.g.:
'espa\xc3\xb1ol'
Above, ñ is represented as \xc3\xb1, because that is the utf-8 sequence for it.
Now, you can check what various encodings would give and see which is right:
>>> ntilde = u'\N{LATIN SMALL LETTER N WITH TILDE}'
>>>
>>> print repr(ntilde.encode('utf-8'))
'\xc3\xb1'
>>> print repr(ntilde.encode('windows-1252'))
'\xf1'
>>> print repr(ntilde.encode('iso-8859-1'))
'\xf1'
>>> print repr(ntilde.encode('macroman'))
'\x96'
Or print all of them:
for c in encodings.aliases.aliases:
try:
encoded = ntilde.encode(c)
print c, repr(encoded)
except:
pass
Then, when you have guessed which encoding it is, use that, e.g.:
with codecs.open(filename, "r", encoding='iso-8859-1') as csvfile:

Python Split not working properly

I have the following code to read the lines in a file and split them with a delimiter specified. After split I have to write some specific fields into another file.
Sample Data:
Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
Code:
import os
import codecs
sfilename = "WEEK_RPT_1108" + os.extsep + "dat"
sfilepath = "Club" + "/" + sfilename
sbackupname = "Club" + "/" + sfilename + os.extsep + "bak"
try:
os.unlink(sbackupname)
except OSError:
pass
os.rename(sfilepath, sbackupname)
try:
inputfile = codecs.open(sbackupname, "r", "utf-16-le")
outputfile = codecs.open(sfilepath, "w", "utf-16-le")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(';')
outputfile.write(record[1])
except IOError, err:
pass
I can see that the 0th array position contains the whole line instead of the first record:
record[0] = Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63
while on printing record[1], it says array index out of range.
Need help as new to python.
Thanks!
After you comment saying that print line outputs u'\u6557\u6b65\u3934\u415f\u365f\u3030\u3230\u3030\u3b30\u614d\u3b72\u5946\u3431‌​\u413b\u7463\u6175\u3b6c\u6f57\u6b72\u6e69\u3b67\u5f45\u3031\u3030\u503b\u5f43\u3‌​030\u3030\u3030\u343b\u3832\u2e37\u3336', I can explain what happens and how to fix it.
What happens:
you have a normal 8bits characters file, and the line you show is even in plain ASCII, but you try to decode it as if it were in UTF-16 little endian. So you wrongly combine every two bytes in a single 16 bits unicode character! If your system had been able to correctly display them and if you had directly print line instead of repr(line), you would have got 敗步㤴䅟㙟〰㈰〰㬰慍㭲奆㐱䄻瑣慵㭬潗歲湩㭧彅〱〰倻彃〰〰〰㐻㠲⸷㌶. Of course, none of those unicode characters is the semicolon (; or \x3b of \u003b) so the line cannot be splitted on it.
But as you encode it back before writing record[0] you find the whole line in the new file, what let you believe erroneously that the problem is in the split function.
How to fix:
Just open the file normally, or use the correct encoding if it contains non ascii characters. But as you are using a version 2 of python, I would just do:
try:
inputfile = open(sbackupname, "r")
outputfile = open(sfilepath, "w")
sdelimdatfile = ";"
for line in inputfile:
record = line.split(sdelimdatfile)
outputfile.write(record[1])
except IOError, err:
pass
If you really need to use the codecs module, for example if the file contains UTF8 or latin1 characters, you can replace the open part with:
encoding = "utf8" # or "latin1" or whatever the actual encoding is...
inputfile = codecs.open(sbackupname, "r", encoding)
outputfile = codecs.open(sfilepath, "w", encoding)
Then there is no index [1]:
Either skip the line with "continue" if len(record) < 1 or just not write to the file (like here)
for line in inputfile:
record = line.split(';')
if len(record) >= 1:
outputfile.write(record[1])

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Struggling with unicode in Python

I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position
5: ordinal not in range(128)
How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
for i in range (len(filenames)):
pathname = filenames[i]
fin = open(pathname, 'r')
with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find("Brand Title")
brand_end = lines.find("/>",brand_start)
brand = lines [brand_start+47:brand_end-2]
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
flog.close()
I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.
You are mixing bytestrings with Unicode values; your fin file object produces bytestrings, and you are mixing it with Unicode here:
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
brand is a bytestring, interpolated into a Unicode format string. Either decode brand there, or better yet, use io.open() (rather than codecs.open(), which is not as robust as the newer io module) to manage both your files:
with io.open('Assets.log', 'w', encoding='utf-8') as f,\
io.open(pathname, encoding='utf-8') as fin:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find(u"Brand Title")
brand_end = lines.find(u"/>", brand_start)
brand = lines[brand_start + 47:brand_end - 2]
f.write(u'{}|{}\n'.format(pathname[4:35], brand))
You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open(), so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.
This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):
import lxml
import io
import os
from lxml import etree
from glob import glob
nsmap = {'xmlns': 'thisnamespace'}
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
with io.open(('Assets.log'),'w',encoding='utf-8') as f:
f.write(u'File Path|Series|Brand\n')
for i in range (len(filenames)):
pathname = filenames[i]
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(pathname, parser)
root = tree.getroot()
fin = open(pathname, 'r')
with io.open(pathname, encoding='utf-8') as fin:
for info in root.xpath('//somepath'):
series_x = info.find ('./somemorepath')
series = series_x.get('Asset_Name') if series_x != None else 'Missing'
lines = fin.read()
brand_start = lines.find(u"sometext")
brand_end = lines.find(u"/>",brand_start)
brand = lines [brand_start:brand_end-2]
brand = brand[(brand.rfind("/"))+1:]
f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))
f.close()
Someone will now come along and do it all in one line!

Ascii Code error while converting from xlsx to csv

I have referred some post related to unicode error but didn't get any solution for my problem. I am converting xlsx to csv fom a workbook of 6 sheets.
Use the following code
def csv_from_excel(file_loc):
#file_acess check
print os.access(file_loc, os.R_OK)
wb = xlrd.open_workbook(file_loc)
print wb.nsheets
sheet_names = wb.sheet_names()
print sheet_names
counter = 0
while counter < wb.nsheets:
try:
sh = wb.sheet_by_name(sheet_names[counter])
file_name = str(sheet_names[counter]) + '.csv'
print file_name
fh = open(file_name, 'wb')
wr = csv.writer(fh, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
except Exception as e:
print str(e)
finally:
fh.close()
counter += 1
I get an error in 4th sheet
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)"
but position 0 is blank and it has converted to csv till 33rd row.
I am unable to figure out. CSV was easy way to read content and put in my data structure .
You'll need to manually encode Unicode values to bytes; for CSV usually UTF-8 is fine:
for rownum in xrange(sh.nrows):
wr.writerow([unicode(c).encode('utf8') for c in sh.row_values(rownum)])
Here I use unicode() for column data that is not text.
The character you encountered is the U+2018 LEFT SINGLE QUOTATION MARK, which is just a fancy form of the ' single quote. Office software (spreadsheets, word processors, etc.) often auto-replace single and double quotes with the 'fancy' versions. You could also just replace those with ASCII equivalents. You can do that with the Unidecode package:
from unidecode import unidecode
for rownum in xrange(sh.nrows):
wr.writerow([unidecode(unicode(c)) for c in sh.row_values(rownum)])
Use this when non-ASCII codepoints are only used for quotes and dashes and other punctuation.

Categories

Resources