UnicodeError Replacing Not Working - Python

UnicodeError Replacing Not Working - Python - python

I am trying to replace nonunicode characters with an _ but this program despite compiling with no errors, does not solve the issue and I cannot determine why.
import csv
import unicodedata
import pandas as pd
df = pd.read_csv('/Users/pabbott/Desktop/Unicode.csv', sep = ',',
index_col=False, converters={'ClinetEMail':str, 'ClientZip':str,
'LocationZip':str, 'LicenseeName': str, 'LocationState':str,
'AppointmentType':str, 'ClientCity':str, 'ClientState':str})
data = df
for row in data:
for val in row:
try:
val.encode("utf-8")
except UnicodeDecodeError:
replace(val,"_")
data.to_csv('UnicodeExport.csv', sep=',', index=False,
quoting=csv.QUOTE_NONNUMERIC)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 4: invalid start byte
The above message (thrown from pd.read_csv) shows that the file is not saved in utf-8. You need to
either save the file as utf-8,
or read a file using proper encoding.
For instance (the latter variant), add encoding='windows-1252' to df = pd.read_csv(… as follows:
df = pd.read_csv('/Users/pabbott/Desktop/Unicode.csv', sep = ',', encoding='windows-1252',
index_col=False, converters={'ClinetEMail':str, 'ClientZip':str,
'LocationZip':str, 'LicenseeName': str, 'LocationState':str,
'AppointmentType':str, 'ClientCity':str, 'ClientState':str})
Then, you can omit all the stuff try: val.encode("utf-8") in for row in data: for val in row: loops.
Read pandas.read_csv:
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of
Python standard encodings.

Related

Python / Pandas: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte

I'm trying to build a method to import multiple types of csvs or Excels and standardize it. Everything was running smoothly until a certain csv showed up, that brought me this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte
I'm building a set of try/excepts to include variations of data types but for this one I couldn't figure out how to prevent.
if csv_or_excel_path[-3:]=='csv':
try: table=pd.read_csv(csv_or_excel_path)
except:
try: table=pd.read_csv(csv_or_excel_path,sep=';')
except:
try:table=pd.read_csv(csv_or_excel_path,sep='\t')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
except: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
By the way, the separator of the file is ";".
So:
a) I understand it would be easier to track down the problem if I could identify what's the character in "position 133", however I'm not sure how to find that out. Any suggestions?
b) Does anyone have a suggestion on what to include in that try/except sequence to skip this prob?

For the record, this is probably better than multiple try/excepts
def read_csv(filepath):
if os.path.splitext(filepath)[1] != '.csv':
return # or whatever
seps = [',', ';', '\t'] # ',' is default
encodings = [None, 'utf-8', 'ISO-8859-1'] # None is default
for sep in seps:
for encoding in encodings:
try:
return pd.read_csv(filepath, encoding=encoding, sep=sep)
except Exception: # should really be more specific
pass
raise ValueError("{!r} is has no encoding in {} or seperator in {}"
.format(filepath, encodings, seps))

Another possibility is to do
with open(path_to_file, encoding="utf8", errors="ignore") as f:
table = pd.read_csv(f, sep=";")
By default, errors="ignore" will omit problematic byte sequences from read() calls. You can also supply a fill value for such byte sequences. But in general this should reduce the need for lots of painful error handling and nested try-excepts.

Thanks for the support #woblers and #FHTMitchell. The problem was a weird enconding the CSV had: ISO-8859-1.
I fixed it by adding a few lines to the try/except sequence. Here you can see the full version of it.
if csv_or_excel_path[-3:]=='csv':
try: table=pd.read_csv(csv_or_excel_path)
except:
try: table=pd.read_csv(csv_or_excel_path,sep=';')
except:
try:table=pd.read_csv(csv_or_excel_path,sep='\t')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep=';')
except:
try: table=pd.read_csv(csv_or_excel_path,encoding='utf-8',sep='\t')
except:
try:table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
except:
try: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep=";")
except: table=pd.read_csv(csv_or_excel_path,encoding = "ISO-8859-1", sep="\t")

UnicodeEncodeError in Python CSV manipulation script

I have a script that was working earlier but now stops due to UnicodeEncodeError.
I am using Python 3.4.3.
The full error message is the following:
Traceback (most recent call last):
File "R:/A/APIDevelopment/ScivalPubsExternal/Combine/ScivalPubsExt.py", line 58, in <module>
outputFD.writerow(row)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8a' in position 413: character maps to <undefined>
How can I address this error?
The Python script is the following below:
import pdb
import csv,sys,os
import glob
import os
import codecs
os.chdir('R:/A/APIDevelopment/ScivalPubsExternal/Combine')
joinedFileOut='ScivalUpdate'
csvSourceDir="R:/A/APIDevelopment/ScivalPubsExternal/Combine/AustralianUniversities"
# create dictionary from Codes file (Institution names and codes)
codes = csv.reader(open('Codes.csv'))
#rows of the file are stored as lists/arrays
InstitutionCodesDict = {}
InstitutionYearsDict = {}
for row in codes:
#keys: instnames, #values: instcodes
InstitutionCodesDict[row[0]] = row[1]
#define year dictionary with empty values field
InstitutionYearsDict[row[0]] = []
#to create a fiel descriptor for the outputfile, wt means text mode (also rt opr r is the same)
with open(joinedFileOut,'wt') as csvWriteFD:
#write the file (it is still empty here)
outputFD=csv.writer(csvWriteFD,delimiter=',')
#with closes the file at the end, if exception occurs then before that
# open each scival file, create file descriptor (encoding needed) and then read it and print the name of the file
if not glob.glob(csvSourceDir+"/*.csv"):
print("CSV source files not found")
sys.exit()
for scivalFile in glob.glob(csvSourceDir+"/*.csv"):
#with open(scivalFile,"rt", encoding="utf8") as csvInFD:
with open(scivalFile,"rt", encoding="ISO-8859-1") as csvInFD:
fileFD = csv.reader(csvInFD)
print(scivalFile)
#create condition for loop
printon=False
#reads all rows in file and creates lists/arrays of each row
for row in fileFD:
if len(row)>1:
#the next printon part is skipped when looping through the rows above the data because it is not set to true
if printon:
#inserts instcode and inst sequentially to each row where there is data and after the header row
row.insert(0, InstitutionCode)
row.insert(0, Institution)
if row[10].strip() == "-":
row[10] = " "
else:
p = row[10].zfill(8)
q = p[0:4] + '-' + p[4:]
row[10] = q
#writes output file
outputFD.writerow(row)
else:
if "Publications at" in row[1]:
#get institution name from cell B1
Institution=row[1].replace('Publications at the ', "").replace('Publications at ',"")
print(Institution)
#lookup institution code from dictionary
InstitutionCode=InstitutionCodesDict[Institution]
#printon gets set to TRUE after the header column
if "Title" in row[0]: printon=True
if "Publication years" in row[0]:
#get the year to print it later to see which years were pulled
year=row[1]
#add year to institution in dictionary
if not year in InstitutionYearsDict[Institution]:
InstitutionYearsDict[Institution].append(year)
# Write a report showing the institution name followed by the years for
# which we have that institution's data.
with open("Instyears.txt","w") as instReportFD:
for inst in (InstitutionYearsDict):
instReportFD.write(inst)
for yr in InstitutionYearsDict[inst]:
instReportFD.write(" "+yr)
instReportFD.write("\n")

Make sure to use the correct encoding of your source and destination files. You open files in three locations:
codes = csv.reader(open('Codes.csv'))
: : :
with open(joinedFileOut,'wt') as csvWriteFD:
outputFD=csv.writer(csvWriteFD,delimiter=',')
: : :
with open(scivalFile,"rt", encoding="ISO-8859-1") as csvInFD:
fileFD = csv.reader(csvInFD)
This should look something like:
# Use the correct encoding. If you made this file on
# Windows it is likely Windows-1252 (also known as cp1252):
with open('Codes.csv', encoding='cp1252') as f:
codes = csv.reader(f)
: : :
# The output encoding can be anything you want. UTF-8
# supports all Unicode characters. Windows apps tend to like
# the files to start with a UTF-8 BOM if the file is UTF-8,
# so 'utf-8-sig' is an option.
with open(joinedFileOut,'w', encoding='utf-8-sig') as csvWriteFD:
outputFD=csv.writer(csvWriteFD)
: : :
# This file is probably the cause of your problem and is not ISO-8859-1.
# Maybe UTF-8 instead? 'utf-8-sig' will safely handle and remove a UTF-8 BOM
# if present.
with open(scivalFile,'r', encoding='utf-8-sig') as csvInFD:
fileFD = csv.reader(csvInFD)

The error is caused by an attempt to write a string containing a U+008A character using the default cp1252 encoding of your system. It is trivial to fix, just declare a latin1 encoding (or iso-8859-1) for your output file (because it just outputs the original byte without conversion):
with open(joinedFileOut,'wt', encoding='latin1') as csvWriteFD:
But this will only hide the real problem: where does this 0x8a character come from? My advice is to intercept the exception and dump the line where it occurs:
try:
outputFD.writerow(row)
except UnicodeEncodeError:
# print row, the name of the file being processed and the line number
It is probably caused by one of the input files not being is-8859-1 encoded but more probably utf8 encoded...

UnicodeEncodeError: 'ascii' codec can't encode character error using writerow and map

In Python 2.7 and Ubuntu 14.04 I am trying to write to a csv file:
csv_w.writerow( map( lambda x: flatdata.get( x, "" ), columns ))
this gives me the notorious
UnicodeEncodeError: 'ascii' codec can't encode character u'\u265b' in position 19: ordinal not in range(128)
error.
The usual advice on here is to use unicode(x).encode("utf-8")
I have tried this and also just .encode("utf-8") for both parameters in the get:
csv_w.writerow( map( lambda x: flatdata.get( unicode(x).encode("utf-8"), unicode("").encode("utf-8") ), columns ))
but I still get the same error.
Any help is much appreciated in getting rid of the error. (I imagine the unicode("").encode("utf-8") is clumsy but I'm still a newb).
EDIT:
My full program is:
#!/usr/bin/env python
import json
import csv
import fileinput
import sys
import glob
import os
def flattenjson( b, delim ):
val = {}
for i in b.keys():
if isinstance( b[i], dict ):
get = flattenjson( b[i], delim )
for j in get.keys():
val[ i + delim + j ] = get[j]
else:
val[i] = b[i]
return val
def createcolumnheadings(cols):
#create column headings
print ('a', cols)
columns = cols.keys()
columns = list( set( columns ) )
print('b', columns)
return columns
doOnce=True
out_file= open( 'Excel.csv', 'wb' )
csv_w = csv.writer( out_file, delimiter="\t" )
print sys.argv, os.getcwd()
os.chdir(sys.argv[1])
for line in fileinput.input(glob.glob("*.txt")):
print('filename:', fileinput.filename(),'line #:',fileinput.filelineno(),'line:', line)
data = json.loads(line)
flatdata = flattenjson(data, "__")
if doOnce:
columns=createcolumnheadings(flatdata)
print('c', columns)
csv_w.writerow(columns)
doOnce=False
csv_w.writerow( map( lambda x: flatdata.get( unicode(x).encode("utf-8"), unicode("").encode("utf-8") ), columns ))
Redacted single tweet that throws the error UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 14: ordinal not in range(128): is available here.
SOLUTION as per Alistair's advice I installed unicodescv.
The steps were:
Download the zip from here
install it: sudo pip install /path/to/zipfile/python-unicodecsv-master.zip
import unicodecsv as csv
csv_w = csv.writer(f, encoding='utf-8')
csv_w.writerow(flatdata.get(x, u'') for x in columns)

Without seeing your data it would seem that your data contains Unicode data types (See How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" for a brief explination of Unicode vs. str types)
Your solution to encode it is then error prone - any str with non-ascii encoded in it will throw an error when you unicode() it (See previous link for explanation).
You should get all you data into Unicode types before writing to CSV. As Python 2.7's CSV module is broken, you will need to use the drop in replacement: https://github.com/jdunck/python-unicodecsv.
You may also wish to break out your map into a separate statement to avoid confusion. Make to sure to provide the full stacktrace and examples of your code.

csv_w.writerow( map( lambda x: flatdata.get( unicode(x).encode("utf-8"), unicode("").encode("utf-8") ), columns ))
You've encoded the parameters passed to flatdata.get(), ie the dict key. But the unicode characters aren't in the key, they're in the value. You should encode the value returned by get():
csv_w.writerow([flatdata.get(x, u'').encode('utf-8') for x in columns])

python encoding for huge volumes of data

i have a huge amount of jsondata that i need to transfer to excel(10,000 or so rows and 20ish columns) Im using csv.my code:
x = json.load(urllib2.urlopen('#####'))
f = csv.writer(codecs.open("fsbmigrate3.csv", "wb+", encoding='utf-8'))
y = #my headers
f.writerow(y)
for row in x:
f.writerow(row.values())
unicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
is what comes up.
i have tried encoding the json data
dict((k.encode('utf-8'), v.encode('utf-8')) for (k,v) in x)
but there is too much data to handle.
any ideas on how to pull this off, (apologies for the lack of SO convention its my first post
the full traceback is; Traceback (most recent call last):
File "C:\Users\bryand\Desktop\bbsports_stuff\gba2.py", line 22, in <module>
f.writerow(row.values())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
[Finished in 6.2s]

Since you didn't specify here's a Python 3 solution. The Python 2 solution is much more painful. I've included some short sample data with non-ASCII characters:
#!python3
import json
import csv
json_data = '[{"a": "\\u9a6c\\u514b", "c": "somethingelse", "b": "something"}, {"a": "one", "c": "three", "b": "two"}]'
data = json.loads(json_data)
with open('fsbmigrate3.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.DictWriter(f,fieldnames=sorted(data[0].keys()))
w.writeheader()
w.writerows(data)
The utf-8-sig codec makes sure a byte order mark character (BOM) is written at the start of the output file, since Excel will assume the local ANSI encoding otherwise.
Since you have json data with key/value pairs, using DictWriter allows the headers to be specified; otherwise, the header order isn't predictable.

Python: Reading from two array of tuples at a time and placing them side-by-side on CSV file

So I have two arrays of tuples that are arranged with Restaurant Name and an Int:
("Restaurant Name", 0)
One is called ArrayForInitialSpots, and the other is called ArrayForChosenSpots. What I want to do is to write the tuples from both rows in side-by-side order in a csv file like this:
"First Restaurant in ArrayForInitialSPots",0,"First Restaurant in ArrayForChosenSpots", 1
"Second Restaurant in ArrayForInitialSpots",0,"Second Restaurant in ArrayForChosenSpots",0
So far i've tried doing this:
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow(x + y)
#csv_out.writerow(y)
But I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-6: ordinal not in range(128)
If I remove the zip function, I get too many values to unpack. Any suggestions guys? Thank you very much in advance.

There are two things that you could use to handle extended ascii characters while writing to files:
Set default encoding to utf-8
import sys
reload(sys).setdefaultencoding("utf-8")
Use unicodecsv writer to write data to files
import unicodecsv
unicodecsv

with mhawke help here is my solution
with open('data.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(['Restaurant Name','Change'])
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
list_ = [str(word).decode("utf8") for word in (x+y)]
counter = 0
while counter < len(list_):
s=""
for i in range(counter,counter+4):
s+=list_[i].encode('utf-8')
s+=","
counter = counter + 4
csv_out.writerow(s[:-1])

The problem is not due to your use of zip() - that looks OK, but instead it is an encoding issue. Probably the restaurant names are unicode strings or in some encoding other than ASCII or UTF8? ISO-8859-1 perhaps?
The csv module does not handle unicode; other encodings might work, but it depends. The module does handle 8-bit values OK (except ASCII NUL), so you should be able to encode them as UTF8 like this:
ENCODING = 'iso-8859-1' # assume strings are encoded in this encoding
def to_utf8(item, from_encoding):
if isinstance(item, str):
# byte strings are first decoded to unicode
item = unicode(item, from_encoding)
return unicode(item).encode('utf8')
with open('data.csv', 'w') as out:
csv_out = csv.writer(out)
csv_out.writerow(['Restaurant Name', 'Change'] * 2)
for x, y in zip(arrayForInitialSpots, arrayForChosenSpots):
csv_out.writerow([to_utf8(item, ENCODING) for item in x+y])
This works by converting each element of the tuple formed by x+y into a UTF-8 string. This includes byte strings in other encodings, as well as other objects such as integers that can be converted to a unicode string via unicode(). If your strings are unicode, just set ENCODING to None.

I'd suggest using numpy:
import numpy as np
IniSpots=[("Restaurant Name0a", 0),("Restaurant Name1a", 1)]
ChoSpots=[("Restaurant Name0b", 0),("Restaurant Name1b", 0)]
c=np.hstack((IniSpots,ChoSpots))
np.savetxt("data.csv", c, fmt='%s',delimiter=",")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

UnicodeError Replacing Not Working - Python - python

Related

Python / Pandas: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 133: invalid continuation byte

UnicodeEncodeError in Python CSV manipulation script

UnicodeEncodeError: 'ascii' codec can't encode character error using writerow and map

python encoding for huge volumes of data

Python: Reading from two array of tuples at a time and placing them side-by-side on CSV file

Categories

Resources