UnicodeDecodeError: 'utf8' codec can't decode byte error - python

I have a csv file which has one of the 4 columns namely tweet_id, label, topic,text. In one of the rows, the "text" column has the value:
I'm wit chu!! “#ShayDiddy: Officially boycotting #ups!!! Calling #apple to curse them out next for using them wasting my time!â€
I am using this code for importing the data:
def createTrainingCorpus(corpusFile):
import csv
corpus=[]
with open(corpusFile,'rb') as csvfile:
lineReader = csv.reader(csvfile,delimiter=',')
r=1
for row in lineReader:
if r<257:
corpus.append({"tweet_id":row[2],"label":row[1],"topic":row[0],"text":row[4]})
r=r+1
return corpus
corpusFile= "/Users/name/Desktop/corpus.csv"
TrainingData= createTrainingCorpus(corpusFile)
This line doesn't get added to the list TrainingData and I receive an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
The TrainingData list has all the elements as expected until the loop reaches the row with the "text" as mentioned above. I googled for the error but couldn't find solution that worked for me. Please help.

Related

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data

I want to read an excel file using pandas but get the following error:
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
--------------------------------------------------------------------------
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
Here's the code that produces the error:
if not os.path.exists("Raw_Data"):
os.mkdir("Raw_Data")
path = 'Raw_Data'
all_files = glob.glob(path + "/*.xls")
li = []
for filename in all_files:
df_updated = pd.read_excel(filename, index_col=None, header=0)
li.append(df_updated)
The file was exported to .xlsx from a .aspx internal server page.
I've spent the morning troubleshooting to no avail - any suggestions on how to proceed?
this may solve your decoding problem
with open('dataset.xls', "w") as data:
new = data.read().decode('utf-16-le')
so, you can do the operations on the decoded file 'new'
To decode using utf-16, the size of a string (in bytes) must be even.
This a='abcde'.encode().decode('utf-16') produces error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x65 in
position 4: truncated data
While this a='abcdef'.encode().decode('utf-16') runs fine:
You can add a space or newline at the end, if the string length is not even. But this is a quick workaround and cannot be applied to all scenarios.

Python Pandas in colab:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byteUnicodeDecodeError:

guys i have been trying to load some datasets from kaggle which has been downloaded already.
hist_trans = pd.read_csv('historical_transactions.csv')
new_trans = pd.read_csv('new_merchant_transactions.csv')
train = pd.read_csv('train.csv', parse_dates=['first_active_month'])
test = pd.read_csv('test.csv', parse_dates=['first_active_month'])
And i had this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte
Try encoding option in read_csv like below.
read_csv('file', encoding = "utf-8")
or
read_csv('file', encoding = "ISO-8859-1")

UnicodeEncodeError: 'utf-8' codec can't encode character '\udc43' in position 1: surrogates not allowed

I have a list containing placenames and I want to create another array, initially empty, and then iterate the list of placenames and fill up my empty array with these placenames.
For example, my first location is 'CHARTRIDGE' and accessing this element in my LOCARY list via LOCARY[S[0][0]] I get: 'CHARTRIDGE'
I created an empty array: LOCLIST = np.empty([len(LOCARY),1])
I then wrote a for loop to fill it up with the items from LOCARY using:
for i in range(len(LOCARY)):
LOCLIST[i] = LOCARY[S[i][0]]
But I get the error:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc43' in position 1: surrogates not allowed
I'm wondering if it doesn't like the characters ' in the placename.
Any help would be appreciated, thank you.

python encoding for huge volumes of data

i have a huge amount of jsondata that i need to transfer to excel(10,000 or so rows and 20ish columns) Im using csv.my code:
x = json.load(urllib2.urlopen('#####'))
f = csv.writer(codecs.open("fsbmigrate3.csv", "wb+", encoding='utf-8'))
y = #my headers
f.writerow(y)
for row in x:
f.writerow(row.values())
unicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
is what comes up.
i have tried encoding the json data
dict((k.encode('utf-8'), v.encode('utf-8')) for (k,v) in x)
but there is too much data to handle.
any ideas on how to pull this off, (apologies for the lack of SO convention its my first post
the full traceback is; Traceback (most recent call last):
File "C:\Users\bryand\Desktop\bbsports_stuff\gba2.py", line 22, in <module>
f.writerow(row.values())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
[Finished in 6.2s]
Since you didn't specify here's a Python 3 solution. The Python 2 solution is much more painful. I've included some short sample data with non-ASCII characters:
#!python3
import json
import csv
json_data = '[{"a": "\\u9a6c\\u514b", "c": "somethingelse", "b": "something"}, {"a": "one", "c": "three", "b": "two"}]'
data = json.loads(json_data)
with open('fsbmigrate3.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.DictWriter(f,fieldnames=sorted(data[0].keys()))
w.writeheader()
w.writerows(data)
The utf-8-sig codec makes sure a byte order mark character (BOM) is written at the start of the output file, since Excel will assume the local ANSI encoding otherwise.
Since you have json data with key/value pairs, using DictWriter allows the headers to be specified; otherwise, the header order isn't predictable.

python : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2201: ordinal not in range(128)

I'm getting UnicodeDecodeError in the following code:
cr.execute(" SELECT c.nouveau_champs AS nouveau_champs FROM parametrage_consultation c ORDER BY c.order " )
nom = cr.fetchall() #un tuple d'elements
cr.execute(" SELECT c.type AS type FROM parametrage_consultation c ORDER BY c.order" )
type = cr.fetchall()
for i in range(len(nom)):
nom_str=''.join(nom[i])
type_str=''.join(type[i])
result = file("E:/addons/consultation/consultation_temp.py","r").read().replace(" #put a new field here"," '"+nom_str+"': fields."+type_str+"('"+nom_str+"'),\n #put a new field here\n")
file("E:/addons/consultation/consultation_temp.py","w").write(result)
result1 = file("E:/addons/consultation/consultation_view_new.txt","r").read().encode("utf-8").replace(" <!--put a new field here-->",' <field name="'+nom_str+'"/>\n <!--put a new field here-->')
file("E:/addons/consultation/consultation_view_new.txt","w").write(result1)
The problem is that i'm getting a UnicodeDecodeError while reading the second .txt file.
Your file uses a character encoding different from the one expected by Python. You can specify the encoding explicitly by using codecs.open instead of open.
See https://docs.python.org/2/library/codecs.html

Categories

Resources