Python3 now throwing ASCII error loading data from JSON file - python

I am relatively new to Python and spent at least two hours searching both the internet and now StackOverFlow and could not find out what the problem is here. My code was working in Python2, now persistent error message below. I even found this code online for an apparent answer to my question claiming it worked in python3, but it's not for me. Odd.
import json, sys
with open('2689364.json', 'r', encoding='utf-8') as json_data:
d = json.load(json_data)
print(d)
UnicodeEncodeError: 'ascii' codec can't encode character '\xb6' in position 1938: ordinal not in range(128)

Related

'charmap' codec can't encode character '\u0110' in position 1: character maps to <undefined> while reading csv file using pd.read_csv

So I have searched a lot in google why I am getting this error, but wherever I go, the solution is to use a different encoding like cp1252, iso-8859-1, latin1 or utf-8. I am actually using utf-8 an tried all other encodings too while using pd.read_csv.
When I read the csv in a different PC with the same encoding, it does not throw this error. So I think this is a fault with my local machine.
This is how i read my csv :
dataframe = pd.read_csv(csv_path + file_name, dtype='object', encoding="UTF-8")
If there is any other characters like arabic or chinese, i get this error saying
'charmap' codec can't encode character '\u0110' in position 1: character maps to <undefined>
while reading csv file using pd.read_csv
I have visited a lot of stack overflow and a lot of other solution providers but none of them fix my problem. This is the fault with my local machine. So can anyone help me figure out where this problem persists ?
Thank you

UnicodeDecodeError with nltk

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.
Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

load .json into python; UnicodeDecodeError

I am trying to load a json file into python with no success. I have been googling a solution for the past few hours and just cannot seem to get it to load. I have tried to load it using the same json.load('filename') function that has worked for everyone. I keep getting :
"UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in postion 124: invalid continuation byte"
Here is the code I am using
import json
json_data = open('myfile.json')
for line in json_data:
data = json.loads(line) <--I get an error at this.
Here is a sample line from my file
{"topic":"security","question":"Putting the Biba-LaPadula Mandatory Access Control Methods to Practise?","excerpt":"Text books on database systems always refer to the two Mandatory Access Control models; Biba for the Integrity objective and Bell-LaPadula for the Secrecy or Confidentiality objective.\n\nText books ...\r\n "}
What is my error if this seems to have worked for everyone in every example I have googled?
Have you tried:
json.loads(line.decode("utf-8"))
Similar question asked here: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2
Edit:
If the above does not work,
json.loads(line.decode("utf-8","ignore"))
will.

Converting ASCII output to UTF-8

I'm really close having a script that fetches JSON from the New York Times API, then converts it to CSV. However, occasionally I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
position 21: ordinal not in range(128)
I think I could avoid this all together if I converted the output to UTF-8, but I am unsure how to do so. Here is my python script:
import urllib2
import json
import csv
outfile_path='/NYTComments.csv'
writer = csv.writer(open(outfile_path, 'w'))
url = urllib2.Request('http://api.nytimes.com/svc/community/v2/comments/recent?api-key=ea7aac6c5d0723d7f1e06c8035d27305:5:66594855')
parsed_json = json.load(urllib2.urlopen(url))
print parsed_json
for comment in parsed_json['results']['comments']:
row = []
row.append(str(comment['commentSequence']))
row.append(str(comment['commentBody']))
row.append(str(comment['commentTitle']))
row.append(str(comment['approveDate']))
writer.writerow(row)
A few things...
I don't know anything about the NewYork Times API, but I would guess you probably shouldn't publish a code snippet with your "api-key". Just a guess on this point (I've never used this API before)
If you look, the API is tells you the encoding. You are getting the following back in the header:
Content-Type=application/json; charset=UTF-8
Googling "python and UnicodeEncodeError" will give you a lot of help. But here, it seems your problem is probably calling the "str" on the comments. In which case, it will use the 'ascii' codec. And if there is a char above 128, then boom. You get the error you are seeing. Here is a pretty good blog post on the topic. It might help you to read over it.
Edit: This solution works for me:
for comment in parsed_json['results']['comments']:
row = []
row.append(str(comment['commentSequence']))
row.append(comment['commentBody'].encode('UTF-8', 'replace'))
row.append(comment['commentTitle'].encode('UTF-8', 'replace'))
row.append(str(comment['approveDate']))
writer.writerow(row)
Replace the second and third call to str() with unicode().
for comment in parsed_json['results']['comments']:
row = []
row.append(str(comment['commentSequence']))
row.append(unicode(comment['commentBody']))
row.append(unicode(comment['commentTitle']))
row.append(str(comment['approveDate']))
writer.writerow(row)

Categories

Resources