Converting ASCII output to UTF-8

Converting ASCII output to UTF-8 - python

I'm really close having a script that fetches JSON from the New York Times API, then converts it to CSV. However, occasionally I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
position 21: ordinal not in range(128)
I think I could avoid this all together if I converted the output to UTF-8, but I am unsure how to do so. Here is my python script:
import urllib2
import json
import csv
outfile_path='/NYTComments.csv'
writer = csv.writer(open(outfile_path, 'w'))
url = urllib2.Request('http://api.nytimes.com/svc/community/v2/comments/recent?api-key=ea7aac6c5d0723d7f1e06c8035d27305:5:66594855')
parsed_json = json.load(urllib2.urlopen(url))
print parsed_json
for comment in parsed_json['results']['comments']:
row = []
row.append(str(comment['commentSequence']))
row.append(str(comment['commentBody']))
row.append(str(comment['commentTitle']))
row.append(str(comment['approveDate']))
writer.writerow(row)

A few things...
I don't know anything about the NewYork Times API, but I would guess you probably shouldn't publish a code snippet with your "api-key". Just a guess on this point (I've never used this API before)
If you look, the API is tells you the encoding. You are getting the following back in the header:
Content-Type=application/json; charset=UTF-8
Googling "python and UnicodeEncodeError" will give you a lot of help. But here, it seems your problem is probably calling the "str" on the comments. In which case, it will use the 'ascii' codec. And if there is a char above 128, then boom. You get the error you are seeing. Here is a pretty good blog post on the topic. It might help you to read over it.
Edit: This solution works for me:
for comment in parsed_json['results']['comments']:
row = []
row.append(str(comment['commentSequence']))
row.append(comment['commentBody'].encode('UTF-8', 'replace'))
row.append(comment['commentTitle'].encode('UTF-8', 'replace'))
row.append(str(comment['approveDate']))
writer.writerow(row)

Replace the second and third call to str() with unicode().
for comment in parsed_json['results']['comments']:
row = []
row.append(str(comment['commentSequence']))
row.append(unicode(comment['commentBody']))
row.append(unicode(comment['commentTitle']))
row.append(str(comment['approveDate']))
writer.writerow(row)

Related

decoding issue while parsing JSON [python]

I am reading a JSON file in Python which has lots of fields and values (~8000 records).
Env: windows 10, python 3.6.4;
code:
import json
json_data = json.load(open('json_list.json'))
print (json_data)
With this I get an error. Below is the stack trace:
json_data = json.load(open('json_list.json'))
File "C:\Program Files (x86)\Python36-32\lib\json\__init__.py", line 296, in load
return loads(fp.read(),
File "C:\Program Files (x86)\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7977319: character maps to <undefined>
Along with this I have tried
import json
with open('json_list.json', encoding='utf-8') as fd:
json_data = json.load(fd)
print (json_data)
with this my program runs for a long time then hangs with no output.
I have searched almost all topics related to this and could not find a solution.
Note: The JSON data is a valid one as when I see it on Postman/any REST client it doesn't report any anomalies.
Any help on this or alternative solution on how can I load my JSON data (any way by converting it to string then back to JSON etc) will be of great help.
Here is what the file looks like around the reported error:
>>> from pprint import pprint
>>> f = open('C:/Users/c5242046/Desktop/test2/dblist_rest.json', 'rb')
>>> f.seek(7977319)
7977319
>>> pprint(f.read(100))
(b'\x81TICA EL ABGEN INGL\xc3\x83\xc2\x89S, S.A.","memory_size_gb":"64","since'
b'":"2017-04-10","storage_size_gb":"84.747')

The snippet you are asking about seems to have been double-encoded. Basically, whatever originally generated this data produced text in Latin-1 or some related encoding (Windows code page 1252?). It was then fed to a process which converts Latin-1 to UTF-8 ... twice.
Of course, "converting" data which is already UTF-8 but telling the computer that it's Latin-1 just produces mojibake.
The string INGL\xc3\x83\xc2\x89S suggests this analysis, if you can guess that it is supposed to say Inglés in upper case, and realize that the UTF-8 encoding for É is \xC3 \x89 and then examine which characters these two bytes encode in Latin-1 (or, as it happens, Unicode, which is a superset of Latin-1, though they are not compatible on the encoding level).
Notice that being able to guess which string a problematic sequence is supposed to represent is the crucial step here; it also explains why including a representative snippet of the problematic data - with enough context! - is vital for debugging.
Anyway, if the entire file has the same symptom, you should be able to undo the second, superfluous and incorrect round of re-encoding; though an error this far into the file makes me imagine it's probably a local problem with just one or a few records. Maybe they were merged from multiple input files, only one of which had this error. Then fixing it requires a fair bit of detective work, and manual editing, or identifying and fixing the erroneous source. A quick and dirty workaround is to simply manually remove any erroneous records.

GAE Python - CSV Writer gives ascii codec error

I have a Google App Engine app that stores data in a number of languages, like e.g. German and Russian. To do so, I need to store the strings as UTF-8, which luckily is done automatically since I use the webapp2.request handler. As a beginning programmer, this allowed me to avoid the complications of encoding and decoding the data.
However now I am trying to write the contents to a CSV file, and it seems that for the csv.writer command, the encoding is necessary anyway. Now I am not sure if I should be decoding or encoding, but currently the error I get is following:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
The code I am using is:
import csv, webapp2, codecs
class AdminShopExport(webapp2.RequestHandler):
def get(self):
shops = Shop.all()
shops.order('name')
self.response.headers['Content-Type'] = 'application/csv'
writer = csv.writer(self.response.out)
writer.writerow(["id", "name", "domain", "category"])
for shop in shops:
writer.writerow([shop.keyname, shop.name, shop.url, shop.category])
Regarding the contents, the error is currently coming from a category which is given in Russian. However as said, any of these fields could contain a UTF-8 character. What would be the best way to handle this? Thanks for your help!

python 2x csv does not support unicode
try this
https://github.com/jdunck/python-unicodecsv

International characters in Python

I'm currently working on a Python script that takes a list of log files (from a search engine) and produces a file with all the queries within these, for later analysis.
Another feature of the script is that it removes the most common words, which I've also implemented, but I've faced a problem I can't seem to overcome. The removing of words does work as intended, as long as the queries does not contain special characters. As the search logs are in Danish, the characters æ, ø and å will appear regularly.
Searching on the topic I'm now aware that I need to encode these into UTF-8, which I'm doing when obtaining the query:
tmp = t_query.encode("UTF-8").lower().split()
t_query is the query and I split it up to later compare each word with my list of forbidden words. If I do not use the encoding I'll get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 1: ordinal not in range(128)
Edit: I also tried using the decode instead, but get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa7' in position 3: ordinal not in range(128)
I loop through the words like this:
for i in tmp:
if i in words_to_filter:
tmp.remove(i)
As said this works perfectly for words not including special characters. I've tried to print the i along with the current forbidden word and will get e.g:
fÃ¦rdelsloven - færdelsloven
Where the first word is the ith element in tmp. The last word in the one from the forbidden words. Obviously something has gone wrong, but I just can't manage to find a solution. I've tried many suggestions found on Google and in here, but nothing have worked so far.
Edit 2: if it makes a difference, I've tried loading the log files both with and without the use of codec:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
I'm running Python 2.7.2 from a Windows environment, if it matters. The script should be able to run on other platforms (namely Linux and Mac OS).
I would really appreciate if one of you are able to help me out.
Best regards
Casper

If you are reading files, you want to decode them.
tmp = t_query.decode("UTF-8").lower().split()

Given a utf-8 file with json object per line, you could read all objects:
with open(filename) as file:
jlogs = [json.loads(line) for line in file]
Except for an embeded newline treatment the above code should produce the same result as yours:
with codecs.open(file_name, "r", "utf-8") as f_src:
jlogs = map(json.loads, f_src.readlines())
At this point all strings in jlogs are Unicode you don't need to do anything to handle "special" characters. Just make sure you are not mixing bytes and Unicode text in your code.
to get Unicode text from bytes: some_bytes.decode(character_encoding)
to get bytes from Unicode text: some_text.encode(character_encoding)
Don't encode bytes/decode Unicode.

If encoding is right and you just want to ignore unexpected characters you could use errors='ignore' or errors='replace' parameter passed to codecs.open function.
with codecs.open(file_name, encoding='utf-8', mode='r', errors='ignore') as f:
jlogs = map(json.loads, f.readlines())
Details in docs:
http://docs.python.org/2/howto/unicode.html#reading-and-writing-unicode-data

I've finally solved it. As Lattyware Python 3.x seems to do much better. After changing the version and encoding the Python file to Unicode it works as intended.

Python Encoding\Decoding for writing to a text file

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
print kmlDescription #this prints out fine
outputFile.write(kmlDescription)
The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).
I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.
outputFile.write(kmlDescription).decode('utf-8')
Please forgive me if this is basic, I'm still learning Python (2.7).
Cheers!
EDIT1: Sample data looks something like the following:
Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.
When I add the print type(raw), I get
Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)
I will check out the suggested thread and video. Thanks folks!
Edit 3: I'm using Python 2.7
Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:
text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed
Once I figured out I was trying to double encode, the solution was the following:
allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
kmlDescription = allTheNTMs[a]
kmlDescriptionDecode = kmlDescription.decode("latin-1")
outputFile.write(kmlDescriptionDecode)
It's working now, and I sure appreciate all of your help!!

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error
u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)
output:
Traceback (most recent call last):
File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "/usr/lib/python2.7/codecs.py", line 691, in write
return self.writer.write(data)
File "/usr/lib/python2.7/codecs.py", line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
Solution:
this will work, if you don't set any codec
f = open('del.txt', 'wb')
f.write(s)
other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.
f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.
HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-
allTheNTMs.append(contentRaw[s1:].encode("latin-1"))
I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

Try to decode it:
> print u'abcdé'.encode('utf-8')
> abcdÃ©
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

In case your string is 'str':
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
My code which was throwing the UnicodeEncodeError looked like:
with tempfile.NamedTemporaryFile(delete=False) as fh:
dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
with tempfile.NamedTemporaryFile(delete=False) as fh:
fh = codecs.lookup("utf-8")[3](fh)
dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

I encounter this error a few times, and my hacky way of dealing with it is just to do this:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.