How to extract String from a Unicoded JSONObject in Python? - python

I'm getting the below error when I try to parse a String with Unicodes like ' symbol and Emojis, etc :
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f33b' in position 19: ordinal not in range(128)
Sample Object:
{"user":{"name":"\u0e2a\u0e31\u0e48\u0e07\u0e14\u0e48\u0e27\u0e19 \u0e2b\u0e21\u0e14\u0e44\u0e27 \u0e40\u0e14\u0e23\u0e2a\u0e41\u0e1f\u0e0a\u0e31\u0e48\u0e19\u0e21\u0e32\u0e43\u0e2b\u0e21\u0e48 \u0e23\u0e32\u0e04\u0e32\u0e40\u0e1a\u0e32\u0e46 \u0e2a\u0e48\u0e07\u0e17\u0e31\u0e48\u0e27\u0e44\u0e17\u0e22 \u0e44\u0e14\u0e49\u0e02\u0e2d\u0e07\u0e0a\u0e31\u0e27\u0e23\u0e4c\u0e08\u0e49\u0e32 \u0e2a\u0e19\u0e43\u0e08\u0e15\u0e34\u0e14\u0e15\u0e48\u0e2d\u0e2a\u0e2d\u0e1a\u0e16\u0e32\u0e21 Is it","tag":"XYZ"}}
I'm able to extract tag value, but I'm unable to extract name value.
Here is my code:
dict = json.loads(json_data)
print('Tag - 'dict['user']['tag'])
print('Name - 'dict['user']['name'])

You can save the data in CSV file format which could also be opened using Excel. When you open a file in this way: open(filename, "w") then you can only store ASCII characters, but if you try to store Unicode data this way, you would get UnicodeEncodeError. In order for you to store Unicode data, you need to open the file with UTF-8 encoding.
mydict = json.loads(json_data) # or whatever dictionary it is...
# Open the file with UTF-8 encoding, most important step
f = open("userdata.csv", "w", encoding='utf-8')
f.write(mydict['user']['name'] + ", " + mydict['user']['tag'] + "\n")
f.close()
Feel free to change the code based on the data you have.
That's it...

Related

JSON import in Python

I would like to import the JSON file located at "https://www.drivy.com/cars/458342/reviews?page=1&paginate_per=6&rel=next" in python.
When I run this:
with open('C:/Users/coppe/Documents/py trials/eval.json') as json_file:
reviews = json.load(json_file)
I get an error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6776: character maps to <undefined>
Actually this error is due to a special character contained in the html keyvalue. Knowing that this character is an emoticon (a thumb), how can I still import my JSON by ignoring this ?
You need to specify the correct format for the json encoder to use. Most use utf8, therefore use something like:
reviews = json.load(
open("C:/Users/coppe/Documents/py trials/eval.json", encoding="utf8")
)
or
with open('C:/Users/coppe/Documents/py trials/eval.json') as json_file:
reviews = json.load(json_file, encoding="utf8")
Good Luck!
use
open(json_file, encoding="utf8")

Errror in outputting CSV with Django?

I am trying to output my model as a CSV file.It is working fine with small data in model and it is very slow with large data.And secondly there are some error in outputting a model as CSV.My logic which I am using is:
def some_view(request):
# Create the HttpResponse object with the appropriate CSV header.
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename="news.csv"'
writer = csv.writer(response)
news_obj = News.objects.using('cms').all()
for item in news_obj:
#writer.writerow([item.newsText])
writer.writerow([item.userId.name])
return response
and the error which I am facing is:
UnicodeEncodeError :--
'ascii' codec can't encode characters in position 0-6: ordinal not in
range(128)
and further it says:-
The string that could not be encoded/decoded was: عبدالله الحذ
Replace line
writer.writerow([item.userId.name])
with:
writer.writerow([item.userId.name.encode('utf-8')])
Before saving unicode string to a file you must encode it in some encoding. Most system use utf-8 by default, so it's a safe choice.
From the error, The write content of csv file is like ASCII character. So decode the character.
>>>u'aあä'.encode('ascii', 'ignore')
'a'
Can fix this error from ignoring the ASCII character:
writer.writerow([item.userId.name.encode('ascii', 'ignore')])

Python how to "ignore" ascii text?

I'm trying to scrape some stuff off a page using selenium. But this some of the text has ascii text in it.. so I get this.
f.write(database_text.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1462: ordinal not in range(128)
I was wondering, is there anyway to just simpley ascii?
Thanks!
print("â")
I'm not looking to write it in my text file, but ignore it.
note: It's not just "â" it has other chars like that also.
window_before = driver.window_handles[0]
nmber_one = 1
f = open(str(unique_filename) + ".txt", 'w')
for i in range(5, 37):
time.sleep(3)
driver.find_element_by_xpath("""/html/body/center/table[2]/tbody/tr[2]/td/table/tbody/tr""" + "[" + str(i) + "]" + """/td[2]/a""").click()
time.sleep(3)
driver.switch_to.window(driver.window_handles[nmber_one])
nmber_one = nmber_one + 1
database_text = driver.find_element_by_xpath("/html/body/pre")
f = open(str(unique_filename) + ".txt", 'w',)
f.write(database_text.text)
driver.switch_to.window(window_before)
import uuid
import io
unique_filename = uuid.uuid4()
which generates a new filename, well it should anyway, it worked before.
The problem is that some of the text is not ascii. database_text.text is likely unicode text (you can do print type(database_text.text) to verify) and contains non-english text. If you are on windows it may be "codepage" text which depends on how your user account is configured.
Often, one wants to store text like this as utf-8 so open your output file accordingly
import io
text = u"â"
with io.open('somefile.txt', 'w', encoding='utf-8') as f:
f.write(text)
If you really do want to just drop the non-ascii characters from the file completely you can setup a error policy
text = u"ignore funky â character"
with io.open('somefile.txt', 'w', encoding='ascii', errors='ignore') as f:
f.write(text)
In the end, you need to choose what representation you want to use for non-ascii (roughly speaking, non-English) text.
A Try Except block would work:
try:
f.write(database_text.text)
except UnicodeEncodeError:
pass

How can I get my Python to parse the following text?

I have a sample of the text:
"PROTECTING-ħarsien",
I'm trying to parse with the following
import csv, json
with open('./dict.txt') as maltese:
entries = maltese.readlines()
for entry in entries:
tokens = entry.replace('"', '').replace(",", "").replace("\r\n", "").split("-")
if len(tokens) == 1:
pass
else:
print tokens[0] + "," + unicode(tokens[1])
But I'm getting an error message
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
What am I doing wrong?
It appears that dict.txt is UTF-8 encoded (ħ is 0xc4 0xa7 in UTF-8).
You should open the file as UTF-8, then:
import codecs
with codecs.open('./dict.txt', encoding="utf-8") as maltese:
# etc.
You will then have Unicode strings instead of bytestrings to work with; you therefore don't need to call unicode() on them, but you may have to re-encode them to the encoding of the terminal you're outputting to.
You have to change your last line to (this has been tested to work on your data):
print tokens[0] + "," + unicode(tokens[1], 'utf8')
If you don't have that utf8, Python assumes that the source is ascii encoding, hence the error.
See http://docs.python.org/2/howto/unicode.html#the-unicode-type

UnicodeEncodeError when reading pdf with pyPdf

Guys i had posted a question earlier pypdf python tool .dont mark this as duplicate as i get this error indicated below
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]))
f.close()
# or print contents to the standard out stream
print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
I get this error for a the 1st pdf file
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
and the following error for this pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
How to resolve this
I tried it myself and got the same result. Ignore my comment, I hadn't seen that you're writing the output to a file as well. This is the problem:
f.write(convertPdf2String(sys.argv[1]))
As convertPdf2String returns a Unicode string, but file.write can only write bytes, the call to f.write tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like
f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
EDIT:
The working source code, only one line changed.
# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()
# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

Categories

Resources