UnicodeEncodeError when reading pdf with pyPdf - python

Guys i had posted a question earlier pypdf python tool .dont mark this as duplicate as i get this error indicated below
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]))
f.close()
# or print contents to the standard out stream
print convertPdf2String("/home/tom/Desktop/Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")
I get this error for a the 1st pdf file
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
and the following error for this pdf http://www.envis-icpe.com/pointcounterpointbook/Hindi_Book.pdf
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 38: ordinal not in range(128)
How to resolve this

I tried it myself and got the same result. Ignore my comment, I hadn't seen that you're writing the output to a file as well. This is the problem:
f.write(convertPdf2String(sys.argv[1]))
As convertPdf2String returns a Unicode string, but file.write can only write bytes, the call to f.write tries to automatically convert the Unicode string using ASCII encoding. As the PDF obviously contains non-ASCII characters, that fails. So it should be something like
f.write(convertPdf2String(sys.argv[1]).encode("utf-8"))
# or
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
EDIT:
The working source code, only one line changed.
# Execute with "Hindi_Book.pdf" in the same directory
import sys
import pyPdf
def convertPdf2String(path):
content = ""
# load PDF file
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# iterate pages
for i in range(0, pdf.getNumPages()):
# extract the text from each page
content += pdf.getPage(i).extractText() + " \n"
# collapse whitespaces
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content
# convert contents of a PDF file and store retult to TXT file
f = open('a.txt','w+')
f.write(convertPdf2String(sys.argv[1]).encode("ascii", "xmlcharrefreplace"))
f.close()
# or print contents to the standard out stream
print convertPdf2String("Hindi_Book.pdf").encode("ascii", "xmlcharrefreplace")

Related

How to read a non-english text in file and write a non-english text to another file

I need to read the non-english text from the text file line by line and translate to english and write that to another text file, also read english text from text file and translate that to non-english(could be any native language depends on requirement) and save that translated text to a new text file.
Imported googletrans3.1.0a0 to translate.
The reading and writing part fails when its non-english.
my code:
import googletrans
from googletrans import Translator
translator = Translator(service_urls=['translate.googleapis.com'])
with open('Tobetranslated.txt', 'r') as f:
with open('output.txt', 'w') as w:
f_contents = str(f.readline())
while len(f_contents) > 0:
print(f_contents, end="")
translated = translator.translate(f_contents, src='ko', dest='en')
print(translated.text)
w.write(str(translated.text) + "\n")
f_contents = f.readline()
The error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4: character maps to

How to convert UTF-8 list to String in python

i need to save BeautifulSoup results to .txt file. and i need convert results to string with str() and not worked because list is UTF-8 :
# -*- coding: utf-8 -*-
page_content = soup(page.content, "lxml")
links = page_content.select('h3', class_="LC20lb")
for link in links:
with open("results.txt", 'a') as file:
file.write(str(link) + "\n")
and get this error :
File "C:\Users\omido\AppData\Local\Programs\Python\Python37-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 183-186: character maps to <undefined>
If you want to write to the file as UTF-8 as well, you’ll need to specify that:
with open("results.txt", 'a', encoding='utf-8') as file:
file.write(str(link) + "\n")
and it’s a good idea to only open the file once:
with open("results.txt", 'a', encoding='utf-8') as file:
for link in links:
file.write(str(link) + "\n")
(You can also print(link, file=file).)

How to extract String from a Unicoded JSONObject in Python?

I'm getting the below error when I try to parse a String with Unicodes like ' symbol and Emojis, etc :
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f33b' in position 19: ordinal not in range(128)
Sample Object:
{"user":{"name":"\u0e2a\u0e31\u0e48\u0e07\u0e14\u0e48\u0e27\u0e19 \u0e2b\u0e21\u0e14\u0e44\u0e27 \u0e40\u0e14\u0e23\u0e2a\u0e41\u0e1f\u0e0a\u0e31\u0e48\u0e19\u0e21\u0e32\u0e43\u0e2b\u0e21\u0e48 \u0e23\u0e32\u0e04\u0e32\u0e40\u0e1a\u0e32\u0e46 \u0e2a\u0e48\u0e07\u0e17\u0e31\u0e48\u0e27\u0e44\u0e17\u0e22 \u0e44\u0e14\u0e49\u0e02\u0e2d\u0e07\u0e0a\u0e31\u0e27\u0e23\u0e4c\u0e08\u0e49\u0e32 \u0e2a\u0e19\u0e43\u0e08\u0e15\u0e34\u0e14\u0e15\u0e48\u0e2d\u0e2a\u0e2d\u0e1a\u0e16\u0e32\u0e21 Is it","tag":"XYZ"}}
I'm able to extract tag value, but I'm unable to extract name value.
Here is my code:
dict = json.loads(json_data)
print('Tag - 'dict['user']['tag'])
print('Name - 'dict['user']['name'])
You can save the data in CSV file format which could also be opened using Excel. When you open a file in this way: open(filename, "w") then you can only store ASCII characters, but if you try to store Unicode data this way, you would get UnicodeEncodeError. In order for you to store Unicode data, you need to open the file with UTF-8 encoding.
mydict = json.loads(json_data) # or whatever dictionary it is...
# Open the file with UTF-8 encoding, most important step
f = open("userdata.csv", "w", encoding='utf-8')
f.write(mydict['user']['name'] + ", " + mydict['user']['tag'] + "\n")
f.close()
Feel free to change the code based on the data you have.
That's it...

How to save an encoded pdf in a zip file using django?

I've read some posts about this problem, but most of them didn't help my case, I'm trying to save an encoded pdf in a zip file (I'm using Docraptor API for the pdf generation, which return the encoded pdf).
def toZip(request, ...):
...
response = docraptor_api_call() #api call to generate pdf (encoded pdf)
with open('creation.pdf', 'wb') as f:
f.write(response)
#decode pdf
with open(f.name, 'rb') as pdf:
# this will download the pdf to the user
# doc = HttpResponse(pdf.read(), content_type='application/pdf')
# doc['Content-Disposition'] = "attachment; filename=filename.pdf"
# return doc
zip_io = io.BytesIO()
# create zipFile
zf = zipfile.ZipFile(zip_io, mode='w')
# write PDF in ZIP ?
save_zf = zf.write(pdf.read())
# save zip to FileField
zip = ZipStore.objects.create(zip=save_zf)
While trying the code on top I get this error :
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 43: character maps to
I'm don't really get what am I doing wrong and how I should fix it, any suggestion ?
You've got an error in the way you're calling zf.write. You should be using:
# ZipFile.write would take the the file to write, not bytes to be written.
# f.name is the name of the file in the zip archive. So if I passed
# in "foo.txt", "1", I'd get a file named `foo.txt` after decompressing, and its
# contents would be 1
zf.writestr(f.name, pdf.read())
This method does not appear to return something, so you'll need to change this: zip = ZipStore.objects.create(zip=save_zf) probably to:
zip = ZipStore.objects.create(zip=zip_io)

Python how to "ignore" ascii text?

I'm trying to scrape some stuff off a page using selenium. But this some of the text has ascii text in it.. so I get this.
f.write(database_text.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1462: ordinal not in range(128)
I was wondering, is there anyway to just simpley ascii?
Thanks!
print("â")
I'm not looking to write it in my text file, but ignore it.
note: It's not just "â" it has other chars like that also.
window_before = driver.window_handles[0]
nmber_one = 1
f = open(str(unique_filename) + ".txt", 'w')
for i in range(5, 37):
time.sleep(3)
driver.find_element_by_xpath("""/html/body/center/table[2]/tbody/tr[2]/td/table/tbody/tr""" + "[" + str(i) + "]" + """/td[2]/a""").click()
time.sleep(3)
driver.switch_to.window(driver.window_handles[nmber_one])
nmber_one = nmber_one + 1
database_text = driver.find_element_by_xpath("/html/body/pre")
f = open(str(unique_filename) + ".txt", 'w',)
f.write(database_text.text)
driver.switch_to.window(window_before)
import uuid
import io
unique_filename = uuid.uuid4()
which generates a new filename, well it should anyway, it worked before.
The problem is that some of the text is not ascii. database_text.text is likely unicode text (you can do print type(database_text.text) to verify) and contains non-english text. If you are on windows it may be "codepage" text which depends on how your user account is configured.
Often, one wants to store text like this as utf-8 so open your output file accordingly
import io
text = u"â"
with io.open('somefile.txt', 'w', encoding='utf-8') as f:
f.write(text)
If you really do want to just drop the non-ascii characters from the file completely you can setup a error policy
text = u"ignore funky â character"
with io.open('somefile.txt', 'w', encoding='ascii', errors='ignore') as f:
f.write(text)
In the end, you need to choose what representation you want to use for non-ascii (roughly speaking, non-English) text.
A Try Except block would work:
try:
f.write(database_text.text)
except UnicodeEncodeError:
pass

Categories

Resources