Encoding error when printing tesseract output - python

I'm just trying to make a simple program to OCR a entire page, however I am getting a encode error, which I've always have had trouble with fixing.
My code:
from PIL import Image
import pytesseract
text = pytesseract.image_to_string(Image.open('005.png'))
print(text)
My error:
File "c:/Users/Dylan C/Desktop/Comparitor/image.py", line 4, in
print(text)
File "C:\Users\Dylan C\AppData\Local\Programs\Python\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 187: character maps to
Sorry if this is a stupid question I have JUST downloaded tesseract, and am no expert in programming.

As error states: problem is in print(text) - you try to print unicode (utf-8) text to console/environment that does not support it.
Search for print UnicodeEncodeError windows solution e.g. Python, Unicode, and the Windows console

Related

UnicodeEncodeError: 'charmap' codec can't encode character '\u2264'

I'm using python3.6 in windows7 and django 1.9
I got this error when i run my code.
In my code i'm parsing xml data to write a html page.
I got that some character is unable to encode properly thats why its throwing an error.
\u2264 this is the character(less than or equal) which is the root cause of the error.
My question is how to encode this properly in python3
Detailed Error Log:
Traceback (most recent call last):
File "C:\Dev\EXE\TEMP\cookie\crumbs\views.py", line 1520, in parser
html_file.write(html_text)
File "C:\Users\Cookie1\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2264' in position 389078: character maps to <undefined>
The error message indicates that you're trying to encode to the Windows-1252 character encoding. That encoding doesn't have a representation of the less than or equal symbol.
>>> "\u2264".encode("cp1252")
>>> Traceback... [as above]
The answer is to use UTF-8, an unrestricted encoding, instead of Windows-1252, a very restricted encoding.
Your question doesn't include much context, but the line html_file.write(html_text) makes me think you're using Python's file protocol. The documentation for open() shows how to set the encoding, e.g.
html_file = open("file.html", mode="w", encoding="utf8")
Note that "the default encoding is platform dependent (whatever locale.getpreferredencoding() returns)", which is why you're getting Windows-1252 on Windows 7.

writing to text file - 'ascii' codec can't encode character

I'm having a bit of trouble outputting words from a text-image to a .txt file.
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
text = pytesseract.image_to_string(Image.open("book_image.jpg"))
file = open("text_file","w")
file.write(text)
print(text)
The code which reads the image file and prints out the words on the image works fine. The problem is when I try to take the text and write it to file, I get the following error;
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 366: ordinal not in range(128)
Could anyone please explain how I can convert the variable text to a string?
Try this:
file = open("text_file", "w", encoding='utf8', errors="ignore")
Also try:
file.write(text).encode('utf-8').strip()

Error while importing Wikipedia using pip

I'm using Python 2.7 and trying to work with the following code
import wikipedia
input = raw_input("Question: ")
print wikipedia.summary(input)
I see this error when the code is run:
Traceback (most recent call last): File "wik.py", line 5, in
print wikipedia.summary(input) File "C:\Anaconda2\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013'
in position 38: character maps to undefined
How can I fix this? Thanks in advance.
Python 2 defaults to ASCII, which only maps characters between \u0000 and \u007F1. You need to use a different encoding in order to properly get this character (\u2013 is a long dash) and many others outside of ASCII.
Using UTF-8 should work for you, and I believe this print statement will properly output text:
print wikipedia.summary(input).encode("utf8")
For more information on this, check this similar question: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128).

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

I am trying to do OCR on an image file in python using teseract-OCR.
My environment is-
Python 3.5 Anaconda on Windows Machine.
Here is the code:
from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))
The error I am getting is :
File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>
I have tried the solution mentioned here
The hack is not working
I have tried my code on Mac OS it is working.
I have looked into the pytesseract issues:
Here is this an open issue
Thanks
Hmm..something very weird going on there -
The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.
What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.
The correct thing to do there is try to supply the image_to_string function with the optional lang parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.
So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:
from encodings import cp1252
original_decode = cp1252.Codec.decode
cp1252.Codec.decode = lambda self, input, errors="replace": original_decode(self, input, errors)
But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.

unicode error printing \u2002 using Python 3

I am getting the error that Python can't decode character \u2002 when trying to print a block of text:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 355: character maps to <undefined>
What i don't understand is that, as far as i can tell, this is a unicode character (i.e. the EN SPACE character), so not sure why not printing.
For reference, the content was read in using file_content = open (file_name, encoding="utf8")
Works for me! (on a linux teminal)
>>> print("\u2002")
It's an invisible as it's EN_SPACE
If you are on windows however you are likely using codepage 125X in your terminal and...
>>> "\u2002".encode("cp1250")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/encodings/cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2002' in position 0: character maps to <undefined>
There is no problem using that character in Unicode (as a unicode string in Python). But when you write it out ("print it") it needs to be encoded into an encoding. Some encodings don't support some characters. The encoding you are using to print does not support that particular character.
Probably you are using the Windows console which typically uses a codepage like 850 or 437 which doesn't include this character.
There are ways to change the Windows console codepage (chcp) or you could try in Idle or some other IDE

Categories

Resources