UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python - python

I am trying to do OCR on an image file in python using teseract-OCR.
My environment is-
Python 3.5 Anaconda on Windows Machine.
Here is the code:
from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))
The error I am getting is :
File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>
I have tried the solution mentioned here
The hack is not working
I have tried my code on Mac OS it is working.
I have looked into the pytesseract issues:
Here is this an open issue
Thanks

Hmm..something very weird going on there -
The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.
What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.
The correct thing to do there is try to supply the image_to_string function with the optional lang parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.
So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:
from encodings import cp1252
original_decode = cp1252.Codec.decode
cp1252.Codec.decode = lambda self, input, errors="replace": original_decode(self, input, errors)
But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.

Related

Python 2.7: Printing out an decoded string

I have an file that is called: Abrázame.txt
I want to decode this so that python understands what this 'á' char is so that it will print me Abrázame.txt
This is the following code i have on an Scratch file:
import os
s = os.path.join(r'C:\Test\AutoTest', os.listdir(r'C:\\Test\\AutoTest')[0])
print(unicode(s.decode(encoding='utf-16', errors='strict')))
The error i get from above is:
Traceback (most recent call last):
File "C:/Users/naythan_onfri/.PyCharmCE2017.2/config/scratches/scratch_3.py", line 12, in <module>
print(unicode(s.decode(encoding='utf-16', errors='strict')))
File "C:\Python27\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x74 in position 28: truncated data
I have looked up the utf-16 character set and it does indeed have 'á' character in it. So why is it that this string cannot be decoded with Utf-16.
Also i know that 'latin-1' will work and produce the string im looking for however since this is for an automation project and i wanting to ensure that any filename with any registered character can be decoded and used for other things within the project for example:
"Opening up file explorer at the directory of the file with the file already selected."
Is looping through each of the codecs(Mind you i believe there is 93 codecs) to find whichever one can decode the string, the best way of getting the result I'm looking for? I figure there something far better than that solution.
You want to decode at the edge when you first read a string so that you don't have surprises later in your code. At the edge, you have some reasonable chance of guessing what that encoding is. For this code, the edge is
os.listdir(r'C:\\Test\\AutoTest')[0]
and you can get the current file system directory encoding. So,
import sys
fs_encoding = sys.getfilesystemencoding()
s = os.path.join(r'C:\Test\AutoTest',
os.listdir(r'C:\\Test\\AutoTest')[0].decode(encoding=fs_encodig, errors='strict')
print(s)
Note that once you decode you have a unicode string and you don't need to build a new unicode() object from it.
latin-1 works if that's your current code page. Its an interesting curiosity that even though Windows has supported "wide" characters with "W" versions of their API for many years, python 2 is single-byte character based and doesn't use them.
Long live python 3.

Python pandas load csv ANSI Format as UTF-8

I want to load a CSV File with pandas in Jupyter Notebooks which contains characters like ä,ö,ü,ß.
When i open the csv file with Notepad++ here is one example row which causes trouble in ANSI Format:
Empf„nger;Empf„ngerStadt;Empf„ngerStraáe;Empf„ngerHausnr.;Empf„ngerPLZ;Empf„ngerLand
The correct UTF-8 outcome for Empf„nger should be: Empfänger
Now when i load the CSV Data in Python 3.6 pandas on Windows with the following code:
df_a = pd.read_csv('file.csv',sep=';',encoding='utf-8')
I get and Error Message:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position xy: invalid continuation byte
Position 'xy' is the position where the character occurs that causes the error message
when i use the ansi format to load my csv file it works but display the umlaute incorrect.
Example code:
df_a = pd.read_csv('afile.csv',sep=';',encoding='ANSI')
Empfänger is represented as: Empf„nger
Note: i have tried to convert the file to UTF-8 in Notepad++ and load it afterwards with the pandas module but i still get the same error.
I have searched online for a solution but the provided solutions such as "change format in notepad++ to utf-8" or "use encoding='UTF-8'" or 'latin1' which gives me the same result as ANSI format or
import chardet
with open('afile.csv', 'rb') as f:
result = chardet.detect(f.readline())
df_a = pd.read_csv('afile.csv',sep=';',encoding=result['encoding'])
didnt work for me.
encoding='cp1252'
throws the following exception:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>
I also tried to replace Strings afterwards with the x.replace() method but the character ü disappears completely after loaded into a pandas DataFrame
If you don't know which are your file encoding, I think that the fastest approach is to open the file on a text editor, like Notepad++ to check how your file are encoding.
Then you go to the python documentation and look for the correct codec to use.
In your case , ANSI, the codec is 'mbcs', so your code will look like these
df_a = pd.read_csv('file.csv',sep=';',encoding='mbcs')
When EmpfängerStraße shows up as Empf„ngerStraáe when decoded as ”ANSI”, or more correctly cp1250 in this case, then the actual encoding of the data is most likely cp850:
print 'Empf„ngerStraáe'.decode('utf8').encode('cp1250').decode('cp850')
Or Python 3, where literal strings are already unicode strings:
print("Empf„ngerStraáe".encode("cp1250").decode("cp850"))
I couldnt find a proper solution after trying out all the well known encodings from ISO-8859-1 to 8859-15, from UTF-8 to UTF-32, from Windows-1250-1258 and nothing worked properly. So my guess is that the text encoding got corrupted during the export. My own solution to this is to load the textfile in a Dataframe with Windows-1251 as it does not cut out special characters in my text file and then replaced all broken characters with the corresponding ones. Its a rather dissatisfying solution that takes a lot of time to compute but its better than nothing.
You could use the encoding value UTF-16LE to solve the problem
pd.read_csv("./file.csv", encoding="UTF-16LE")
The file.csv should be saved using encoding UTF-16LE by NotePad++, option UCS-2 LE BOM
Best,
cp1252 works on both linux and windows to decode latin1 encoded files.
df = pd.read_csv('data.csv',sep=';',encoding='cp1252')
Although, if you are running on a windows machine, I would recommend using
df = pd.read_csv('data.csv', sep=';', encoding='mbcs')
Ironically, using 'latin1' in the encoding does not always work. Especially if you want to convert file to a different encoding.

Python encoding issue while reading a file

I am trying to read a file that contains this character in it "ë". The problem is that I can not figure out how to read it no matter what I try to do with the encoding. When I manually look at the file in textedit it is listed as a unknown 8-bit file. If I try changing it to utf-8, utf-16 or anything else it either does not work or messes up the entire file. I tried reading the file just in standard python commands as well as using codecs and can not come up with anything that will read it correctly. I will include a code sample of the read below. Does anyone have any clue what I am doing wrong? This is Python 2.17.10 by the way.
readFile = codecs.open("FileName",encoding='utf-8')
The line I am trying to read is this with nothing else in it.
Aeëtes
Here are some of the errors I get:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x91 in position 0: invalid start byte
UTF-16 stream does not start with BOM"
UnicodeError: UTF-16 stream does not start with BOM -- I know this one is that it is not a utf-16 file.
UnicodeDecodeError: 'ascii' codec can't decode byte 0x91 in position 0: ordinal not in range(128)
If I don't use a Codec the word comes in as Ae?tes which then crashes later in the program. Just to be clear, none of the suggested questions or any other anywhere on the net have pointed to an answer. One other detail that might help is that I am using OS X, not Windows.
Credit for this answer goes to RadLexus for figuring out the proper encoding and also to Mad Physicist who pointed me in the right track even if I did not consider all possible encodings.
The issue is apparently a Mac will convert the .txt file to mac_roman. If you use that encoding it will work perfectly.
This is the line of code that I used to convert it.
readFile = codecs.open("FileName",encoding='mac_roman')

UnicodeDecodeError for Reading files in Python

pythonNotes = open('E:\\Python Notes.docx','r')
read_it_now = pythonNotes.read()
print(read_it_now.encode('utf-16'))
When I try this code, I get:
UnicodeDecodeError: 'charmap' can't decode byte 0x8f in position 591 character maps to <undefined>
I am running this in visual studio with python tools - starting without debugging.
I have tried putting enc='utf-8' at the top, throwing it in as a parameter, I've looked at other questions and just couldn't find a solution to this simple issue.
Please assist.
This error can occur when text that is already in utf-8 format is read in as an 8-bit encoding, and python tries to "decode" it to Unicode: Bytes that have no meaning in the supposed encoding throw a UnicodeDecodeError. But you'll always get an error if you try to read a file as utf-8 that is not in the utf-8 encoding.
In your case, the problem is that a docx file is not a regular text file; no single text encoding can meaningfully import it. See this SO answer for directions on how to read it on a low level, or use python-docx to get access to the document in a way that resembles what you see in Word.

Why does my Python program get UnicodeDecodeError in IntelliJ but is OK from the command line?

I have a simple program that loads a .json file which contains a funny character. The program (see below) runs fine in Terminal but gets this error in IntelliJ:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
2: ordinal not in range(128)
The crucial code is:
with open(jsonFileName) as f:
jsonData = json.load(f)
if I replace the open with:
with open(jsonFileName, encoding='utf-8') as f:
Then it works in both IntelliJ and Terminal. I'm still new to Python and the IntelliJ plugin, and I don't understand why they're different. I thought sys.path might be different, but the output makes me think that's not the cause. Could someone please explain? Thanks!
Versions:
OS: Mac OS X 10.7.4 (also tested on 10.6.8)
Python 3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) /Library/Frameworks/Python.framework/Versions/3.2/bin/python3.2
IntelliJ: 11.1.3 Ultimate
Files (2):
1. unicode-error-demo.py
#!/usr/bin/python
import json
from pprint import pprint as pp
import sys
def main():
if len(sys.argv) is not 2:
print(sys.argv[0], "takes one arg: a .json file")
return
jsonFileName = sys.argv[1]
print("sys.path:")
pp(sys.path)
print("processing", jsonFileName)
# with open(jsonFileName) as f: # OK in Terminal, but BUG in IntelliJ: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)
with open(jsonFileName, encoding='utf-8') as f: # OK in both
jsonData = json.load(f)
pp(jsonData)
if __name__ == "__main__":
main()
2. encode-temp.json
["™"]
The JSON .load() function expects Unicode data, not raw bytes. Python automatically tries to decode the byte string to a Unicode string for you using a default codec (in your case ASCII), and fails. By opening the file with the UTF-8 codec, Python makes an explicit conversion for you. See the open() function, which states:
In text mode, if encoding is not specified the encoding used is platform dependent.
The encoding that would be used is determined as follows:
Try os.device_encoding() to see if there is a terminal encoding.
Use locale.getpreferredencoding() function, which depends on the environment you run your code in. The do_setlocale of that function is set to False.
Use 'ASCII' as a default if both methods have returned None.
This is all done in C, but it's python equivalent would be:
if encoding is None:
encoding = os.device_encoding()
if encoding is None:
encoding = locale.getpreferredencoding(False)
if encoding is None:
encoding = 'ASCII'
So when you run your program in a terminal, os.deviceencoding() returns 'UTF-8', but when running under IntelliJ there is no terminal, and if no locale is set either, python uses 'ASCII'.
The Python Unicode HOWTO tells you all about the difference between unicode strings and bytestrings, as well as encodings. Another essential article on the subject is Joel Spolsky's Absolute Minimum Unicode knowledge article.
Python 2.x has strings and unicode strings. The basic strings are encoded with ASCII. ASCII uses only 7 bits/char, which allow to encode 128 characters, while modern UTF-8 uses up to 4 bytes/char. UTF-8 is compatible with ASCII (so that any ASCII-encoded string is a valid UTF-8 string), but not the other way round.
Apparently, your file name contains non-ASCII characters. And python by default wants to read it in as simple ASCII-encoded string, spots a non-ASCII character (its first bit is not 0 as it's 0xe2) and says, 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128).
Has nothing to do with python, but still my favourite tutorial about encodings:
http://hektor.umcs.lublin.pl/~mikosmul/computing/articles/linux-unicode.html

Categories

Resources