writing to text file - 'ascii' codec can't encode character - python

I'm having a bit of trouble outputting words from a text-image to a .txt file.
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
text = pytesseract.image_to_string(Image.open("book_image.jpg"))
file = open("text_file","w")
file.write(text)
print(text)
The code which reads the image file and prints out the words on the image works fine. The problem is when I try to take the text and write it to file, I get the following error;
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 366: ordinal not in range(128)
Could anyone please explain how I can convert the variable text to a string?

Try this:
file = open("text_file", "w", encoding='utf8', errors="ignore")

Also try:
file.write(text).encode('utf-8').strip()

Related

How to put Arabic text in a txt file

I have been trying to put Arabic text in a .txt file and when do so using the code bellow I get this error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to <undefined>
code:
Log1 = open("File.txt", "a")
Log1.write("سلام")
Log1.close()
This question was asked in stack overflow many times but all of them has suggested using utf-8 which will output \xd8\xb3\xd9\x84\xd8\xa7\xd9\x85 for this case, I was wondering if there is anyways to make the thing in the text file look like the سلام instead of \xd8\xb3\xd9\x84\xd8\xa7\xd9\x85.
This works perfectly:
FILENAME = 'foo.txt'
with open(FILENAME, "w", encoding='utf-8') as data:
data.write("سلام")
Then from a zsh shell:
cat foo.txt
سلام

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 100: character maps to <undefined>

I have two files both at the same directory:
http://nlp.lsi.upc.edu/awn/AWNDatabaseManagement.py.gz
the xml database of Arabic WordNet (http://nlp.lsi.upc.edu/awn/get_bd.php) upc_db.xml
When i try to run the .py file to give me the error in the image
i am trying to check the .py file is working so i can import it as WordNet for arabic words
Can you help me through it?
Thanks
image for error
To read any binary file/db use the encoding="utf-8" while opening the file/db.
UTF-8 is capable of encoding all 1,112,064 valid character code points in Unicode using one to four one-byte code units.
So, simple is the best.
To read the above binary file, use
ent = open(ent, 'rb')
instead of,
ent = open(ent)
Try encoding it.
with open(file, encoding="utf-8") as file:
# Reads each character
file.read()

Encoding error when printing tesseract output

I'm just trying to make a simple program to OCR a entire page, however I am getting a encode error, which I've always have had trouble with fixing.
My code:
from PIL import Image
import pytesseract
text = pytesseract.image_to_string(Image.open('005.png'))
print(text)
My error:
File "c:/Users/Dylan C/Desktop/Comparitor/image.py", line 4, in
print(text)
File "C:\Users\Dylan C\AppData\Local\Programs\Python\Python35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 187: character maps to
Sorry if this is a stupid question I have JUST downloaded tesseract, and am no expert in programming.
As error states: problem is in print(text) - you try to print unicode (utf-8) text to console/environment that does not support it.
Search for print UnicodeEncodeError windows solution e.g. Python, Unicode, and the Windows console

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128) chinese characters

Im trying to write Chinese characters into a text file from a SQL output called result.
result looks like this:
[('你好吗', 345re4, '2015-07-20'), ('我很好',45dde2, '2015-07-20').....]
This is my code:
#result is a list of tuples
file = open("my.txt", "w")
for row in result:
print >> file, row[0].encode('utf-8')
file.close()
row[0] contains Chinese text like this: 你好吗
I also tried:
print >> file, str(row[0]).encode('utf-8')
and
print >> file, 'u'+str(row[0]).encode('utf-8')
but both gave the same error.
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-10: ordinal not in range(128)
Found a simple solution instead of doing encoding and decoding by formatting the file to "utf-8" from the beginning using codecs.
import codecs
file = codecs.open("my.txt", "w", "utf-8")
Don't forget to ad the UTF8 BOM on the file beginning if you wish to view your file in text editor correctly:
file = open(...)
file.write("\xef\xbb\xbf")
for row in result:
print >> file, u""+row[0].decode("mbcs").encode("utf-8")
file.close()
I think you'll have to decode from your machines default encoding to unicode(), then encode it as UTF-8.
mbcs represents (at least it did ages a go) default encoding on Windows.
But do not rely on that.
Did you try the codecs module?

how to visualise results in Spyder?

I was trying to read a txt file in python. After the following input:
f = open("test.txt","r") #opens file with name of "test.txt"
print(f.read(1))
print(f.read())
instead of looking at the text I'm returned this:
how do i visualize the output?
Thanks
I think you need to go line by line. Be careful, if its a big text file this could go on for awhile.
for line in f:
print line.decode('utf-8').strip()
some strings not being read correctly, so you'll need the decode line.
See:UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128)
try this setup...
it worked for my console printing problems:
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Categories

Resources