UnicodeDecodeError in Python 2.7 - python

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?

codecs.utf_8_decode(input.encode('utf8'))

Related

How to decode Windows-1254 file to a readable format in Python

I have a file that contains a Windows-1254 encoded data and I want to decode it to a readable format with python but I have no success so far.
I tried this code but I am getting an erorr:
file = open('file_name', 'rb')
content_of_file = file.read()
file.close()
decoded_data = content_of_file.decode('cp1254') # here I am trying to decode that file
print(decoded_data)
This is the error that I get:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 658: character maps to
The way that I now that this file was encoded in Windows-1254 is by using the chardet.detect(data)['encoding'] so keep that in mind.
Does someone knows how can I decode taht data in a way I could understand it?
this is the file content if someone need it.
b'c\x00\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00C\x00\x00\x00s3\x03\x00\x00d\x01\x00\x01\x03\x03\x03\t\x02\x02\t\x03\x03\x03\t\x02\x02\td\x02\x00d\x00\x00l\x00\x00}\x01\x00d\x02\x00d\x00\x00l\x01\x00}\x02\x00t\x02\x00\x83\x00\x00}\x03\x00d\x03\x00}\x00\x00t\x03\x00|\x03\x00\x83\x01\x00d\x04\x00k\x03\x00rZ\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\xd0\x02|\x03\x00d\x06\x00\x19d\x07\x00k\x03\x00ry\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x08\x00\x19\x83\x01\x00d\t\x00Ad\n\x00k\x03\x00r\xa2\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x0b\x00d\x0c\x00!d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19j\x06\x00d\r\x00\x83\x01\x00s\xd4\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19j\x08\x00\x83\x00\x00\x0co\x07\x01|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19d\x0f\x00j\t\x00d\x10\x00\x83\x01\x00k\x02\x00s\x19\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00|\x01\x00j\n\x00d\x11\x00d\x04\x00\x83\x02\x00d\x06\x00\x14d\x11\x00\x17\x19j\x0b\x00d\x12\x00\x83\x01\x00d\x13\x00k\x03\x00rU\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x14\x00\x19\x83\x01\x00t\x05\x00|\x03\x00d\x15\x00\x19\x83\x01\x00Ad\x16\x00k\x03\x00s\x89\x01|\x03\x00d\x14\x00\x19d\x17\x00k\x03\x00r\x98\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x18\x00\x19d\x19\x00j\t\x00d\x1a\x00\x83\x01\x00k\x03\x00r\xc0\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x1b\x00\x19d\x1c\x00j\x0c\x00d\x1c\x00d\x1d\x00\x83\x02\x00k\x03\x00r\xeb\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\r\x00d\x1e\x00d\x0b\x00\x83\x02\x00\x19t\x0e\x00d\x11\x00\x83\x01\x00d\x1f\x00\x17j\t\x00d\x10\x00\x83\x01\x00k\x03\x00r&\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d \x00d!\x00!j\x0c\x00d"\x00d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19d\x1c\x00\x83\x02\x00d\x1c\x00k\x03\x00ra\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d!\x00\x19t\x0f\x00i\x00\x00\x83\x01\x00j\x10\x00d\x0b\x00\x19k\x03\x00r\x8d\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d#\x00\x19t\x0e\x00d\x06\x00\x83\x01\x00k\x03\x00r\xb2\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d$\x00\x19d%\x00k\x07\x00r\xd1\x02d&\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x02\x00j\x01\x00|\x03\x00d'\x00\x19\x83\x01\x00j\x11\x00\x83\x00\x00d(\x00k\x03\x00r\xff\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d)\x00\x19\x83\x01\x00t\x07\x00d*\x00\x83\x01\x00k\x03\x00r*\x03d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00d+\x00GHd\x00\x00S(,\x00\x00\x00Ns\x1a\x00\x00\x00#Import random,md5 modulesi\xff\xff\xff\xffs\x14\x00\x00\x00#Start checking flagi\x12\x00\x00\x00s\x05\x00\x00\x00NO!!!i\x00\x00\x00\x00t\x01\x00\x00\x00Mi\x01\x00\x00\x00i\xff\x00\x00\x00i\xaa\x00\x00\x00i\x02\x00\x00\x00i\x04\x00\x00\x00t\x02\x00\x00\x00HCs\x10\x00\x00\x00([] > {}) << -~1t\x02\x00\x00\x007bt\x03\x00\x00\x00hexi\x05\x00\x00\x00t\x06\x00\x00\x00base64s\x05\x00\x00\x00ZA==\ni\x06\x00\x00\x00i\x07\x00\x00\x00i\x1a\x00\x00\x00t\x01\x00\x00\x00ii\x08\x00\x00\x00t\x01\x00\x00\x00nt\x05\x00\x00\x00rot13i\t\x00\x00\x00t\x00\x00\x00\x00t\x01\x00\x00\x00st\x04\x00\x00\x001010t\x01\x00\x00\x00fi\x0b\x00\x00\x00i\r\x00\x00\x00t\x02\x00\x00\x00ypi\x0e\x00\x00\x00i\x0f\x00\x00\x00t\x08\x00\x00\x00dddddddds\x06\x00\x00\x00NO!!!!i\x10\x00\x00\x00t \x00\x00\x00eccbc87e4b5ce2fe28308fd9f2a7baf3i\x11\x00\x00\x00sM\x00\x00\x00~(~(~((((~({}<[])<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))s\n\x00\x00\x00Bazinga!!!(\x12\x00\x00\x00t\x06\x00\x00\x00randomt\x03\x00\x00\x00md5t\t\x00\x00\x00raw_inputt\x03\x00\x00\x00lent\x04\x00\x00\x00quitt\x03\x00\x00\x00ordt\n\x00\x00\x00startswitht\x04\x00\x00\x00evalt\x07\x00\x00\x00isalphat\x06\x00\x00\x00decodet\x07\x00\x00\x00randintt\x06\x00\x00\x00encodet\x07\x00\x00\x00replacet\x03\x00\x00\x00intt\x03\x00\x00\x00strt\x04\x00\x00\x00typet\x08\x00\x00\x00__name__t\t\x00\x00\x00hexdigest(\x04\x00\x00\x00t\x07\x00\x00\x00commentR\x0f\x00\x00\x00R\x10\x00\x00\x00t\x03\x00\x00\x00ans(\x00\x00\x00\x00(\x00\x00\x00\x00sA\x00\x00\x00XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXt\x0c\x00\x00\x00get_the_flag\x0b\x00\x00\x00sl\x00\x00\x00\x00\x02\x12\x01\x0c\x01\x0c\x01\t\x02\x06\x01\x12\x01\x05\x01\n\x02\x10\x01\x05\x01\n\x01\x1a\x01\x05\x01\n\x01#\x01\x05\x01\n\x016\x01\x05\x01\n\x01-\x01\x05\x01\n\x014\x01\x05\x01\n\x01\x19\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01,\x01\x05\x01\n\x01,\x01\x05\x01\n\x01\x1d\x01\x05\x01\n\x01\x16\x01\x05\x01\n\x01\x10\x01\x05\x01\n\x01\x1f\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01'

Emoji support reading from file in python? [duplicate]

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>
Since you are using Python 3, just add the encoding parameter to open():
corpus = open(
r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

Python: Can't seem to decode the textfile

I am trying to open the file i.md3 in python. Only the first line is displayed properly.
This file is correct as I can open this in C easily with structures and pointers.
How can I decode this file. I have tried many encoding techniques many of which can show only output of first line.
Without the "encoding=cp850" there is an error:
Traceback (most recent call last):
File "D:\Eclipse Workspace\IGG Project\Main.py", line 40, in <module>
line = fp1.read()
File "C:\Program Files (x86)\Python38-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 35: character maps to <undefined>
This file is correct as I can open this in C easily with structures and pointers.
Code:
fp1 = open("i.md3", encoding="cp850")
while 1:
line = fp1.read()
if not line:
break
print (line)
First few lines of output is in the link below:
https://i.stack.imgur.com/TGZhq.png

Decoding HappyBase data from HBase

While trying to decode the values from HBase, i am seeing an error but it is apparent that Python thinks it is not in UTF-8 format but the Java application that put the data into HBase encoded it in UTF-8 only
a = '\x00\x00\x00\x00\x10j\x00\x00\x07\xe8\x02Y'
a.decode("UTF-8")
Traceback (most recent call last):
File "", line 1, in
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 9: invalid continuation byte
any thoughts?
that data is not valid utf-8, so if you really retrieved it as such from the database, you should check who/what put it in there.

Printing unicode characters from SQL in Excel with xlwt

I am using Python to extract data from an MSSQL database, using an ODBC connection. I am then trying to put the extracted data into an Excel file, using xlwt.
However this generates the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 20: ordinal not in range(128)
I have run the script to just print the data and established that the offending character in the database is an O with a slash through it. On the python print it shows as "\xd8".
The worksheet encoding for xlwt is set as UTF-8.
Is there any way to have this come straight through into Excel?
Edit
Full error message below:
C:\>python dbtest1.py
Traceback (most recent call last):
File "dbtest1.py", line 24, in <module>
ws.write(i,j,item)
File "build\bdist.win32\egg\xlwt\Worksheet.py", line 1032, in write
File "build\bdist.win32\egg\xlwt\Row.py", line 240, in write
File "build\bdist.win32\egg\xlwt\Workbook.py", line 309, in add_str
File "build\bdist.win32\egg\xlwt\BIFFRecords.py", line 25, in add_str
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 20: invalid
continuation byte
Setting the workbook encoding to 'latin-1' seems to have achieved the same:
wb = xlwt.Workbook(encoding='latin-1')
(It was set at 'UTF-8' before)
The other answer didn't work in my case as there were other fields that were not strings.
The SQL extraction seems to be returning strings encoded using ascii. You can convert them to unicode with:
data = unicode(input_string, 'latin-1')
You can then put them into a spreadsheet with xlwt.

Categories

Resources