Python: Can't seem to decode the textfile - python

I am trying to open the file i.md3 in python. Only the first line is displayed properly.
This file is correct as I can open this in C easily with structures and pointers.
How can I decode this file. I have tried many encoding techniques many of which can show only output of first line.
Without the "encoding=cp850" there is an error:
Traceback (most recent call last):
File "D:\Eclipse Workspace\IGG Project\Main.py", line 40, in <module>
line = fp1.read()
File "C:\Program Files (x86)\Python38-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 35: character maps to <undefined>
This file is correct as I can open this in C easily with structures and pointers.
Code:
fp1 = open("i.md3", encoding="cp850")
while 1:
line = fp1.read()
if not line:
break
print (line)
First few lines of output is in the link below:
https://i.stack.imgur.com/TGZhq.png

Related

How to decode Windows-1254 file to a readable format in Python

I have a file that contains a Windows-1254 encoded data and I want to decode it to a readable format with python but I have no success so far.
I tried this code but I am getting an erorr:
file = open('file_name', 'rb')
content_of_file = file.read()
file.close()
decoded_data = content_of_file.decode('cp1254') # here I am trying to decode that file
print(decoded_data)
This is the error that I get:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 658: character maps to
The way that I now that this file was encoded in Windows-1254 is by using the chardet.detect(data)['encoding'] so keep that in mind.
Does someone knows how can I decode taht data in a way I could understand it?
this is the file content if someone need it.
b'c\x00\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00C\x00\x00\x00s3\x03\x00\x00d\x01\x00\x01\x03\x03\x03\t\x02\x02\t\x03\x03\x03\t\x02\x02\td\x02\x00d\x00\x00l\x00\x00}\x01\x00d\x02\x00d\x00\x00l\x01\x00}\x02\x00t\x02\x00\x83\x00\x00}\x03\x00d\x03\x00}\x00\x00t\x03\x00|\x03\x00\x83\x01\x00d\x04\x00k\x03\x00rZ\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\xd0\x02|\x03\x00d\x06\x00\x19d\x07\x00k\x03\x00ry\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x08\x00\x19\x83\x01\x00d\t\x00Ad\n\x00k\x03\x00r\xa2\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x0b\x00d\x0c\x00!d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19j\x06\x00d\r\x00\x83\x01\x00s\xd4\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19j\x08\x00\x83\x00\x00\x0co\x07\x01|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19d\x0f\x00j\t\x00d\x10\x00\x83\x01\x00k\x02\x00s\x19\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00|\x01\x00j\n\x00d\x11\x00d\x04\x00\x83\x02\x00d\x06\x00\x14d\x11\x00\x17\x19j\x0b\x00d\x12\x00\x83\x01\x00d\x13\x00k\x03\x00rU\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x14\x00\x19\x83\x01\x00t\x05\x00|\x03\x00d\x15\x00\x19\x83\x01\x00Ad\x16\x00k\x03\x00s\x89\x01|\x03\x00d\x14\x00\x19d\x17\x00k\x03\x00r\x98\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x18\x00\x19d\x19\x00j\t\x00d\x1a\x00\x83\x01\x00k\x03\x00r\xc0\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x1b\x00\x19d\x1c\x00j\x0c\x00d\x1c\x00d\x1d\x00\x83\x02\x00k\x03\x00r\xeb\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\r\x00d\x1e\x00d\x0b\x00\x83\x02\x00\x19t\x0e\x00d\x11\x00\x83\x01\x00d\x1f\x00\x17j\t\x00d\x10\x00\x83\x01\x00k\x03\x00r&\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d \x00d!\x00!j\x0c\x00d"\x00d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19d\x1c\x00\x83\x02\x00d\x1c\x00k\x03\x00ra\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d!\x00\x19t\x0f\x00i\x00\x00\x83\x01\x00j\x10\x00d\x0b\x00\x19k\x03\x00r\x8d\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d#\x00\x19t\x0e\x00d\x06\x00\x83\x01\x00k\x03\x00r\xb2\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d$\x00\x19d%\x00k\x07\x00r\xd1\x02d&\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x02\x00j\x01\x00|\x03\x00d'\x00\x19\x83\x01\x00j\x11\x00\x83\x00\x00d(\x00k\x03\x00r\xff\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d)\x00\x19\x83\x01\x00t\x07\x00d*\x00\x83\x01\x00k\x03\x00r*\x03d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00d+\x00GHd\x00\x00S(,\x00\x00\x00Ns\x1a\x00\x00\x00#Import random,md5 modulesi\xff\xff\xff\xffs\x14\x00\x00\x00#Start checking flagi\x12\x00\x00\x00s\x05\x00\x00\x00NO!!!i\x00\x00\x00\x00t\x01\x00\x00\x00Mi\x01\x00\x00\x00i\xff\x00\x00\x00i\xaa\x00\x00\x00i\x02\x00\x00\x00i\x04\x00\x00\x00t\x02\x00\x00\x00HCs\x10\x00\x00\x00([] > {}) << -~1t\x02\x00\x00\x007bt\x03\x00\x00\x00hexi\x05\x00\x00\x00t\x06\x00\x00\x00base64s\x05\x00\x00\x00ZA==\ni\x06\x00\x00\x00i\x07\x00\x00\x00i\x1a\x00\x00\x00t\x01\x00\x00\x00ii\x08\x00\x00\x00t\x01\x00\x00\x00nt\x05\x00\x00\x00rot13i\t\x00\x00\x00t\x00\x00\x00\x00t\x01\x00\x00\x00st\x04\x00\x00\x001010t\x01\x00\x00\x00fi\x0b\x00\x00\x00i\r\x00\x00\x00t\x02\x00\x00\x00ypi\x0e\x00\x00\x00i\x0f\x00\x00\x00t\x08\x00\x00\x00dddddddds\x06\x00\x00\x00NO!!!!i\x10\x00\x00\x00t \x00\x00\x00eccbc87e4b5ce2fe28308fd9f2a7baf3i\x11\x00\x00\x00sM\x00\x00\x00~(~(~((((~({}<[])<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))s\n\x00\x00\x00Bazinga!!!(\x12\x00\x00\x00t\x06\x00\x00\x00randomt\x03\x00\x00\x00md5t\t\x00\x00\x00raw_inputt\x03\x00\x00\x00lent\x04\x00\x00\x00quitt\x03\x00\x00\x00ordt\n\x00\x00\x00startswitht\x04\x00\x00\x00evalt\x07\x00\x00\x00isalphat\x06\x00\x00\x00decodet\x07\x00\x00\x00randintt\x06\x00\x00\x00encodet\x07\x00\x00\x00replacet\x03\x00\x00\x00intt\x03\x00\x00\x00strt\x04\x00\x00\x00typet\x08\x00\x00\x00__name__t\t\x00\x00\x00hexdigest(\x04\x00\x00\x00t\x07\x00\x00\x00commentR\x0f\x00\x00\x00R\x10\x00\x00\x00t\x03\x00\x00\x00ans(\x00\x00\x00\x00(\x00\x00\x00\x00sA\x00\x00\x00XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXt\x0c\x00\x00\x00get_the_flag\x0b\x00\x00\x00sl\x00\x00\x00\x00\x02\x12\x01\x0c\x01\x0c\x01\t\x02\x06\x01\x12\x01\x05\x01\n\x02\x10\x01\x05\x01\n\x01\x1a\x01\x05\x01\n\x01#\x01\x05\x01\n\x016\x01\x05\x01\n\x01-\x01\x05\x01\n\x014\x01\x05\x01\n\x01\x19\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01,\x01\x05\x01\n\x01,\x01\x05\x01\n\x01\x1d\x01\x05\x01\n\x01\x16\x01\x05\x01\n\x01\x10\x01\x05\x01\n\x01\x1f\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01'

Error when printing random line from text file

I need to print a random line from the file "Long films".
My code is:
import random
with open('Long films') as f:
lines = f.readlines()
print(random.choice(lines))
But it prints this error:
Traceback (most recent call last):
line 3, in <module>
lines = f.readlines()
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
What do I need to do in order to avoid this error?
The problem is not with printing, it is with reading. It seems your file has some special characters. Try opening your file with a different encoding:
with open('Long films', encoding='latin-1') as f:
...
Also, have you made any settings to your locale? Have you set any encoding scheme at the top of your file? Ordinarily, python3 will "helpfully" decode your text to utf-8, so you typically should not be getting this error.

About UnicodeDecodeError

I am writing a program to count the words with python(3.6), the code runs smoothly from the terminal. But if I use python IDLE, below error happens:
Traceback (most recent call last):
File "/Users/zhangchaont/python/Course Python Programming/6.7V2.py", line 122, in <module>
main()
File "/Users/zhangchaont/python/Course Python Programming/6.7V2.py", line 21, in main
for line in txtFile:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 33: ordinal not in range(128)
How to solve this?
Since there is not much info about your code. I can only suggest instead of codecs you can also use this package.
https://github.com/iki/unidecode. The method below should solve your problem. Open your file with open method, and pass it the file_handle.read()
unidecode.unidecode_expect_nonascii(string)

UnicodeDecodeError for writing file

I know that this is a very common error, but it's the first time I've encountered it when trying to write a file.
I'm using networkx to work with graphs for network analysis, and when I try to write into any format:
nx.write_gml(G, "Graph.gml")
nx.write_pajek(G, "Graph.net")
nx.write_gexf(G, "graph.gexf")
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 2, in write_pajek
File "/Library/Python/2.7/site-packages/networkx/utils/decorators.py", line 263, in _open_file
result = func(*new_args, **kwargs)
File "/Library/Python/2.7/site-packages/networkx/readwrite/pajek.py", line 100, in write_pajek
path.write(line.encode(encoding))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
I haven't found documentation on this, so quite confused.
Wondering if you can make use of codec module to solve it or not. Just create a file object by codec as following before feeding to networkx.
ex,
import codecs
f = codecs.open("graph.gml", "w", "utf-8")

UnicodeDecodeError in Python 2.7

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?
codecs.utf_8_decode(input.encode('utf8'))

Categories

Resources