Python: Can't read file encoded in ASCII - python

I generated a bugreport in Android through ADB and extracted the large report file. But when I open and read that file, it prints:
>>> f = open('bugreport.txt')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 12788794: invalid start byte
>>> f = open('bugreport.txt', encoding='ascii')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 5455694: ordinal not in range(128)
It seems that neither UTF-8 nor ASCII codec can decode the file.
Then I checked the file encoding by two commands:
$ enca bugreport.txt
7bit ASCII characters
$ file -i bugreport.txt
bugreport.txt: text/plain; charset=us-ascii
They show me the file is encoded in ascii, while I can't open it by ascii codec.
Some other clues:
1. The above python interpreter is python 3.6.3. I tried python 2.7.14 and it went well.
2. If the file is opened by adding parameters errors='ignore' and encoding='ascii', it can be read but all Chinese characters are lost.
So how can I open that peculiar file in python 3? Can anyone help me?

In python 3 you can specify encoding with open context.
with open(file, encoding='utf-8') as f:
data = f.read()

It's likely that the file is encoded as latin-1 or utf-16 (little-endian).
>>> bytes_ = [b'\xc0', b'\xef']
>>> for b in bytes_:
... print(repr(b), b.decode('latin-1'))
...
b'\xc0' À
b'\xef' ï
>>> bytes_ = [b'\xc0\x00', b'\xef\x00']
>>> for b in bytes_:
... print(repr(b), b.decode('utf-16le'))
...
b'\xc0\x00' À
b'\xef\x00' ï

Related

How to decode Windows-1254 file to a readable format in Python

I have a file that contains a Windows-1254 encoded data and I want to decode it to a readable format with python but I have no success so far.
I tried this code but I am getting an erorr:
file = open('file_name', 'rb')
content_of_file = file.read()
file.close()
decoded_data = content_of_file.decode('cp1254') # here I am trying to decode that file
print(decoded_data)
This is the error that I get:
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 658: character maps to
The way that I now that this file was encoded in Windows-1254 is by using the chardet.detect(data)['encoding'] so keep that in mind.
Does someone knows how can I decode taht data in a way I could understand it?
this is the file content if someone need it.
b'c\x00\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00C\x00\x00\x00s3\x03\x00\x00d\x01\x00\x01\x03\x03\x03\t\x02\x02\t\x03\x03\x03\t\x02\x02\td\x02\x00d\x00\x00l\x00\x00}\x01\x00d\x02\x00d\x00\x00l\x01\x00}\x02\x00t\x02\x00\x83\x00\x00}\x03\x00d\x03\x00}\x00\x00t\x03\x00|\x03\x00\x83\x01\x00d\x04\x00k\x03\x00rZ\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\xd0\x02|\x03\x00d\x06\x00\x19d\x07\x00k\x03\x00ry\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x08\x00\x19\x83\x01\x00d\t\x00Ad\n\x00k\x03\x00r\xa2\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x0b\x00d\x0c\x00!d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19j\x06\x00d\r\x00\x83\x01\x00s\xd4\x00d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19j\x08\x00\x83\x00\x00\x0co\x07\x01|\x03\x00t\x07\x00d\x0e\x00\x83\x01\x00\x19d\x0f\x00j\t\x00d\x10\x00\x83\x01\x00k\x02\x00s\x19\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00|\x01\x00j\n\x00d\x11\x00d\x04\x00\x83\x02\x00d\x06\x00\x14d\x11\x00\x17\x19j\x0b\x00d\x12\x00\x83\x01\x00d\x13\x00k\x03\x00rU\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d\x14\x00\x19\x83\x01\x00t\x05\x00|\x03\x00d\x15\x00\x19\x83\x01\x00Ad\x16\x00k\x03\x00s\x89\x01|\x03\x00d\x14\x00\x19d\x17\x00k\x03\x00r\x98\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x18\x00\x19d\x19\x00j\t\x00d\x1a\x00\x83\x01\x00k\x03\x00r\xc0\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d\x1b\x00\x19d\x1c\x00j\x0c\x00d\x1c\x00d\x1d\x00\x83\x02\x00k\x03\x00r\xeb\x01d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00t\r\x00d\x1e\x00d\x0b\x00\x83\x02\x00\x19t\x0e\x00d\x11\x00\x83\x01\x00d\x1f\x00\x17j\t\x00d\x10\x00\x83\x01\x00k\x03\x00r&\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d \x00d!\x00!j\x0c\x00d"\x00d\x00\x00d\x00\x00d\x02\x00\x85\x03\x00\x19d\x1c\x00\x83\x02\x00d\x1c\x00k\x03\x00ra\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d!\x00\x19t\x0f\x00i\x00\x00\x83\x01\x00j\x10\x00d\x0b\x00\x19k\x03\x00r\x8d\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d#\x00\x19t\x0e\x00d\x06\x00\x83\x01\x00k\x03\x00r\xb2\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x03\x00d$\x00\x19d%\x00k\x07\x00r\xd1\x02d&\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00|\x02\x00j\x01\x00|\x03\x00d'\x00\x19\x83\x01\x00j\x11\x00\x83\x00\x00d(\x00k\x03\x00r\xff\x02d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00t\x05\x00|\x03\x00d)\x00\x19\x83\x01\x00t\x07\x00d*\x00\x83\x01\x00k\x03\x00r*\x03d\x05\x00GHt\x04\x00\x83\x00\x00\x01n\x00\x00d+\x00GHd\x00\x00S(,\x00\x00\x00Ns\x1a\x00\x00\x00#Import random,md5 modulesi\xff\xff\xff\xffs\x14\x00\x00\x00#Start checking flagi\x12\x00\x00\x00s\x05\x00\x00\x00NO!!!i\x00\x00\x00\x00t\x01\x00\x00\x00Mi\x01\x00\x00\x00i\xff\x00\x00\x00i\xaa\x00\x00\x00i\x02\x00\x00\x00i\x04\x00\x00\x00t\x02\x00\x00\x00HCs\x10\x00\x00\x00([] > {}) << -~1t\x02\x00\x00\x007bt\x03\x00\x00\x00hexi\x05\x00\x00\x00t\x06\x00\x00\x00base64s\x05\x00\x00\x00ZA==\ni\x06\x00\x00\x00i\x07\x00\x00\x00i\x1a\x00\x00\x00t\x01\x00\x00\x00ii\x08\x00\x00\x00t\x01\x00\x00\x00nt\x05\x00\x00\x00rot13i\t\x00\x00\x00t\x00\x00\x00\x00t\x01\x00\x00\x00st\x04\x00\x00\x001010t\x01\x00\x00\x00fi\x0b\x00\x00\x00i\r\x00\x00\x00t\x02\x00\x00\x00ypi\x0e\x00\x00\x00i\x0f\x00\x00\x00t\x08\x00\x00\x00dddddddds\x06\x00\x00\x00NO!!!!i\x10\x00\x00\x00t \x00\x00\x00eccbc87e4b5ce2fe28308fd9f2a7baf3i\x11\x00\x00\x00sM\x00\x00\x00~(~(~((((~({}<[])<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))<<({}<[]))s\n\x00\x00\x00Bazinga!!!(\x12\x00\x00\x00t\x06\x00\x00\x00randomt\x03\x00\x00\x00md5t\t\x00\x00\x00raw_inputt\x03\x00\x00\x00lent\x04\x00\x00\x00quitt\x03\x00\x00\x00ordt\n\x00\x00\x00startswitht\x04\x00\x00\x00evalt\x07\x00\x00\x00isalphat\x06\x00\x00\x00decodet\x07\x00\x00\x00randintt\x06\x00\x00\x00encodet\x07\x00\x00\x00replacet\x03\x00\x00\x00intt\x03\x00\x00\x00strt\x04\x00\x00\x00typet\x08\x00\x00\x00__name__t\t\x00\x00\x00hexdigest(\x04\x00\x00\x00t\x07\x00\x00\x00commentR\x0f\x00\x00\x00R\x10\x00\x00\x00t\x03\x00\x00\x00ans(\x00\x00\x00\x00(\x00\x00\x00\x00sA\x00\x00\x00XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXt\x0c\x00\x00\x00get_the_flag\x0b\x00\x00\x00sl\x00\x00\x00\x00\x02\x12\x01\x0c\x01\x0c\x01\t\x02\x06\x01\x12\x01\x05\x01\n\x02\x10\x01\x05\x01\n\x01\x1a\x01\x05\x01\n\x01#\x01\x05\x01\n\x016\x01\x05\x01\n\x01-\x01\x05\x01\n\x014\x01\x05\x01\n\x01\x19\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01,\x01\x05\x01\n\x01,\x01\x05\x01\n\x01\x1d\x01\x05\x01\n\x01\x16\x01\x05\x01\n\x01\x10\x01\x05\x01\n\x01\x1f\x01\x05\x01\n\x01\x1c\x01\x05\x01\n\x01'

Emoji support reading from file in python? [duplicate]

I need to analyse a textfile in tamil (utf-8 encoded). Im using nltk package of Python on the interface IDLE. when i try to read the text file on the interface, this is the error i get. how do i avoid this?
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
corpus = open('C:\\Users\\Customer\\Desktop\\DISSERTATION\\ettuthokai.txt').read()
File "C:\Users\Customer\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33: character maps to <undefined>
Since you are using Python 3, just add the encoding parameter to open():
corpus = open(
r"C:\Users\Customer\Desktop\DISSERTATION\ettuthokai.txt", encoding="utf-8"
).read()

Error when printing random line from text file

I need to print a random line from the file "Long films".
My code is:
import random
with open('Long films') as f:
lines = f.readlines()
print(random.choice(lines))
But it prints this error:
Traceback (most recent call last):
line 3, in <module>
lines = f.readlines()
line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 36: ordinal not in range(128)
What do I need to do in order to avoid this error?
The problem is not with printing, it is with reading. It seems your file has some special characters. Try opening your file with a different encoding:
with open('Long films', encoding='latin-1') as f:
...
Also, have you made any settings to your locale? Have you set any encoding scheme at the top of your file? Ordinarily, python3 will "helpfully" decode your text to utf-8, so you typically should not be getting this error.

UnicodeDecodeError, utf-8 invalid continuation byte

I m trying to extract lines from a log file , using that code :
with open('fichier.01') as f:
content = f.readlines()
print (content)
but its always makes the error statement
Traceback (most recent call last):
File "./parsepy", line 4, in <module>
content = f.readlines()
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2213: invalid continuation byte
how can i fix it ??
try one of the following
open('fichier.01', 'rb')
open('fichier.01', encoding ='utf-8')
open('fichier.01', encoding ='ISO-8859-1')
or also you can use io Module:
import io
io.open('fichier.01')
This is a common error when opening files when using Python (or any language really). This is an error you will soon learn to catch.
If it's not encoded as text then you will have to open it in binary mode e.g.:
with open('fichier.01', 'rb') as f:
content = f.readlines()
If it's encoded as something other than UTF-8 and it can be opened in text mode then open takes an encoding argument: https://docs.python.org/3.5/library/functions.html#open
Try to use it to solve it:
with open('fichier.01', errors='ignore') as f:
###

Getting UnicodeDecodeError while accessing csv file

Input file : chars.csv :
4,,x,,2,,9.012,2,,,,
6,,y,,2,,12.01,±4,,,,
7,,z,,2,,14.01,_3,,,,
When I try to parse this file, I get this error even after specifying utf-8 encoding.
>>> f=open('chars.csv',encoding='utf-8')
>>> f.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.2/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 36: invalid start byte
How to correct this error?
Version: Python 3.2.3
Your input file is clearly not utf-8 encoded, so you have at least those options:
f=open('chars.csv', encoding='utf-8', errors='ignore') if given file is mostly utf-8 and you don't care about some small data loss. For other errors parameter values check manual
simply use proper encoding, like latin-1, if you know one
This is not UTF-8 encoding. The UTF-8 encoding of ± is \xC2\xB1 and  is \xC2\x83. As RobertT suggested, try Latin-1:
f=open('chars.csv',encoding='latin-1')

Categories

Resources