How to decode this byte string from a file?

How to decode this byte string from a file? - python

I used code provided by someone on the stackoverflow to read a binary file. The code is:
with open('OutputFile', 'rb') as f:
text = f.read()
I printed text and it appeared as below. Now I need to decode them into binary data, much like 01010010....
I used text.decode('utf-8') to decode the text, but it showed error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 0: invalid start byte
How to solve this issue? How to read them into binary strings?
b'\x88\xc6\xe7\x1a3\'1(\xd2\xb7*\xa5{a\xac\x0f\xabf\\\xe9Z\xa4\xb0\x116v\xe0}\xc5\xb8\xe1P\xf1\x01\xd7\xf63\x11\xec\xe7\xb7\xbeG\xaa\xdf\xd87\xdaK\x9c?\xe8\x0e\x84\xce|%\xcb\xc0\x1d\xe4\xe4\x02#c\xb1\xd0o\x1da\x87\nW\xb9\xc9\xb2\x08\x1c\xffP,4\x86\'\xecx$\x05\xa2\x1b\x8d\xc9\xe9\x12(t\xcba\xa5\xc9H\xb5[X\xf9\xdd\xa7q\x8e\x92r\xe2T\xe0a*\x13$\xdaS\x8bx\r\xc1\xa9~\xb7\xd8-\xef&\xad\xa8\xdd\xe5\x1d#\x99`\xa0\xb7\xbc\xe1\x96;:#SG\xea\xd2]\xfc\xc02\xb9\x01\xbe%\xbb?\x99\x0e\xa0{7Z\xa3\xa4\xcb\xe4\xe0C%1\xaf\xcb\x1e\xb3\xc6\xb1\xd7\xf8,,\x08\xc0\'\xbf\xde\xb5\xe3)~$\x05\x8c>\x88\xb8`\xce\xe3\x1a\x97zs\x05\x91\xcd\xee\xb9^S\x8c\x8f=\x0f\xe6\xf5TZ\xb24c\xf0YZ\xac\xf9\x87\x05,\x04\xf7>\x0c\xf6c/t\xbayB\x06\x0cd\x0f\x15\x1eZ\x9c#\xb7\\-\x16A\x06#\r\x12\x19\x85YV\xb3\x7f$\xc4}\xab\xda\xf5\xebO\xcf#/\x1ea\xa7\x03E\xb3\xef4\x11\x05hCJF(\x93\r\xb9\xa6\x84\x15\x8a\xda\xbe\x12\xff\xd2\xa5\x19y\xea\xb5H\xbd\x97\xc8\x81\xd5\'\xadN\xd8s\x0c\x0f\x97\xcb3d\xfa\xf6&\n\xdc\xd5\xd4\x15\x87\x08\xcb\xeb\xb4\x07\xf8)IE\xfd\x1am_C\xf2x\x04a\xa8\xdc\xb3G\xa4\xeaq6O\xe6D\xb4]d\x93\x95`\xe6W\xe2w\xc8^\t\xdc\x13aJE\xafU];V\x1f\xda\x96\xc8t\xdfk\x96\xc6\xd5\xc0B.\xeb\xac)<\xa7F\xce\x0c\xf1\xf1v\x18\xba\xf6^#0\x14\x1c#\xc7r\x86\xc4\xd6\x0e\xca\x94c\xf8m!\xeb57\xc7P\t\x1a\xed\xc8#7h\xc2\x03\xd9M\xdf\xdd\x05\x7f\xecS\x1c\xd4\xca\x84\xf5\xb3\xe5<\x1f\xb5\x05\xd8$\x8dC_J\n\x89\xe7\x8b\xb7\x00\x95\xe9\x8ct!\xe8\xf3\x82|\x9f|6ORa_J\x9c*\xf9\x0b\x1emV\x91\x93#\x91+\x18^wfK\x01\xc8\xd7&[\x13\xeb[\xb8\x0b\xf0.8\xb1\t)#\xc5H\xa0O0H"n\xeaQ\xc0p\xdaZm\xf0A^\xed)m\xd2\xef[$\xce\x9d\xd9\x97\xd3K\xdc\x1c\xeb\x17\xc4\x0e-\xc7p\xe5\x7f\xcf\xa5l\x95q\xb9\xe7WB2\xb5\x8c\xf3\xdf_\xea\x02\x1cx5\x92>\xf1\xcec\x9b\x01\xd3\xa8\x89\xb6\x85O\x04\xd4W\xe5\xfa\t\xe6-\xaa?r\x166\xe6\xed\x80\xf4\xe2\xe6\x83\x0f\xae!\xc7C\xff#'

The data is already in binary format interally since you read the file in mode "rb". Here's how to print it out as 0s and 1s:
with open('OutputFile', 'rb') as f:
data = f.read()
binary = ''.join(('{:08b}'.format(byte) for byte in data))
print(binary)
Output:
1000100011000110111001110001101000110011001001110011000100101000110100101011011100101010101001010111101101100001101011000000111110101011011001100101110011101001010110101010010010110000000100010011011001110110111000000111110111000101101110001110000101010000111100010000000111010111111101100011001100010001111011001110011110110111101111100100011110101010110111111101100000110111110110100100101110011100001111111110100000001110100001001100111001111100001001011100101111000000000111011110010011100100000000100010001101100011101100011101000001101111000111010110000110000111000010100101011110111001110010011011001000001000000111001111111101010000001011000011010010000110001001111110110001111000001001000000010110100010000110111000110111001001111010010001001000101000011101001100101101100001101001011100100101001000101101010101101101011000111110011101110110100111011100011000111010010010011100101110001001010100111000000110000100101010000100110010010011011010010100111000101101111000000011011100000110101001011111101011011111011000001011011110111100100110101011011010100011011101111001010001110100100011100110010110000010100000101101111011110011100001100101100011101100111010010000000101001101000111111010101101001001011101111111001100000000110010101110010000000110111110001001011011101100111111100110010000111010100000011110110011011101011010101000111010010011001011111001001110000001000011001001010011000110101111110010110001111010110011110001101011000111010111111110000010110000101100000010001100000000100111101111111101111010110101111000110010100101111110001001000000010110001100001111101000100010111000011000001100111011100011000110101001011101111010011100110000010110010001110011011110111010111001010111100101001110001100100011110011110100001111111001101111010101010100010110101011001000110100011000111111000001011001010110101010110011111001100001110000010100101100000001001111011100111110000011001111011001100011001011110111010010111010011110010100001000000110000011000110010000001111000101010001111001011010100111000010001110110111010111000010110100010110010000010000011001000000000011010001001000011001100001010101100101010110101100110111111100100100110001000111110110101011110110101111010111101011010011111100111101000000001011110001111001100001101001110000001101000101101100111110111100110100000100010000010101101000010000110100101001000110001010001001001100001101101110011010011010000100000101011000101011011010101111100001001011111111110100101010010100011001011110011110101010110101010010001011110110010111110010001000000111010101001001111010110101001110110110000111001100001100000011111001011111001011001100110110010011111010111101100010011000001010110111001101010111010100000101011000011100001000110010111110101110110100000001111111100000101001010010010100010111111101000110100110110101011111010000111111001001111000000001000110000110101000110111001011001101000111101001001110101001110001001101100100111111100110010001001011010001011101011001001001001110010101011000001110011001010111111000100111011111001000010111100000100111011100000100110110000101001010010001011010111101010101010111010011101101010110000111111101101010010110110010000111010011011111011010111001011011000110110101011100000001000010001011101110101110101100001010010011110010100111010001101100111000001100111100011111000101110110000110001011101011110110010111100100000000110000000101000001110000100011110001110111001010000110110001001101011000001110110010101001010001100011111110000110110100100001111010110011010100110111110001110101000000001001000110101110110111001000001000110011011101101000110000100000001111011001010011011101111111011101000001010111111111101100010100110001110011010100110010101000010011110101101100111110010100111100000111111011010100000101110110000010010010001101010000110101111101001010000010101000100111100111100010111011011100000000100101011110100110001100011101000010000111101000111100111000001001111100100111110111110000110110010011110101001001100001010111110100101010011100001010101111100100001011000111100110110101010110100100011001001100100011100100010010101100011000010111100111011101100110010010110000000111001000110101110010011001011011000100111110101101011011101110000000101111110000001011100011100010110001000010010010100101000000110001010100100010100000010011110011000001001000001000100110111011101010010100011100000001110000110110100101101001101101111100000100000101011110111011010010100101101101110100101110111101011011001001001100111010011101110110011001011111010011010010111101110000011100111010110001011111000100000011100010110111000111011100001110010101111111110011111010010101101100100101010111000110111001111001110101011101000010001100101011010110001100111100111101111101011111111010100000001000011100011110000011010110010010001111101111000111001110011000111001101100000001110100111010100010001001101101101000010101001111000001001101010001010111111001011111101000001001111001100010110110101010001111110111001000010110001101101110011011101101100000001111010011100010111001101000001100001111101011100010000111000111010000111111111101000000

Related

How to figure out how this file was encoded? Cant seem to open it properly to read the contents

I cant seem to open this .dat file to read ECG data from it.
import chardet
with open("rec_1.dat", "rb") as f:
result = chardet.detect(f.read())
with open("rec_1.dat", "r", encoding=result["encoding"]) as f:
data = f.read()
print(data)`
Here's the code I tried and this
"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1396: character maps to <undefined>"
is the error I keep getting, I can't seem to decode this file and the website (https://physionet.org/content/ecgiddb/1.0.0/) doesn't give me any hints on how this was encoded either.

Opening .dat file UnicodeDecodeError: invalid start byte

I am using Python to convert a .dat file (which you can find here) to csv in order for me to use it later in numpy or csv reader.
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./i2019.dat").readlines()]
# write it as a new CSV file
with open("./i2019.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
But this results in an error message of
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 68: invalid start byte
Any help would be appreciated!

It seems like your dat file uses Shift JIS(Japanese) encoding.
So you can pass shift_jis as the encoding argument to the open function.
datContent = [i.strip().split() for i in open("./i2019.dat", encoding='shift_jis').readlines()]

Process unicode strings in python

I am using fasttext pre-trained model based on english wikipedia. It works as expected...
https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_english.ipynb
But when I try the same code with some other language, I get an error as shown on this page...
https://github.com/shantanuo/pandas_examples/blob/master/nlp/fasttext_marathi.ipynb
The error is related to unicode:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 15: invalid start byte
I tried to open the file using Raw Binary option. I changed the function load_words_raw in load.py file:
with open(file_path, 'rb') as f:
And now I get a different error:
ValueError: could not convert string to float: b'\x00l\x02'
I have no idea how to handle this.

You should change the second line of the notebook file to:
#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.vec.gz
So pointing to the vec file, instead of the bin file:
#!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.mr.300.bin.gz

byte 0x80 in position 15. There is a possibility that file may be encoded in UTF-16.
Try this:
with open(path, encoding='utf-16') as f:
// your logic

Try this one :
data : str
with open('crawl-D.txt' ,'r', encoding='utf8') as file:
data = file.read()
str will contains the whole file as string.
parse float with float().

Python - Decoding Error ('ascii' codec can't decode byte 0x94 in position 19.....)

Hello :) I have a big bin file which has been gzipped (so it's a blabla.bin.gz).
I need to decompress and write it to a txt file with ascii format.
Here's my code :
import gzip
with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:
file_content = f.read()
file_content.decode("ascii")
output = open("new_file.txt", "w", encoding="ascii")
output.write(file_content)
output.close()
But I got this error :
file_content.decode("ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 19: ordinal not in range(128)
I'm not so new to Python but format/coding problems have always been my greatest weakness :(
Please, could you help me?
Thank you !!!

First, there is no reason for decoding anything to immediatly write it back in raw bytes. So a simpler (and more robust) implementation could be:
with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:
file_content = f.read()
with open("new_file.txt", "wb") as output: # just directly write raw bytes
output.write(file_content)
If you really want to decode but are unsure of the encoding, you could use Latin1. Every byte is valid in Latin1 and is translated in the unicode character of the same value. So whatever is the byte string bs, bs.decode('Latin1').encode('Latin1') is just a copy of bs.
Finaly, if you really need to filter out all non ascii characters, you could use the error parameter of decode:
file_content = file_content.decode("ascii", errors="ignore") # just remove any non ascii byte
or:
with gzip.open("GoogleNews-vectors-negative300.bin.gz", "rb") as f:
file_content = f.read()
file_content = file_content.decode("ascii", errors="replace") #non ascii chars are
# replaced with the U+FFFD replacement character
output = open("new_file.txt", "w", encoding="ascii", errors="replace") # non ascii chars
# are replaced with a question mark "?"
output.write(file_content)
output.close()

'utf-8' codec can't decode byte 0x80

I'm trying to download BVLC-trained model and I'm stuck with this error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte
I think it's because of the following function (complete code)
# Closure-d function for checking SHA1.
def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
with open(filename, 'r') as f:
return hashlib.sha1(f.read()).hexdigest() == sha1
Any idea how to fix this?

You are opening a file that is not UTF-8 encoded, while the default encoding for your system is set to UTF-8.
Since you are calculating a SHA1 hash, you should read the data as binary instead. The hashlib functions require you pass in bytes:
with open(filename, 'rb') as f:
return hashlib.sha1(f.read()).hexdigest() == sha1
Note the addition of b in the file mode.
See the open() documentation:
mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. [...] In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
and from the hashlib module documentation:
You can now feed this object with bytes-like objects (normally bytes) using the update() method.

You didn't specify to open the file in binary mode, so f.read() is trying to read the file as a UTF-8-encoded text file, which doesn't seem to be working. But since we take the hash of bytes, not of strings, it doesn't matter what the encoding is, or even whether the file is text at all: just open it, and then read it, as a binary file.
>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte
but
>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325

Since there is not a single hint in the documentation nor src code, I have no clue why, but using the b char (i guess for binary) totally works (tf-version: 1.1.0):
image_data = tf.gfile.FastGFile(filename, 'rb').read()
For more information, check out: gfile

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to decode this byte string from a file? - python

Related

How to figure out how this file was encoded? Cant seem to open it properly to read the contents

Opening .dat file UnicodeDecodeError: invalid start byte

Process unicode strings in python

Python - Decoding Error ('ascii' codec can't decode byte 0x94 in position 19.....)

'utf-8' codec can't decode byte 0x80

Categories

Resources