Opening .dat file UnicodeDecodeError: invalid start byte - python

I am using Python to convert a .dat file (which you can find here) to csv in order for me to use it later in numpy or csv reader.
import csv
# read flash.dat to a list of lists
datContent = [i.strip().split() for i in open("./i2019.dat").readlines()]
# write it as a new CSV file
with open("./i2019.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(datContent)
But this results in an error message of
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 68: invalid start byte
Any help would be appreciated!

It seems like your dat file uses Shift JIS(Japanese) encoding.
So you can pass shift_jis as the encoding argument to the open function.
datContent = [i.strip().split() for i in open("./i2019.dat", encoding='shift_jis').readlines()]

Related

How to figure out how this file was encoded? Cant seem to open it properly to read the contents

I cant seem to open this .dat file to read ECG data from it.
import chardet
with open("rec_1.dat", "rb") as f:
result = chardet.detect(f.read())
with open("rec_1.dat", "r", encoding=result["encoding"]) as f:
data = f.read()
print(data)`
Here's the code I tried and this
"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1396: character maps to <undefined>"
is the error I keep getting, I can't seem to decode this file and the website (https://physionet.org/content/ecgiddb/1.0.0/) doesn't give me any hints on how this was encoded either.

How to decode this byte string from a file?

I used code provided by someone on the stackoverflow to read a binary file. The code is:
with open('OutputFile', 'rb') as f:
text = f.read()
I printed text and it appeared as below. Now I need to decode them into binary data, much like 01010010....
I used text.decode('utf-8') to decode the text, but it showed error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 0: invalid start byte
How to solve this issue? How to read them into binary strings?
b'\x88\xc6\xe7\x1a3\'1(\xd2\xb7*\xa5{a\xac\x0f\xabf\\\xe9Z\xa4\xb0\x116v\xe0}\xc5\xb8\xe1P\xf1\x01\xd7\xf63\x11\xec\xe7\xb7\xbeG\xaa\xdf\xd87\xdaK\x9c?\xe8\x0e\x84\xce|%\xcb\xc0\x1d\xe4\xe4\x02#c\xb1\xd0o\x1da\x87\nW\xb9\xc9\xb2\x08\x1c\xffP,4\x86\'\xecx$\x05\xa2\x1b\x8d\xc9\xe9\x12(t\xcba\xa5\xc9H\xb5[X\xf9\xdd\xa7q\x8e\x92r\xe2T\xe0a*\x13$\xdaS\x8bx\r\xc1\xa9~\xb7\xd8-\xef&\xad\xa8\xdd\xe5\x1d#\x99`\xa0\xb7\xbc\xe1\x96;:#SG\xea\xd2]\xfc\xc02\xb9\x01\xbe%\xbb?\x99\x0e\xa0{7Z\xa3\xa4\xcb\xe4\xe0C%1\xaf\xcb\x1e\xb3\xc6\xb1\xd7\xf8,,\x08\xc0\'\xbf\xde\xb5\xe3)~$\x05\x8c>\x88\xb8`\xce\xe3\x1a\x97zs\x05\x91\xcd\xee\xb9^S\x8c\x8f=\x0f\xe6\xf5TZ\xb24c\xf0YZ\xac\xf9\x87\x05,\x04\xf7>\x0c\xf6c/t\xbayB\x06\x0cd\x0f\x15\x1eZ\x9c#\xb7\\-\x16A\x06#\r\x12\x19\x85YV\xb3\x7f$\xc4}\xab\xda\xf5\xebO\xcf#/\x1ea\xa7\x03E\xb3\xef4\x11\x05hCJF(\x93\r\xb9\xa6\x84\x15\x8a\xda\xbe\x12\xff\xd2\xa5\x19y\xea\xb5H\xbd\x97\xc8\x81\xd5\'\xadN\xd8s\x0c\x0f\x97\xcb3d\xfa\xf6&\n\xdc\xd5\xd4\x15\x87\x08\xcb\xeb\xb4\x07\xf8)IE\xfd\x1am_C\xf2x\x04a\xa8\xdc\xb3G\xa4\xeaq6O\xe6D\xb4]d\x93\x95`\xe6W\xe2w\xc8^\t\xdc\x13aJE\xafU];V\x1f\xda\x96\xc8t\xdfk\x96\xc6\xd5\xc0B.\xeb\xac)<\xa7F\xce\x0c\xf1\xf1v\x18\xba\xf6^#0\x14\x1c#\xc7r\x86\xc4\xd6\x0e\xca\x94c\xf8m!\xeb57\xc7P\t\x1a\xed\xc8#7h\xc2\x03\xd9M\xdf\xdd\x05\x7f\xecS\x1c\xd4\xca\x84\xf5\xb3\xe5<\x1f\xb5\x05\xd8$\x8dC_J\n\x89\xe7\x8b\xb7\x00\x95\xe9\x8ct!\xe8\xf3\x82|\x9f|6ORa_J\x9c*\xf9\x0b\x1emV\x91\x93#\x91+\x18^wfK\x01\xc8\xd7&[\x13\xeb[\xb8\x0b\xf0.8\xb1\t)#\xc5H\xa0O0H"n\xeaQ\xc0p\xdaZm\xf0A^\xed)m\xd2\xef[$\xce\x9d\xd9\x97\xd3K\xdc\x1c\xeb\x17\xc4\x0e-\xc7p\xe5\x7f\xcf\xa5l\x95q\xb9\xe7WB2\xb5\x8c\xf3\xdf_\xea\x02\x1cx5\x92>\xf1\xcec\x9b\x01\xd3\xa8\x89\xb6\x85O\x04\xd4W\xe5\xfa\t\xe6-\xaa?r\x166\xe6\xed\x80\xf4\xe2\xe6\x83\x0f\xae!\xc7C\xff#'
The data is already in binary format interally since you read the file in mode "rb". Here's how to print it out as 0s and 1s:
with open('OutputFile', 'rb') as f:
data = f.read()
binary = ''.join(('{:08b}'.format(byte) for byte in data))
print(binary)
Output:
1000100011000110111001110001101000110011001001110011000100101000110100101011011100101010101001010111101101100001101011000000111110101011011001100101110011101001010110101010010010110000000100010011011001110110111000000111110111000101101110001110000101010000111100010000000111010111111101100011001100010001111011001110011110110111101111100100011110101010110111111101100000110111110110100100101110011100001111111110100000001110100001001100111001111100001001011100101111000000000111011110010011100100000000100010001101100011101100011101000001101111000111010110000110000111000010100101011110111001110010011011001000001000000111001111111101010000001011000011010010000110001001111110110001111000001001000000010110100010000110111000110111001001111010010001001000101000011101001100101101100001101001011100100101001000101101010101101101011000111110011101110110100111011100011000111010010010011100101110001001010100111000000110000100101010000100110010010011011010010100111000101101111000000011011100000110101001011111101011011111011000001011011110111100100110101011011010100011011101111001010001110100100011100110010110000010100000101101111011110011100001100101100011101100111010010000000101001101000111111010101101001001011101111111001100000000110010101110010000000110111110001001011011101100111111100110010000111010100000011110110011011101011010101000111010010011001011111001001110000001000011001001010011000110101111110010110001111010110011110001101011000111010111111110000010110000101100000010001100000000100111101111111101111010110101111000110010100101111110001001000000010110001100001111101000100010111000011000001100111011100011000110101001011101111010011100110000010110010001110011011110111010111001010111100101001110001100100011110011110100001111111001101111010101010100010110101011001000110100011000111111000001011001010110101010110011111001100001110000010100101100000001001111011100111110000011001111011001100011001011110111010010111010011110010100001000000110000011000110010000001111000101010001111001011010100111000010001110110111010111000010110100010110010000010000011001000000000011010001001000011001100001010101100101010110101100110111111100100100110001000111110110101011110110101111010111101011010011111100111101000000001011110001111001100001101001110000001101000101101100111110111100110100000100010000010101101000010000110100101001000110001010001001001100001101101110011010011010000100000101011000101011011010101111100001001011111111110100101010010100011001011110011110101010110101010010001011110110010111110010001000000111010101001001111010110101001110110110000111001100001100000011111001011111001011001100110110010011111010111101100010011000001010110111001101010111010100000101011000011100001000110010111110101110110100000001111111100000101001010010010100010111111101000110100110110101011111010000111111001001111000000001000110000110101000110111001011001101000111101001001110101001110001001101100100111111100110010001001011010001011101011001001001001110010101011000001110011001010111111000100111011111001000010111100000100111011100000100110110000101001010010001011010111101010101010111010011101101010110000111111101101010010110110010000111010011011111011010111001011011000110110101011100000001000010001011101110101110101100001010010011110010100111010001101100111000001100111100011111000101110110000110001011101011110110010111100100000000110000000101000001110000100011110001110111001010000110110001001101011000001110110010101001010001100011111110000110110100100001111010110011010100110111110001110101000000001001000110101110110111001000001000110011011101101000110000100000001111011001010011011101111111011101000001010111111111101100010100110001110011010100110010101000010011110101101100111110010100111100000111111011010100000101110110000010010010001101010000110101111101001010000010101000100111100111100010111011011100000000100101011110100110001100011101000010000111101000111100111000001001111100100111110111110000110110010011110101001001100001010111110100101010011100001010101111100100001011000111100110110101010110100100011001001100100011100100010010101100011000010111100111011101100110010010110000000111001000110101110010011001011011000100111110101101011011101110000000101111110000001011100011100010110001000010010010100101000000110001010100100010100000010011110011000001001000001000100110111011101010010100011100000001110000110110100101101001101101111100000100000101011110111011010010100101101101110100101110111101011011001001001100111010011101110110011001011111010011010010111101110000011100111010110001011111000100000011100010110111000111011100001110010101111111110011111010010101101100100101010111000110111001111001110101011101000010001100101011010110001100111100111101111101011111111010100000001000011100011110000011010110010010001111101111000111001110011000111001101100000001110100111010100010001001101101101000010101001111000001001101010001010111111001011111101000001001111001100010110110101010001111110111001000010110001101101110011011101101100000001111010011100010111001101000001100001111101011100010000111000111010000111111111101000000

codec can't decode byte 0x81

I have this simple bit of code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in open(filename))
I simply want to get the number of lines in the file. However I keep getting this error. I'm thinking of just skipping Python and doing it in C# ;-)
Can anyone help? I added 'utf-8' after searching for the error and read it should fix it. The file is just a simple text file, not an image. Albeit a large file. It's actually a CSV string, but I just want to get an idea of the number of lines before I start processing it.
Many thanks.
in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4344:
character maps to <undefined>
It seems to be an encoding problem.
In your example code, you are opening the file twice, and the second doesn't include the encoding.
Try the following code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in file)
Or (more recent) :
with open(filename, "r", encoding="utf-8") as file:
num_lines = sum(1 for line in file)

'utf-8' codec can't decode byte 0x8a in position 170: invalid start byte

I am trying to do this:
fh = request.FILES['csv']
fh = io.StringIO(fh.read().decode())
reader = csv.DictReader(fh, delimiter=";")
This is failing always with the error in title and I spent almost 8 hours on this.
here is my understanding:
I am using python3, so file fh is in bytes. I am encoding it into string and putting it in memory via StringIO.
with csv.DictReader() trying to read it as dict into memory. It is failing here:
also tried with io.StringIO(fh.read().decode('utf-8')), but same error.
what am I missing? :/
The error is because there is some non-ASCII character in the file and it can't be encoded/decoded. One simple way to avoid this error is to encode/decode such strings with encode()/decode() function as follows (if a is the string with non-ASCII character):
a.decode('utf-8')
Also, you could try opening the file as:
with open('filename', 'r', encoding = 'utf-8') as f:
your code using f as file pointer
use 'rb' if your file is binary.

UnicodeDecodeError: save to file in python

i want to read file, find something in it and save the result, but when I want to save it it give me a error
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)
Code to save to file:
fileout.write((key + ';' + nameDict[key]+ ';'+src + alt +'\n').decode('utf-8'))
What can I do to fix it?
Thank you
You are trying to concatenate unicode values with byte strings, then turn the result to unicode, to write it to a file object that most likely only takes byte strings.
Don't mix unicode and byte strings like that.
Open the file to write to with io.open() to automate encoding Unicode values, then handle only unicode in your code:
import io
with io.open(filename, 'w', encoding='utf8') as fileout:
# code gathering stuff from BeautifulSoup
fileout.write(u'{};{};{}{}\n'.format(key, nameDict[key], src, alt)
You may want to check out the csv module to handle writing out delimiter-separated values. If you do go that route, you'll have to explicitly encode your columns:
import csv
with open(filename, 'wb') as fileout:
writer = csv.writer(fileout, delimiter=';')
# code gathering stuff from BeautifulSoup
row = [key, nameDict[key], src + alt]
writer.writerow([c.encode('utf8') for c in row])
If some of this data comes from other files, make sure you also decode to Unicode first; again, io.open() to read these files is probably the best option, to have the data decoded to Unicode values for you as you read.

Categories

Resources