codec can't decode byte 0x81 - python

I have this simple bit of code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in open(filename))
I simply want to get the number of lines in the file. However I keep getting this error. I'm thinking of just skipping Python and doing it in C# ;-)
Can anyone help? I added 'utf-8' after searching for the error and read it should fix it. The file is just a simple text file, not an image. Albeit a large file. It's actually a CSV string, but I just want to get an idea of the number of lines before I start processing it.
Many thanks.
in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4344:
character maps to <undefined>

It seems to be an encoding problem.
In your example code, you are opening the file twice, and the second doesn't include the encoding.
Try the following code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in file)
Or (more recent) :
with open(filename, "r", encoding="utf-8") as file:
num_lines = sum(1 for line in file)

Related

How to figure out how this file was encoded? Cant seem to open it properly to read the contents

I cant seem to open this .dat file to read ECG data from it.
import chardet
with open("rec_1.dat", "rb") as f:
result = chardet.detect(f.read())
with open("rec_1.dat", "r", encoding=result["encoding"]) as f:
data = f.read()
print(data)`
Here's the code I tried and this
"UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1396: character maps to <undefined>"
is the error I keep getting, I can't seem to decode this file and the website (https://physionet.org/content/ecgiddb/1.0.0/) doesn't give me any hints on how this was encoded either.

Script crashes when trying to read a specific Japanese character from file

I was trying to save some Japanese characters from a text file into a string. Most of the characters like "道" make no problems. But others like "坂" don't work. When I'm trying to read them, my script crashes.
Do I have to use a specific encoding while reading the file?
That's my code btw:
with open(path, 'r') as file:
lines = [line.rstrip() for line in file]
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 310: character maps to <undefined>
You have to specify the encoding when working with non ASCII, like this:
file = open(filename, encoding="utf8")

Encoding Error While Merging Multiple Text Files in Python

I want to merge multiple files using python3. All the files are there in a single folder with .txt as extension.
In the folder, there are files starting with special characters like dot (.) and braces() etc. The code and dataset are there in separate folders. Please help.
What I have tried is as follows:
#encoding:utf-8
import os
from pprint import pprint
path = 'F:/abc/intt'
file_names = os.listdir(path)
with open('F://hi//Datasets//megef_fil.txt', 'w', encoding = 'utf-8') as outfile:
for fname in file_names:
with open(os.path.join(path,fname)) as infile:
for line in infile:
outfile.write(line)
The error in trace which I am facing is something like this.
File "C:\Users\Shanky\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 31: character maps to
I am absolutely not having any clue where I am going wrong. Please help me in fixing this issue. Any help is appreciated.
it's seems to be that you read your inerfile in byte with not the proper encoding, try to do :
with open(os.path.join(path,fname), mode="r", encoding="utf-8") as infile:

Python Reading File and Identifying Source of UnicodeDecodeError

I am trying to read a text file using the following statement:
with open(inputFile) as fp:
for line in fp:
if len(line) > 0:
lineRecords.append(line.strip());
The problem is that I get the following error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>
My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.
Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?
(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)
Thanks.
Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.
I think the only way is to track the line number separately and output it yourself.
with open(inputFile) as fp:
num = 0
try:
for num, line in enumerate(fp):
if len(line) > 0:
lineRecords.append(line.strip())
except UnicodeDecodeError as e:
print('Line ', num, e)
You can use the read method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:
with open(inputFile) as fp:
print(len(fp.read(6880).encode()))
I have faced this issue before and the easiest fix is to open file in utf8 mode
with open(inputFile, encoding="utf8") as fp:

encode issues python word document

i have a huge word document file that has more than 10,000 lines in it and it contains random empty lines and also weird characters, and i want to save it as a .txt or a .fasta file to read each line as string, and run through my program to pull out only the fasta headers and their sequences.
i have searched online and all of the posts about encoding issues just make it more confusing for me.
so far i have tried:
1) save the word document file as a .txt file with unicode(UTF-8) option. and ran my code below, about 1000 lines were outputted until it hit an error.
with open('TemplateMaster2.txt', encoding='utf-8') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
fout.write(next(fin))
error message:
UnicodeEncodeError: 'charmap' codec can't encode chracter '\uf044' in position 11: character maps to <undefined>
2) save the word document file as a .txt file with unicode(UTF-8) option. about 1000 some lines were outputted until it hit a different error.
with open('TemplateMaster2.txt') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
fout.write(next(fin))
error message:
unicodeDecodeError: 'charmap' code can't decode byte 0x81 in position 5664: character map to <undefined>
I can try different options for saving that word document as a .txt file but there are too many options and i am not sure what the problem really is. Should i save the word document as .txt with the option of 'unicode' or 'unicode(Big-Endian)', or 'unicode(UTF-7)', or 'Unicode(UTF-8)', or 'US-ASCII', etc.
The only thing which seems to be missing is encoding='utf-8' in your open statement for fout.
with open('TemplateMaster2.txt', 'r', encoding='utf-8') as fin, open('OnlyFastaseq.fasta', 'w', encoding='utf-8') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
seq = next(fin)
fout.write(seq)
Did you double check if your sequences are really every time only in one line?

Categories

Resources