Python Reading File and Identifying Source of UnicodeDecodeError - python

I am trying to read a text file using the following statement:
with open(inputFile) as fp:
for line in fp:
if len(line) > 0:
lineRecords.append(line.strip());
The problem is that I get the following error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6880: character maps to <undefined>
My question is how can I identify exactly where in the file the error is encountered since the position Python gives is tied to the location in the record being read at the time and not the absolution position in the file. So is it the 6,880 character in record 20 or the 6,880 character in record 2000? Without record information, the position value returned by Python is worthless.
Bottom line: is there a way to get Python to tell me what record it was processing at the time it encountered the error?
(And yes I know that 0x9d is a tab character and that I can do a search for that but that is not what I am after.)
Thanks.
Update: the post at UnicodeEncodeError: 'charmap' codec can't encode - character maps to <undefined>, print function has nothing to do with the question I am asking - which is how can I get Python to tell me what record of the input file it was reading when it encountered the unicode error.

I think the only way is to track the line number separately and output it yourself.
with open(inputFile) as fp:
num = 0
try:
for num, line in enumerate(fp):
if len(line) > 0:
lineRecords.append(line.strip())
except UnicodeDecodeError as e:
print('Line ', num, e)

You can use the read method of the file object to obtain the first 6880 characters, encode it, and the length of the resulting bytes object will be the index of the starting byte of the offending character:
with open(inputFile) as fp:
print(len(fp.read(6880).encode()))

I have faced this issue before and the easiest fix is to open file in utf8 mode
with open(inputFile, encoding="utf8") as fp:

Related

Why is it that I am able to print out a good amount of lines until I reach a certain point. Once that point is reached I get an error

So what I am basically trying to do is to read and print each individual line of an RTF file. However, my problem is that with this code that I currently have it seems to do the job up until it reaches line 937. At that point it stops reading lines and gives me this error:
Traceback (most recent call last):
File "/private/var/mobile/Library/Mobile Documents/iCloud~com~omz-software~Pythonista3/Documents/openFolders.py", line 8, in <module>
for element in file:
File "/var/containers/Bundle/Application/8F2965B6-AC1F-46FA-8104-6BB24F1ECB97/Pythonista3.app/Frameworks/Py3Kit.framework/pylib/encodings/ascii.py", line 27, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 4250: ordinal not in range(128)
file = open("Steno Dictionary.rtf", "r")
#line_number is just to know what line number has been printed on the console.
line_number = 1
for element in file:
#print(line_number) prints until it reaches 937 and then the error occurs.
print(line_number)
print(element)
line_number +=1
How would I modify my current code to make it keep on reading lines until the end of the file? As there are still many more lines left. I have searched high and low and cannot seem to figure it out! Thank you very much to whoever can help me out! As a note: Iā€™m using Pythonista on iOS.
The error you are getting means that Python doesn't understand how to translate a specific character in the document using the default text encoding.
There are a few things you can try, the first is to check if explicitly setting the encoding to utf8 works.
file = open("Steno Dictionary.rtf", "r", encoding="utf-8")
...
if that doesn't work you can try to use other encodings or you can tell python to replace the bits it doesn't recognize with something else. like this
file = open("Steno Dictionary.rtf", "r", encoding="utf-8", errors="replace")
...
That will decode everything it knows how to, and replace what it doesn't with ? characters.

Script crashes when trying to read a specific Japanese character from file

I was trying to save some Japanese characters from a text file into a string. Most of the characters like "道" make no problems. But others like "坂" don't work. When I'm trying to read them, my script crashes.
Do I have to use a specific encoding while reading the file?
That's my code btw:
with open(path, 'r') as file:
lines = [line.rstrip() for line in file]
The error is:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 310: character maps to <undefined>
You have to specify the encoding when working with non ASCII, like this:
file = open(filename, encoding="utf8")

codec can't decode byte 0x81

I have this simple bit of code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in open(filename))
I simply want to get the number of lines in the file. However I keep getting this error. I'm thinking of just skipping Python and doing it in C# ;-)
Can anyone help? I added 'utf-8' after searching for the error and read it should fix it. The file is just a simple text file, not an image. Albeit a large file. It's actually a CSV string, but I just want to get an idea of the number of lines before I start processing it.
Many thanks.
in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4344:
character maps to <undefined>
It seems to be an encoding problem.
In your example code, you are opening the file twice, and the second doesn't include the encoding.
Try the following code:
file = open(filename, "r", encoding="utf-8")
num_lines = sum(1 for line in file)
Or (more recent) :
with open(filename, "r", encoding="utf-8") as file:
num_lines = sum(1 for line in file)

Removing 'b and \n from data being transferred from .csv file to text file

Currently have data in a .csv file and trying to move it to a temp.txt file. When I transfer the data, each line begins with b' and ends with \n' which I want to remove.
Previously got it working however had issues with utf-8 language in that I'd get error: UnicodeEncodeError: 'charmap' codec can't encode character '\u0336' in position 113: character maps to undefined
def data(file):
for i in range(1000):
print(file.readline().encode("utf-8"))
file = open(sys.argv[1], encoding = "utf-8")
data(file)
Currently getting these sort of results:
b'Datahere\n'
And I would prefer just getting:
Datahere
It is a bit of a hack, but you can just index into each line that you read by [1:-2]. This will get rid of the first character on each line 'b', and also the last two characters on each line '\n'.

when writing to csv file writerow fails with UnicodeEncodeError

I have the line:
c.writerow(new_values)
That writes a number of values to a csv file. Normally it is working fine but sometimes it throws an exception and doesn't write the line in the csv file. I have no idea how I can find out why.
This is my exception handling right now:
try:
c.writerow(new_values)
except:
print()
print ("Write Error: ", new_values)
I commented out my own exception and it says:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in position 14: character maps to <undefined>
Ok, I solved it by myself:
I just had to add ", encoding='utf-8'" to my csv.writer line:
c = csv.writer(open("Myfile.csv", 'w', newline='', encoding='utf-8'))
the csv module in python is notorious for not handling unicode characters well. Unless all characters fall in the ascii codec you probably won't be able to write the row. There is a (somewhat) drop in replacement called unicodecsv that you may want to look into. https://pypi.python.org/pypi/unicodecsv

Categories

Resources