Reading Hong Kong Supplementary Character Set in python 3

Reading Hong Kong Supplementary Character Set in python 3 - python

I have a hkscs dataset that I am trying to read in python 3. Below code
encoding = 'big5hkscs'
lines = []
num_errors = 0
for line in open('file.txt'):
try:
lines.append(line.decode(encoding))
except UnicodeDecodeError as e:
num_errors += 1
It throws me error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 0: invalid start byte. Seems like there is a non utf-8 character in the dataset that the code is not able to decode.
I tried adding errors = ignore in this line
lines.append(line.decode(encoding, errors='ignore'))
But that does not solve the problem.
Can anyone please suggest?

If a text file contains text encoded with an encoding that is not the default encoding, the encoding must be specified when opening the file to avoid decoding errors:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'r', encoding=encoding,) as f:
for line in f:
# do something with line
Alternatively, the file may be opened in binary mode, and text decoded afterwards:
encoding = 'big5hkscs'
path = 'file.txt'
with open(path, 'rb') as f:
for line in f:
decoded = line.decode(encoding)
# do something with decoded text
In the question, the file is opened without specifying an encoding, so its contents are automatically decoded with the default encoding - apparently UTF-8 in the is case.

Looks like if I do NOT add the except clause except UnicodeDecodeError as e, it works fine
encoding = 'big5hkscs'
lines = []
path = 'file.txt'
with open(path, encoding=encoding, errors='ignore') as f:
for line in f:
line = '\t' + line
lines.append(line)

The answer of snakecharmerb is correct, but possibly you need an explanation.
You didn't write it in the original question, but I assume you have the error on the for line. In this line you are decoding the file from UTF-8 (probably the default on your environment, so not on Windows), but later you are trying to decode it again. So the error is not about decoding big5hkscs, but about opening the file as text.
As in the good answer of snakecharmerb (second part), you should open the file as binary, so without decoding texts, and then you can decode the text with line.decode(encoding). Note: I'm not sure you can read lines on a binary file. In this manner you can still catch the errors, e.g. to write a message. Else the normal way is to decode at open() time, but then you loose the ability to fall back and to get users better error message (e.g. line number).

Related

Can't read a .csv with python: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position..." [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

how to read both ANSI and Unicode txt files in Python?

I'm new to python and face a strange problem:
When I have 50 txt files in a directory, I want to read each .txt file and save its content in a unique variable like:
**file = open(fcf[i], 'r')
text[i] = file.read()**
When I only read one file, it's ok:
count = 0
for file_flag in fcf:
if file_flag == 'feature.txt':
file = open(fcf[count], 'r')
features = file.read().split() # a list, word by word
count = count+1
However, to read txt files in a loop, it's wrong:
Below is my code and a very strange error comes up,
**text = np.zeros((np.shape(fcf)[0],1))
for flag in range(np.shape(fcf)[0]):
file = open(fcf[flag], 'r')
text = file.read() # string
file.close()**
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-41-7e544d88ee9d> in <module>()
2 for flag in range(np.shape(fcf)[0]):
3 file = open(fcf[flag], 'r')
----> 4 text = file.read() # string
5 file.close()
**UnicodeDecodeError: 'gbk' codec can't decode byte 0x94 in position 418: illegal multibyte sequence**
Update:
in a loop form:
file = open(fcf[flag], 'r', encoding='UTF-8')
the error also occurs:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 418: invalid start byte
Could anyone help me? Thank you very much!
Update2:
It seems that in these .txt files, most of them are in Unicode, which is durable for python. I find that, in notepad, there are 2 .txt file in ANSI encoding, which leads to this problem.
How could I read both ANSI and Unicode together in python?
Update3:
Thanks everyone. This problem is fixed.
There are 2 reasons for this problem:
some ANSI txt files are in overall UTF8 files.
some weird matches appears on ANSI encoding:
didn’t - didn抰
weren’t - weren抰, etc. (‘n -> 抰)
("Well - 揥ell)
although my PC is in English language totally, this problem still happens for ANSI txt. (manually modification is needed since notepad only change the encoding, not the above weird character...)
Hope it helps for other people facing the similar problem. Thx

You open your file in default text mode. When reading it, Python tries to decode it, using the default encoding for your platform, which seems to be 'gbk'.
Apparently, the file you're trying to read uses another encoding, which causes this error.
You have to indicate the encoding to use in open. If it is 'UTF-8', for example:
file = open(fcf[flag], 'r', encoding='UTF-8')
If your file uses a different encoding, you must figure it out first, I don't know what is common in your part of the world. You can have a look at the list of standard encodings.
For chinese, the listed encodings are 'gb2312', 'gb18030', 'hz', you could try with these ones.

How to filter Unicode characters [duplicate]

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

Opening file in utf-8 and illegal bytes

I think my file may be mixed encoding and it is a pretty wierd file. The program I made works fine when I open a more normally encoded file. I have been extremely confused for the past 4 hours with how to get this working properly.
actually probably quite a bit longer than 4 >.>.
import os
os.chdir("C:\\Users\\Kingsaber\\documents\\Desktop\\coding")
with open("file1.txt", "r", encoding = "utf-8") as a:
line1 = a.read().splitlines()
with open("file2.txt", "r", encoding = "utf-8") as b:
line2 = b.read().splitlines()
temp3 = tuple(set(line1) - set(line2))
print(temp3)
changes = open("output.txt", "w")
temp3 = list(temp3)
with open("output.txt", 'w') as file_handler:
for item in temp3:
file_handler.write("{}\n".format(item))
Python throws out the error
Traceback (most recent call last):
File "C:\Users\Kingsaber\Documents\Desktop\diff2.py", line 11, in <module>
line1 = a.read().splitlines()
File "C:\Python34\lib\codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 725130-725131: invalid continuation byte`
The idea is to open 2 very large files with about 100000 lines of code and compare file 1 to file 2 for unique lines. I found someone using a set to do this and so far after testing it with a quick txt file I created in notepad it has worked fine.
It seems like the file that I am trying to open however has invalid bytes for utf-8 inside of it. I would like to remove these invaid bytes before passing it into the tuple. Any help would be much obliged as I have actually tried to google for the correct way to do this but haven't found or understood a correct solution. I will actually link 1 of the files in case it helps since it is quite abnormal. Also is there a way to actually check the bytes that are invalid in notepad++. I was curious to find out what was causing the error. Viewing the file in notepad++ as a utf-8 encoded file seems to display text fine.
http://www.mediafire.com/file/5uax2g962ad1ali/file1.txt
Is there no way to have python just ignore these bytes?

Your problem can be boiled down to
text = open("file1.txt", "r", encoding = "utf-8").read()
You can fix it by changing how the decoder handles errors. The choices are "strict" (default), "replace" (put ? in) and "ignore" (skip). UTF-8 has the interesting property that it can figure out where the next character starts so you shouldn't loose too much.
...and you can make the set from the get go
with open("file1.txt", "r", encoding = "utf-8", errors="replace") as a:
set1 = set(a)

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?
source files:
Tue Jan 17$ file brh-m-157.json
brh-m-157.json: UTF-8 Unicode (with BOM) text
Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?
edit 1 proposed sol'n from below (thanks!)
fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding
fp.write(s)
This gives me the following error:
IOError: [Errno 9] Bad file descriptor
Newsflash
I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

Simply use the "utf-8-sig" codec:
fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")
That gives you a unicode string without the BOM. You can then use
s = u.encode("utf-8")
to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:
import os, sys, codecs
BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)
path = sys.argv[1]
with open(path, "r+b") as fp:
chunk = fp.read(BUFSIZE)
if chunk.startswith(codecs.BOM_UTF8):
i = 0
chunk = chunk[BOMLEN:]
while chunk:
fp.seek(i)
fp.write(chunk)
i += len(chunk)
fp.seek(BOMLEN, os.SEEK_CUR)
chunk = fp.read(BUFSIZE)
fp.seek(-BOMLEN, os.SEEK_CUR)
fp.truncate()
It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.
As for guessing the encoding, then you can just loop through the encoding from most to least specific:
def decode(s):
for encoding in "utf-8-sig", "utf-16":
try:
return s.decode(encoding)
except UnicodeDecodeError:
continue
return s.decode("latin-1") # will always work
An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

In Python 3 it's quite easy: read the file and rewrite it with utf-8 encoding:
s = open(bom_file, mode='r', encoding='utf-8-sig').read()
open(bom_file, mode='w', encoding='utf-8').write(s)

import codecs
import shutil
import sys
s = sys.stdin.read(3)
if s != codecs.BOM_UTF8:
sys.stdout.write(s)
shutil.copyfileobj(sys.stdin, sys.stdout)

I found this question because having trouble with configparser.ConfigParser().read(fp) when opening files with UTF8 BOM header.
For those who are looking for a solution to remove the header so that ConfigPhaser could open the config file instead of reporting an error of:
File contains no section headers, please open the file like the following:
configparser.ConfigParser().read(config_file_path, encoding="utf-8-sig")
This could save you tons of effort by making the remove of the BOM header of the file unnecessary.
(I know this sounds unrelated, but hopefully this could help people struggling like me.)

This is my implementation to convert any kind of encoding to UTF-8 without BOM and replacing windows enlines by universal format:
def utf8_converter(file_path, universal_endline=True):
'''
Convert any type of file to UTF-8 without BOM
and using universal endline by default.
Parameters
----------
file_path : string, file path.
universal_endline : boolean (True),
by default convert endlines to universal format.
'''
# Fix file path
file_path = os.path.realpath(os.path.expanduser(file_path))
# Read from file
file_open = open(file_path)
raw = file_open.read()
file_open.close()
# Decode
raw = raw.decode(chardet.detect(raw)['encoding'])
# Remove windows end line
if universal_endline:
raw = raw.replace('\r\n', '\n')
# Encode to UTF-8
raw = raw.encode('utf8')
# Remove BOM
if raw.startswith(codecs.BOM_UTF8):
raw = raw.replace(codecs.BOM_UTF8, '', 1)
# Write to file
file_open = open(file_path, 'w')
file_open.write(raw)
file_open.close()
return 0

You can use codecs.
import codecs
with open("test.txt",'r') as filehandle:
content = filehandle.read()
if content[:3] == codecs.BOM_UTF8:
content = content[3:]
print content.decode("utf-8")

In python3 you should add encoding='utf-8-sig':
with open(file_name, mode='a', encoding='utf-8-sig') as csvfile:
csvfile.writelines(rows)
that's it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading Hong Kong Supplementary Character Set in python 3 - python

Looks like if I do NOT add the except clause except UnicodeDecodeError as e, it works fine encoding = 'big5hkscs' lines = [] path = 'file.txt' with open(path, encoding=encoding, errors='ignore') as f: for line in f: line = '\t' + line lines.append(line)

Related

Can't read a .csv with python: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position..." [duplicate]

how to read both ANSI and Unicode txt files in Python?

How to filter Unicode characters [duplicate]

Opening file in utf-8 and illegal bytes

Convert UTF-8 with BOM to UTF-8 with no BOM in Python

Categories

Resources