encode a file from ASCII to UTF8 - python

I want to encode a csv file from ASCII to UTF-8 encoding and this is the code I tried :
import codecs
import chardet
BLOCKSIZE = 9048576 # or some other, desired size in bytes
with codecs.open("MFile2016-05-22.csv", "r", "ascii") as sourceFile:
with codecs.open("tmp.csv", "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
file = open("tmp.csv", "r")
try:
content = file.read()
finally:
file.close()
encoding = chardet.detect(content)['encoding']
print encoding
After testing it, I still get "ascii" in the value of encoding. The encoding didn't change. What am I missing?

ASCII is a subset of UTF-8. Any ASCII-encoded file is also valid UTF-8.
From the Wikipedia article on UTF-8:
The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single octet with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.
In other words, your operation is a no-op, nothing should change.
Any tools that detect codecs (like chardet) would rightly mark it as ASCII still. Marking it as UTF-8 would also be valid, but so would marking it as ISO-8859-1 (Latin-1) or CP-1252 (the Windows latin-1 based codepage), or any number of codecs that are supersets of ASCII. That would be confusing, however, since your data only consists of ASCII codepoints. Tools that would accept ASCII only would accept your CSV file, while they would not accept UTF-8 data that consists of more than just ASCII codepoints.
If the goal is to validate that any piece of text is valid UTF-8 by using chardet, then you'll have to accept ASCII too:
def is_utf8(content):
encoding = chardet.detect(content)['encoding']
return encoding in {'utf-8', 'ascii'}

ASCII is a subset of UTF-8; all ASCII files are automatically also UTF-8. You don't need to do anything.

Related

type of encoding to read csv files in pandas

Alright, So I'm writing a code where I read a CSV file using pandas.read_csv, the problem is with the encoding, I was using utf-8-sig encoding and this is working. However, this gives me an error with other CSV files. I found out that some files need other types of encoding such as cp1252. The problem is that I can't restrict the user to a specific CSV type that matches my encoding.
So is there any solution for this? for example is there a universal encoding type that works for all CSV's? or can I pass an array of all the possible encoders?
A CSV file is a text file. If it contains only ASCII characters, no problem nowadays, most encodings can correctly handle plain ASCII characters. The problem arises with non ASCII characters. Exemple
character
Latin1 code
cp850 code
UTF-8 codes
é
'\xe9'
'\x82'
'\xc3\xa9'
è
'\xe8'
'\x8a'
'\xc3\xa8'
ö
'\xf6'
'\x94'
'\xc3\xb6'
Things are even worse, because single bytes character sets can represent at most 256 characters while UTF-8 can represent all. For example beside the normal quote character ', unicode contains left ‘or right ’ versions of it, none of them being represented in Latin1 nor CP850.
Long Story short, there is nothing like an universal encoding. But certain encodings, for example Latin1 have a specificity: they can decode any byte. So if you declare a Latin1 encoding, no UnicodeDecodeError will be raised. Simply if the file was UTF-8 encoded, a é will look like é. And the right single quote would be 'â\x80\x99' but will appear as â on an Latin1 system and as ’ on a cp1252 one.
As you spoke of CP1252, it is a Windows variant of Latin1, but it does not share the property of being able to decode any byte.
The common way is to ask people sending you CSV file to use the same encoding and try to decode with that encoding. Then you have two workarounds for badly encoded files. First is the one proposed by CygnusX: try a sequence of encodings terminated with Latin1, for example encodings = ["utf-8-sig", "utf-8", "cp1252", "latin1"] (BTW Latin1 is an alias for ISO-8859-1 so no need to test both).
The second one is to open the file with errors='replace': any offending byte will be replaced with a replacement character. At least all ASCII characters will be correct:
with open(filename, encoding='utf-8-sig', errors='replace') as file:
fd = pd.read_csv(file, other_parameters...)
You could try this: https://stackoverflow.com/a/48556203/11246056
Or iterate over several formats in a try/except statement:
encodings = ["utf-8-sig, "cp1252", "iso-8859-1", "latin1"]
try:
for encoding in encodings:
pandas.read_csv(..., encoding=encoding, ...)
...
except ValueError: # or the error you receive
continue

Reading files with non ascii characters [duplicate]

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
I've composed a nifty BOM-based detector based on Chewie's answer.
It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default
chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in #ivan_pozdeev's answer.
To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.
I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.
In case you want to edit the file, you will want to know which BOM was used. This version of #ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None
A variant of #ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')

Output difference after reading files saved in different encoding option in python

I have a unicode string list file, saved in encode option utf-8. I have another input file, saved in normal ansi. I read directory path from that ansi file and do os.walk() and try to match if any file present in the list (saved by utf-8). But it is not matching even if it is present.
Later I do some normal checking with a single string "40M_Ãz­µ´ú¸ÕÀÉ" and save this particular string (from notepad) in three different files with encoding option ansi, unicode and utf-8. I write a python script to print:
print repr(string)
print string
And the output is like:
ANSI Encoding
'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9'
40M_Ãz­µ´ú¸ÕÀÉ
UNICODE Encoding
'\x004\x000\x00M\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'
4 0 M _ Ã z ­µ ´ ú ¸ Õ À É
UTF-8 Encoding
'40M_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'
40M_Ãz­µ´ú¸ÕÀÉ
I really I can't understand how to compare same string coming from differently encoded file. Please help.
PS: I have some typical unicode characters like: 唐朝小栗子第集.mp3 which are very difficult to handle.
I really I can't understand how to compare same string coming from differently encoded file.
Notepad encoded your character string with three different encodings, resulting in three different byte sequences. To retrieve the character string you must decode those bytes using the same encodings:
>>> ansi_bytes = '40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9'
>>> utf16_bytes = '4\x000\x00M\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'
>>> utf8_bytes = '40M_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'
>>> ansi_bytes.decode('mbcs')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
>>> utf16_bytes.decode('utf-16le')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
>>> utf8_bytes.decode('utf-8')
u'40M_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40M_Ãz­µ´ú¸ÕÀÉ
‘ANSI’ (not “ASCI”) is what Windows (somewhat misleadingly) calls its default locale-specific code page, which in your case is 1252 (Western European, which you can get in Python as windows-1252) but this will vary from machine to machine. You can get whatever this encoding is from Python on Windows using the name mbcs.
‘Unicode’ is the name Windows uses for the UTF-16LE encoding (very misleadingly, because Unicode is the character set standard and not any kind of bytes⇔characters encoding in itself). Unlike ANSI and UTF-8 this is not an ASCII-compatible encoding, so your attempt to read a line from the file has failed because the line terminator in UTF-16LE is not \n, but \n\x00. This has left a spurious \x00 at the start of the byte string you have above.
‘UTF-8’ is at least accurately named, but Windows likes to put fake Byte Order Marks at the front of its “UTF-8” files that will give you an unwanted u'\uFEFF' character when you decode them. If you want to accept “UTF-8” files saved from Notepad you can manually remove this or use Python's utf-8-sig encoding.
You can use codecs.open() instead of open() to read a file with automatic Unicode decoding. This also fixes the UTF-16 newline problem, because then the \n characters are detected after decoding instead of before.
I read directory path from that asci file and do os.walk()
Windows filenames are natively handled as Unicode, so when you give Windows a byte string it has to guess what encoding is needed to convert those bytes into characters. It chooses ANSI not UTF-8. That would be fine if you were using a byte string from a file also encoded in the same machine's ANSI encoding, however in that case you would be limited to filenames that fit within your machine's locale. In Western European 40M_Ãz­µ´ú¸ÕÀÉ would fit but 唐朝小栗子第集.mp3 would not so you wouldn't be able to refer to Chinese files at all.
Python supports passing Unicode filenames directly to Windows, which avoids the problem (most other languages can't do this). Pass a Unicode string into filesystem functions like os.walk() and you should get Unicode strings out, instead of failure.
So, for UTF-8-encoded input files, something like:
with codecs.open(u'directory_path.txt', 'rb', 'utf-8-sig') as fp:
directory_path = fp.readline().strip(u'\r\n') # unicode dir path
good_names = set()
with codecs.open(u'filename_list.txt', 'rb', 'utf-8-sig') as fp:
for line in fp:
good_names.add(line.strip(u'\r\n')) # set of unicode file names
for dirpath, dirnames, filenames in os.walk(directory path): # names will be unicode strings
for filename in filenames:
if filename in good_names:
# do something with file

Reading Unicode file data with BOM chars in Python

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
I've composed a nifty BOM-based detector based on Chewie's answer.
It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default
chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in #ivan_pozdeev's answer.
To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.
I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.
In case you want to edit the file, you will want to know which BOM was used. This version of #ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None
A variant of #ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')

Dealing with UTF-8 numbers in Python

Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:
with open(file) as f:
a,b,c=map(int,f.readline().split(','))
would throw this:
invalid literal for int() with base 10: '\xef\xbb\xbf115'
The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.
Is there a better way of doing this for any type of encoded file?
import codecs
with codecs.open(file, "r", "utf-8-sig") as f:
a, b, c= map(int, f.readline().split(","))
This works in Python 2.6.4. The codecs.open call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.
What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the U+FEFF character if present.

Categories

Resources