Re-encode Unicode stream as Ascii ignoring errors - python

I'm trying to take a Unicode file stream, which contains odd characters, and wrap it with a stream reader that will convert it to Ascii, ignoring or replacing all characters that can't be encoded.
My stream looks like:
"EventId","Rate","Attribute1","Attribute2","(。・ω・。)ノ"
...
My attempt to alter the stream on the fly looks like this:
import chardet, io, codecs
with open(self.csv_path, 'rb') as rawdata:
detected = chardet.detect(rawdata.read(1000))
detectedEncoding = detected['encoding']
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = codecs.getreader('ascii')(csv_file, errors='ignore')
log( csv_ascii_stream.read() )
The result on the log line is: UnicodeEncodeError: 'ascii' codec can't encode characters in position 36-40: ordinal not in range(128) even though I explicitly constructed the StreamReader with errors='ignore'
I would like the resulting stream (when read) to come out like this:
"EventId","Rate","Attribute1","Attribute2","(?????)?"
...
or alternatively, "EventId","Rate","Attribute1","Attribute2","()" (using 'ignore' instead of 'replace')
Why is the Exception happening anyway?
I've seen plenty of problems/solutions for decoding strings, but my challenge is to change the stream as it's being read (using .next()), because the file is potentially too large to be loaded into memory all at once using .read()

You're mixing up the encode and decode sides.
For decoding, you're doing fine. You open it as binary data, chardet the first 1K, then reopen in text mode using the detected encoding.
But then you're trying to further decode that already-decoded data as ASCII, by using codecs.getreader. That function returns a StreamReader, which decodes data from a stream. That isn't going to work. You need to encode that data to ASCII.
But it's not clear why you're using a codecs stream decoder or encoder in the first place, when all you want to do is encode a single chunk of text in one go so you can log it. Why not just call the encode method?
log(csv_file.read().encode('ascii', 'ignore'))
If you want something that you can use as a lazy iterable of lines, you could build something fully general, but it's a lot simpler to just do something like the UTF8Recorder example in the csv docs:
class AsciiRecoder:
def __init__(self, f, encoding):
self.reader = codecs.getreader(encoding)(f)
def __iter__(self):
return self
def next(self):
return self.reader.next().encode("ascii", "ignore")
Or, even more simply:
with io.open(self.csv_path, 'r', encoding=detectedEncoding) as csv_file:
csv_ascii_stream = (line.encode('ascii', 'ignore') for line in csv_file)

I'm a little late to the party with this, but here's an alternate solution, using codecs.StreamRecoder:
from codecs import getencoder, getdecoder, getreader, getwriter, StreamRecoder
with io.open(self.csv_path, 'rb') as f:
csv_ascii_stream = StreamRecoder(f,
getencoder('ascii'),
getdecoder(detectedEncoding),
getreader(detectedEncoding),
getwriter('ascii'),
errors='ignore')
print(csv_ascii_stream.read())
I guess you may want to use this if you need the flexibility to be able to call read()/readlines()/seek()/tell() etc. on the stream that gets returned. If you just need to iterate over the stream, the generator expression abarnert provided is a bit more concise.

Related

Reading files with non ascii characters [duplicate]

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
I've composed a nifty BOM-based detector based on Chewie's answer.
It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default
chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in #ivan_pozdeev's answer.
To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.
I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.
In case you want to edit the file, you will want to know which BOM was used. This version of #ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None
A variant of #ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')

How can I fix "UnicodeDecodeError: 'utf-8' codec can't decode bytes..." in python?

I need to read specified rows and columns of csv file and write into txt file.But I got an unicode decode error.
import csv
with open('output.csv', 'r', encoding='utf-8') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list)
The reason for this error is perhaps that your CSV file does not use UTF-8 encoding. Find out the original encoding used for your document.
First of all, try using the default encoding by leaving out the encoding parameter:
with open('output.csv', 'r') as f:
...
If that does not work, try alternative encoding schemes that are commonly used, for example:
with open('output.csv', 'r', encoding="ISO-8859-1") as f:
...
If you get a unicode decode error with this code, it is likely that the csv file is not utf-8 encoded... The correct fix is to find what is the correct encoding and use it.
If you only want quick and dirty workarounds, Python offers the errors=... option of open. From the documentation of open function in the standard library:
'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
'backslashreplace' replaces malformed data by Python’s backslashed escape sequences.
'namereplace' (also only supported when writing) replaces unsupported characters with \N{...} escape sequences.
I often use errors='replace', when I only want to know that there were erroneous bytes or errors='backslashreplace' when I want to know what they were.

Python reading a file into unicode strings

I am having some troubles with understanding the correct way to handle unicode strings in Python. I have read many questions about it but it is still unclear what should I do to avoid problems when reading and writing files.
My goal is to read some huge (up to 7GB) files efficiently line by line. I was doing it with the simple with open(filename) as f: but it I ended up with an error in ASCII decoding.
Then I read the correct way of doing it would be to write:
with codecs.open(filename, 'r', encoding='utf-8') as logfile:
However this ends up in:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 13: invalid start byte
Frankly I haven't understood why this exception is raised.
I have found a working solution doing:
with open(filename) as f:
for line in logfile:
line = unicode(line, errors='ignore')
But this approach ended up being incredibly slow.
Therefore my question is:
Is there a correct way of doing this, and what is the fastest way?
Thanks
Your data is probably not UTF-8 encoded. Figure out the correct encoding and use that instead. We can't tell you what codec is right, because we can't see your data.
If you must specify an error handler, you may as well do so when opening the file. Use the io.open() function; codecs is an older library and has some issues that io (which underpins all I/O in Python 3 and was backported to Python 2) is far more robust and versatile.
The io.open() function takes an errors too:
import io
with io.open(filename, 'r', encoding='utf-8', errors='replace') as logfile:
I picked replace as the error handler so you at least give you placeholder characters for anything that could not be decoded.

Reading Unicode file data with BOM chars in Python

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?
There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it
BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
I've composed a nifty BOM-based detector based on Chewie's answer.
It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default
chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in #ivan_pozdeev's answer.
To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.
I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.
In case you want to edit the file, you will want to know which BOM was used. This version of #ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None
A variant of #ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')

Python: convert Chinese characters into pinyin with CJKLIB

I'm trying to convert a bunch of Chinese characters into pinyin, reading the characters from one file and writing the pinyin into another. I'm working with the CJKLIB functions to do this.
Here's the code,
from cjklib.characterlookup import CharacterLookup
source_file = 'cities_test.txt'
dest_file = 'output.txt'
s = open(source_file, 'r')
d = open(dest_file, 'w')
cjk = CharacterLookup('T')
for line in s:
p = line.split('\t')
for p_shard in p:
for c in p_shard:
readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')
d.write(readings[0].encode('utf-8'))
d.write('\t')
d.write('\n')
s.close()
d.close()
My problem is that I keep running into Unicode-related errors, the error comes up when I call the getReadingForCharacter function. If I called it as written,
readings = cjk.getReadingForCharacter(c.encode('utf-8'), 'Pinyin')
I get: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range (128).
If I call it like this, without the .encoding(),
readings = cjk.getReadingForCharacter(c, 'Pinyin')
I get an error thrown by sqlalchemy (the CJKLIB uses sqlalchemy and sqlite): You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings ... etc.
Can someone help me out? Thanks!
Oh also, is there a way for CJKLIB to return the pinyin without any tones? I think by default it's returning pinyin with these weird characters to represent tones, I just want the letters without these tones.
Your bug is that you are not decoding the input stream, and yet you are turning around and re-encoding it as though it were UTF-8. That’s going the wrong way.
You have two choices.
You can codecs.open the input file with an explicit encoding so you always get back regular Unicode strings whenever you read from it because the decoding is automatic. This is always my strong preference. There is no such thing as a text file anymore.
Your other choice is to manually decode your binary string it before you pass it to the function. I hate this style, because it almost always indicates that you're doing something wrong, and even when it doesn't, it is clumsy as all get out.
I would do the same thing for the output file. I just hate seeing manually .encode("utf-8") and .decode("utf-8") all over the place. Set the stream encoding and be done with it.

Categories

Resources