Replacing non-UTF-8 from a string

Replacing non-UTF-8 from a string - python

Here is the code:
s = 'Waitematā'
w = open('test.txt','w')
w.write(s)
w.close()
I get the following error.
UnicodeEncodeError: 'charmap' codec can't encode character '\u0101' in position 8: character maps to <undefined>
The string will print with the macron a, ā. However, I am not able to write this to a .txt or .csv file.
Am I able to swap our the macron a, ā for no macron? Thanks for the help in advance.

Note that if you open a file with open('text.txt', 'w') and write a string to it, you are not writing a string to a file, but writing the encoded string into the file. What encoding used depends on your LANG environment variable or other factors.
To force UTF-8, as you suggested in title, you can try this:
w = open('text.txt', 'wb') # note for binary
w.write(s.encode('utf-8')) # convert str into byte explicitly
w.close()

As documented in open:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.
Not all encodings support all Unicode characters. Since the encoding is platform dependent when not specified, it is better and more portable to be explicit and call out the encoding when reading or writing a text file. UTF-8 supports all Unicode code points:
s = 'Waitematā'
with open('text.txt','w',encoding='utf8') as w:
w.write(s)

Related

Python search through file and un-escape unicode characters [duplicate]

I'm working on an application which is using utf-8 encoding. For debugging purposes I need to print the text. If I use print() directly with variable containing my unicode string, ex- print(pred_str).
I get this error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to
So I tried print(pred_str.encode('utf-8')) and my output looks like this:
b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
b'avipar\xc4\xabta-pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81dana-artham'
b'tri\xe1\xb9\x83\xc5\x9bik\xc4\x81-vij\xc3\xb1apti-prakara\xe1\xb9\x87a-\xc4\x81rambha\xe1\xb8\xa5'
b'pudgala-dharma-nair\xc4\x81tmya-pratip\xc4\x81danam punar kle\xc5\x9ba-j\xc3\xb1eya-\xc4\x81vara\xe1\xb9\x87a-prah\xc4\x81\xe1\xb9\x87a-artham'
But, I want my output to look like this:
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
aviparīta-pudgala-dharma-nairātmya-pratipādana-artham
triṃśikā-vijñapti-prakaraṇa-ārambhaḥ
pudgala-dharma-nairātmya-pratipādanam punar kleśa-jñeya-āvaraṇa-prahāṇa-artham
If i save my string in file using:
with codecs.open('out.txt', 'w', 'UTF-8') as f:
f.write(pred_str)
it saves string as expected.

Your data is encoded with the "UTF-8-SIG" codec, which is sometimes used in Microsoft environments.
This variant of UTF-8 prefixes encoded text with a byte order mark '\xef\xbb\xbf', to make it easier for applications to detect UTF-8 encoded text vs other encodings.
You can decode such bytestrings like this:
>>> bs = b'\xef\xbb\xbfpudgala-dharma-nair\xc4\x81tmyayo\xe1\xb8\xa5 apratipanna-vipratipann\xc4\x81n\xc4\x81m'
>>> text = bs.decode('utf-8-sig')
>>> print(text)
pudgala-dharma-nairātmyayoḥ apratipanna-vipratipannānām
To read such data from a file:
with open('myfile.txt', 'r', encoding='utf-8-sig') as f:
text = f.read()
Note that even after decoding from UTF-8-SIG, you may still be unable to print your data because your console's default code page may not be able to encode other non-ascii characters in the data. In that case you will need to adjust your console settings to support UTF-8.

try this code:
if pred_str.startswith('\ufeff'):
pred_str = pred_str.split('\ufeff')[1]

python codecs can't encode to cp1252...but notepad++ can?

I have a very simple piece of code that's converting a csv....also do note i reference notepad++ a few times but my standard IDE is vs-code.
with codecs.open(filePath, "r", encoding = "UTF-8") as sourcefile:
lines = sourcefile.read()
with codecs.open(filePath, 'w', encoding = 'cp1252') as targetfile:
targetfile.write(lines)
Now the job I'm doing requires a specific file be encoded to windows-1252 and from what i understand cp1252=windows-1252. Now this conversion works fine when i do it using the UI features in notepad++, but when i try using python codecs to encode this file it fails;
UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 561488: character maps to <undefined>
When i saw this failure i was confused, so i double checked the output from when i manually convert the file using notepad++, and the converted file is encoded in windows-1252.....so what gives? Why can a UI feature in notepad++ able to do the job when but codecs seems not not be able to? Does notepad++ just ignore errors?

Looks like your input text has the character "�" (the actual placeholder "replacement character" character, not some other undefined character), which cannot be mapped to cp1252 (because it doesn't have the concept).
Depending on what you need, you can:
Filter it out (or replace it, or otherwise handle it) in Python before writing out lines to the output file.
Pass errors=... to the second codecs.open, choosing one of the other error-handling modes; the default is 'strict', you can also use 'ignore', 'replace', 'xmlcharrefreplace', 'backslashreplace' or 'namereplace'.
Check the input file and see why it's got the "�" character; is it corrupted?

Probably Python is simply more explicit in its error handling. If Notepad++ managed to represent every character correctly in CP-1252 then there is a bug in the Python codec where it should not fail where it currently does; but I'm guessing Notepad++ is silently replacing some characters with some other characters, and falsely claiming success.
Maybe try converting the result back to UTF-8 and compare the files byte by byte if the data is not easy to inspect manually.
Uncode U+FFFD is a reserved character which serves as a placeholder for a character which cannot be represented in Unicode; often, it's an indication of a conversion problem previously, when presumably this data was imperfectly input or converted at an earlier point in time.
(And yes, Windows-1252 is another name for Windows code page 1252.)

Why notepad++ "succeeds"
Notepad++ does not offer you to convert your file to cp1252, but to reinterpret it using this encoding. What lead to your confusion is that they are actually using the wrong term for this. This is the encoding menu in the program:
When "Encode with cp1252" is selected, Notepad decodes the file using cp1252 and shows you the result. If you save the character '\ufffd' to a file using utf8:
with open('f.txt', 'w', encoding='utf8') as f:
f.write('\ufffd')`
and use "Encode with cp1252" you'd see three characters:
That means that Notepad++ does not read the character in utf8 and then writes it in cp1252, because then you'd see exactly one character. You could achieve similar results to Notepad++ by reading the file using cp1252:
with open('f.txt', 'r', encoding='cp1252') as f:
print(f.read()) # Prints ï¿½
Notepad++ lets you actually convert to only five encodings, as you can see in the screenshot above.
What should you do
This character does not exist in the cp1252 encoding, which means you can't convert this file without losing information. Common solutions are to skip such characters or replace them with other similar characters that exist in your encoding (see encoding error handlers)

You are dealing with the "utf-8-sig" encoding -- please specify this one as the encoding argument instead of "utf-8".
There is information on it in the docs (search the page for "utf-8-sig").
To increase the reliability with which a UTF-8 encoding can be detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls "utf-8-sig") for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 0xef, 0xbb, 0xbf) is written. [...]

Reading files with non ascii characters [duplicate]

I'm reading a series of source code files using Python and running into a unicode BOM error. Here's my code:
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
result = chardet.detect(raw)
encoding = result['encoding']
infile = open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)
As you can see, I'm detecting the encoding using chardet, then reading the file in memory and attempting to print it. The print statement fails on Unicode files containing a BOM with the error:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2:
character maps to <undefined>
I'm guessing it's trying to decode the BOM using the default character set and it's failing. How do I remove the BOM from the string to prevent this?

There is no reason to check if a BOM exists or not, utf-8-sig manages that for you and behaves exactly as utf-8 if the BOM does not exist:
# Standard UTF-8 without BOM
>>> b'hello'.decode('utf-8')
'hello'
>>> b'hello'.decode('utf-8-sig')
'hello'
# BOM encoded UTF-8
>>> b'\xef\xbb\xbfhello'.decode('utf-8')
'\ufeffhello'
>>> b'\xef\xbb\xbfhello'.decode('utf-8-sig')
'hello'
In the example above, you can see utf-8-sig correctly decodes the given string regardless of the existence of BOM. If you think there is even a small chance that a BOM character might exist in the files you are reading, just use utf-8-sig and not worry about it

BOM characters should be automatically stripped when decoding UTF-16, but not UTF-8, unless you explicitly use the utf-8-sig encoding. You could try something like this:
import io
import chardet
import codecs
bytes = min(32, os.path.getsize(filename))
raw = open(filename, 'rb').read(bytes)
if raw.startswith(codecs.BOM_UTF8):
encoding = 'utf-8-sig'
else:
result = chardet.detect(raw)
encoding = result['encoding']
infile = io.open(filename, mode, encoding=encoding)
data = infile.read()
infile.close()
print(data)

I've composed a nifty BOM-based detector based on Chewie's answer.
It autodetects the encoding in the common use case where data can be either in a known local encoding or in Unicode with BOM (that's what text editors typically produce). More importantly, unlike chardet, it doesn't do any random guessing, so it gives predictable results:
def detect_by_bom(path, default):
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
if any(raw.startswith(bom) for bom in boms):
return enc
return default

chardet detects BOM_UTF8 automatically since 2.3.0 version released on Oct 7, 2014:
#!/usr/bin/env python
import chardet # $ pip install chardet
# detect file encoding
with open(filename, 'rb') as file:
raw = file.read(32) # at most 32 bytes are returned
encoding = chardet.detect(raw)['encoding']
with open(filename, encoding=encoding) as file:
text = file.read()
print(text)
Note: chardet may return 'UTF-XXLE', 'UTF-XXBE' encodings that leave the BOM in the text. 'LE', 'BE' should be stripped to avoid it -- though it is easier to detect BOM yourself at this point e.g., as in #ivan_pozdeev's answer.
To avoid UnicodeEncodeError while printing Unicode text to Windows console, see Python, Unicode, and the Windows console.

I find the other answers overly complex. There is a simpler way that doesn't need dropping down into the lower-level idiom of binary file I/O, doesn't rely on a character set heuristic (chardet) that's not part of the Python standard library, and doesn't need a rarely-seen alternate encoding signature (utf-8-sig vs. the common utf-8) that doesn't seem to have an analog in the UTF-16 family.
The simplest approach I've found is dealing with BOM characters in Unicode, and letting the codecs do the heavy lifting. There is only one Unicode byte order mark, so once data is converted to Unicode characters, determining if it's there and/or adding/removing it is easy. To read a file with a possible BOM:
BOM = '\ufeff'
with open(filepath, mode='r', encoding='utf-8') as f:
text = f.read()
if text.startswith(BOM):
text = text[1:]
This works with all the interesting UTF codecs (e.g. utf-8, utf-16le, utf-16be, ...), doesn't require extra modules, and doesn't require dropping down into binary file processing or specific codec constants.
To write a BOM:
text_with_BOM = text if text.startswith(BOM) else BOM + text
with open(filepath, mode='w', encoding='utf-16be') as f:
f.write(text_with_BOM)
This works with any encoding. UTF-16 big endian is just an example.
This is not, btw, to dismiss chardet. It can help when you have no information what encoding a file uses. It's just not needed for adding / removing BOMs.

In case you want to edit the file, you will want to know which BOM was used. This version of #ivan_pozdeev answer returns both encoding and optional BOM:
def encoding_by_bom(path, default='utf-8') -> Tuple[str, Optional[bytes]]:
"""Adapted from https://stackoverflow.com/questions/13590749/reading-unicode-file-data-with-bom-chars-in-python/24370596#24370596 """
with open(path, 'rb') as f:
raw = f.read(4) # will read less if the file is smaller
# BOM_UTF32_LE's start is equal to BOM_UTF16_LE so need to try the former first
for enc, boms in \
('utf-8-sig', (codecs.BOM_UTF8,)), \
('utf-32', (codecs.BOM_UTF32_LE, codecs.BOM_UTF32_BE)), \
('utf-16', (codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
for bom in boms:
if raw.startswith(bom):
return enc, bom
return default, None

A variant of #ivan_pozdeev's answer for strings/exceptions (rather than files). I'm dealing with unicode HTML content that was stuffed in a python exception (see http://bugs.python.org/issue2517)
def detect_encoding(bytes_str):
for enc, boms in \
('utf-8-sig',(codecs.BOM_UTF8,)),\
('utf-16',(codecs.BOM_UTF16_LE,codecs.BOM_UTF16_BE)),\
('utf-32',(codecs.BOM_UTF32_LE,codecs.BOM_UTF32_BE)):
if (any(bytes_str.startswith(bom) for bom in boms): return enc
return 'utf-8' # default
def safe_exc_to_str(exc):
try:
return str(exc)
except UnicodeEncodeError:
return unicode(exc).encode(detect_encoding(exc.content))
Alternatively, this much simpler code is able to delete non-ascii characters without much fuss:
def just_ascii(str):
return unicode(str).encode('ascii', 'ignore')

Reading UTF8 encoded CSV and converting to UTF-16

I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8 encoded str:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str (as opposed to repr()) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs to unicode:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've read this question:
Reading a UTF8 CSV file with Python
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)
As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach.

Your default encoding is ASCII. When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII.
The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.
To print a unicode, use print some_unicode.encode('utf-8'). (Or whatever encoding your terminal is actually using).
As for the u'\xc1lvaro Salazar', nothing here is mangled. The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \x hex escapes instead of \u unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as \u00c1.)
To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html

Unicode error in python when printing a list

Edit: http://pastebin.com/W4iG3tjS - the file
I have a text file encoded in utf8 with some Cyrillic text it. To load it, I use the following code:
import codecs
fopen = codecs.open('thefile', 'r', encoding='utf8')
fread = fopen.read()
fread dumps the file on the screen all unicodish (escape sequences). print fread displays it in readable form (ASCII I guess).
I then try to split it and write it to an empty file with no encoding:
a = fread.split()
for l in a:
print>>dasFile, l
But I get the following error message: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)
Is there a way to dump fread.split() into a file? How can I get rid of this error?

Since you've opened and read the file via codecs.open(), it's been decoded to Unicode. So to output it you need to encode it again, presumably back to UTF-8.
for l in a:
dasFile.write(l.encode('utf-8'))

print is going to use the default encoding, which is normally "ascii". So you see that error with print. But you can open a file and write directly to it.
a = fopen.readlines() # returns a list of lines already, with line endings intact
# do something with a
dasFile.writelines(a) # doesn't add line endings, expects them to be present already.
assuming the lines in a are encoded already.
PS. You should also investigate the io module.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing non-UTF-8 from a string - python

Related

Python search through file and un-escape unicode characters [duplicate]

python codecs can't encode to cp1252...but notepad++ can?

Reading files with non ascii characters [duplicate]

Reading UTF8 encoded CSV and converting to UTF-16

Unicode error in python when printing a list

Categories

Resources