How to fix broken utf-8 encoding in Python?

How to fix broken utf-8 encoding in Python? - python

My string is Niá»‡m Bá»“ TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx
and I start to try by Python
mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
mystr.decode('utf-8')
but actually it is not correct because original string is utf-8 but the string show is not my expecting result.
Note: it is Vietnamese character.
How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here.

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy
This module fixes pretty much everything and works much better than online decoders.
>>> from ftfy import fix_encoding
>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'
It can be easily installed using pip install ftfy

I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x):
>>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh'
>>> s = mystr.encode('latin1').decode('utf8')
>>> s
'09. Bát Nhã Tâm Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

Try:
str.encode('ascii', 'ignore').decode('utf-8')
You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.

The correct method in python 3.9.6 is:
"string".encode('utf-8').decode('latin-1')
"string".encode('latin1').decode('utf8')

Related

python UTF-8 str.replace() vs re.sub()

When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.
For example processing an image like this:
ends up with output:
"some text \u0141\u00f3\u017a"
It's confusing because if I add some code like this:
mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')
The output is ok:
"some text Ółźż"
What is more if I try to replace them by regex:
mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)
The output remain "broken":
"some text \u0141\u00f3\u017a"
This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:
Will produce the following RAW output:
"\\(\\Delta=b^{2}-4 a c\\)"
Take a note that quotes are included in string - maybe this implies something to the case.
Why the characters are not being displayed properly in the first place while after this silly mystr.replace(x, x) it goes just fine?
Why the first method is working and re.sub fails? The code seems to be okay and it works fine in other script. What am I missing?

Python strings are unicode-encoded by default, so the string you have is different from the string you output.
>>> txt = r"some text \u0141\u00f3\u017a"
>>> txt
'some text \\u0141\\u00f3\\u017a'
>>> print(txt)
some text \u0141\u00f3\u017a
The regex doesn't work since there only is one backslash and it doesn't do anything to replace it. The python code converts your \uXXXX into the actual symbol and inserts it, which obviously works. To reproduce:
>>> txt[-5:]
'u017a'
>>> txt[-6:]
'\\u017a'
>>> txt[-6:-5]
'\\'
What you should do to resolve it:
Make sure your response is received in the correct encoding and not as a raw string. (e.g. use response.text instead of reponse.body)
Otherwise
>>> txt.encode("raw-unicode-escape").decode('unicode-escape')
'some text Łóź'

How to convert ZÃƒÂ¤une to Zäune in Python 2.7

I receive a text string from a third party api with garbled character encodings.
When I print that string to the command line, the string contains words like
ZÃƒÂ¤une instead of Zäune
GartenmÃƒÂ¶bel instead of Gartenmöbel
etc.
What can I do, to fix the incoming text string with python 2.7, so it prints properly to the command line?
Thanks

In [36]: print('ZÃƒÂ¤une'.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'))
Zäune
In [37]: print('GartenmÃƒÂ¶bel'.decode('utf-8').encode('cp1252').decode('utf-8').encode('latin-1'))
Gartenmöbel
I found this chain of encodings guess_chain_encodings.py which performs a brute-force search:
In [51]: 'ZÃƒÂ¤une'
Out[51]: 'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'
In [52]: 'Zäune'
Out[52]: 'Z\xc3\xa4une'
Running
guess_chain_encodings.py "'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'" "'Z\xc3\xa4une'"
yielded
'Z\xc3\x83\xc6\x92\xc3\x82\xc2\xa4une'.decode('utf_8').encode('cp1254').decode('utf_8_sig').encode('palmos')
A little playing around suggested that cp1254 could be replaced by the (more common?) cp1252, and utf_8_sig could be replaced by utf-8, and the odd palmos could be replaced by latin-1.

The strings seem to be UTF-8 encoded twice.

Notice also the console encoding - sometimes you can see your printed strings fine in the app, but it could fail to print in the console. Here's very good guide about Unicode in Python and its using techniques.

Python: Encoding issues?

in my HTML file, the word "Schilderung" looks normally and it doesn't seem to have an (encoding?) problem.
But when I copy the word, I get the following: "Schilde rung", and if I'd like to find out the length with python, I get 13 (instead of 12...).
What's the problem here, and how can I handle this?
Thanks a lot for any help!
EDIT:
At the moment, I use the following: output.write(text.decode("utf-8"))
This handles correctly all umlaut and other special char, but the above problem is still present. print(repr(txt)) gives: Schilde\xc2\xadrung
How can we solve this problem? Thanks a lot!

There is U+00AD SOFT HYPHEN before r in the string:
>>> "Schilderung".decode('utf-8')
u'Schilde\xadrung'
To remove non-ascii characters:
>>> s = u'Schilde\xadrung'
>>> s.encode('ascii', 'ignore').decode()
u'Schilderung'
>>> len(_)
11

Seems like "r" isn't ASCII:
>>> u'Schilderung'
u'Schilde\xadrung'

Using unicodedata.normalize in Python 2.7

Once again, I am very confused with a unicode question. I can't figure out how to successfully use unicodedata.normalize to convert non-ASCII characters as expected. For instance, I want to convert the string
u"Cœur"
To
u"Coeur"
I am pretty sure that unicodedata.normalize is the way to do this, but I can't get it to work. It just leaves the string unchanged.
>>> s = u"Cœur"
>>> unicodedata.normalize('NFKD', s) == s
True
What am I doing wrong?

You could try Unidecode:
# -*- coding: utf-8 -*-
from unidecode import unidecode # $ pip install unidecode
print(unidecode(u"Cœur"))
# -> Coeur

Your problem seems not to have to do with Python, but that the character you are trying to decompose (u'\u0153' - 'œ') is not a composition itself.
Check as your code works with a string containing normal composite characters like "ç" and "ã":
>>> a1 = a
>>> a = u"maçã"
>>> for norm in ('NFC', 'NFKC', 'NFD','NFKD'):
... b = unicodedata.normalize(norm, a)
... print b, len(b)
...
maçã 4
maçã 4
maçã 6
maçã 6
And then, if you check the unicode reference for both characters (yours and c + cedila) you will see that the later has a "decomposition" specification the former lacks:
http://www.fileformat.info/info/unicode/char/153/index.htm
http://www.fileformat.info/info/unicode/char/00e7/index.htm
It like "œ" is not formally equivalent to "oe" - (at least not for the people who defined this unicode part) - so, the way to go to normalize text containing this is to make a manual replacement of the char for the sequence with unicode.replace - as hacky as it sounds.

As jsbueno says, some letters just don't have a compatibility decomposition.
You can use the Unicode CLDR Latin-ASCII transform to generate a mapping of manual replacements.

How to search and replace utf-8 special characters in Python?

I'm a Python beginner, and I have a utf-8 problem.
I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').
u-umlaut has unicode code point 252, so I tried this:
>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'
I expected the last string to be u'ueber'.
What I ultimately want to do is replace all u-umlauts in a file with 'ue':
import sys
import codecs
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f:
print repr(line).replace(unichr(252), 'ue')
Thanks for your help! (I'm using Python 2.3.)

I would define a dictionary of special characters (that I want to map) then I use translate method.
line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'
special_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(special_char_map))
you will get the following result:
Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.

I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).
>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'
There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.
You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file
#-*- coding: UTF-8 -*-
Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.
Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.
(I am trying on Python 2.6, I think in Python 2.3 the result is the same)

repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.
You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.
If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:
repr(str.replace(unichr(252), 'ue'))

You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.
I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.
# coding: ascii
translations = (
(u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
(u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
# et cetera
)
test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'
out = test
for from_str, to_str in translations:
out = out.replace(from_str, to_str)
print out
output:
Moeller von Muenchen

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fix broken utf-8 encoding in Python? - python

I'm not sure what you can do with these kind of data, but for your example in your original post, this works (Python 3.x): >>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh' >>> s = mystr.encode('latin1').decode('utf8') >>> s '09. Bát Nhã Tâm Kinh' >>> print(s) 09. Bát Nhã Tâm Kinh

Try: str.encode('ascii', 'ignore').decode('utf-8') You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.

The correct method in python 3.9.6 is: "string".encode('utf-8').decode('latin-1') "string".encode('latin1').decode('utf8')

Related

python UTF-8 str.replace() vs re.sub()

How to convert ZÃƒÂ¤une to Zäune in Python 2.7

Python: Encoding issues?

Using unicodedata.normalize in Python 2.7

How to search and replace utf-8 special characters in Python?

Categories

Resources