Python - Display string containing entity references as normal text

Python - Display string containing entity references as normal text - python

I have a Python string "''Grassmere''"
as retrieved from a website.
I would like to have the ' displayed as the correct ascii symbol (') but for some reason python insists on just printing the ascii code.

Batteries are included for this one
>>> import xmllib
>>> X=xmllib.XMLParser()
>>> X.translate_references("''Grassmere''")
"''Grassmere''"

Or without additional modules:
re.sub("&#(\d+);", lambda m: chr(int(m.group(1))), "''Grassmere''")

Related

python UTF-8 str.replace() vs re.sub()

When receiving a JSON from some OCR server the encoding seems to be broken. The image includes some characters that are not encoded(?) properly. Displayed in console they are represented by \uXXXX.
For example processing an image like this:
ends up with output:
"some text \u0141\u00f3\u017a"
It's confusing because if I add some code like this:
mystr = mystr.replace(r'\u0141', '\u0141')
mystr = mystr.replace(r'\u00d3', '\u00d3')
mystr = mystr.replace(r'\u0142', '\u0142')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u017a', '\u017a')
The output is ok:
"some text Ółźż"
What is more if I try to replace them by regex:
mystr = re.sub(r'(\\u[0-9|abcdef|ABCDEF]{4})', r'\g<1>', mystr)
The output remain "broken":
"some text \u0141\u00f3\u017a"
This OCR is processing image to MathML / Latex prepared for use in Python. Full documentation can be found here. So for example:
Will produce the following RAW output:
"\\(\\Delta=b^{2}-4 a c\\)"
Take a note that quotes are included in string - maybe this implies something to the case.
Why the characters are not being displayed properly in the first place while after this silly mystr.replace(x, x) it goes just fine?
Why the first method is working and re.sub fails? The code seems to be okay and it works fine in other script. What am I missing?

Python strings are unicode-encoded by default, so the string you have is different from the string you output.
>>> txt = r"some text \u0141\u00f3\u017a"
>>> txt
'some text \\u0141\\u00f3\\u017a'
>>> print(txt)
some text \u0141\u00f3\u017a
The regex doesn't work since there only is one backslash and it doesn't do anything to replace it. The python code converts your \uXXXX into the actual symbol and inserts it, which obviously works. To reproduce:
>>> txt[-5:]
'u017a'
>>> txt[-6:]
'\\u017a'
>>> txt[-6:-5]
'\\'
What you should do to resolve it:
Make sure your response is received in the correct encoding and not as a raw string. (e.g. use response.text instead of reponse.body)
Otherwise
>>> txt.encode("raw-unicode-escape").decode('unicode-escape')
'some text Łóź'

ElementTree will not parse special characters with Python 2.7

I had to rewrite my python script from python 3 to python2 and after that I got problem parsing special characters with ElementTree.
This is a piece of my xml:
<account number="89890000" type="Kostnad" taxCode="597" vatCode="">Avsättning egenavgifter</account>
This is the ouput when I parse this row:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avs\xc3\xa4ttning egenavgifter')
So it seems to be a problem with the character "ä".
This is how i do it in the code:
sys.setdefaultencoding( "UTF-8" )
xmltree = ET()
xmltree.parse("xxxx.xml")
printAccountPlan(xmltree)
def printAccountPlan(xmltree):
print("account:",str(i.attrib['number']), "AccountType:",str(i.attrib['type']),"Name:",str(i.text))
Anyone have an ide to get the ElementTree parse the charracter "ä", so the result will be like this:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

You're running into two separate differences between Python 2 and Python 3 at the same time, which is why you're getting unexpected results.
The first difference is one you're probably already aware of: Python's print statement in version 2 became a print function in version 3. That change is creating a special circumstance in your case, which I'll get to a little later. But briefly, this is the difference in how 'print' works:
In Python 3:
>>> # Two arguments 'Hi' and 'there' get passed to the function 'print'.
>>> # They are concatenated with a space separator and printed.
>>> print('Hi', 'there')
>>> Hi there
In Python 2:
>>> # 'print' is a statement which doesn't need parenthesis.
>>> # The parenthesis instead create a tuple containing two elements
>>> # 'Hi' and 'there'. This tuple is then printed.
>>> print('Hi', 'there')
>>> ('Hi', 'there')
The second problem in your case is that tuples print themselves by calling repr() on each of their elements. In Python 3, repr() displays unicode as you want. But in Python 2, repr() uses escape characters for any byte values which fall outside the printable ASCII range (e.g., larger than 127). This is why you're seeing them.
You may decide to resolve this issue, or not, depending on what you're goal is with your code. The representation of a tuple in Python 2 uses escape characters because it's not designed to be displayed to an end-user. It's more for your internal convenience as a developer, for troubleshooting and similar tasks. If you're simply printing it for yourself, then you may not need to change a thing because Python is showing you that the encoded bytes for that non-ASCII character are correctly there in your string. If you do want to display something to the end-user which has the format of how tuples look, then one way to do it (which retains correct printing of unicode) is to manually create the formatting, like this:
def printAccountPlan(xmltree):
data = (i.attrib['number'], i.attrib['type'], i.text)
print "('account:', '%s', 'AccountType:', '%s', 'Name:', '%s')" % data
# Produces this:
# ('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

Using unicodedata.normalize in Python 2.7

Once again, I am very confused with a unicode question. I can't figure out how to successfully use unicodedata.normalize to convert non-ASCII characters as expected. For instance, I want to convert the string
u"Cœur"
To
u"Coeur"
I am pretty sure that unicodedata.normalize is the way to do this, but I can't get it to work. It just leaves the string unchanged.
>>> s = u"Cœur"
>>> unicodedata.normalize('NFKD', s) == s
True
What am I doing wrong?

You could try Unidecode:
# -*- coding: utf-8 -*-
from unidecode import unidecode # $ pip install unidecode
print(unidecode(u"Cœur"))
# -> Coeur

Your problem seems not to have to do with Python, but that the character you are trying to decompose (u'\u0153' - 'œ') is not a composition itself.
Check as your code works with a string containing normal composite characters like "ç" and "ã":
>>> a1 = a
>>> a = u"maçã"
>>> for norm in ('NFC', 'NFKC', 'NFD','NFKD'):
... b = unicodedata.normalize(norm, a)
... print b, len(b)
...
maçã 4
maçã 4
maçã 6
maçã 6
And then, if you check the unicode reference for both characters (yours and c + cedila) you will see that the later has a "decomposition" specification the former lacks:
http://www.fileformat.info/info/unicode/char/153/index.htm
http://www.fileformat.info/info/unicode/char/00e7/index.htm
It like "œ" is not formally equivalent to "oe" - (at least not for the people who defined this unicode part) - so, the way to go to normalize text containing this is to make a manual replacement of the char for the sequence with unicode.replace - as hacky as it sounds.

As jsbueno says, some letters just don't have a compatibility decomposition.
You can use the Unicode CLDR Latin-ASCII transform to generate a mapping of manual replacements.

Python Command Line "characters" returns 'characters'

Thanks in advance for your help.
When entering "example" at the command line, Python returns 'example'. I can not find anything on the web to explain this. All reference materials speaks to strings in the context of the print command, and I get all of the material about using double quotes, singles quotes, triple quotes, escape commands, etc.
I can not, however, find anything explaining why entering text surrounded by double quotes at the command line always returns the same text surrounded by single quotes. What gives? Thanks.

In Python both 'string' and "string" are used to represent string literals. It's not like Java where single and double quotes represent different data types to the compiler.
The interpreter evaluates each line you enter and displays this value to you. In both cases the interpreter is evaluating what you enter, getting a string, and displaying this value. The default way of displaying strings is in single quotes so both times the string is displayed enclosed in single quotes.
It does seem odd - in that it breaks Python's rule of There should be one - and preferably only one - obvious way to do it - but I think disallowing one of the options would have been worse.
You can also enter a string literal using triple quotes:
>>> """characters
... and
... newlines"""
'characters\nand\nnewlines'
You can use the command line to confirm that these are the same thing:
>>> type("characters")
<type 'str'>
>>> type('characters')
<type 'str'>
>>> "characters" == 'characters'
True
The interpreter uses the __repr__ method of an object to get the display to print to you. So on your own objects you can determine how they are displayed in the interpreter. We can't change the __repr__ method for built in types, but we can customise the interpreter output using sys.displayhook:
>>> import sys
>>> def customoutput(value):
... if isinstance(value,str):
... print '"%s"' % value
... else:
... sys.__displayhook__(value)
...
>>> sys.displayhook = customoutput
>>> 'string'
"string"

In python, single quotes and double quotes are semantically the same.
It struck me as strange at first, since in C++ and other strongly-typed languages single quotes are a char and doubles are a string.
Just get used to the fact that python doesn't care about types, and so there's no special syntax for marking a string vs. a character. Don't let it cloud your perception of a great language

Don't get confused.
In python single quotes and double quotes are same. The creates an string object.

How to search and replace utf-8 special characters in Python?

I'm a Python beginner, and I have a utf-8 problem.
I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').
u-umlaut has unicode code point 252, so I tried this:
>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'
I expected the last string to be u'ueber'.
What I ultimately want to do is replace all u-umlauts in a file with 'ue':
import sys
import codecs
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f:
print repr(line).replace(unichr(252), 'ue')
Thanks for your help! (I'm using Python 2.3.)

I would define a dictionary of special characters (that I want to map) then I use translate method.
line = 'Ich möchte die Qualität des Produkts überprüfen, bevor ich es kaufe.'
special_char_map = {ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
print(line.translate(special_char_map))
you will get the following result:
Ich moechte die Qualitaet des Produkts ueberpruefen, bevor ich es kaufe.

I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).
>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'
There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.
You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file
#-*- coding: UTF-8 -*-
Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.
Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.
(I am trying on Python 2.6, I think in Python 2.3 the result is the same)

repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.
You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.
If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:
repr(str.replace(unichr(252), 'ue'))

You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.
I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.
# coding: ascii
translations = (
(u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
(u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
# et cetera
)
test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'
out = test
for from_str, to_str in translations:
out = out.replace(from_str, to_str)
print out
output:
Moeller von Muenchen

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Display string containing entity references as normal text - python

I have a Python string "''Grassmere''" as retrieved from a website. I would like to have the ' displayed as the correct ascii symbol (') but for some reason python insists on just printing the ascii code.

Batteries are included for this one >>> import xmllib >>> X=xmllib.XMLParser() >>> X.translate_references("''Grassmere''") "''Grassmere''"

Or without additional modules: re.sub("&#(\d+);", lambda m: chr(int(m.group(1))), "''Grassmere''")

Related

python UTF-8 str.replace() vs re.sub()

ElementTree will not parse special characters with Python 2.7

Using unicodedata.normalize in Python 2.7

Python Command Line "characters" returns 'characters'

How to search and replace utf-8 special characters in Python?

Categories

Resources