ElementTree will not parse special characters with Python 2.7

ElementTree will not parse special characters with Python 2.7 - python

I had to rewrite my python script from python 3 to python2 and after that I got problem parsing special characters with ElementTree.
This is a piece of my xml:
<account number="89890000" type="Kostnad" taxCode="597" vatCode="">Avsättning egenavgifter</account>
This is the ouput when I parse this row:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avs\xc3\xa4ttning egenavgifter')
So it seems to be a problem with the character "ä".
This is how i do it in the code:
sys.setdefaultencoding( "UTF-8" )
xmltree = ET()
xmltree.parse("xxxx.xml")
printAccountPlan(xmltree)
def printAccountPlan(xmltree):
print("account:",str(i.attrib['number']), "AccountType:",str(i.attrib['type']),"Name:",str(i.text))
Anyone have an ide to get the ElementTree parse the charracter "ä", so the result will be like this:
('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

You're running into two separate differences between Python 2 and Python 3 at the same time, which is why you're getting unexpected results.
The first difference is one you're probably already aware of: Python's print statement in version 2 became a print function in version 3. That change is creating a special circumstance in your case, which I'll get to a little later. But briefly, this is the difference in how 'print' works:
In Python 3:
>>> # Two arguments 'Hi' and 'there' get passed to the function 'print'.
>>> # They are concatenated with a space separator and printed.
>>> print('Hi', 'there')
>>> Hi there
In Python 2:
>>> # 'print' is a statement which doesn't need parenthesis.
>>> # The parenthesis instead create a tuple containing two elements
>>> # 'Hi' and 'there'. This tuple is then printed.
>>> print('Hi', 'there')
>>> ('Hi', 'there')
The second problem in your case is that tuples print themselves by calling repr() on each of their elements. In Python 3, repr() displays unicode as you want. But in Python 2, repr() uses escape characters for any byte values which fall outside the printable ASCII range (e.g., larger than 127). This is why you're seeing them.
You may decide to resolve this issue, or not, depending on what you're goal is with your code. The representation of a tuple in Python 2 uses escape characters because it's not designed to be displayed to an end-user. It's more for your internal convenience as a developer, for troubleshooting and similar tasks. If you're simply printing it for yourself, then you may not need to change a thing because Python is showing you that the encoded bytes for that non-ASCII character are correctly there in your string. If you do want to display something to the end-user which has the format of how tuples look, then one way to do it (which retains correct printing of unicode) is to manually create the formatting, like this:
def printAccountPlan(xmltree):
data = (i.attrib['number'], i.attrib['type'], i.text)
print "('account:', '%s', 'AccountType:', '%s', 'Name:', '%s')" % data
# Produces this:
# ('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

Related

Do not transform special characters in Python

I just want to print the original string.
[Case1] I know put "r" before the string can work
print r'123\r\n567"45'
>>`
123\r\n567"45
[Case2] But when it is a Variable
aaa = '123\r\n567"45'
print aaa
>>
123
567"45
Is there any function can print aaa with the same effect like Case1?

The obvious way to make Case 2 work like Case 1 is to use a raw string in your assignment statement:
aaa = r'123\r\n567"45'
Now when you print aaa, you'll get the actual backslashes and r and n characters, rather than a carriage return and a newline.
If you're actually loading aaa from some other source (rather than using a string literal), your task is a little bit more complicated. You'll actually need to transform the string in some way to get the output you want.
One simple way of doing something close to what you want is to use the repr function:
aaa = some_function() # returns '123\r\n567"45' and some_function can't be changed
print repr(aaa)
This will not quite do what you want though, since it will add quotation marks around the string's text. If you care about that, you could remove them with a slice:
print repr(aaa)[1:-1]
Another approach to take is to manually transform the characters you want escaped, e.g. with str.replace or str.translate. This is easy to do if you only care about escaping a few special characters and not others.
print aaa.replace('\r', r'\r').replace('\n', r'\n')
A final option is to use str.encode with the special character set called unicode-escape, which will escape all characters that are not printable ASCII:
print aaa.encode('unicode-escape')
This only works as intended in Python 2 however. In Python 3, str.encode always returns a bytes instance which you'd need to decode again to get a str (aaa.encode('unicode-escape').decode('ascii') would work, but it's really ugly).

You can do it using repr function in python:
If you are using python 2 then :
If you are using python 3 then :

If what you really want is just to print the original string, instead of prepending an r before the literal, you may want to use the python native function repr. E.g.
>>> aaa = '123\r\n567"45'
>>> print repr(aaa)
'123\r\n567"45'
which is equivalent in this (exact) case to
>>> print repr('123\r\n567"45')
'123\r\n567"45'

ZWNJ not shown properly in python 3.3

I am trying to replace the space between two tokens written in the Arabic alphabet with a ZWNJ but what the function returns is not decoded properly on the screen:
>>> nm.normalize("رشته ها")
'رشته\u200cها'
\u200 should be rendered as a half-space that would be placed between 'رشته' and 'ها' here, but it gets messed up like that. I am using Python 3.3.3

The function returned a string object with the \u200c character as part of it, but Python shows you the representation. The \uxxxx syntax is used to make the representation useful as a debugging value, you can now copy that representation and paste it back into Python and get the exact same value.
In other words, the function worked exactly as advertised; the space was indeed replaced by a U+200C ZERO WIDTH NON-JOINER codepoint.
If you wanted to write the string to your terminal or console, use print():
print(nm.normalize("رشته ها"))
Demo:
>>> result = 'رشته\u200cها'
>>> len(result)
7
>>> result[4]
'\u200c'
>>> print(result)
رشته‌ها
You can see that character 5 (index 4) is a single character here, not 6 separate characters.

splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?

You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.

Python: 2.6 and 3.1 string matching inconsistencies

I wrote my module in Python 3.1.2, but now I have to validate it for 2.6.4.
I'm not going to post all my code since it may cause confusion.
Brief explanation:
I'm writing a XML parser (my first interaction with XML) that creates objects from the XML file. There are a lot of objects, so I have a 'unit test' that manually scans the XML and tries to find a matching object. It will print out anything that doesn't have a match.
I open the XML file and use a simple 'for' loop to read line-by-line through the file. If I match a regular expression for an 'application' (XML has different 'application' nodes), then I add it to my dictionary, d, as the key. I perform a lxml.etree.xpath() query on the title and store it as the value.
After I go through the whole thing, I iterate through my dictionary, d, and try to match the key to my value (I have to use the get() method from my 'application' class). Any time a mismatch is found, I print the key and title.
Python 3.1.2 has all matching items in the dictionary, so nothing is printed. In 2.6.4, every single value is printed (~600) in all. I can't figure out why my string comparisons aren't working.
Without further ado, here's the relevant code:
for i in d:
if i[1:-2] != d[i].get('id'):
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
I slice the strings because the strings are different. Where the key would be "9626-2008olympics_Prod-SH"\n the value would be 9626-2008olympics_Prod-SH, so I have to cut the quotes and newline. I also added the Xs and Ys to the print statements to make sure that there wasn't any kind of whitespace issues.
Here is an example line of output:
X9626-2008olympics_Prod-SHX Y9626-2008olympics_Prod-SHY
Remember to ignore the Xs and Ys. Those strings are identical. I don't understand why Python2 can't match them.
Edit:
So the problem seems to be the way that I am slicing.
In Python3,
if i[1:-2] != d[i].get('id'):
this comparison works fine.
In Python2,
if i[1:-3] != d[i].get('id'):
I have to change the offset by one.
Why would strings need different offsets? The only possible thing that I can think of is that Python2 treats a newline as two characters (i.e. '\' + 'n').
Edit 2:
Updated with requested repr() information.
I added a small amount of code to produce the repr() info from the "2008olympics" exmpale above. I have not done any slicing. It actually looks like it might not be a unicode issue. There is now a "\r" character.
Python2:
'"9626-2008olympics_Prod-SH"\r\n'
'9626-2008olympics_Prod-SH'
Python3:
'"9626-2008olympics_Prod-SH"\n'
'9626-2008olympics_Prod-SH'
Looks like this file was created/modified on Windows. Is there a way in Python2 to automatically suppress '\r'?

You are printing i[1:-3] but comparing i[1:-2] in the loop.
Very Important Question
Why are you writing code to parse XML when lxml will do all that for you? The point of unit tests is to test your code, not to ensure that the libraries you are using work!

Russell Borogrove is right.
Python 3 defaults to unicode, and the newline character is correctly interpreted as one character. That's why my offset of [1:-2] worked in 3 because I needed to eliminate three characters: ", ", and \n.
In Python 2, the newline is being interpreted as two characters, meaning I have to eliminate four characters and use [1:-3].
I just added a manual check for the Python major version.
Here is the fixed code:
for i in d:
# The keys in D contain quotes and a newline which need
# to be removed. In v3, newline = 1 char and in v2,
# newline = 2 char.
if sys.version_info[0] < 3:
if i[1:-3] != d[i].get('id'):
print('%s %s' % (i[1:-3], d[i].get('id')))
else:
if i[1:-2] != d[i].get('id'):
print('%s %s' % (i[1:-2], d[i].get('id')))
Thanks for the responses everyone! I appreciate your help.

repr() and %r format are your friends ... they show you (for basic types like str/unicode/bytes) exactly what you've got, including type.
Instead of
print('X%sX Y%sY' % (i[1:-3], d[i].get('id')))
do
print('%r %r' % (i, d[i].get('id')))
Note leaving off the [1:-3] so that you can see what is in i before you slice it.
Update after comment "You are perfectly right about comparing the wrong slice. However, once I change it, python2.6 works, but python3 has the problem now (i.e. it doesn't match any objects)":
How are you opening the file (two answers please, for Python 2 and 3). Are you running on Windows? Have you tried getting the repr() as I suggested?
Update after actual input finally provided by OP:
If, as it appears, your input file was created on Windows (lines are separated by "\r\n"), you can read Windows and *x text files portably by using the "universal newlines" option ... open('datafile.txt', 'rU') on Python2 -- read this. Universal newlines mode is the default in Python3. Note that the Python3 docs say that you can use 'rU' also in Python3; this would save you having to test which Python version you are using.

I don't understand what you're doing exactly, but would you try using strip() instead of slicing and see whether it helps?
for i in d:
stripped = i.strip()
if stripped != d[i].get('id'):
print('X%sX Y%sY' % (stripped, d[i].get('id')))

String formatting issues and concatenating a string with a number

I'm coming from a c# background, and I do this:
Console.Write("some text" + integerValue);
So the integer automatically gets converted to a string and it outputs.
In python I get an error when I do:
print 'hello' + 10
Do I have to convert to string everytime?
How would I do this in python?
String.Format("www.someurl.com/{0}/blah.html", 100);
I'm beginning to really like python, thanks for all your help!

From Python 2.6:
>>> "www.someurl.com/{0}/blah.html".format(100)
'www.someurl.com/100/blah.html'
To support older environments, the % operator has a similar role:
>>> "www.someurl.com/%d/blah.html" % 100
'www.someurl.com/100/blah.html'
If you would like to support named arguments, then you can can pass a dict.
>>> url_args = {'num' : 100 }
>>> "www.someurl.com/%(num)d/blah.html" % url_args
'www.someurl.com/100/blah.html'
In general, when types need to be mixed, I recommend string formatting:
>>> '%d: %s' % (1, 'string formatting',)
'1: string formatting'
String formatting coerces objects into strings by using their __str__ methods.[*] There is much more detailed documentation available on Python string formatting in the docs. This behaviour is different in Python 3+, as all strings are unicode.
If you have a list or tuple of strings, the join method is quite convenient. It applies a separator between all elements of the iterable.
>>> ' '.join(['2:', 'list', 'of', 'strings'])
'2: list of strings'
If you are ever in an environment where you need to support a legacy environment, (e.g. Python <2.5), you should generally avoid string concatenation. See the article referenced in the comments.
[*] Unicode strings use the __unicode__ method.
>>> u'3: %s' % ':)'
u'3: :)'

>>> "www.someurl.com/{0}/blah.html".format(100)
'www.someurl.com/100/blah.html'
you can skip 0 in python 2.7 or 3.1.

Additionally to string formatting, you can always print like this:
print "hello", 10
Works since those are separate arguments and print converts non-string arguments to strings (and inserts a space in between).

For string formatting that includes different types of values, use the % to insert the value into a string:
>>> intvalu = 10
>>> print "hello %i"%intvalu
hello 10
>>>
so in your example:
>>>print "www.someurl.com/%i/blah.html"%100
www.someurl.com/100/blah.html
In this example I'm using %i as the stand-in. This changes depending on what variable type you need to use. %s would be for strings. There is a list here on the python docs website.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ElementTree will not parse special characters with Python 2.7 - python

Related

Do not transform special characters in Python

ZWNJ not shown properly in python 3.3

splitting unicode string into words

Python: 2.6 and 3.1 string matching inconsistencies

String formatting issues and concatenating a string with a number

Categories

Resources