Getting unicode error while parsing xml file

Getting unicode error while parsing xml file - python

I have a directory of xml files, where a xml file is of the form:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>Brand</word>
<lemma>brand</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>5</CharacterOffsetEnd>
<POS>NN</POS>
<NER>O</NER>
</token>
<token id="2">
<word>Blogs</word>
<lemma>blog</lemma>
<CharacterOffsetBegin>6</CharacterOffsetBegin>
<CharacterOffsetEnd>11</CharacterOffsetEnd>
<POS>NNS</POS>
<NER>O</NER>
</token>
<token id="3">
<word>Capture</word>
<lemma>capture</lemma>
<CharacterOffsetBegin>12</CharacterOffsetBegin>
<CharacterOffsetEnd>19</CharacterOffsetEnd>
<POS>VBP</POS>
<NER>O</NER>
</token>
I am parsing each xml file and storing the word between the tags, and then finding the top 100 words.
I am doing like this:
def find_top_words(xml_directory):
file_list = []
temp_list=[]
file_list2=[]
for dir_file in os.listdir(xml_directory):
dir_file_path = os.path.join(xml_directory, dir_file)
if os.path.isfile(dir_file_path):
with open(dir_file_path) as f:
page = f.read()
soup = BeautifulSoup(page,"xml")
for word in soup.find_all('word'):
file_list.append(str(word.string.strip()))
f.close()
for element in file_list:
s = element.lower()
file_list2.append(s)
counts = Counter(file_list2)
for w in sorted(counts, key=counts.get, reverse=True):
temp_list.append(w)
return temp_list[:100]
But, I'm getting this error:
File "prac31.py", line 898, in main
v = find_top_words('/home/xyz/xml_dir')
File "prac31.py", line 43, in find_top_words
file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)
What does this mean and how to fix it?

Don't use BeautifulSoup, it's totally deprecated. Why not the standard lib ? if you want something more complex for xml handling you have lxml (but I am pretty sure that you don't)
It will solve your problem easily.
edit:
forget the preview answer it was bad -_-
your problem is str(my_string) in python 2 if my_string contains non-ascii characters, cause str() in python 2 on a unicode string is like trying to encode as ascii, use the method encode('utf-8') instead.

Str() function encode ascii codec and as your word.string.strip() does not return ascii character some where in your xml file you catch this error. the solution is using:
file_list.append(word.string.strip().encode('utf-8'))
and for returning this value you need to do something like :
for item in file_list:
print item.decode('utf-8')
Hope it helps.

In this line of code:
file_list.append(str(word.string.strip()))
why are you using str? The data is Unicode, and you can append unicode strings to a list. If you need a bytestring, then you can use word.string.strip().encode('utf8') instead.

Related

Matching the XML Req and Res

I need advice on the below
Below are the request and response XML's. Request XML contains the words to be translated in the Foriegn language [String attribute inside Texts node] and the response XML contains the translation of these words in English [inside ].
REQUEST XML
<TranslateArrayRequest>
<AppId />
<From>ru</From>
<Options>
<Category xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Category>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
<ReservedFlags xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" />
<State xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></State>
<Uri xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Uri>
<User xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></User>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">вк азиза и ринат</string>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">скачать кайда кайдк кайрат нуртас бесплатно</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>
RESPONSE XML
<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>16</a:int>
</OriginalTextSentenceLengths>
<State/>
<TranslatedText>BK Aziza and Rinat</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>18</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>43</a:int> </OriginalTextSentenceLengths>
<State/>
<TranslatedText>Kairat kajdk Qaeda nurtas download free</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>39</a:int></TranslatedTextSentenceLengths>
</TranslateArrayResponse
</ArrayOfTranslateArrayResponse>

So there are two ways to relate the translated text to the original text:
Length of the original text; and
Order in the XML file
Relating by length being the probably unreliable because the probability of translating 2 or more phrases with the same number of characters is relatively significant.
So it comes down to order. I think it is relatively safe to assume that the files were processed and written in the same order. So I'll show you a way to relate the phrases using the order of the XML files.
This is relatively simple. We simply iterate through the trees and grab the words in the list. Also, for the translated XML due to its structure, we need to grab the root's namespace:
import re
import xml.etree.ElementTree as ElementTree
def map_translations(origin_file, translate_file):
origin_tree = ElementTree.parse(origin_file)
origin_root = origin_tree.getroot()
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
translate_tree = ElementTree.parse(translate_file)
translate_root = translate_tree.getroot()
namespace = re.match('{.*}', translate_root.tag).group()
translate_text = [text.text for text in translate_root.findall(
'.//{}TranslatedText'.format(namespace))]
return dict(zip(origin_text, translate_text))
origin_file = 'some_file_path.xml'
translate_file = 'some_other_path.xml'
mapping = map_translations(origin_file, translate_file)
print(mapping)
Update
The above code is applicable for Python 2.7+. In Python 2.6 it changes slightly:
ElementTree objects do not have an iter function. Instead they have a getiterator function.
Change the appropriate line above to this:
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
XPath syntax is (most likely) not supported. In order to get down to the TranslatedText nodes we need to use the same strategy as we do above:
Change the appropriate line above to this:
translate_text = [string.text for text in translate_root.getiterator(
'{0}TranslateArrayResponse'.format(namespace))
for string in text.getiterator(
'{0}TranslatedText'.format(namespace))]

xml parsing error special characters

I have following xml that I want to parse with xml.dom.minidom module
<?xml version="1.0" encoding="UTF-8"?>
<RootTag>
<InnerTag>
<MyValue>"< here is special char."</MyValue>
</InnerTag>
</RootTag>
I have following snippet for parsing above xml
import xml.dom.minidom
xml.dom.minidom.parse('input_xml')
But I get following error:
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 4, column 26
Above error occurs only when I provide '&' or '<' provided in MyValue tags
So,
How to resolve this issue?
I am not wishing to change my XML by using escape sequence < etc..
and I want to use "" (quotes)

Your example is not well-formed XML. < is not allowed in XML anywhere else other than the tags. Your data needs to be wrapped in CDATA or escaped as <
<![CDATA[< here is special char.]]>

How to add a newline to a long multi-line string?

I have a string like this :
my_xml = '''
<?xml version="1.0" encoding="utf-8" ?> \n
<entries> \n
<entry>1</entry> \n
<entry>2</entry> \n
</entries>
'''
return HttpResponse(my_xml)
This is my output it was printed without new line and without xml tags:
1 2
How would we add a newline in Python and make browser interpreter xml tags and print them?

You are looking at the output in a browser; HTML browsers consider newlines whitespace, collapsing successive whitespace characters to form one space.
Your browser is interpreting the response as HTML-formatted data, use <br/> tags instead:
s = """ this is line 1<br/>
this is line 2<br/>
and other lines ..."""
If you expected to see just newlines, look at the response source code instead; the newlines are there.
If you wanted to see XML output in a browser, you should set a content type header (text/xml) so the browser knows you are sending XML instead of HTML:
return HttpResponse(s, content_type='text/xml') # Assumes you are using Django
Your browser will use a default stylesheet to display XML data (usually as a tree with collapsible sections). You can use a XML stylesheet (XSLT) to override that behaviour. Add a stylesheet header:
<?xml-stylesheet type="text/xsl" href="script.xsl" ?>
the browser will fetch the named stylesheet and apply it to your XML.

Empty lines while using minidom.toprettyxml

I've been using a minidom.toprettyxml for prettify my xml file.
When I'm creating XML file and using this method, all works grate, but if I use it after I've modified the xml file (for examp I've added an additional nodes) and then I'm writing it back to XML, I'm getting empty lines, each time I'm updating it, I'm getting more and more empty lines...
my code :
file.write(prettify(xmlRoot))
def prettify(elem):
rough_string = xml.tostring(elem, 'utf-8') //xml as ElementTree
reparsed = mini.parseString(rough_string) //mini as minidom
return reparsed.toprettyxml(indent=" ")
and the result :
<?xml version="1.0" ?>
<testsuite errors="0" failures="3" name="TestSet_2013-01-23 14_28_00.510935" skip="0" tests="3" time="142.695" timestamp="2013-01-23 14:28:00.515460">
<testcase classname="TC test" name="t1" status="Failed" time="27.013"/>
<testcase classname="TC test" name="t2" status="Failed" time="78.325"/>
<testcase classname="TC test" name="t3" status="Failed" time="37.357"/>
</testsuite>
any suggestions ?
thanks.

I found a solution here: http://code.activestate.com/recipes/576750-pretty-print-xml/
Then I modified it to take a string instead of a file.
from xml.dom.minidom import parseString
pretty_print = lambda data: '\n'.join([line for line in parseString(data).toprettyxml(indent=' '*2).split('\n') if line.strip()])
Output:
<?xml version="1.0" ?>
<testsuite errors="0" failures="3" name="TestSet_2013-01-23 14_28_00.510935" skip="0" tests="3" time="142.695" timestamp="2013-01-23 14:28:00.515460">
<testcase classname="TC test" name="t1" status="Failed" time="27.013"/>
<testcase classname="TC test" name="t2" status="Failed" time="78.325"/>
<testcase classname="TC test" name="t3" status="Failed" time="37.357"/>
</testsuite>
This may help you work it into your function a little be easier:
def new_prettify():
reparsed = parseString(CONTENT)
print '\n'.join([line for line in reparsed.toprettyxml(indent=' '*2).split('\n') if line.strip()])

I found an easy solution for this problem, just with changing the last line
of your prettify() so it will be:
def prettify(elem):
rough_string = xml.tostring(elem, 'utf-8') //xml as ElementTree
reparsed = mini.parseString(rough_string) //mini as minidom
return reparsed.toprettyxml(indent=" ", newl='')

use this to resolve problem with the lines
toprettyxml(indent=' ', newl='\r', encoding="utf-8")

I am having the same issue with Python 2.7 (32b) in a Windows 10 machine. The issue seems to be that when python parses an XML text to an ElementTree object, it adds some annoying line feeds to either the "text" or "tail" attributes of each element.
This script removes such line break characters:
def removeAnnoyingLines(elem):
hasWords = re.compile("\\w")
for element in elem.iter():
if not re.search(hasWords,str(element.tail)):
element.tail=""
if not re.search(hasWords,str(element.text)):
element.text = ""
Use this function before "pretty-printing" your tree:
removeAnnoyingLines(element)
myXml = xml.dom.minidom.parseString(xml.etree.ElementTree.tostring(element))
print myXml.toprettyxml()
It worked for me. I hope it works for you!

Here's a Python3 solution that gets rid of the ugly newline issue (tons of whitespace), and it only uses standard libraries unlike most other implementations.
import xml.etree.ElementTree as ET
import xml.dom.minidom
import os
def pretty_print_xml_given_root(root, output_xml):
"""
Useful for when you are editing xml data on the fly
"""
xml_string = xml.dom.minidom.parseString(ET.tostring(root)).toprettyxml()
xml_string = os.linesep.join([s for s in xml_string.splitlines() if s.strip()]) # remove the weird newline issue
with open(output_xml, "w") as file_out:
file_out.write(xml_string)
def pretty_print_xml_given_file(input_xml, output_xml):
"""
Useful for when you want to reformat an already existing xml file
"""
tree = ET.parse(input_xml)
root = tree.getroot()
pretty_print_xml_given_root(root, output_xml)
I found how to fix the common newline issue here.

The problem is that minidom doesn't handle well the new line chars (on Windows).
Anyway it doesn't need them so removing them from the sting is the solution:
reparsed = mini.parseString(rough_string) //mini as minidom
replace with
reparsed = mini.parseString(rough_string.replace('\n','')) //mini as minidom
But be aware that this is solution working only for Windows.

Since minidom toprettyxml insert too many lines, my solution was to delete lines that do not have useful data in them by checking if there is at least one '<' character (there may be a better idea). This worked perfectly for a similar issue I had (on Windows).
text = md.toprettyxml() # get the prettyxml string from minidom Document md
# text = text.replace(' ', '\t') # for those using tabs :)
spl = text.split('\n') # split lines into a list
spl = [i for i in spl if '<' in i] # keep only element with data inside
text = '\n'.join(spl) # join again all elements of the filtered list into a string
# write the result to file (I use codecs because I needed the utf-8 encoding)
import codecs # if not imported yet (just to show this import is needed)
with codecs.open('yourfile.xml', 'w', encoding='utf-8') as f:
f.write(text)

How to read special characters of a XML file with minidom

When reading data from a XML file, I cannot get the right string.
My XML file is like that :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<Root>
<Name>aa é bb</Name>
</Root>
I would like to read the <Name> balise correctly.
So I try the command:
NameValue = Item.getElementsByTagName("Name")[0].childNodes[0].data
Which returns u'aa \xc3\xa9 bb' in NameValue.
So how can I get u'aa é bb' or 'aa é bb' in NameValue ?
I have tried encode and decode functions without success.
I would like to do that with Python 2.7.

OK I have it.
I managed to do that with :
NameValue = unicode(Item.getElementsByTagName("Name")[0].childNodes[0].data.encode("latin-1"), "utf-8")
Thanks for your help fanlix

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting unicode error while parsing xml file - python

In this line of code: file_list.append(str(word.string.strip())) why are you using str? The data is Unicode, and you can append unicode strings to a list. If you need a bytestring, then you can use word.string.strip().encode('utf8') instead.

Related

Matching the XML Req and Res

xml parsing error special characters

How to add a newline to a long multi-line string?

Empty lines while using minidom.toprettyxml

How to read special characters of a XML file with minidom

Categories

Resources