How to read special characters of a XML file with minidom - python

When reading data from a XML file, I cannot get the right string.
My XML file is like that :
<?xml version="1.0" encoding="ISO-8859-1" ?>
<Root>
<Name>aa é bb</Name>
</Root>
I would like to read the <Name> balise correctly.
So I try the command:
NameValue = Item.getElementsByTagName("Name")[0].childNodes[0].data
Which returns u'aa \xc3\xa9 bb' in NameValue.
So how can I get u'aa é bb' or 'aa é bb' in NameValue ?
I have tried encode and decode functions without success.
I would like to do that with Python 2.7.

OK I have it.
I managed to do that with :
NameValue = unicode(Item.getElementsByTagName("Name")[0].childNodes[0].data.encode("latin-1"), "utf-8")
Thanks for your help fanlix

Related

Python code to remove XML Declaration tag

I xml file like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<product_info article_id="0006303562330" group_id="0006303562310" vendor_id="0006303562321"
subgroup_id="0006303562313">
<available>
...
using pure Python I want to have this:
<product_info article_id="0006303562330" group_id="0006303562310" vendor_id="0006303562321"
subgroup_id="0006303562313">
<available>
...
I get my xml code in response_xml.text (response_xml gives me Response (200)) and I have tried to do this:
response_xml = response_xml.text.replace('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>','')
but I get:
AttributeError: 'str' object has no attribute 'text'
What the error is telling you is that your sample xml is a string. You need to parse it first to get the structure. You can use parser like BeautifulSoup or ElementTree and work with their output.

Beautiful Soup fails to recognize UTF-8 encoding on Python 3, IPython 6 console

I am trying to read an xml document using Beautiful Soup on Python 3.6.2, IPython 6.1.0, Windows 10, and I can't get the encoding right.
Here's my test xml, saved as a file in UTF8-encoding:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<info name="愛よ">ÜÜÜÜÜÜÜ</info>
<items>
<item thing="ÖöÖö">"23Äßßß"</item>
</items>
</root>
First check the XML using ElementTree:
import xml.etree.ElementTree as ET
def printXML(xml,indent=''):
print(indent+str(xml.tag)+': '+(xml.text if xml.text is not None else '').replace('\n',''))
if len(xml.attrib) > 0:
for k,v in xml.attrib.items():
print(indent+'\t'+k+' - '+v)
if xml.getchildren():
for child in xml.getchildren():
printXML(child,indent+'\t')
xml0 = ET.parse("test.xml").getroot()
printXML(xml0)
The output is correct:
root:
info: ÜÜÜÜÜÜÜ
name - 愛よ
items:
item: "23Äßßß"
thing - ÖöÖö
Now read the same file with Beautiful Soup and pretty-print it:
import bs4
with open("test.xml") as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
print(xml.prettify())
Output:
<!--?xml version="1.0" encoding="UTF-8"?-->
<html>
<head>
</head>
<body>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
</body>
</html>
This is just wrong. Doing the call with explicite encoding specified bs4.BeautifulSoup(ff,"html5lib",from_encoding="UTF-8") doesn't change the result.
Doing
print(xml.original_encoding)
outputs
None
So Beautiful Soup is apparently unable to detect the original encoding even though the file is encoded in UTF8 (according to Notepad++) and the header information says UTF-8 as well, and I do have chardet installed as the doc recommends.
Am I making a mistake here? What could be causing this?
EDIT:
When I invoke the code without the html5lib I get this warning:
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib").
This usually isn't a problem, but if you run this code on another system, or in a different virtual environment,
it may use a different parser and behave differently.
The code that caused this warning is on line 241 of the file C:\Users\My.Name\AppData\Local\Continuum\Anaconda2\envs\Python3\lib\site-packages\spyder\utils\ipython\start_kernel.py.
To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "html5lib")
markup_type=markup_type))
EDIT 2:
As suggested in a comment I tried bs4.BeautifulSoup(ff,"html.parser"), but the problem remains.
Then I installed lxml and tried bs4.BeautifulSoup(ff,"lxml-xml"), still the same output.
What also strikes me as odd is that even when specifying an encoding like bs4.BeautifulSoup(ff,"lxml-xml",from_encoding='UTF-8') the value of xml.original_encoding is None contrary to what is written in the doc.
EDIT 3:
I put my xml contents into a string
xmlstring = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><info name=\"愛よ\">ÜÜÜÜÜÜÜ</info><items><item thing=\"ÖöÖö\">\"23Äßßß\"</item></items></root>"
And used bs4.BeautifulSoup(xmlstring,"lxml-xml"), now I'm getting the correct output:
<?xml version="1.0" encoding="utf-8"?>
<root>
<info name="愛よ">
ÜÜÜÜÜÜÜ
</info>
<items>
<item thing="ÖöÖö">
"23Äßßß"
</item>
</items>
</root>
So it seems something is wrong with the file after all.
Found the error, I have to specify the encoding when opening the file:
with open("test.xml",encoding='UTF-8') as ff:
xml = bs4.BeautifulSoup(ff,"html5lib")
As I'm on Python 3 I thought the value of encoding was UTF-8 by default, but it turned out it's system-dependent and on my system it's cp1252.

Getting unicode error while parsing xml file

I have a directory of xml files, where a xml file is of the form:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>Brand</word>
<lemma>brand</lemma>
<CharacterOffsetBegin>0</CharacterOffsetBegin>
<CharacterOffsetEnd>5</CharacterOffsetEnd>
<POS>NN</POS>
<NER>O</NER>
</token>
<token id="2">
<word>Blogs</word>
<lemma>blog</lemma>
<CharacterOffsetBegin>6</CharacterOffsetBegin>
<CharacterOffsetEnd>11</CharacterOffsetEnd>
<POS>NNS</POS>
<NER>O</NER>
</token>
<token id="3">
<word>Capture</word>
<lemma>capture</lemma>
<CharacterOffsetBegin>12</CharacterOffsetBegin>
<CharacterOffsetEnd>19</CharacterOffsetEnd>
<POS>VBP</POS>
<NER>O</NER>
</token>
I am parsing each xml file and storing the word between the tags, and then finding the top 100 words.
I am doing like this:
def find_top_words(xml_directory):
file_list = []
temp_list=[]
file_list2=[]
for dir_file in os.listdir(xml_directory):
dir_file_path = os.path.join(xml_directory, dir_file)
if os.path.isfile(dir_file_path):
with open(dir_file_path) as f:
page = f.read()
soup = BeautifulSoup(page,"xml")
for word in soup.find_all('word'):
file_list.append(str(word.string.strip()))
f.close()
for element in file_list:
s = element.lower()
file_list2.append(s)
counts = Counter(file_list2)
for w in sorted(counts, key=counts.get, reverse=True):
temp_list.append(w)
return temp_list[:100]
But, I'm getting this error:
File "prac31.py", line 898, in main
v = find_top_words('/home/xyz/xml_dir')
File "prac31.py", line 43, in find_top_words
file_list.append(str(word.string.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 2: ordinal not in range(128)
What does this mean and how to fix it?
Don't use BeautifulSoup, it's totally deprecated. Why not the standard lib ? if you want something more complex for xml handling you have lxml (but I am pretty sure that you don't)
It will solve your problem easily.
edit:
forget the preview answer it was bad -_-
your problem is str(my_string) in python 2 if my_string contains non-ascii characters, cause str() in python 2 on a unicode string is like trying to encode as ascii, use the method encode('utf-8') instead.
Str() function encode ascii codec and as your word.string.strip() does not return ascii character some where in your xml file you catch this error. the solution is using:
file_list.append(word.string.strip().encode('utf-8'))
and for returning this value you need to do something like :
for item in file_list:
print item.decode('utf-8')
Hope it helps.
In this line of code:
file_list.append(str(word.string.strip()))
why are you using str? The data is Unicode, and you can append unicode strings to a list. If you need a bytestring, then you can use word.string.strip().encode('utf8') instead.

How to add a newline to a long multi-line string?

I have a string like this :
my_xml = '''
<?xml version="1.0" encoding="utf-8" ?> \n
<entries> \n
<entry>1</entry> \n
<entry>2</entry> \n
</entries>
'''
return HttpResponse(my_xml)
This is my output it was printed without new line and without xml tags:
1 2
How would we add a newline in Python and make browser interpreter xml tags and print them?
You are looking at the output in a browser; HTML browsers consider newlines whitespace, collapsing successive whitespace characters to form one space.
Your browser is interpreting the response as HTML-formatted data, use <br/> tags instead:
s = """ this is line 1<br/>
this is line 2<br/>
and other lines ..."""
If you expected to see just newlines, look at the response source code instead; the newlines are there.
If you wanted to see XML output in a browser, you should set a content type header (text/xml) so the browser knows you are sending XML instead of HTML:
return HttpResponse(s, content_type='text/xml') # Assumes you are using Django
Your browser will use a default stylesheet to display XML data (usually as a tree with collapsible sections). You can use a XML stylesheet (XSLT) to override that behaviour. Add a stylesheet header:
<?xml-stylesheet type="text/xsl" href="script.xsl" ?>
the browser will fetch the named stylesheet and apply it to your XML.

markup.py how to use ':' in a tag

Code:
import markup
url_= ('href1.com','href2.com')
mycxml=markup.page(mode='xml', case='given')
mycxml.init(encoding='utf-8')
mycxml.Collection.open()
mycxml.Items(url_)
mycxml.collection.close()
print mycxml
Output:
<?xml version='1.0' encoding='utf-8' ?>
<Collection>
<Items>href1.com</Items>
<Items>href2.com</Items>
</collection>
I would like to have a line like <Collection xmlns:p="somelines"> instead of <Collection>, but the : does not let me compile it. How can I "escape" it?
I don't know if markup.py has something built-in to handle this, but it's easy to force python to accept it using the ** syntax:
import markup
url_= ('href1.com','href2.com')
mycxml=markup.page(mode='xml', case='given')
mycxml.init(encoding='utf-8')
mycxml.Collection(**{'xmlns:p': 'somelines'})
mycxml.Items(url_)
mycxml.collection.close()
print mycxml
output:
<?xml version='1.0' encoding='utf-8' ?>
<Collection xmlns:p="somelines">
<Items>href1.com</Items>
<Items>href2.com</Items>
</collection>

Categories

Resources