trouble parsing XML in python - python

I'm trying to query a database, then convert the file-like object it returns to an XML document. Here's what I've been doing:
>>> import urllib, xml.dom.minidom
>>> query = "http://sbol.bhi.washington.edu/openrdf-sesame/repositories/sbol_test?query=select%20distinct%20%3Fname%20%3Ffeaturename%20where%20%7B%3Fpart%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23annotation%3E%20%3Fannotation%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23status%3E%20'Available'%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Fname.%3Fannotation%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23feature%3E%20%3Ffeature.%3Ffeature%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type%3E%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23binding%3E%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Ffeaturename%7D"
>>> raw_result = urllib.urlopen(query)
>>> xml_result = xml.dom.minidom.parse(raw_result)
That last command gives me
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 4
Almost the same thing happens if I use xml.etree.ElementTree to do the parsing. I think they both use Expat. The weird part is, if instead of loading the file in python I just paste the query into Firefox, the resulting file can be read in perfectly well using open(path_to_file, "r").
Any ideas what this could be?
UPDATE:
This is the first line of the file:
<?xml version='1.0' encoding='UTF-8'?>
However that may not be what's in raw_result... that's what you get after downloading query-result.srx and changing the extension to .txt. The file extension doesn't matter does it? Also, I'm pretty new to this whole xml thing—why is column 4 the 8th character? – Jeff 0 secs ago edit

Your server is picky about the accept header in deciding what to send back and in which format. The following should work:
In [265]: import urllib2
In [266]: req = urllib2.Request(query, headers={'Accept':'application/xml'})
In [267]: rsp = urllib2.urlopen(req)
In [268]: xml = minidom.parse(rsp)
In [268]: xml.toxml()[:64]
Out[268]: u'<?xml version="1.0" ?><sparql xmlns="http://www.w3.org/2005/spar'
Note the accept header in urllib2.Request.

Any chance you could post the XML snippet? The parser is indicating that the error is happening at the very first line. My guess is the formatting is off or reporting incorrectly, which is causing EXPAT to pitch an exception right off the bat.
My guess is that first line violates something in the "well formed XML" content anwyay. For reference, you might compare against http://en.wikipedia.org/wiki/XML

Looks like something is wrong with your XML file, right about line 1, column 4.
I tried this, and what I got doesn't look like XML to me. Here are the first eight characters, as Alex suggested:
>>> raw_result.read(8)
'BRTR\x00\x00\x00\x03'

It seems that the RDF server is delivering plain text to your urllib.urlopen call.
You should be able, with setting the right header
Accept: application/sparql-results+xml, */*;q=0.5
, to get the xml response. You have to read the RDF protocol specification of openRDF for details - there is for openRDF more than one format.

Related

Beautifulsoup4, parsing Tableau XML file, and writing to file

I'm having an issue where I'm using beautifulsoup to parse the xml generated from a Tableau workbook and when I write the results to file it doesn't behave as expected. Chose bs4 and it's standard XML parser, because I find it easiest for my brain to comprehend and I don't need the speed of the lxml parser/package.
Background: I have a calculated field in my Tableau workbook that will programmatically change during publish depending on the server and site location that template workbook will go to. I've already gone through and built some functions and scripted out everything I need to get the data to do this, but when my script writes the xml to file it adds some encodings for ampersand. This results in the file being valid and able to be opened in Tableau, but the field is considered invalid, despite looking like it is valid. I'm thinking the XML is some how getting malformed somewhere in my process.
Code so far for where I think the issue is occuring:
import bs4 as bs
twb = open(Script_config['local_file_location'], 'r')
bs_content = bs(twb, 'xml')
# formula_final below comes from another script that handles getting the data I need to programmatically generate the formula I need.
# Here is what I use to generate the bulk of the formula for Tableau
# 'When &apos;[{}]&apos; then {} '.format(rows['Column_Name'], rows['Formatted_ColumnName']))
# Does some other stuff and slaps together the formula I need as a string that can be written into my XML
# Verified that my result is coming over correctly and only changes once I do the replacement here and/or the writing of the file.
for calculation in bs_content.find_all('column', {'caption': 'Group By', 'datatype':'string', 'name':'[Calculation_12345678910]'}):
calculation.find('calculation')['formula'] = formula_final
with open('test.twb', 'w') as file:
file.write(str(bs_content))
Sample XML:
<?xml version="1.0" encoding="utf-8"?>
<workbook source-build="2021.1.4 (20211.21.0712.0907)" source-platform="win" version="18.1" xml:base="https://localhost" xmlns:user="http://www.tableausoftware.com/xml/user">
...
<column caption="Group By" datatype="string" name="[Calculation_12345678910]" role="dimension" type="nominal">
<calculation class="tableau" formula="Case [Parameters].[Location External ID Parameter] When &apos;[Territory]&apos; then [Territory] End"/>
</column>
Problem:
In the sample XML, Tableau is expecting the XML to be formatted without the & in front of the apos;. It should just be reading as &apos;.
What I've tried:
Thinking that I could just escape the & character I put the necessary slashes in place to escape it before the apos; portion, but to no avail I can't figure out how to get my XML to be formed so that it doesn't always put the ampersand code as part of the other special characters in my XML.
Any help would be much appreciated!
Good problem description.
Your problem is known as 'double escaping'. Your program is reading data which has already been serialized by an XML processor. That's why it contains &apos;[{}]&apos; and not '[{}]'
I think your program reads that XML value from a file as a simple string and assigns it to the value of a tag. But when BeautifulSoup's XML processor encounters the & in the tag value it must replace it with &. So you end up with &apos;' instead of &apos; in the XML output.
The quick and dirty solution is to write some code to replace all XML entities with the equivalent text. A better solution would be to read the XML data using an XML parser - that way, your program will receive the intended string value automatically.

Python ElementTree generate not well formed XML file with special character '\x0b'

I used ElementTree to generate xml with special character of '\x0b', then use minidom to parse it. It will throw not well-formed error.
import xml.etree.ElementTree as ET
from xml.dom import minidom
root = ET.Element('root')
root.text='\x0b'
xml = ET.tostring(root, 'UTF-8')
print(xml)
pretty_tree = minidom.parseString(xml)
Generated XML: <root>\x0b</root>
Error:
Traceback (most recent call last):
File "testXml.py", line 7, in <module>
pretty_tree = minidom.parseString(xml)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 6
This behaviour has been raised as a bug in the past and resolved as "won't fix".
The author of the ElementTree module commented
For ET, [this behaviour is] very much on purpose. Validating data provided by every
single application would kill performance for all of them, even if only a
small minority would ever try to serialize data that cannot be represented
in XML.
The closing comment (by the maintainer of lxml, who is also a Python core dev) includes these observations:
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.
I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.
...
In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)
...
So in summary, ET.tostring will generate xml which is not well-formed, and this is by design. If necessary, the output can be parsed to check that it is well-formed, using ET.fromstring or another parser. Alternatively, lxml can be used instead of ElementTree.
\x0b is an XML restricted character. There is a good description of valid and restricted characters in the answers to this question.
As a workaround for myself, I wrote a helper method to clean the restricted chars before saving to XML model:
def clean(str):
return re.sub(r'[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+', '', str)

Python ElementTree ParseError from iterparse when reaching escape character (XML)

This question appears related to this one from 2013, but it didn't help me.
I'm about to parse a large (2GB) XML file, and plan to do it with Python 3.5.2 and ElementTree. I'm new to Python, but it works well until reaching any escape character, such as:
<author>Sanjeev Saxöna</author>
returning:
test.xml
File "<string>", line unknown
ParseError: undefined entity ö: line 5, column 19enter code here
My code looks something like this:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse('test_esc.xml'):
# do something with the node
What's the best way to deal with this? Parsing the unescaped 'ö' actually works fine:
<author>Sanjeev Saxöna</author>
Is there an easy way to programmatically unescape the whole XML file?
As suggested by the answer linked by Soulaimane Sahmi, I added an inline DTD to the XML file. It is maybe not the best solution out there, but it works for now.

Handling ` ` in Python

Problem Background:
I have an XML file that I'm importing into BeautifulSoup and parsing through. One node has the following:
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
Notice that the value has 
 and
within the text. I understand those are the XML representation of carriage return and line feed.
When I import into BeautifulSoup, the value gets converted into the following:
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
You'll notice that the
gets converted to a newline.
My use case requires that the value remains as the original. Any idea how to get that to stay? Or convert it back?
Source Code:
python: (2.7.11)
from bs4 import BeautifulSoup #version 4.4.0
s = BeautifulSoup(open('test.xml'),'lxml-xml',from_encoding="ansi")
print s.DIAttribute
#XML file looks like
'''
<?xml version="1.0" encoding="UTF-8" ?>
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
'''
Notepad++ says the encoding of the source XML file is ANSI.
Things I've Tried:
I've scoured the documentation without any success.
Variations for line 3:
print s.DIAttribute.prettify('ascii')
print s.DIAttribute.prettify('windows-1252')
print s.DIAttribute.prettify('ansi')
print s.DIAttribute.prettify('utf-8')
print s.DIAttribute['value'].replace('\r','
').replace('\n','
') #This works, but it feels like a bandaid and will likely other problems will remain.
Any ideas anyone? I appreciate any comments/suggestions.
Just for record, first the libraries that DO NOT handle properly the
entity: BeautifulSoup(data ,convertEntities=BeautifulSoup.HTML_ENTITIES), lxml.html.soupparser.unescape, xml.sax.saxutils.unescape
And this is what works (in Python 2.x):
import sys
import HTMLParser
## accept file name as argument, or read stdin if nothing passed
data = len(sys.argv) > 1 and open(sys.argv[1]).read() or sys.stdin.read()
parser = HTMLParser.HTMLParser()
print parser.unescape(data)

Python Regex on File Read Input

So im reading from a file like so.
f = open("log.txt", 'w+r')
f.read()
So f now has a bunch of lines, but im mainly concerned with it having a number, and then a specific string (For example "COMPLETE" would be the string)
How...exactly would you go about checking this?
I thought it'd be something like:
r.search(['[0-9]*'+,"COMPLETE")
but that doesn't seem to work? Maybe it's my Regex thats wrong (im pretty terrible at it)
but basically it just needs to check the Entire String (which is multiple lines and contain's \n's for a Number (specifically 200) and the word COMPLETE (in caps)
edit: For reference here is what the logfile looks like
Using https
Sending install data... DONE
200 COMPLETE
<?xml version="1.0" encoding="UTF-8"?>
<SolutionsRevision version="185"/>
I just need to make sure it says "200" and COMPLETE
Regular expressions are overkill here if you're just looking for "200 COMPLETE". Just do this:
if "200 COMPLETE" in log:
# Do something
You should use Rafe's answer instead of regex to search for "200 COMPLETE" in the file content, but your current code won't work for reading the file, using "w+r" as the mode will truncate your file. You need to do something like this:
f = open("log.txt", "r")
log = f.read()
if "200 COMPLETE" in log:
# Do something
It should be something like
m = r.search('[0-9]+\s+COMPLETE',line)

Categories

Resources