Extracting Raw XML via lxml etree - python

I'm trying to extract raw XML from an XML file.
So if my data is:
<xml>
... Lots of XML ...
<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>
<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>
... Lots of XML ...
</xml>
I can get the keys I want easily using etree:
from lxml import etree
search_me = etree.fromstring(xml_str)
search_me.findall('./xml/getThis')
But how do I get the actual content as raw XML? All the stuff I can see in the docs is for getting elements/text/attributes rather than the raw XML.
My desired output would be a list with two elements:
["<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>",
"<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>"]

You should be able to use tostring() to serialize the XML.
Example...
from lxml import etree
xml = """
<xml>
<getThese>
<clonedKey>1</clonedKey>
<clonedKey>2</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>this is a sentence</randomStuff>
</getThese>
<getThese>
<clonedKey>6</clonedKey>
<clonedKey>8</clonedKey>
<clonedKey>3</clonedKey>
<randomStuff>more words</randomStuff>
</getThese>
</xml>
"""
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(xml, parser=parser)
elems = []
for elem in tree.xpath("getThese"):
elems.append(etree.tostring(elem).decode())
print(elems)
Printed output...
['<getThese><clonedKey>1</clonedKey><clonedKey>2</clonedKey><clonedKey>3</clonedKey><randomStuff>this is a sentence</randomStuff></getThese>', '<getThese><clonedKey>6</clonedKey><clonedKey>8</clonedKey><clonedKey>3</clonedKey><randomStuff>more words</randomStuff></getThese>']

Related

How can I retrieve specific information from a XML file using python?

I am working with Sentinel-2 Images, and I want to retrieve the Cloud_Coverage_Assessment from the XML file. I need to do this with Python.
Does anyone have any idea how to do this? I think I have to use the xml.etree.ElementTree but I'm not sure how?
The XML file:
<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
...
</n1:General_Info>
<n1:Geometric_Info>
...
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
...
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
...
</Technical_Quality_Assessment>
<Quality_Control_Checks>
...
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>
read xml from file
import xml.etree.ElementTree as ET
tree = ET.parse('sentinel2.xml')
root = tree.getroot()
print(root.find('.//Cloud_Coverage_Assessment').text)
..and I want to retrieve the Cloud_Coverage_Assessment
Try the below (use xpath)
import xml.etree.ElementTree as ET
xml = '''<n1:Level-1C_User_Product xmlns:n1="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://psd-14.sentinel2.eo.esa.int/PSD/User_Product_Level-1C.xsd">
<n1:General_Info>
</n1:General_Info>
<n1:Geometric_Info>
</n1:Geometric_Info>
<n1:Auxiliary_Data_Info>
</n1:Auxiliary_Data_Info>
<n1:Quality_Indicators_Info>
<Cloud_Coverage_Assessment>90.7287</Cloud_Coverage_Assessment>
<Technical_Quality_Assessment>
</Technical_Quality_Assessment>
<Quality_Control_Checks>
</Quality_Control_Checks>
</n1:Quality_Indicators_Info>
</n1:Level-1C_User_Product>'''
root = ET.fromstring(xml)
print(root.find('.//Cloud_Coverage_Assessment').text)
output
90.7287

Parsing of xml in Python

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
print(text)
<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G>A,NC_000013.10:g.28676705G>A,NG_007066.1:g.3001C>T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>
You're using the wrong method to parse your XML. The etree.Element
class is for creating a single XML element. For example:
>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'
As Jayvee has pointed how, to parse XML contained in a string you use
the etree.fromstring method (to parse XML content in a file you
would use the etree.parse method):
>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>
Note that because this XML document sets a default namespace, you'll
need properly set namespaces when looking for elements. E.g., this
will fail:
>>> doc.find('DocumentSummary')
>>>
But this works:
>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>
You can check if the xml is well formed by try converting it:
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
try:
doc=etree.fromstring(text)
print("valid")
except:
print("not a valid xml")

Restore CDATA during lxml serialization

I know that I can preserve CDATA sections during XML parsing, using the following:
from lxml import etree
parser = etree.XMLParser(strip_cdata=False)
root = etree.XML('<root><![CDATA[test]]></root>', parser)
See APIs specific to lxml.etree
But, is there a simple way to "restore" CDATA section during serialization?
For example, by specifying a list of tag names…
For instance, I want to turn:
<CONFIG>
<BODY>This is a <message>.</BODY>
</CONFIG>
to:
<CONFIG>
<BODY><![CDATA[This is a <message>.]]></BODY>
</CONFIG>
Just by telling that BODY should contains CDATA…
Something like this?
from lxml import etree
parser = etree.XMLParser(strip_cdata=True)
root = etree.XML('<root><x><![CDATA[<test>]]></x></root>', parser)
print etree.tostring(root)
for elem in root.findall('x'):
elem.text = etree.CDATA(elem.text)
print etree.tostring(root)
Produces:
<root><x><test></x></root>
<root><x><![CDATA[<test>]]></x></root>

xml parsing with beautifulsoup4, namespaces issue

While parsing .docx file contents in form of xml (word/document.xml) with beautifulsoup4 (with lxml installed, as required) I encountered one problem. This part from xml:
...
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
...
becomes this:
...
<graphic>
<graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic>
...
Even when I just parse file and save it, without any modifications. Like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(filepath_in), 'xml')
with open(filepath_out, "w+") as fd:
fd.write(str(soup))
Or parse xml from python console.
For me it looks like namespaces, declared like this, not in root document node, get eaten by parser.
Is this a bug, or feature? And is there a way to preserve these while parsing with beautifulesoup4? Or do I need to switch to something else for that?
UPDATE 1: if with some regex and text replacement I add these namespace declarations to the root document node, then beautifulsoup parses it just fine. But I'm still interested if this can be solved without modification of xml before parsing.
UPDATE 2: after playing with beutifulsoup a bit, I figured out that namespace declarations are parsed only in first occurrence. Means that if tag declares namespace, then if its children have namespace declarations, they will not be parsed. Below is code example with output to illustrate that.
from bs4 import BeautifulSoup
xmls = []
xmls.append("""<name1:tag xmlns:name1="namespace1" xmlns:name2="namespace2">
<name2:intag>
text
</name2:intag>
</name1:tag>
""")
xmls.append("""<tag>
<name2:intag xmlns:name2="namespace2">
text
</name2:intag>
</tag>
""")
xmls.append("""<name1:tag xmlns:name1="namespace1">
<name2:intag xmlns:name2="namespace2">
text
</name2:intag>
</name1:tag>
""")
for i, xml in enumerate(xmls):
print "============== xml {} ==============".format(i)
soup = BeautifulSoup(xml, "xml")
print soup
Will produce output:
============== xml 0 ==============
<?xml version="1.0" encoding="utf-8"?>
<name1:tag xmlns:name1="namespace1" xmlns:name2="namespace2">
<name2:intag>
text
</name2:intag>
</name1:tag>
============== xml 1 ==============
<?xml version="1.0" encoding="utf-8"?>
<tag>
<name2:intag xmlns:name2="namespace2">
text
</name2:intag>
</tag>
============== xml 2 ==============
<?xml version="1.0" encoding="utf-8"?>
<name1:tag xmlns:name1="namespace1">
<intag>
text
</intag>
</name1:tag>
See, how first two xmls are parsed correctly, while second declaration in third one gets eaten.
Actually this problem does not involve docx anymore. And my question is rounded to such: is this behavior hardcoded in beautifulsoup4, and if not, then how can I change it?
From the W3C recommendation:
The Prefix provides the namespace prefix part of the qualified name, and MUST be associated with a namespace URI reference in a namespace declaration.
https://www.w3.org/TR/REC-xml-names/#ns-qualnames
So I think it is the expected behaviour: discard the namespaces not declared to gracefully allow some parsing on documents that not respect the recommendation.
change this line:
soup = BeautifulSoup(open(filepath_in), 'xml')
to
soup = BeautifulSoup(open(filepath_in), 'lxml')
or
soup = BeautifulSoup(open(filepath_in), 'html.parser')

What's the best way to handle -like entities in XML documents with lxml?

Consider the following:
from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11
This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.
If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.
What's the best way to get a â text child under the aa tag with lxml?
You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.
https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html
You can try lying and putting an XHTML DTD on your document. e.g.
from lxml import etree
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa> â</aa>'
#Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:
entities = [
(' ', u'\u00a0'),
('â', u'\u00e2'),
...
]
for before, after in entities:
x = x.replace(before, after.encode('utf8'))
Of course, this can be broken by sufficiently weird "xml" also.
Your best bet is to fix your input XML documents to be well-formed XML.
When I was trying to do something similar, I just used x.replace('&', '&') before parsing the string.

Categories

Resources