Python, extract date from XML

Python, extract date from XML - python

Apologies, my Python knowledge is pretty non-existant. I need to extract a date from some XML which is in a format similar to:
<Header>
<Version>1.0</Version>
....
<cd:Data>...</Data>
.....
<cd:DateReceived>20070620171524</cd:DateReceived>
From looking around here I found something similar
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
print collection.getElementsByTagName("cd:DateReceived").item(0)
However this only prints the Hex value:
<DOM Element: cd:DateReceived at 0x1529e0>
How can I get the date 20070620171524?
I've tried using the following
#!/usr/bin/python
from xml.dom.minidom import parse
import xml.dom.minidom
# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("date.xml")
collection = DOMTree.documentElement
date = cd:DateReceived[0].firstChild.nodeValue
print date
but it gives an error as it doesn't like the "cd" part of the tag
date = cd:DateReceived[0].firstChild.nodeValue
^
SyntaxError: invalid syntax
Any help would be appreciated. Thanks!

collection.getElementsByTagName("cd:DateReceived").item(0) returns a node. from that node, you can get nodeValue

Related

Parsing of xml in Python

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
print(text)
<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G>A,NC_000013.10:g.28676705G>A,NG_007066.1:g.3001C>T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>

You're using the wrong method to parse your XML. The etree.Element
class is for creating a single XML element. For example:
>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'
As Jayvee has pointed how, to parse XML contained in a string you use
the etree.fromstring method (to parse XML content in a file you
would use the etree.parse method):
>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>
Note that because this XML document sets a default namespace, you'll
need properly set namespaces when looking for elements. E.g., this
will fail:
>>> doc.find('DocumentSummary')
>>>
But this works:
>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>

You can check if the xml is well formed by try converting it:
import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
text=response.text
try:
doc=etree.fromstring(text)
print("valid")
except:
print("not a valid xml")

xml minidom - get the full content of childnodes text

I have a Test.xml file as:
<?xml version="1.0" encoding="utf-8"?>
<SetupConf>
<LocSetup>
<Src>
<Dir1>C:\User1\test1</Dir1>
<Dir2>C:\User2\log</Dir2>
<Dir3>D:\Users\Checkup</Dir3>
<Dir4>D:\Work1</Dir4>
<Dir5>E:\job1</Dir5>
</Src>
</LocSetup>
</SetupConf>
Where node depends on user input. In "Dir" node it may be 1,2,5,10 dir structure defined. As per requirement I am able to extract data from the Test.xml with help of #Padraic Cunningham using below Python code:
from xml.dom import minidom
from StringIO import StringIO
dom = minidom.parse('Test.xml')
Src = dom.getElementsByTagName('Src')
output = ", ".join([a.childNodes[0].nodeValue for node in Src for a in node.getElementsByTagName('Dir')])
print [output]
And getting the output:
C:\User1\test1, C:\User2\log, D:\Users\Checkup, D:\Work1, E:\job1
But the expected output is:
['C:\\User1\\test1', 'C:\\User2\\log', 'D:\\Users\\Checkup', 'D:\\Work1', 'E:\\job1']

Well it's solved by myself:
from xml.dom import minidom
DOMTree = minidom.parse('Test0001.xml')
dom = DOMTree.documentElement
Src = dom.getElementsByTagName('Src')
for node in Src:
output = [a.childNodes[0].nodeValue for a in node.getElementsByTagName('Dir')]
print output
And getting output:
[u'C:\User1\test1', u'C:\User2\log', u'D:\Users\Checkup', u'D:\Work1', u'E:\job1']
I am sure there is more simple another way .. please let me know.. Thanks in adv.

Python (xml.etree) not reading XML text

I've not worked with XML before, but am having trouble with getting text out of the following XML:
<w>
<shortening>n</shortening>
ūmi
<mor type="mor">
<mw>
[extra stuff]
</mw>
<menx>rest</menx>
<menx>sleep</menx>
<gra type="gra" relation="ROOT" head="0" index="1"/>
</mor>
</w>
The Element.text property corresponding to the w tag doesn't have the text ūmi inside, instead it has None. I think this is because it is preceded by the <shortening> tag. This shouldn't be a Unicode issue, because there are plenty of other Unicode characters that read just fine (this is transliterated Hebrew).
Is there an easy way to fix this? Is this malformed XML?

That is because that text itself isn't being part of any node. It's the text of an attribute tail for the tag before it, you can access it with shortening node, see this:
import xml.etree.ElementTree as ET
from StringIO import StringIO
s = '''<w>
<shortening>n</shortening>
ūmi
<mor type="mor">
<mw>
[extra stuff]
</mw>
<menx>rest</menx>
<menx>sleep</menx>
<gra type="gra" relation="ROOT" head="0" index="1"/>
</mor>
</w>'''
tree = ET.parse(StringIO(s))
root = tree.getroot()
for i in root.iter('shortening'):
print i.tail
Results:
ūmi

Parsing XML using Python minidom

<PacketHeader>
<HeaderField>
<name>number</name>
<dataType>int</dataType>
</HeaderField>
</PacketHeader>
This is my small XML file and I want to extract out the text which is within the name tag.
Here is my code snippet:-
from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
packetHeader = xmldoc.getElementsByTagName("PacketHeader")
headerField = packetHeader.getElementsByTagName("HeaderField")
for field in headerField:
getFieldName = field.getElementsByTagName("name")
print getFieldName
But I am getting the location but not the text.

from xml.dom import minidom
from xml.dom.minidom import parse
xmldoc = minidom.parse('sample.xml')
# find the name element, if found return a list, get the first element
name_element = xmldoc.getElementsByTagName("name")[0]
# this will be a text node that contains the actual text
text_node = name_element.childNodes[0]
# get text
print text_node.data
Please check this.
Update
BTW i suggest you ElementTree, Below is the code snippet using ElementTree which is doing samething as the above minidom code
import elementtree.ElementTree as ET
tree = ET.parse("sample.xml")
# the tree root is the toplevel `PacketHeader` element
print tree.findtext("HeaderField/name")

A small variant of the accepted and correct answer above is:
from xml.dom import minidom
xmldoc = minidom.parse('fichier.xml')
name_element = xmldoc.getElementsByTagName('name')[0]
print name_element.childNodes[0].nodeValue
This simply uses nodeValue instead of its alias data

Lxml: Return a part of the xml file as string

I am parsing my xml file using lxml parser which looks like this:
# some elements above
<contact>
<phonenumber>
#something
</phonenumber>
</contact>
I want to be able to return only a part of the xml file.
Like Suppose if I am on phonenumber, I want lxml to return the everything between as a string .
I dont want to return textb/w phonenumber but the entire string :
<phonenumebr>something</phonenumber>
Is it possible ?

To print a part of the XML tree, you can use lxml.etree.tostring. On Python 2:
In [1]: from lxml.etree import tostring, parse
In [2]: tree = parse('test.xml')
In [3]: elem = tree.xpath('//phonenumber')[0]
In [4]: print tostring(elem)
<phonenumber>
something
</phonenumber>
For more information you can refer to the "Serialisation" section of the lxml tutorial.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, extract date from XML - python

collection.getElementsByTagName("cd:DateReceived").item(0) returns a node. from that node, you can get nodeValue

Related

Parsing of xml in Python

xml minidom - get the full content of childnodes text

Python (xml.etree) not reading XML text

Parsing XML using Python minidom

Lxml: Return a part of the xml file as string

Categories

Resources