Filtering XML in Python

Filtering XML in Python - python

I need to write a filter to discard some elements, tags and blocks in my XML Files. In the following you can see what are my xml examples and expected outputs. I am somehow confused about the differences between element, tag, attribute in the elemetTree. My test does not work!
Filter:
import xml.etree.ElementTree as xee
def test(input):
doc=xee.fromstring(input)
print xee.tostring(doc)
#RemoveTimeStampAttribute
for elem in doc.findall('Component'):
if 'timeStamp' in elem.attrib:
del elem.attrib['timeStamp']
#RemoveTimeStampElements
for elem in doc.findall('TimeStamp'):
del elem
print xee.tostring(doc)
return xee.tostring(doc)

First of all, you are removing the attribute incorrectly, see if timeStamp is in the element's attrib dictionary and then use del to remove it:
def amdfilter(input):
doc = xee.fromstring(input)
for node in doc.findall('Component'):
if 'timeStamp' in node.attrib:
del node.attrib['timeStamp']
return xee.tostring(doc)
Also, since you are testing only the attribute removal here, change your expectation to:
expected = '<ComponentMain><Component /></ComponentMain>'
Complete test (it passes):
import unittest
from amdfilter import *
class FilterTest(unittest.TestCase):
def testRemoveTimeStampAttribute(self):
input = '<?xml version="1.0"?><ComponentMain><Component timeStamp="2014"></Component></ComponentMain>'
output = amdfilter(input)
expected = '<ComponentMain><Component /></ComponentMain>'
self.assertEqual(expected, output)
Note that I don't care here about the xml declaration line (it could be easily added).

Related

How to read all DICOM attributes/tags with pydicom?

I'm trying to get a list of all the attributes (tags) of a given DICOM instance using pydicom.
The list should include the attribute key/id, its vr, the value, and also the corresponding name.
For example:
Tag: (2,0)
VR: UL
Name: File Meta Information Group Length
Value: 246
I'd like to get some guidance on how to obtain this information since I can't find anything useful in the pydicom docs.
My code is the following:
import pydicom
from io import BytesIO
dicom_data = await client.download_dicom_file(image_id)
data = pydicom.dcmread(BytesIO(dicom_data))

To get all tags, you just iterate over all elements in a dataset. Here is an example in the documentation that does that. It boils down to:
from pydicom import dcmread
ds = dcmread(file_name)
for element in ds:
print(element)
The example also shows how to handle sequences (by recursively iterating the sequence items). Here is a simple example for handling sequence items using just the string representation of the elements:
def show_dataset(ds, indent):
for elem in ds:
if elem.VR == "SQ":
indent += 4 * " "
for item in elem:
show_dataset(item, indent)
indent = indent[4:]
print(indent + str(elem))
def print_dataset(file_name):
ds = dcmread(file_name)
show_dataset(ds, indent="")
If you want to print your own representation of the data elements, you can access the element attributes.
Each element is a DataElement,
which has the information you need:
>>> from pydicom import dcmread
>>> ds = dcmread("ct_small.dcm") # from the test data
>>> len(ds)
258
>>> element = ds[0x00080008]
>>> element
(0008, 0008) Image Type CS: ['ORIGINAL', 'PRIMARY', 'AXIAL']
>>> type(element)
<class 'pydicom.dataelem.DataElement'>
>>> element.VR
'CS'
>>> element.tag
(0008, 0008)
>>> element.name
'Image Type'
>>> element.value
['ORIGINAL', 'PRIMARY', 'AXIAL']
>>> element.VM
3
You will find similar information in the documentation of Dataset, and probably in other examples.
Note that there is also a command line interface that shows the contents of a DICOM file.
Edit:
As this has been asked in the other answer: if you want to access the file meta information, e.g. the tags in group 2, you can do so by iterating over ds.meta_info (ds being the dataset). meta_info is also of type Dataset and can be accessed the same way. Note that meta_info may be None if no meta information is present in the dataset:
from pydicom import dcmread
ds = dcmread(file_name)
meta_info = ds.meta_info
if meta_info is not None:
for element in meta_info:
print(element)

Use to_json()
https://pydicom.github.io/pydicom/stable/tutorials/dicom_json.html
Please note that for tags with group num 0x0002, pydicom cannot read them using to_json() and MrBean Bremen's for-loop methods. I am sorry that I have no solution for this limitation.

Can you iterate over only tags with the .children iterator from BeautifulSoup?

I am pulling down an xml file using BeautifulSoup with this code
dlink = r'https://www.sec.gov/Archives/edgar/data/1040188/000104018820000126/primary_doc.xml'
dreq = requests.get(dlink).content
dsoup = BeautifulSoup(dreq, 'lxml')
There is a level I'm trying to access and then place the elements into a dictionary. I've got it working with this code:
if dsoup.otherincludedmanagerscount.text != '0':
inclmgr = []
for i in dsoup.find_all('othermanagers2info'):
for m in i.find_all('othermanager2'):
for o in m.find_all('othermanager'):
imd={}
if o.cik:
imd['cik'] = o.cik.text
if o.form13ffilenumber:
imd['file_no'] = o.form13ffilenumber.text
imd['name'] = o.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
I assume its easier to use the .children or .descendants generators, but every time I run it, I get an error. Is there a way to only iterate over tags using the BeautifulSoup generators?
Something like this?
for i in dsoup.othermanagers2info.children:
imd['cik'] = i.cik.text
AttributeError: 'NavigableString' object has no attribute 'cik'

Assuming othermanagers2info is a single item; you can create the same results using 1 for loop:
for i in dsoup.find('othermanagers2info').find_all('othermanager'):
imd={}
if i.cik:
imd['cik'] = i.cik.text
if i.form13ffilenumber:
imd['file_no'] = i.form13ffilenumber.text
imd['name'] = i.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
You can also do for i in dsoup.find('othermanagers2info').findChildren():. However this will produce different results (unless you add additional code). It will flattened the list and include both parent & child items. You can also pass in a node name

lxml XPath gobbles an element from the next record

I am using lxml to obtain names from each record via XPath. For some reason XPath always fetches first name from the next record, despite that I feed it only one record at a time. In addition, it also fetches the same name again when the next record is loaded. What am I doing wrong?
Example: parse the following sample.xml:
<?xml version="1.0" encoding="UTF-8"?>
<records>
<REC>
<name>Alpha</name>
<name>Beta</name>
<name>Gamma</name>
</REC>
<REC>
<name>Delta</name>
</REC>
</records>
Code:
#!/usr/bin/env python3
from lxml import etree
class Nam:
XPATH = '/records/REC/name'
def __init__(self):
self.xp = etree.XPath(self.XPATH)
def getvals(self, doc):
for no, el in enumerate(self.xp(doc)):
print("{} val: {} ".format(no, el.text))
print()
def main():
nam = Nam()
context = etree.iterparse("sample.xml", events=('end',), tag='REC')
for event, elem in context:
print("Element: {}".format( etree.tostring(elem).decode()))
nam.getvals(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
if __name__ == '__main__':
main()
Output:
Element: <REC>
<name>Alpha</name>
<name>Beta</name>
<name>Gamma</name> </REC>
0 val: Alpha
1 val: Beta
2 val: Gamma
3 val: Delta
Element: <REC>
<name>Delta</name> </REC>
0 val: Delta
Thank you for your help.

When iterparse emits an event that doesn't mean that it has only parsed the input up to the current element, it may actually have already parsed beyond that point as while you iterate over it it parses the input file in chunks of a fixed size.
That however means there is no guarantee how much of the input xml has already been parsed, so for a start event you shouldn't try to access an elements content (other then its attribute) as in may not have been parsed yet and you should not try to access any of the following siblings in either start or end events.
In this case your sample xml is very short, so it's being parsed as a single chunk. Your xpath expression is rooted, so it will always return all matching elements of a document regardless of the given element.
Given that you only handle REC tags anyway, your xpath expression should probably be ./name instead.

Parsing XML in Python using the cElementTree module

I have an XML file, which I wanted to convert to a dictionary. I have tried to write the following code but the output is not as expected. I have the following XML file named core-site.xml:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdfs/tmp</value>
<description>Temporary Directory.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.XXX.X.XXX:XXXX</value>
<description>Use HDFS as file storage engine</description>
</property>
</configuration>
The code that I wrote is:
import xml.etree.cElementTree
import xml.etree.ElementTree as ET
import warnings
warnings.filterwarnings("ignore")
class XmlListConfig(list):
def __init__(self, aList):
for element in aList:
if element:
# treat like dict
if len(element) == 1 or element[0].tag != element[1].tag:
self.append(XmlDictConfig(element))
# treat like list
elif element[0].tag == element[1].tag:
self.append(XmlListConfig(element))
elif element.text:
text = element.text.strip()
if text:
self.append(text)
class XmlDictConfig(dict):
def __init__(self, parent_element):
if parent_element.items():
self.update(dict(parent_element.items()))
for element in parent_element:
if element:
# treat like dict - we assume that if the first two tags
# in a series are different, then they are all different.
if len(element) == 1 or element[0].tag != element[1].tag:
aDict = XmlDictConfig(element)
# treat like list - we assume that if the first two tags
# in a series are the same, then the rest are the same.
else:
# here, we put the list in dictionary; the key is the
# tag name the list elements all share in common, and
# the value is the list itself
aDict = {element[0].tag: XmlListConfig(element)}
# if the tag has attributes, add those to the dict
if element.items():
aDict.update(dict(element.items()))
self.update({element.tag: aDict})
# this assumes that if you've got an attribute in a tag,
# you won't be having any text. This may or may not be a
# good idea -- time will tell. It works for the way we are
# currently doing XML configuration files...
elif element.items():
self.update({element.tag: dict(element.items())})
# finally, if there are no child tags and no attributes, extract
# the text
else:
self.update({element.tag: element.text})
tree = ET.parse('core-site.xml')
root = tree.getroot()
xmldict = XmlDictConfig(root)
print xmldict
This is the output that I am getting:
{
'property':
{
'name': 'fs.defaultFS',
'value': 'hdfs://192.X.X.X:XXXX',
'description': 'Use HDFS as file storage engine'
}
}
Why isn't the first property tag being shown? It only shows the data in the last property tag.

Since you are using a dict, the second element with the same key property replaces the first element previously recorded in the dict.
You have to use a different data structure, a list of dict for instance.

Populating Python list using data obtained from lxml xpath command

I'm reading instrument data from a specialty server that delivers the info in xml format. The code I've written is:
from lxml import etree as ET
xmlDoc = ET.parse('http://192.168.1.198/Bench_read.xml')
print ET.tostring(xmlDoc, pretty_print=True)
dmtCount = xmlDoc.xpath('//dmt')
print(len(dmtCount))
dmtVal = []
for i in range(1, len(dmtCount)):
dmtVal[i:0] = xmlDoc.xpath('./address/text()')
dmtVal[i:1] = xmlDoc.xpath('./status/text()')
dmtVal[i:2] = xmlDoc.xpath('./flow/text()')
dmtVal[i:3] = xmlDoc.xpath('./dp/text()')
dmtVal[i:4] = xmlDoc.xpath('./inPressure/text()')
dmtVal[i:5] = xmlDoc.xpath('./actVal/text()')
dmtVal[i:6] = xmlDoc.xpath('./temp/text()')
dmtVal[i:7] = xmlDoc.xpath('./valveOnPercent/text()')
print dmtVal
And the results I get are:
$python XMLparse2.py
<response>
<heartbeat>0x24</heartbeat>
<dmt node="1">
<address>0x21</address>
<status>0x01</status>
<flow>0.000000</flow>
<dp>0.000000</dp>
<inPressure>0.000000</inPressure>
<actVal>0.000000</actVal>
<temp>0x00</temp>
<valveOnPercent>0x00</valveOnPercent>
</dmt>
<dmt node="2">
<address>0x32</address>
<status>0x01</status>
<flow>0.000000</flow>
<dp>0.000000</dp>
<inPressure>0.000000</inPressure>
<actVal>0.000000</actVal>
<temp>0x00</temp>
<valveOnPercent>0x00</valveOnPercent>
</dmt>
</response>
...Starting to parse XML nodes
2
[]
...Done
Sooo, nothing is coming out. I've tried using /value in place of the /text() in the xpath call, but the results are unchanged. Is my problem:
1) An incorrect xpath command in the for loop? or
2) A problem in the way I've structured list variable dmtVal ? or
3) Something else I'm missing completely?
I'd welcome any suggestions! Thanks in advance...

dmtVal[i:0] is the syntax for slicing.
You probably wanted indexing: dmtVal[i][0]. But that also wouldn't work.
You don't typically loop over the indices of a list in python, you loop over it's elements instead.
So, you'd use
for element in some_list:
rather than
for i in xrange(len(some_list)):
element = some_list[i]
The way you handle your xpaths is also wrong.
Something like this should work(not tested):
from lxml import etree as ET
xml_doc = ET.parse('http://192.168.1.198/Bench_read.xml')
dmts = xml_doc.xpath('//dmt')
dmt_val = []
for dmt in dmts:
values = []
values.append(dmt.xpath('./address/text()'))
# do this for all values
# making this a loop would be a good idea
dmt_val.append(values)
print dmt_val

Counting <dmt/> tags and then iterating over them by index is both inefficient and un-Pythonic. Apart from that you are using wrong syntax (slice instead of index) for indexing arrays. In fact you don't need to index the val at all, to do it Pythonic way use list comprehensions.
Here's a slightly modified version of what stranac suggested:
from lxml import etree as ET
xmlDoc = ET.parse('http://192.168.1.198/Bench_read.xml')
print ET.tostring(xmlDoc, pretty_print=True)
response = xmlDoc.getroot()
tags = (
'address',
'status',
'flow',
'dp',
'inPressure',
'actVal',
'temp',
'valveOnPercent',
)
dmtVal = []
for dmt in response.iter('dmt'):
val = [dmt.xpath('./%s/text()' % tag) for tag in tags]
dmtVal.append(val)

Can you explain this:
dmtVal[i:0]
If the iteration starts with a count of 0 and increments over times, you're not actually storing anything in the list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering XML in Python - python

Related

How to read all DICOM attributes/tags with pydicom?

Can you iterate over only tags with the .children iterator from BeautifulSoup?

lxml XPath gobbles an element from the next record

Parsing XML in Python using the cElementTree module

Populating Python list using data obtained from lxml xpath command

Categories

Resources