Parsing orphaned XML children

Parsing orphaned XML children - python

I've been trying to parse the following XML file using xml.etree: Bills.xml
This is the simple python source: xml.py
I'm able to successfully print the child under BILLFIXED using the for loop. The result of which is as under:
1-Apr-2017 [Registered Creditor] 1
1-Apr-2017 [Registered Creditor] 58
However, as you can see in the XML certain orphaned child, BILLCL BILLOVERDUE BILLDUE which must logically be under BILLFIXED are not taken into consideration when outputting the XML as we are finding all the elements under BILLFIXED using the following code:
billfixed = dom.findall('BILLFIXED')
Is there any way to include the BILLCL, BILLDUE and BILLOVERDUE to be included under their respective listing? I'm unable to think of any logic that could help me consider those orphaned children to be treated as the sub children of BILLFIXED.
Thanks!

You could use zip:
for bill_fixed_node, bill_cl in zip(root.findall('BILLFIXED'), root.iter('BILLCL')):
print(bill_fixed_node)
print(bill_cl.text)
# <Element 'BILLFIXED' at 0x07905120>
# 600.00
# <Element 'BILLFIXED' at 0x079052D0>
# 10052.00
But it would probably be better to fix the structure of the XML file if you have control over it.

A friend of mine was able to answer and help me with the following code: https://gist.github.com/anonymous/dba333b6c6342d13d21fd8c0781692cb
from xml.etree import ElementTree
dom = ElementTree.parse('bills.xml')
billfixed = dom.findall('BILLFIXED')
billcl = dom.findall('BILLCL')
billdue = dom.findall('BILLDUE')
billoverdue = dom.findall('BILLOVERDUE')
for fixed, cl, due, overdue in zip(billfixed, billcl, billdue, billoverdue) :
party = fixed.find('BILLDATE').text
date = fixed.find('BILLREF').text
ref = fixed.find('BILLPARTY').text
print(' * {} [{}] {} + {} + {} + {}'.format(
party, ref, date, cl.text, due.text, overdue.text
))

Related

How to avoid running into IndexError: list index out of range error if an element is nonexistent while parsing xml with BeautifulSoup in python

I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<Entries>
<EntrySynopsisDetail_1_0>
<EntryID>262148</EntryID>
<EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
<EntryTitle>Call for Mobility Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
<CategoryOfEntry>MAT</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
</Entries>
And my code is below:
from bs4 import BeautifulSoup
import pandas as pd
fd = open("file_120123.xml",'r')
data = fd.read()
Bs_data = BeautifulSoup(data,'xml')
ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try:
Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
Cat = ''
CatDict = {
"ENG":"English",
"MAT" :"Mathematics"
}
dataDf = []
for i in range(0,len(ID)):
if (Cat[i] == CatDict):
Cat[i] == CatDict.get(Cat[i])
rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
dataDf.append(rows)
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')
As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file. Now one of the elements is a key and I have created a dictionary listing all possible keys. Not all parents have that element. I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.
With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line. Any insights on how to resolve this?

If you just want to avoid raising the error, add a conditional break
for i in range(0,len(ID)):
if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded
if (Cat[i] == CatDict):
Cat[i] == CatDict.get(Cat[i])
rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
dataDf.append(rows)

First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath). BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html.
Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner! Though you haven't shown your expected output, my guess is you expect the output below. Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:
entries = """<Entries>
<EntrySynopsisDetail_1_0>
<EntryID>262148</EntryID>
<EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
<CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
<EntryTitle>Call for Mobility Program</EntryTitle>
<CategoryOfEntry>MAT</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
</Entries>"""
pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")
Output:
EntryID EntryTitle CategoryOfEntry
0 262148 Establishment of the Graduate Internship Program ENG
1 2667654 Call for Mobility Program MAT

Navigating XML based on the last node you processed in Python

In Python I am trying to navigate XML (nodes) and creating links/traversing through nodes based on the last node you processed, I have a set of source and target nodes where i have to traverse from Source to Target and then from Target to Source and then same again, it may have same nodes multiples times as well.
Attached the XML structure below
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_1"
targetNode="FCMComposite_1_4" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_6" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_8" sourceNode="FCMComposite_1_2"
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_9"
targetNode="FCMComposite_1_3" sourceNode="FCMComposite_1_8"
targetNode="FCMComposite_1_5" sourceNode="FCMComposite_1_3"
In the XML above, I have to start from the 1st SourceNode (FCMComposite_1_1) to the 1st TargetNode (FCMComposite_1_2), then I have to navigate from this TargetNode (Last Node) to the SourceNode having the same value, in this case the 4th row, then from there to the destination Node and so on.
What is the best way to Achieve this? is Graph a good option for this, I am trying this in Python. Can someone please help me?

You can use a dictionary to store the connections. What you posted isn't actually XML, so I just use re to parse it, but you can do the parsing differently.
import re
data = '''
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_1"
targetNode="FCMComposite_1_4" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_6" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_8" sourceNode="FCMComposite_1_2"
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_9"
targetNode="FCMComposite_1_3" sourceNode="FCMComposite_1_8"
targetNode="FCMComposite_1_5" sourceNode="FCMComposite_1_3"
'''
beginning = None
connections = {}
for line in data.split('\n'):
m = re.match(r'targetNode="([^"]+)" sourceNode="([^"]+)"', line)
if m:
target = m.group(1)
source = m.group(2)
if beginning is None:
beginning = source
connections[source] = target
print('Starting at', beginning)
current = beginning
while current in connections.keys():
print(current, '->', connections[current])
current = connections[current]
Output:
Starting at FCMComposite_1_1
FCMComposite_1_1 -> FCMComposite_1_2
FCMComposite_1_2 -> FCMComposite_1_8
FCMComposite_1_8 -> FCMComposite_1_3
FCMComposite_1_3 -> FCMComposite_1_5
FCMComposite_1_5 -> FCMComposite_1_6
I'm not sure whats's supposed to happen with the multiple targets for FCMComposite_1_5.

Check if datasets exists, using a regex, without first reading the paths of all datasets

How can I check if a datasets exists using something like a regex, without first reading the paths of all datasets?
For example, I want to check if a dataset 'completed' exists in a file that may (or may not) contain
/123/completed
(Suppose that I do not a-priori know the complete path, I just want to check for a dataset name. So this answer will not work in my case.)

Custom recursion
No need for regex. You can build a set of dataset names by recursively traversing the groups in your HDF5 file:
import h5py
def traverse_datasets(hdf_file):
"""Traverse all datasets across all groups in HDF5 file."""
def h5py_dataset_iterator(g, prefix=''):
for key in g.keys():
item = g[key]
path = '{}/{}'.format(prefix, key)
if isinstance(item, h5py.Dataset): # test for dataset
yield (path, item)
elif isinstance(item, h5py.Group): # test for group (go down)
yield from h5py_dataset_iterator(item, path)
with h5py.File(hdf_file, 'r') as f:
for (path, dset) in h5py_dataset_iterator(f):
yield path.split('/')[-1]
all_datasets = set(traverse_datasets('file.h5'))
Then just check for membership: 'completed' in all_datasets.
Group.visit
Alternatively, you can use Group.visit. Note you need your searching function to return None to iterate all groups.
res = []
def searcher(name, k='completed'):
""" Find all objects with k anywhere in the name """
if k in name:
res.append(name)
return None
with h5py.File('file.h5', 'r') as f:
group = f['/']
group.visit(searcher)
print(res) # print list of dataset names matching criterion
Complexity is O(n) in both cases. You need to test the name of each dataset, but nothing more. The first option may be preferable if you need a lazy solution.

Recursion to Find All Valid Paths to dataset(s)
The following code uses recursion to find valid data paths to all dataset(s). After getting the valid paths (terminating possible circular references after 3 repeats) I then can use a regular expression against the returned list (not shown) .
import numpy as np
import h5py
import collections
import warnings
def visit_data_sets(group, max_len_check=20, max_repeats=3):
# print(group.name)
# print(list(group.items()))
if len(group.name) > max_len_check:
# this section terminates a circular reference after 4 repeats. However it will
# incorrectly terminate a tree if the identical repetitive sequences of names are
# actually used in the tree.
name_list = group.name.split('/')
current_name = name_list[-1]
res_list = [i for i in range(len(name_list)) if name_list[i] == current_name]
res_deq = collections.deque(res_list)
res_deq.rotate(1)
res_deq2 = collections.deque(res_list)
diff = [res_deq2[i] - res_deq[i] for i in range(0, len(res_deq))]
if len(diff) >= max_repeats:
if diff[-1] == diff[-2]:
message = 'Terminating likely circular reference "{}"'.format(group.name)
warnings.warn(message, UserWarning)
print()
return []
dataset_list = list()
for key, value in group.items():
if isinstance(value, h5py.Dataset):
current_path = group.name + '/{}'.format(key)
dataset_list.append(current_path)
elif isinstance(value, h5py.Group):
dataset_list += visit_data_sets(value)
else:
print('Unhandled class name {}'.format(value.__class__.__name__))
return dataset_list
def visit_callback(name, object):
print('Visiting name = "{}", object name = "{}"'.format(name, object.name))
return None
hdf_fptr = h5py.File('link_test.hdf5', mode='w')
group1 = hdf_fptr.require_group('/junk/group1')
group1a = hdf_fptr.require_group('/junk/group1/group1a')
# group1a1 = hdf_fptr.require_group('/junk/group1/group1a/group1ai')
group2 = hdf_fptr.require_group('/junk/group2')
group3 = hdf_fptr.require_group('/junk/group3')
# create a circular reference
group1ai = group1a['group1ai'] = group1
avect = np.arange(0,12.3, 1.0)
dset = group1.create_dataset('avect', data=avect)
group2['alias'] = dset
group3['alias3'] = h5py.SoftLink(dset.name)
print('\nThis demonstrates "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"')
print('Visiting Root - {}'.format(hdf_fptr.name))
hdf_fptr.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group2" with a Hard Link to "avect"')
print('Visiting Group - {}'.format(group2.name))
group2.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"')
print('Visiting Group - {}'.format(group3.name))
group3.visititems(visit_callback)
print('\n\nNow demonstrate recursive visit of Root looking for datasets')
print('using the function "visit_data_sets" in this code snippet.\n')
data_paths = visit_data_sets(hdf_fptr)
for data_path in data_paths:
print('Data Path = "{}"'.format(data_path))
hdf_fptr.close()
The following output shows how "visititems" works, or for my purposes fails to work, in identifying all valid paths while the recursion meets my needs and possibly yours.
This demonstrates "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"
Visiting Root - /
Visiting name = "junk", object name = "/junk"
Visiting name = "junk/group1", object name = "/junk/group1"
Visiting name = "junk/group1/avect", object name = "/junk/group1/avect"
Visiting name = "junk/group1/group1a", object name = "/junk/group1/group1a"
Visiting name = "junk/group2", object name = "/junk/group2"
Visiting name = "junk/group3", object name = "/junk/group3"
This demonstrates "h5py visititems" visiting "group2" with a Hard Link to "avect"
Visiting Group - /junk/group2
Visiting name = "alias", object name = "/junk/group2/alias"
This demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"
Visiting Group - /junk/group3
Now demonstrate recursive visit of Root looking for datasets
using the function "visit_data_sets" in this code snippet.
link_ref_test.py:26: UserWarning: Terminating likely circular reference "/junk/group1/group1a/group1ai/group1a/group1ai/group1a"
warnings.warn(message, UserWarning)
Data Path = "/junk/group1/avect"
Data Path = "/junk/group1/group1a/group1ai/avect"
Data Path = "/junk/group1/group1a/group1ai/group1a/group1ai/avect"
Data Path = "/junk/group2/alias"
Data Path = "/junk/group3/alias3"
The first "Data Path" result is the original dataset. The second and third are references to the original dataset caused by a circular reference. The fourth result is a Hard Link and the fifth is a Soft Link to the original dataset.

How do you keep table rows together in python-docx?

As an example, I have a generic script that outputs the default table styles using python-docx (this code runs fine):
import docx
d=docx.Document()
type_of_table=docx.enum.style.WD_STYLE_TYPE.TABLE
list_table=[['header1','header2'],['cell1','cell2'],['cell3','cell4']]
numcols=max(map(len,list_table))
numrows=len(list_table)
styles=(s for s in d.styles if s.type==type_of_table)
for stylenum,style in enumerate(styles,start=1):
label=d.add_paragraph('{}) {}'.format(stylenum,style.name))
label.paragraph_format.keep_with_next=True
label.paragraph_format.space_before=docx.shared.Pt(18)
label.paragraph_format.space_after=docx.shared.Pt(0)
table=d.add_table(numrows,numcols)
table.style=style
for r,row in enumerate(list_table):
for c,cell in enumerate(row):
table.row_cells(r)[c].text=cell
d.save('tablestyles.docx')
Next, I opened the document, highlighted a split table and under paragraph format, selected "Keep with next," which successfully prevented the table from being split across a page:
Here is the XML code of the non-broken table:
You can see the highlighted line shows the paragraph property that should be keeping the table together. So I wrote this function and stuck it in the code above the d.save('tablestyles.docx') line:
def no_table_break(document):
tags=document.element.xpath('//w:p')
for tag in tags:
ppr=tag.get_or_add_pPr()
ppr.keepNext_val=True
no_table_break(d)
When I inspect the XML code the paragraph property tag is set properly and when I open the Word document, the "Keep with next" box is checked for all tables, yet the table is still split across pages. Am I missing an XML tag or something that's preventing this from working properly?

Ok, I also needed this. I think we were all making the incorrect assumption that the setting in Word's table properties (or the equivalent ways to achieve this in python-docx) was about keeping the table from being split across pages. It's not -- instead, it's simply about whether or not a table's rows can be split across pages.
Given that we know how successfully do this in python-docx, we can prevent tables from being split across pages by putting each table within the row of a larger master table. The code below successfully does this. I'm using Python 3.6 and Python-Docx 0.8.6
import docx
from docx.oxml.shared import OxmlElement
import os
import sys
def prevent_document_break(document):
"""https://github.com/python-openxml/python-docx/issues/245#event-621236139
Globally prevent table cells from splitting across pages.
"""
tags = document.element.xpath('//w:tr')
rows = len(tags)
for row in range(0, rows):
tag = tags[row] # Specify which <w:r> tag you want
child = OxmlElement('w:cantSplit') # Create arbitrary tag
tag.append(child) # Append in the new tag
d = docx.Document()
type_of_table = docx.enum.style.WD_STYLE_TYPE.TABLE
list_table = [['header1', 'header2'], ['cell1', 'cell2'], ['cell3', 'cell4']]
numcols = max(map(len, list_table))
numrows = len(list_table)
styles = (s for s in d.styles if s.type == type_of_table)
big_table = d.add_table(1, 1)
big_table.autofit = True
for stylenum, style in enumerate(styles, start=1):
cells = big_table.add_row().cells
label = cells[0].add_paragraph('{}) {}'.format(stylenum, style.name))
label.paragraph_format.keep_with_next = True
label.paragraph_format.space_before = docx.shared.Pt(18)
label.paragraph_format.space_after = docx.shared.Pt(0)
table = cells[0].add_table(numrows, numcols)
table.style = style
for r, row in enumerate(list_table):
for c, cell in enumerate(row):
table.row_cells(r)[c].text = cell
prevent_document_break(d)
d.save('tablestyles.docx')
# because I'm lazy...
openers = {'linux': 'libreoffice tablestyles.docx',
'linux2': 'libreoffice tablestyles.docx',
'darwin': 'open tablestyles.docx',
'win32': 'start tablestyles.docx'}
os.system(openers[sys.platform])

Have been straggling with the problem for some hours and finally found the solution worked fine for me. I just changed the XPath in the topic starter's code so now it looks like this:
def keep_table_on_one_page(doc):
tags = self.doc.element.xpath('//w:tr[position() < last()]/w:tc/w:p')
for tag in tags:
ppr = tag.get_or_add_pPr()
ppr.keepNext_val = True
The key moment is this selector
[position() < last()]
We want all but the last row in each table to keep with the next one

Would have left this is a comment under #DeadAd 's answer, but had low rep.
In case anyone is looking to stop a specific table from breaking, rather than all tables in a doc, change the xpath to the following:
tags = table._element.xpath('./w:tr[position() < last()]/w:tc/w:p')
where table refers to the instance of <class 'docx.table.Table'> which you want to keep together.
"//" will select all nodes that match the xpath (regardless of relative location), "./" will start selection from current node

First time parsing XML in Python: this can't be like like it was meant to be, can it?

I need to read data from an XML file and am using ElementTree. Reading a number of nodes looks like this ATM:
def read_edl_ev_ids(xml_tree):
# Read all EDL events (those which start with "EDL_EV_") from XML
# and put them into a dict with
# symbolic name as key and number as value. The XML looks like:
# <...>
# <COMPU-METHOD>
# <SHORT-NAME>DT_EDL_EventType</SHORT-NAME>
# <...>
# <COMPU-SCALE>
# <LOWER_LIMIT>number</LOWER-LIMIT>
# <....>
# <COMPU-CONST>
# <VT>EDL_EV_symbolic_name</VT>
# </COMPU-CONST>
# </COMPU-SCALE>
# </COMPU-METHOD>
edl_ev = {}
for node in xml_tree.findall('.//COMPU-METHOD'):
if node.find('./SHORT-NAME').text() == 'DT_EDL_EventType':
for subnode in node.findall('.//COMPU-SCALE'):
lower_limit = subnode.find('./LOWER-LIMIT').text()
edl_ev_name = subnode.find('./COMPU-CONST/VT').text()
if edl_ev_name.startswith('EDL_EV_'):
edl_ev[edl_ev_name] = lower_limit or '0'
return edl_ev
To sum it up: I don't like it. Its clearly a XML-parsing beginners code and ugly/tedious to maintain/unflexible/DRY-violating/etc... Is there a better (declarative?) way to read in XML?

Try taking a look at the llml library's examples. (look here) Specifically I think you'll want to take a look at the XPath function

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing orphaned XML children - python

Related

How to avoid running into IndexError: list index out of range error if an element is nonexistent while parsing xml with BeautifulSoup in python

Navigating XML based on the last node you processed in Python

Check if datasets exists, using a regex, without first reading the paths of all datasets

How do you keep table rows together in python-docx?

First time parsing XML in Python: this can't be like like it was meant to be, can it?

Categories

Resources