Breadth First Search traversal for LXML Files Python - python

I am working on performing a breadth-first search (BFS) traversal on an XML File.
A Depth First Search algorithm is shown in the https://lxml.de/3.3/api.html#lxml-etre. However, I need help with applying the BFS Search based on this code.
Below is the code given in the documentation:
>>> root = etree.XML('<root><a><b/><c/></a><d><e/></d></root>')
>>> print(etree.tostring(root, pretty_print=True, encoding='unicode'))
<root>
<a>
<b/>
<c/>
</a>
<d>
<e/>
</d>
</root>
>>> queue = deque([root])
>>> while queue:
... el = queue.popleft() # pop next element
... queue.extend(el) # append its children
... print(el.tag)
root
a
d
b
c
e
I need help with trying to append it to make it suitable for BFS Traversal. Below is an example of the code I tried to write but it doesn't work correctly. Can someone please help.
My Code:
from collections import deque
>>> d = deque([root])
>>> while d:
>>> el = d.pop()
>>> d.extend(el)
>>> print(el.tag)
Thank You

Your implementation of BFS is currently popping from the wrong end of your queue. You should use popleft() rather than pop().
d = deque([root])
while d:
el = d.popleft()
d.extend(el)
print(el.tag)

Can be implemented with xpath also
>>> root = etree.XML('<root><a><b/><c><f/></c></a><d><e/></d></root>')
>>> queue = deque([root])
>>> while queue:
... el = queue.popleft()
... queue.extend(el.xpath('./child::*'))
... print(el.tag)
...
root
a
d
b
c
e
f

Related

Algorithm to split the values of a list into a specific format

Can you help me with my algorithm in Python to parse a list, please?
List = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA', 'PPPP_TOTO_EHEH_IIII_SSSS_RRRR']
In this list, I have to get the last two words (PARENT_CHILD). For example for PPPP_TOTO_TATA_TITI_TUTU, I only get TITI_TUTU
In the case where there are duplicates, that is to say that in my list, I have: PPPP_TOTO_TATA_TITI_TUTU and PPPP_TOTO_EHEH_TITI_TUTU, I would have two times TITI_TUTU, I then want to recover the GRANDPARENT for each of them, that is: TATA_TITI_TUTU and EHEH_TITI_TUTU
As long as the names are duplicated, we take the level above.
But in this case, if I added the GRANDPARENT for EHEH_TITI_TUTU, I also want it to be added for all those who have EHEH in the name so instead of having OOOO_AAAAA, I would like to have EHEH_OOO_AAAAA and EHEH_IIII_SSSS_RRRR
My final list =
['ZZZZ_XXXX', 'TATA_TITI_TUTU', 'MMMM_TITI_TUTU', 'EHEH_TITI_TUTU', 'EHEH_OOOO_AAAAA', 'EHEH_IIII_SSSS_RRRR']
Thank you in advance.
Here is the code I started to write:
json_paths = ['PPPP_YYYY_ZZZZ_XXXX', 'PPPP_TOTO_TATA_TITI_TUTU',
'PPPP_TOTO_EHEH_TITI_TUTU', 'PPPP_TOTO_MMMM_TITI_TUTU', 'PPPP_TOTO_EHEH_OOOO_AAAAA']
cols_name = []
for path in json_paths:
acc=2
col_name = '_'.join(path.split('_')[-acc:])
tmp = cols_name
while col_name in tmp:
acc += 1
idx = tmp.index(col_name)
cols_name[idx] = '_'.join(json_paths[idx].split('_')[-acc:])
col_name = '_'.join(path.split('_')[-acc:])
tmp = ['_'.join(item.split('_')[-acc:]) for item in json_paths].pop()
cols_name.append(col_name)
print(cols_name.index(col_name), col_name)
cols_name
help ... with ... algorithm
use a dictionary for the initial container while iterating
keys will be PARENT_CHILD's and values will be lists containing grandparents.
>>> s = 'PPPP_TOTO_TATA_TITI_TUTU'
>>> d = collections.defaultdict(list)
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA']})
>>> s = 'PPPP_TOTO_EHEH_TITI_TUTU'
>>> *_,grandparent,parent,child = s.rsplit('_',maxsplit=3)
>>> d['_'.join([parent,child])].append(grandparent)
>>> d
defaultdict(<class 'list'>, {'TITI_TUTU': ['TATA', 'EHEH']})
>>>
after iteration determine if there are multiple grandparents in a value
if there are, join/append the parent_child to each grandparent
additionally find all the parent_child's with these grandparents and prepend their grandparents. To facilitate build a second dictionary during iteration - {grandparent:[list_of_children],...}.
if the parent_child only has one grandparent use as-is
Instead of splitting each string the info could be extracted with a regular expression.
pattern = r'^.*?_([^_]*)_([^_]*_[^_]*)$'

Text of xml element not reassigned to new value

I have an xml defined as string my_xml.
Then I tried to increase amount of strings and change some values.
my_xml = """<root><foo><bar>spamm.xml</bar></foo></root>"""
from xml.etree import ElementTree as et
tree = et.fromstring(my_xml )
el = list(tree)[0].copy()
tree.insert(0, el)
tree.insert(0, el)
cnt = 0
elements = [elem for elem in tree.iter() if elem.text is not None]
for elem in elements:
if cnt !=0:
print elem.text[:4]+str(cnt)+elem.text[5:]
elem.text= elem.text[:4]+str(cnt)+elem.text[5:] # strange behavour
cnt +=1
print et.tostring(tree)
Why elem.text= elem.text[:4]+str(cnt)+elem.text[5:] string does not reassigned elem.text to new value?
Expected output
<root>
<foo><bar>spamm.xml</bar></foo>
<foo><bar>spamm1.xml</bar></foo>
<foo><bar>spamm2.xml</bar></foo>
</root>
Actual output
<root>
<foo><bar>spam2.xml</bar></foo>
<foo><bar>spam2.xml</bar></foo>
<foo><bar>spam2.xml</bar></foo>
</root>
The problem is in your copy phase:
you should do it for each el or both el share the same ref
you should use copy.deepcopy() because shallow copy doesn't cut it here
I use python 3, so the copy() method doesn't exist. I had to use the copy module, using deepcopy and on both items (or you're copying only once) to make sure all references are duplicated
part of the code I changed (better with a loop):
import copy
tree = et.fromstring(my_xml)
for _ in range(2):
el = copy.deepcopy(list(tree)[0])
tree.insert(0, el)
result:
<root><foo><bar>spam1.xml</bar></foo><foo><bar>spam1.xml</bar></foo><foo><bar>spam2.xml</bar></foo></root>

How to extract from nested dictionaries using list comprehension

I'm trying to extract some data from XML. I'm using xmltodict to load the data into a dictionary, then using list comprehensions to pull out individual parts into separate lists. I will later be plotting these using matplotlib.
XML:
<?xml version="1.0" ?>
<MYDATA>
<SESSION ID="1234">
<INFO>
<BEGIN LOAD="23"/>
</INFO>
<TRANSACTION ID="2103645570">
<ANSWER>Hello</ANSWER>
</TRANSACTION>
<TRANSACTION ID="4315547431">
<ANSWER>This is an answer</ANSWER>
</TRANSACTION>
</SESSION>
<SESSION ID="5678">
<INFO>
<BEGIN LOAD="28"/>
</INFO>
<TRANSACTION ID="4099381642">
<ANSWER>Hello</ANSWER>
</TRANSACTION>
<TRANSACTION ID="1220404184">
<ANSWER>A Different answer</ANSWER>
</TRANSACTION>
<TRANSACTION ID="201506542">
<ANSWER>Yet another one</ANSWER>
</TRANSACTION>
</SESSION>
</MYDATA>
My code:
from collections import OrderedDict
# doc contains the xml exactly as loaded by xmltodict
doc = OrderedDict([(u'MYDATA', OrderedDict([(u'SESSION', [OrderedDict([(u'#ID', u'1234'), (u'INFO', OrderedDict([(u'BEGIN', OrderedDict([(u'#LOAD', u'23')]))])), (u'TRANSACTION', [OrderedDict([(u'#ID', u'2103645570'), (u'ANSWER', u'Hello')]), OrderedDict([(u'#ID', u'4315547431'), (u'ANSWER', u'This is an answer')])])]), OrderedDict([(u'#ID', u'5678'), (u'INFO', OrderedDict([(u'BEGIN', OrderedDict([(u'#LOAD', u'28')]))])), (u'TRANSACTION', [OrderedDict([(u'#ID', u'4099381642'), (u'ANSWER', u'Hello')]), OrderedDict([(u'#ID', u'1220404184'), (u'ANSWER', u'A Different answer')]), OrderedDict([(u'#ID', u'201506542'), (u'ANSWER', u'Yet another one')])])])])]))])
sess_ids = [i['#ID'] for i in doc['MYDATA']['SESSION']]
print sess_ids
sess_loads = [i['INFO']['BEGIN']['#LOAD'] for i in doc['MYDATA']['SESSION']]
print sess_loads
trans_ids = [[j['#ID'] for j in i['TRANSACTION']] for i in doc['MYDATA']['SESSION']]
print trans_ids
Output:
sess_ids: [u'1234', u'5678']
sess_loads: [u'23', u'28']
trans_ids: [[u'2103645570', u'4315547431'], [u'4099381642', u'1220404184', u'201506542']]
You can see that I'm able to access the ID attributes from the SESSION elements and also the LOAD attributes from the BEGIN elements.
I need to get the ID attributes from the TRANSACTION elements as a single list. Currently I'm getting a list of lists in variable trans_ids.
How can I get just a flat list of the values?
I have tried:
[j['#ID'] for j in i['TRANSACTION'] for i in doc['MYDATA']['SESSION']]
but that just repeats the second session twice, giving:
[u'4099381642',
u'4099381642',
u'1220404184',
u'1220404184',
u'201506542',
u'201506542']
Is there a reason you need to go to a dictionary? This sort of thing is fairly straightforward in XML:
import xml.etree.ElementTree as etree
txml = etree.parse('xml string above')
txml.findall('SESSION/TRANSACTION')
[<Element TRANSACTION at 0x4064f9d8>,
<Element TRANSACTION at 0x4064fa20>,
<Element TRANSACTION at 0x4064f990>,
<Element TRANSACTION at 0x4064fa68>,
<Element TRANSACTION at 0x4064fab0>]
[x.get('ID') for x in txml.findall('SESSION/TRANSACTION')]
['2103645570', '4315547431', '4099381642', '1220404184', '201506542']
At least, it seems more compact to me.
I have tried:
[j['#ID'] for j in i['TRANSACTION'] for i in doc['MYDATA']['SESSION']]
You nearly had it. Just reverse the inner for..in parts:
>>> [j['#ID'] for i in doc['MYDATA']['SESSION'] for j in i['TRANSACTION']]
[u'2103645570', u'4315547431', u'4099381642', u'1220404184', u'201506542']
To understand this, take a look at this example:
>>> a = [[1, 2, 3], [4, 5, 6]]
>>> [j for j in i for i in a]
[4, 4, 5, 5, 6, 6]
>>> [j for i in a for j in i]
[1, 2, 3, 4, 5, 6]
When there are multiple for..in parts in a list comprehension, they are evaluated from left to right. So if your look would like this:
for i in a:
for j in i
j
Then you have to specify it in the same order, instead of from inner to outer:
[j for i in a for j in i]
from itertools import chain
list(chain(*trans_ids))

How to loop over a circular list, while peeking ahead and behind current element?

With the following example list: L = ['a','b','c','d']
I'd like to achieve the following output:
>>> a d b
>>> b a c
>>> c b d
>>> d c a
Pseudo-code would be:
for e in L:
print(e, letter_before_e, letter_after_e
You could just loop over L and take the index i minus and plus 1 modulo len(L) to get the previous and next element.
You're pretty much there
for i, e in enumerate(L):
print(e, L[i-1], L[(i+1) % len(L)])
EDITED to add mod
it would probably be overkill in this case, but this is the general use-case for a circular doubly-linked list http://ada.rg16.asn-wien.ac.at/~python/how2think/english/chap17.htm
It's often simpler conceptually to keep track items you've already seen than to look ahead. The deque class is ideal for keeping track of n previous items because it lets you set a maximum length; appending new items automatically pushes old items off.
from collections import deque
l = ['a','b','c','d']
d = deque(l[-2:], maxlen=3)
for e in l:
d.append(e)
print d[1], d[0], d[2]
The only difference in this solution is that d c a will come first instead of last. If that matters, you can start out as though you've already seen one iteration:
from collections import deque
l = ['a','b','c','d']
d = deque(l[-1:] + l[:1], maxlen=3)
for e in l[1:] + l[:1]:
d.append(e)
print d[1], d[0], d[2]
In my code I would use a moving window of 3 elements over the list prepended by the last element and appended by the first element:
from itertools import tee, izip, chain
def window(iterable,n):
'''Moving window
window([1,2,3,4,5],3) -> (1,2,3), (2,3,4), (3,4,5)
'''
els = tee(iterable,n)
for i,el in enumerate(els):
for _ in range(i):
next(el, None)
return izip(*els)
def chunked(L):
it = chain(L[-1:], L, L[:1]) # (1,2,3,4,5) -> (5,1,2,3,4,5,1)
for a1,a2,a3 in window(it,3): # (3,1,2,3,1) -> (3,1,2), (1,2,3), (2,3,1)
yield (a2,a1,a3)
## Usage example ##
L = ['a','b','c','d']
for t in chunked(L):
print(' '.join(t))

Move an entire element in with lxml.etree

Within lxml, is it possible, given an element, to move the entire thing elsewhere in the xml document without having to read all of it's children and recreate it? My best example would be changing parents. I've rummaged around the docs a bit but haven't had much luck. Thanks in advance!
.append, .insert and other operations do that by default
>>> from lxml import etree
>>> tree = etree.XML('<a><b><c/></b><d><e><f/></e></d></a>')
>>> node_b = tree.xpath('/a/b')[0]
>>> node_d = tree.xpath('/a/d')[0]
>>> node_d.append(node_b)
>>> etree.tostring(tree) # complete 'b'-branch is now under 'd', after 'e'
'<a><d><e><f/></e><b><c/></b></d></a>'
>>> node_f = tree.xpath('/a/d/e/f')[0] # Nothing stops us from moving it again
>>> node_f.append(node_b) # Now 'b' and its child are under 'f'
>>> etree.tostring(tree)
'<a><d><e><f><b><c/></b></f></e></d></a>'
Be careful when moving nodes having a tail text. In lxml tail text belong to the node and moves around with it. (Also, when you delete a node, its tail text is also deleted)
>>> tree = etree.XML('<a><b><c/></b>TAIL<d><e><f/></e></d></a>')
>>> node_b = tree.xpath('/a/b')[0]
>>> node_d = tree.xpath('/a/d')[0]
>>> node_d.append(node_b)
>>> etree.tostring(tree)
'<a><d><e><f/></e><b><c/></b>TAIL</d></a>'
Sometimes it's a desired effect, but sometimes you will need something like that:
>>> tree = etree.XML('<a><b><c/></b>TAIL<d><e><f/></e></d></a>')
>>> node_b = tree.xpath('/a/b')[0]
>>> node_d = tree.xpath('/a/d')[0]
>>> node_a = tree.xpath('/a')[0]
>>> # Manually move text
>>> node_a.text = node_b.tail
>>> node_b.tail = None
>>> node_d.append(node_b)
>>> etree.tostring(tree)
>>> # Now TAIL text stays within its old place
'<a>TAIL<d><e><f/></e><b><c/></b></d></a>'
You could use .append(), .insert() methods to add a subelement to the existing element:
>>> from lxml import etree
>>> from_ = etree.fromstring("<from/>")
>>> to = etree.fromstring("<to/>")
>>> to.append(from_)
>>> etree.tostring(to)
'<to><from/></to>'

Categories

Resources