How to remove duplicate nodes xml Python - python

I have a special case xml file structure is something like :
<Root>
<parent1>
<parent2>
<element id="Something" >
</parent2>
</parent1>
<parent1>
<element id="Something">
</parent1>
</Root>
My use case is to remove the duplicated element , I want to remove the elements with same Id . I tried the following code with no positive outcome (its not finding the duplicate node)
import xml.etree.ElementTree as ET
path = 'old.xml'
tree = ET.parse(path)
root = tree.getroot()
prev = None
def elements_equal(e1, e2):
if type(e1) != type(e2):
return False
if e1.tag != e1.tag: return False
if e1.text != e2.text: return False
if e1.tail != e2.tail: return False
if e1.attrib != e2.attrib: return False
if len(e1) != len(e2): return False
return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])
for page in root: # iterate over pages
elems_to_remove = []
for elem in page:
for insideelem in page:
if elements_equal(elem, insideelem) and elem != insideelem:
print("found duplicate: %s" % insideelem.text) # equal function works well
elems_to_remove.append(insideelem)
continue
for elem_to_remove in elems_to_remove:
page.remove(elem_to_remove)
# [...]
tree.write("out.xml")
Can someone help me in letting me know how can i solve it. I am very new to python with almost zero experience .

First of all what you're doing is a hard problem in the library you're using, see this question: How to remove a node inside an iterator in python xml.etree.ElemenTree
The solution to this would be to use lxml which "implements the same API but with additional enhancements". Then you can do the following fix.
You seem to be only traversing the second level of nodes in your XML tree. You're getting root, then walking the children its children. This would get you parent2 from the first page and the element from your second page. Furthermore you wouldn't be comparing across pages here:
your comparison will only find second-level duplicates within the same page.
Select the right set of elements using a proper traversal function such as iter:
# Use a `set` to keep track of "visited" elements with good lookup time.
visited = set()
# The iter method does a recursive traversal
for el in root.iter('element'):
# Since the id is what defines a duplicate for you
if 'id' in el.attr:
current = el.get('id')
# In visited already means it's a duplicate, remove it
if current in visited:
el.getparent().remove(el)
# Otherwise mark this ID as "visited"
else:
visited.add(current)

Related

Find the smallest item in a singly-linked List and move to the head?

I had a quiz recently and this is what the question looked like:-
You may use the following Node class:
class Node:
"""Lightweight, nonpublic class for storing a singly linked node."""
__slots__ = 'element', 'next' # streamline memory usage
def __init__(self, element, next): # initialize node's fields
self.element = element # reference to user's element
self.next = next # reference to next node
Assume you have a singly-linked list of unique integers. Write a Python method that traverses this list to find the smallest element, removes the node that contains that value, and inserts the smallest value in a new node at the front of the list. Finally, return the head pointer. For simplicity, you may assume that the node containing the smallest value is not already at the head of the list (ie, you will never have to remove the head node and re-add it again).
Your method will be passed the head of the list as a parameter (of type Node), as in the following method signature:
def moveSmallest(head):
You may use only the Node class; no other methods (like size(), etc) are available. Furthermore, the only pointer you have is head (passed in as a parameter); you do not have access to a tail pointer.
For example, if the list contains:
5 → 2 → 1 → 3
the resulting list will contain:
1 → 5 → 2 → 3
Hint 1: There are several parts to this question; break the problem down and think about how to do each part separately.
Hint 2: If you need to exit from a loop early, you can use the break command.
Hint 3: For an empty list or a list with only one element, there is nothing to do!
My answer:
def moveSmallest(h):
if h==None or h.next==None:
return h
# find part
temp=h
myList=[]
while temp!=None:
myList.append(temp.element)
temp=temp.next
myList.sort()
sv=myList[0]
# remove part
if h.element==sv and h.next!=None:
h=h.next
else:
start=h
while start!=None:
if start.next.element==sv and start.next.next!=None:
start.next=start.next.next
break
if start.next.element==sv and start.next.next==None:
start.next=None
break
start=start.next
# Insert part
newNode=Node(sv)
newNode.next=h
h=newNode
return h
Mark received=10/30
Feedback on my answer:
"Not supposed to use sorting; searching the list should be the way we've covered in class.
You're advancing too far ahead in the list without checking whether nodes exist.
Review the 'singly-linked list' slides and answer this question as the examples suggest."
As you can see I am finding the element in the list and removing it and then adding it to the list as a head node. I ran this code and it works fine. As you can see in the feedback he says "You're advancing too far ahead in the list without checking whether nodes exist." which is taken care by the first if statement in my answer and for "Not supposed to use sorting; searching the list should be the way we've covered in class." I believe my mistake was to use the list at the first place but given the code the final score should be or be more than 20/30. Can you guys please check this or give your opinion on this feedback?
As you can see in the feedback he says "You're advancing too far ahead
in the list without checking whether nodes exist." which is taken care
by the first if statement in my answer
It's not taken care of. The first if-statement of your function just checks to see if the head exists and that the node following the head exists as well (basically, asserting that you have at least two nodes in the linked list).
What you have:
if h.element==sv and h.next!=None:
h=h.next
else:
start=h
while start!=None:
if start.next.element==sv and start.next.next!=None:
If you enter the while loop, you only know the following things:
Your linked list has at least two elements
h.element != sv or h.next == None
The current node is not None
The node following the current node (start.next), however, may be None at some point - when you reach the end of your linked list. Therefore, you are trying to access a node that doesn't exist.
Here's how I would have done it. I haven't tested this, but I'm pretty sure this works:
def moveSmallest(head):
if head is None or head.next is None:
return head
# First, determine the smallest element.
# To do this, we need to visit each node once (except the head).
# Create a cursor that "iterates" through the nodes
# (The cursor can start at the 2nd element because we're guaranteed the head will never have the smallest element.)
cursor = head.next
current_minimum = head.next.element
while cursor is not None:
if current_minimum > cursor.element:
# We've found the new minimum.
current_minimum = cursor.element
cursor = cursor.next
# At this point, current_minimum is the smallest element.
# Next, go through the linked list again until right before we reach the node with the smallest element.
cursor = head
while cursor.next is not None:
if cursor.next.element == current_minimum:
# We want to reconnect the arrows.
if cursor.next.next is None:
cursor.next = None
else:
cursor.next = cursor.next.next
break
new_node = Node(current_minimum, head)
head = new_node
return head

How can I implement doubly linked lists into my tree structure in Python?

So I'm trying to represent an XML file using python. I already managed to do that. But each child node in my tree has to be doubly linked and I don't know how to do that. I've found a few examples of code online but they all use classes and the professor doesn't want us using classes. Here's my code:
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as etree
def create_tree(): #This function creates the root element of my tree
n = input("Enter your root Element: ")
root = Element(n)
tree = ElementTree(root)
print(etree.tostring(root))
return root
def add_child():
root = create_tree()
new = True
while new == True:
ask = input("If you wish to add a new child type 'yes', else type something else: ")
if ask == 'yes':
n = input("Enter the name of the node: ") #This block of code creates a child and appends it to the root element
node = Element(n)
root.append(node)
print(etree.tostring(root))
else:
break
return etree.tostring(root)
add_child()
For those of you wondering, the purpose of this project is to create a rooted tree with unbounded branching. I hope that once I can implement the doubly linked list I will be able to add a child node within a child node.
You should be able to implement a linked list using lists. You can define
each element as a list with the element itself, the item before it, and the item after it. If the list contains the previous node, the next node, and the element itself, respectively, and the head is stored in head, the syntax to append newElement to the beginning of the list would be as follows:
newNode = [None,head,newElement]
head[0] = newNode
head = newNode

Binary search tree insertion Python

What is wrong with my insert function? I'm passing along the tr and the element el that I wish to insert, but I keep getting errors...
def insert( tr,el ):
""" Inserts an element into a BST -- returns an updated tree """
if tr == None:
return createEyecuBST( el,None )
else:
if el > tr.value:
tr.left = createEyecuBST( el,tr )
else:
tr.right = createEyecuBST( el,tr )
return EyecuBST( tr.left,tr.right,tr)
Thanks in advance.
ERROR:
ValueError: Not expected BST with 2 elements
It's a test function that basically tells me whether or not what I'm putting in is what I want out.
So, the way insertion in a binary tree usually works is that you start at the root node, and then decide which side, i.e. which subtree, you want to insert your element. Once you have made that decision, you are recursively inserting the element into that subtree, treating its root node as the new root node.
However, what you are doing in your function is that instead of going down towards the tree’s leaves, you are just creating a new subtree with the new value immediately (and generally mess up the existing tree).
Ideally, an binary tree insert should look like this:
def insert (tree, value):
if not tree:
# The subtree we entered doesn’t actually exist. So create a
# new tree with no left or right child.
return Node(value, None, None)
# Otherwise, the subtree does exist, so let’s see where we have
# to insert the value
if value < tree.value:
# Insert the value in the left subtree
tree.left = insert(tree.left, value)
else:
# Insert the value in the right subtree
tree.right = insert(tree.right, value)
# Since you want to return the changed tree, and since we expect
# that in our recursive calls, return this subtree (where the
# insertion has happened by now!).
return tree
Note, that this modifies the existing tree. It’s also possible that you treat a tree as an immutable state, where inserting an element creates a completely new tree without touching the old one. Since you are using createEyecuBST all the time, it is possible that this was your original intention.
To do that, you want to always return a newly created subtree representing the changed state of that subtree. It looks like this:
def insert (tree, value):
if tree is None:
# As before, if the subtree does not exist, create a new one
return Node(value, None, None)
if value < tree.value:
# Insert in the left subtree, so re-build the left subtree and
# return the new subtree at this level
return Node(tree.value, insert(tree.left, value), tree.right)
elif value > tree.value:
# Insert in the right subtree and rebuild it
return Node(tree.value, tree.left, insert(tree.right, value))
# Final case is that `tree.value == value`; in that case, we don’t
# need to change anything
return tree
Note: Since I didn’t know what’s the difference in your createEyecuBST function and the EyecuBST type is, I’m just using a type Node here which constructer accepts the value as the first parameter, and then the left and right subtree as the second and third.
Since the binary doesn't have the need to balance out anything , you can write as simple logic as possible while traversing at each step .
--> Compare with root value.
--> Is it less than root then go to left node.
--> Not greater than root , then go to right node.
--> Node exists ? Make it new root and repeat , else add the new node with the value
def insert(self, val):
treeNode = Node(val)
placed = 0
tmp = self.root
if not self.root:
self.root = treeNode
else:
while(not placed):
if val<tmp.info:
if not tmp.left:
tmp.left = treeNode
placed = 1
else:
tmp = tmp.left
else:
if not tmp.right:
tmp.right = treeNode
placed = 1
else:
tmp = tmp.right
return
You can also make the function recursive , but it shouldn't return anything. It will just attach the node in the innermost call .

Python - lxml - how to 'move' around the tree when building the tree

Basic question - how do you 'move' around in a tree when you are building a tree.
I can populate the first level:
import lxml.etree as ET
def main():
root = ET.Element('baseURL')
root.attrib["URL"]='www.com'
root.attrib["title"]='Level Title'
myList = [["www.1.com","site 1 Title"],["www.2.com","site 2 Title"],["www.3.com","site 3 Title"]]
for i in xrange(len(myList)):
ET.SubElement(root, "link_"+str(i), URL=myList[i][0], title=myList[i][1])
This gives me something like:
baseURL:
link_0
link_1
link_2
from there, I want to add a subtree from each of the new nodes so it looks something like:
baseURL:
link_0:
link_A
link_B
link_C
link_1
link_2
I can't see how to 'point' the subElement call to the next node down - I tried:
myList2 = [["www.A.com","site A Title"],["www.B.com","site B Title"],["www.C.com","site C Title"]]
for i in xrange(len(myList2)):
ET.SubElement('link_0', "link_"+str(i), URL=myList2[i][0], title=myList2[i][1])
But that throws the error:
TypeError: Argument '_parent' has incorrect type (expected lxml.etree._Element, got str)
as I am giving the subElement call a string, not an element reference. I also tried it as a variable, (i.e. link_0' rather than"link_0"`) and that gives a global missing variable, so my reference is obviously incorrect.
How do I 'point' my lxml builder to a child as a parent, and write a new child?
ET.SubElement(parent_node,type) creates a new XML element node as a child of parent_node. It also returns this new node.
So you could do this:
import lxml.etree as ET
def main():
root = ET.Element('baseURL')
myList = [1,2,3]
children = []
for x in myList:
children.append( ET.SubElement(root, "link_"+str(x)) )
for y in myList:
ET.SubElement( children[0], "child_"+str(y) )
But keeping track of the children is probably excessive since lxml already provides you with many ways to get to them.
Here's a way using lxmls built in children lists:
node = root[0]
for y in myList:
ET.SubElement( node, "child_"+str(y) )
Here's a way using XPath (possibly better if your XML is getting ugly)
node = root.xpath("/baseURL/link_0")[0]
for y in myList:
ET.SubElement( node, "child_"+str(y) )
Found the answer. I should be using the python array referencing, root[n] not trying to get to it via list_0

Python Lxml: Adding and deleting tags

I am attempting to add and remove tags in an xml tree (snip below). I have a dict of boolean values that I use to determine whether to add or remove a tag. If the value is true, and the element does not exist, it creates the tag (and its parent if it doesn't exist). If false, it deletes the value.
However, it doesn't seem to work, and I can't find out why.
<Assets>
<asset name="Adham">
<pos>
<x>27913.769923</x>
<y>5174.627773</y>
</pos>
<GFX>
<space>P03.png</space>
<exterior>snow.png</exterior>
</GFX>
<presence>
<faction>Dvaered</faction>
<value>10.000000</value>
<range>1</range>
</presence>
<general>
<class>P</class>
<population>100</population>
<services>
<land/>
<refuel/>
</services>
<commodities/>
<description>Fooo</description>
<bar>(null)</bar>
</general>
</asset>
</Assets>
Code:
def writeflagX(self, root, x_path, _flag):
''' Writes flag to tree: deletes if false and already exists
and adds if true but doesn't exist yet)
'''
try:
if root.xpath(x_path):
if not self.flag[_flag]:
#delete value
temp1 = root.xpath(x_path)
temp1.getparent().remove(temp1)
print "removed"
#yeah, pretty ugly
except AttributeError:
#element does not exist, so create it if true value is here
#first, see if parent tag of list items exists, create it if neccesary
#split xpath into leader and item
leader = x_path.split("/")[0]
print leader
item = x_path.split("/")[1]
try:
if root.xpath(leader): #try to see if parent tag exists
child = etree.Subelement(root.xpath(leader), item)
print "no errors"
print "not caught"
except AttributeError:
l2 = leader.split("/")[0]
print l2 + " hi"
try:
l3 = leader.split("/")[1]
if l3: #if this tag is not a direct child of the root
child1 = etree.Subelement(root.xpath(l2), l3)
child1.append(etree.Element(item))
print "no dex error"
except IndexError: #if this tag is a direct child of the root
print "dex error"
child2 = etree.SubElement(root, l2)
def writeALLflagsX(self, _root):
'''Uses writeflagX and sets all the flags
'''
for k in self.flag:
self.writeflagX(_root, self.flagPaths[k], k)
I try to change the mission flag from false to true, and the refuel flag from true to false.
#Change Missions to true and refuel to false
foo = Asset()
###parsing code###
foo.alist["Adham"].flag["Is_missions"] = True
foo.alist["Adham"].flag["Is_refuel"] = False
foo.alist["Adham"].writeALLflagsX(foo.alist["Adham"].node)
foo.writeXML("output.xml")
I am stumped. The missions tag does not get added and the refuel tag does not get deleted.
Does this have something to do with me nesting the try/except statements?
Edit:
Ok, fixed the deleting problem by using a for loop as suggested:
temp1 = root.xpath(x_path)
for n in temp1:
n.getparent().remove(n)
Still can't add a node.
I think I am going to set up a new question that is simpler, since this is too convoluted.
Edit edit: new question that is much better: How to handle adding elements and their parents using xpath
There are several things could be improved in the code:
node.xpath returns the list of nodes - i.e. you can't do root.xpath(path).getparent(), check the list and take the node #0 if you're sure that it should exist (you node deletion code uses this);
when working with attributes try using node.attrib dictionary. Working with attributes becomes as easy as modifying python dictionary (del node.attrib[attr] and node.attrib[attr] = value, make sure the value is str though);
it might be useful to use etree.XML('<myelement><child/></myelement>') for creating new nodes.
Hope it helps.

Categories

Resources