Navigating XML based on the last node you processed in Python - python

In Python I am trying to navigate XML (nodes) and creating links/traversing through nodes based on the last node you processed, I have a set of source and target nodes where i have to traverse from Source to Target and then from Target to Source and then same again, it may have same nodes multiples times as well.
Attached the XML structure below
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_1"
targetNode="FCMComposite_1_4" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_6" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_8" sourceNode="FCMComposite_1_2"
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_9"
targetNode="FCMComposite_1_3" sourceNode="FCMComposite_1_8"
targetNode="FCMComposite_1_5" sourceNode="FCMComposite_1_3"
In the XML above, I have to start from the 1st SourceNode (FCMComposite_1_1) to the 1st TargetNode (FCMComposite_1_2), then I have to navigate from this TargetNode (Last Node) to the SourceNode having the same value, in this case the 4th row, then from there to the destination Node and so on.
What is the best way to Achieve this? is Graph a good option for this, I am trying this in Python. Can someone please help me?

You can use a dictionary to store the connections. What you posted isn't actually XML, so I just use re to parse it, but you can do the parsing differently.
import re
data = '''
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_1"
targetNode="FCMComposite_1_4" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_6" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_8" sourceNode="FCMComposite_1_2"
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_9"
targetNode="FCMComposite_1_3" sourceNode="FCMComposite_1_8"
targetNode="FCMComposite_1_5" sourceNode="FCMComposite_1_3"
'''
beginning = None
connections = {}
for line in data.split('\n'):
m = re.match(r'targetNode="([^"]+)" sourceNode="([^"]+)"', line)
if m:
target = m.group(1)
source = m.group(2)
if beginning is None:
beginning = source
connections[source] = target
print('Starting at', beginning)
current = beginning
while current in connections.keys():
print(current, '->', connections[current])
current = connections[current]
Output:
Starting at FCMComposite_1_1
FCMComposite_1_1 -> FCMComposite_1_2
FCMComposite_1_2 -> FCMComposite_1_8
FCMComposite_1_8 -> FCMComposite_1_3
FCMComposite_1_3 -> FCMComposite_1_5
FCMComposite_1_5 -> FCMComposite_1_6
I'm not sure whats's supposed to happen with the multiple targets for FCMComposite_1_5.

Related

How to traverse an ordered "network"?

I have an interesting python "network" challenge. Not sure what data construct to use and am hoping to learn something new from the experts!
The following is a snippet from an ordered table (see SORT column) with a FROM NODE and TO NODE. This table represents a hydrologic network (watershed sequencing) from upstream to downstream. TO NODE = 0 indicates the bottom of the network.
Given there can be many branches, how does one traverse a network from any given FROM NODE to TO NODE = 0 and store the list of nodes that are sequenced over?
For example, starting at:
HYBAS_ID (this is same as FROM NODE)= 2120523430
see <---- about 2/3rd way down, the sequence of Nodes would include the following:
[2120523430, 2120523020, 2120523290, 2120523280,2120523270,2120020790, 0]
Table:
SORT FROM NODE TO NODE
30534 2121173490 2120522070
30533 2120521930 2120521630
30532 2120522070 2120521630
30531 2120521620 2120522930
30530 2120521630 2120522930
30529 2121172200 2121173080
30528 2120522930 2120523360
30527 2120522940 2120523360
30526 2120519380 2120523170
30525 2120519520 2120523170
30524 2120523440 2120523350
30523 2120523360 2120523350
30522 2120525750 2120523430
30521 2120525490 2120523430
30520 2121173080 2120522820
30519 2120523430 2120523020 <------- if you start network here
30518 2120523350 2120523020
30517 2120522820 2120523290
30516 2120523020 2120523290 <------ the second node is here
30515 2120523170 2120523280
30514 2120523290 2120523280 <------ third node here
30513 2120523160 2120523270
30512 2120523280 2120523270 <------ fourth
30511 2120523150 2120020790
30510 2120523270 2120020790 <------ fifth
30509 2120020790 0 <------ End.
Would a dictionary and some kind of graph structure work to traverse this network? The code may be fed a list of thousands or millions of FROM NODES to calculate the routing so efficiency is important.
The standard way to do this is with an unordered dictionary representing a directed graph, and then to iterate through it. First you have to populate the dictionary (which let's say is a csv for the sake of argument), and then iterate through the values. so
my_graph = {}
with open("myfile.csv") as f:
for l in f:
split_lines = l.split(",")
my_graph[int(split_lines[1])] = int(split_lines[2])
HYBAS_ID = 2120523430
current_hybas = my_graph[HYBAS_ID]
return_value = [HYBAS_ID]
while current_hybas != 0:
return_value.append(current_hybas)
current_hybas = my_graph[current_hybas]
print return_value
Should approximately get you what you want.

Parsing orphaned XML children

I've been trying to parse the following XML file using xml.etree: Bills.xml
This is the simple python source: xml.py
I'm able to successfully print the child under BILLFIXED using the for loop. The result of which is as under:
1-Apr-2017 [Registered Creditor] 1
1-Apr-2017 [Registered Creditor] 58
However, as you can see in the XML certain orphaned child, BILLCL BILLOVERDUE BILLDUE which must logically be under BILLFIXED are not taken into consideration when outputting the XML as we are finding all the elements under BILLFIXED using the following code:
billfixed = dom.findall('BILLFIXED')
Is there any way to include the BILLCL, BILLDUE and BILLOVERDUE to be included under their respective listing? I'm unable to think of any logic that could help me consider those orphaned children to be treated as the sub children of BILLFIXED.
Thanks!
You could use zip:
for bill_fixed_node, bill_cl in zip(root.findall('BILLFIXED'), root.iter('BILLCL')):
print(bill_fixed_node)
print(bill_cl.text)
# <Element 'BILLFIXED' at 0x07905120>
# 600.00
# <Element 'BILLFIXED' at 0x079052D0>
# 10052.00
But it would probably be better to fix the structure of the XML file if you have control over it.
A friend of mine was able to answer and help me with the following code: https://gist.github.com/anonymous/dba333b6c6342d13d21fd8c0781692cb
from xml.etree import ElementTree
dom = ElementTree.parse('bills.xml')
billfixed = dom.findall('BILLFIXED')
billcl = dom.findall('BILLCL')
billdue = dom.findall('BILLDUE')
billoverdue = dom.findall('BILLOVERDUE')
for fixed, cl, due, overdue in zip(billfixed, billcl, billdue, billoverdue) :
party = fixed.find('BILLDATE').text
date = fixed.find('BILLREF').text
ref = fixed.find('BILLPARTY').text
print(' * {} [{}] {} + {} + {} + {}'.format(
party, ref, date, cl.text, due.text, overdue.text
))

First time parsing XML in Python: this can't be like like it was meant to be, can it?

I need to read data from an XML file and am using ElementTree. Reading a number of nodes looks like this ATM:
def read_edl_ev_ids(xml_tree):
# Read all EDL events (those which start with "EDL_EV_") from XML
# and put them into a dict with
# symbolic name as key and number as value. The XML looks like:
# <...>
# <COMPU-METHOD>
# <SHORT-NAME>DT_EDL_EventType</SHORT-NAME>
# <...>
# <COMPU-SCALE>
# <LOWER_LIMIT>number</LOWER-LIMIT>
# <....>
# <COMPU-CONST>
# <VT>EDL_EV_symbolic_name</VT>
# </COMPU-CONST>
# </COMPU-SCALE>
# </COMPU-METHOD>
edl_ev = {}
for node in xml_tree.findall('.//COMPU-METHOD'):
if node.find('./SHORT-NAME').text() == 'DT_EDL_EventType':
for subnode in node.findall('.//COMPU-SCALE'):
lower_limit = subnode.find('./LOWER-LIMIT').text()
edl_ev_name = subnode.find('./COMPU-CONST/VT').text()
if edl_ev_name.startswith('EDL_EV_'):
edl_ev[edl_ev_name] = lower_limit or '0'
return edl_ev
To sum it up: I don't like it. Its clearly a XML-parsing beginners code and ugly/tedious to maintain/unflexible/DRY-violating/etc... Is there a better (declarative?) way to read in XML?
Try taking a look at the llml library's examples. (look here) Specifically I think you'll want to take a look at the XPath function

Parsing a list of words into a tree

I have a list of words. For example:
reel
road
root
curd
I would like to store this data in a manner that reflects the following structure:
Start -> r -> e -> reel
-> o -> a -> road
o -> root
c -> curd
It is apparent to me that I need to implement a tree. From this tree, I must be able to easily obtain statistics such as the height of a node, the number of descendants of a node, searching for a node and so on. Adding a node should 'automatically' add it to the correct position in the tree, since this position is unique.
It would also like to be able to visualize the data in the form of an actual graphical tree. Since the tree is going to be huge, I would need zoom / pan controls on the visualization. And of course, a pretty visualization is always better than an ugly one.
Does anyone know of a Python package which would allow me to achieve all this simply? Writing the code myself will take quite a while. Do you think http://packages.python.org/ete2/ would be appropriate for this task?
I'm on Python 2.x, btw.
I discovered that NLTK has a trie class - nltk.containers.trie. This is convenient for me, since I already use NLTK. Does anyone know how to use this class? I can't find any examples anywhere! For example, how do I add words to the trie?
ETE2 is an environment for tree exploration, in principle made for browsing, building and exploring phylogenetic trees, and i've used it long time ago for these purposes.
But its possible that if you set your data properly, you could get it done.
You just have to place paretheses wherever you need to split your tree and create a branch. See the following example, taken from ETE doc.
If you change these "(A,B,(C,D));" for your words/letters it should be done.
from ete2 import Tree
unrooted_tree = Tree( "(A,B,(C,D));" )
print unrooted_tree
output:
/-A
|
----|--B
|
| /-C
\---|
\-D
...and this package will let u do most of the operations you want, giving u the chance to select every branch individually, and operating with it in an easy way.
I recommend u to give a look to the tutorial anyway, not pretty difficult :)
I think the following example does pretty much what you want, using the ETE toolkit.
from ete2 import Tree
words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"]
#Creates a empty tree
tree = Tree()
tree.name = ""
# Lets keep tree structure indexed
name2node = {}
# Make sure there are no duplicates
words = set(words)
# Populate tree
for wd in words:
# If no similar words exist, add it to the base of tree
target = tree
# Find relatives in the tree
for pos in xrange(len(wd), -1, -1):
root = wd[:pos]
if root in name2node:
target = name2node[root]
break
# Add new nodes as necessary
fullname = root
for letter in wd[pos:]:
fullname += letter
new_node = target.add_child(name=letter, dist=1.0)
name2node[fullname] = new_node
target = new_node
# Print structure
print tree.get_ascii()
# You can also use all the visualization machinery from ETE
# (http://packages.python.org/ete2/tutorial/tutorial_drawing.html)
# tree.show()
# You can find, isolate and operate with a specific node using the index
wh_node = name2node["whe"]
print wh_node.get_ascii()
# You can rebuild words under a given node
def recontruct_fullname(node):
name = []
while node.up:
name.append(node.name)
node = node.up
name = ''.join(reversed(name))
return name
for leaf in wh_node.iter_leaves():
print recontruct_fullname(leaf)
/n-- /e-- /v-- /e-- /-r
/e--|
/w-- /h--| \r-- /e-- /v-- /e-- /-r
| |
| \a-- /t-- /e-- /v-- /e-- /-r
|
| /e-- /e-- /-l
----|-r--|
| | /o-- /-t
| \o--|
| \a-- /-d
|
| /-d
\c-- /u-- /r--|
\-l

By having a list with mazes houses(2 dimensions), how do i create a directed graph with a dictionary

I can only make an undirected graph. no idea on how i can make a directed one.
Any idea?
Apologies for the rather long winded post. I had time to kill on the train.
I'm guessing what you're after is a directed graph representing all paths leading away from your starting position (as opposed to a graph representation of the maze which once can use to solve arbitrary start/end positions).
(No offence meant, but) this sounds like homework, or at least, a task that is very suitable for homework. With this in mind, the following solution focuses on simplicity rather than performance or elegance.
Approach
One straight-forward way to do this would be to first store your map in a more navigable format, then, beginning with the start node do the following:
look up neighbours (top, bottom, left, right)
for each neighbour:
if it is not a possible path, ignore
if we have processed this node before, ignore
else, add this node as an edge and push it a queue (not a stack, more on this later) for further processing
for each node in the queue/stack, repeat from step 1.
(See example implementation below)
At this point, you'll end up with a directed acyclic graph (DAG) with the starting node at the top of the tree and end node as one of the leaves. Solving this would be easy at this point. See this answer on solving a maze representing as a graph.
A possible optimisation when building the graph would be to stop once the end point is found. You'll end up with an incomplete graph, but if you're only concerned about the final solution this doesn't matter.
stack or queue?
Note that using a stack (first in last out) would mean building the graph in a depth-first manner, while using a queue (first in first out) would result in a breadth-first approach.
You would generally want to use a queue (breadth first if the intention is to look for the shortest path. Consider the following map:
START
######## ######
######## ######
### b a ######
### ## ######
### ## e ######
### c d ######
######## ######
######## END
#################
If the path is traversed depth-first and at branch a you happen take the a->b path before a->e, you end up with the graph:
START
|
a
/ \
b e <-- end, since d already visited
|
c
|
d
\
END
However, using a breadth-first approach the a->e path would come across node d earlier, resulting in a shorter graph and a better solution:
START
|
a
/ \
b e
| |
c d
|
END
Example code
Sample input provided:
..........
#########.
..........
.#########
......#...
#####...#.
##...####.
##.#......
...#######
e = (0,0)
s = (8,0)
DISCLAIMER: The following code is written for clarity, not speed. It is not fully tested so there is no guarantee of correctness but it should give you an idea of what is possible.
We assumes that the input file is formatted consistently. Most error checking left out for brevity.
# regex to extract start/end positions
import re
re_sepos = re.compile("""
^([se])\s* # capture first char (s or e) followed by arbitrary spaces
=\s* # equal sign followed by arbitrary spaces
\( # left parenthesis
(\d+),(\d+) # capture 2 sets of integers separated by comma
\) # right parenthesis
""", re.VERBOSE)
def read_maze(filename):
"""
Reads input from file and returns tuple (maze, start, end)
maze : dict holding value of each maze cell { (x1,y1):'#', ... }
start: start node coordinage (x1,y1)
end : end node coordinate (x2,y2)
"""
# read whole file into a list
f = open(filename, "r")
data = f.readlines()
f.close()
# parse start and end positions from last 2 lines
pos = {}
for line in data[-2:]:
match = re_sepos.match(line)
if not match:
raise ValueError("invalid input file")
c,x,y = match.groups() # extract values
pos[c] = (int(x),int(y))
try:
start = pos["s"]
end = pos["e"]
except KeyError:
raise ValueError("invalid input file")
# read ascii maze, '#' for wall '.' for empty slor
# store as maze = { (x1,y1):'#', (x2,y2):'.', ... }
# NOTE: this is of course not optimal, but leads to a simpler access later
maze = {}
for line_num, line in enumerate(data[:-3]): # ignore last 3 lines
for col_num, value in enumerate(line[:-1]): # ignore \n at end
maze[(line_num, col_num)] = value
return maze, start, end
def maze_to_dag(maze, start, end):
"""
Traverses the map starting from start coordinate.
Returns directed acyclic graph in the form {(x,y):[(x1,y1),(x2,y2)], ...}
"""
dag = {} # directed acyclic graph
queue = [start] # queue of nodes to process
# repeat till queue is empty
while queue:
x,y = queue.pop(0) # get next node in queue
edges = dag[(x,y)] = [] # init list to store edges
# for each neighbour (top, bottom, left, right)
for coord in ((x,y-1), (x,y+1), (x-1,y), (x+1,y)):
if coord in dag.keys(): continue # visited before, ignore
node_value = maze.get(coord, None) # Return None if outside maze
if node_value == ".": # valid path found
edges.append(coord) # add as edge
queue.append(coord) # push into queue
# uncomment this to stop once we've found the end point
#if coord == end: return dag
return dag
if __name__ == "__main__":
maze,start,end = read_maze("l4.txt")
dag = maze_to_dag(maze, start, end)
print dag
This page provides a nice tutorial on implementing graphs with python. From the article, this is an example of a directory graph represented by dictionary:
graph = {'A': ['B', 'C'],
'B': ['C', 'D'],
'C': ['D'],
'D': ['C'],
'E': ['F'],
'F': ['C']}
That said, you might also want to look into existing graph libraries such as NetworkX and igraph.
Since you already have a list, try creating an Adjacency Matrix instead of a dictionary.
list_of_houses = []
directed_graph = [][]
for i in xrange(len(list_of_houses)):
for i in xrange(len(list_of_houses)):
directed_graph[i][i] = 0
Then for any new edge from one house to another (or w/e the connection is)
directed_graph[from_house][to_house] = 1
and you're done. If there is an edge from house_a to house_b then directed_graph[house_a][house_b] == 1.

Categories

Resources