Parsing a list of words into a tree

Parsing a list of words into a tree - python

I have a list of words. For example:
reel
road
root
curd
I would like to store this data in a manner that reflects the following structure:
Start -> r -> e -> reel
-> o -> a -> road
o -> root
c -> curd
It is apparent to me that I need to implement a tree. From this tree, I must be able to easily obtain statistics such as the height of a node, the number of descendants of a node, searching for a node and so on. Adding a node should 'automatically' add it to the correct position in the tree, since this position is unique.
It would also like to be able to visualize the data in the form of an actual graphical tree. Since the tree is going to be huge, I would need zoom / pan controls on the visualization. And of course, a pretty visualization is always better than an ugly one.
Does anyone know of a Python package which would allow me to achieve all this simply? Writing the code myself will take quite a while. Do you think http://packages.python.org/ete2/ would be appropriate for this task?
I'm on Python 2.x, btw.
I discovered that NLTK has a trie class - nltk.containers.trie. This is convenient for me, since I already use NLTK. Does anyone know how to use this class? I can't find any examples anywhere! For example, how do I add words to the trie?

ETE2 is an environment for tree exploration, in principle made for browsing, building and exploring phylogenetic trees, and i've used it long time ago for these purposes.
But its possible that if you set your data properly, you could get it done.
You just have to place paretheses wherever you need to split your tree and create a branch. See the following example, taken from ETE doc.
If you change these "(A,B,(C,D));" for your words/letters it should be done.
from ete2 import Tree
unrooted_tree = Tree( "(A,B,(C,D));" )
print unrooted_tree
output:
/-A
|
----|--B
|
| /-C
\---|
\-D
...and this package will let u do most of the operations you want, giving u the chance to select every branch individually, and operating with it in an easy way.
I recommend u to give a look to the tutorial anyway, not pretty difficult :)

I think the following example does pretty much what you want, using the ETE toolkit.
from ete2 import Tree
words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"]
#Creates a empty tree
tree = Tree()
tree.name = ""
# Lets keep tree structure indexed
name2node = {}
# Make sure there are no duplicates
words = set(words)
# Populate tree
for wd in words:
# If no similar words exist, add it to the base of tree
target = tree
# Find relatives in the tree
for pos in xrange(len(wd), -1, -1):
root = wd[:pos]
if root in name2node:
target = name2node[root]
break
# Add new nodes as necessary
fullname = root
for letter in wd[pos:]:
fullname += letter
new_node = target.add_child(name=letter, dist=1.0)
name2node[fullname] = new_node
target = new_node
# Print structure
print tree.get_ascii()
# You can also use all the visualization machinery from ETE
# (http://packages.python.org/ete2/tutorial/tutorial_drawing.html)
# tree.show()
# You can find, isolate and operate with a specific node using the index
wh_node = name2node["whe"]
print wh_node.get_ascii()
# You can rebuild words under a given node
def recontruct_fullname(node):
name = []
while node.up:
name.append(node.name)
node = node.up
name = ''.join(reversed(name))
return name
for leaf in wh_node.iter_leaves():
print recontruct_fullname(leaf)
/n-- /e-- /v-- /e-- /-r
/e--|
/w-- /h--| \r-- /e-- /v-- /e-- /-r
| |
| \a-- /t-- /e-- /v-- /e-- /-r
|
| /e-- /e-- /-l
----|-r--|
| | /o-- /-t
| \o--|
| \a-- /-d
|
| /-d
\c-- /u-- /r--|
\-l

Related

Improving BFS performance with some kind of memoization

I have this issue that I'm trying to build an algorithm which will find distances from one vertice to others in graph.
Let's say with the really simple example that my network looks like this:
network = [[0,1,2],[2,3,4],[4,5,6],[6,7]]
I created a BFS code which is supposed to find length of paths from the specified source to other graph's vertices
from itertools import chain
import numpy as np
n = 8
graph = {}
for i in range(0, n):
graph[i] = []
for communes in communities2:
for vertice in communes:
work = communes.copy()
work.remove(vertice)
graph[vertice].append(work)
for k, v in graph.items():
graph[k] = list(chain(*v))
def bsf3(graph, s):
matrix = np.zeros([n,n])
dist = {}
visited = []
queue = [s]
dist[s] = 0
visited.append(s)
matrix[s][s] = 0
while queue:
v = queue.pop(0)
for neighbour in graph[v]:
if neighbour in visited:
pass
else:
matrix[s][neighbour] = matrix[s][v] + 1
queue.append(neighbour)
visited.append(neighbour)
return matrix
bsf3(graph,2)
First I'm creating graph (dictionary) and than use the function to find distances.
What I'm concerned about is that this approach doesn't work with larger networks (let's say with 1000 people in there). And what I'm thinking about is to use some kind of memoization (actually that's why I made a matrix instead of list). The idea is that when the algorithm calculates the path from let's say 0 to 3 (what it does already) it should keep track for another routes in such a way that matrix[1][3] = 1 etc.
So I would use the function like bsf3(graph, 1) it would not calculate everything from scratch, but would be able to access some values from matrix.
Thanks in advance!

Knowing this not fully answer your question, but this is another approach you cabn try.
In networks you will have a routing table for each node inside your network. You simple save a list of all nodes inside the network and in which node you have to go. Example of routing table of node D
A -> B
B -> B
C -> E
D -> D
E -> E
You need to run BFS on each node to build all routing table and it will take O(|V|*(|V|+|E|). The space complexity is quadratic but you have to check all possible paths.
When you create all this information you can simple start from a node and search for your destination node inside the table and find the next node to go. This will give a more better time complexity (if you use the right data structure for the table).

Random ultrametric trees

I've implemented a program on python which generates random binary trees. So now I'd like to assign to each internal node of the tree a distance to make it ultrametric. Then, the distance between the root and any leaves must be the same. If a node is a leaf then the distance is null. Here is a node :
class Node() :
def __init__(self, G = None , D = None) :
self.id = ""
self.distG = 0
self.distD = 0
self.G = G
self.D = D
self.parent = None
My idea is to set the distance h at the beginning and to decrease it as an internal node is found but its working only on the left side.
def lgBrancheRand(self, h) :
self.distD = h
self.distG = h
hrandomD = round(np.random.uniform(0,h),3)
hrandomG = round(np.random.uniform(0,h),3)
if self.D.D is not None :
self.D.distD = hrandomD
self.distD = round(h-hrandomD,3)
lgBrancheRand(self.D,hrandomD)
if self.G.G is not None :
self.G.distG = hrandomG
self.distG = round(h-hrandomG,3)
lgBrancheRand(self.G,hrandomG)

In summary, you would create random matrices and apply UPGMA to each.
More complete answer below
Simply use the UPGMA algorithm. This is a clustering algorithm used to resolve a pairwise matrix.
You take the total genetic distance between two pairs of "taxa" (technically OTUs) and divide it by two. You assign the closest members of the pairwise matrix as the first 'node'. Reformat the matrix so these two pairs are combined into a single group ('removed') and find the next 'nearest neighbor' ad infinitum. I suspect R 'ape' will have a ultrametric algorhithm which will save you from programming. I see that you are using Python, so BioPython MIGHT have this (big MIGHT), personally I would pipe this through a precompiled C program and collect the results via paup that sort of thing. I'm not going to write code, because I prefer Perl and get flamed if any Perl code appears in a Python question (the Empire has established).
Anyway you will find this algorhithm produces a perfect ultrametric tree. Purests do not like ultrametric trees derived throught this sort of algorithm. However, in your calculation it could be useful because you could find the phylogeny from real data , which is most "clock-like" against the null distribution you are producing. In this context it would be cool.
You might prefer to raise the question on bioinformatics stackexchange.

Navigating XML based on the last node you processed in Python

In Python I am trying to navigate XML (nodes) and creating links/traversing through nodes based on the last node you processed, I have a set of source and target nodes where i have to traverse from Source to Target and then from Target to Source and then same again, it may have same nodes multiples times as well.
Attached the XML structure below
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_1"
targetNode="FCMComposite_1_4" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_6" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_8" sourceNode="FCMComposite_1_2"
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_9"
targetNode="FCMComposite_1_3" sourceNode="FCMComposite_1_8"
targetNode="FCMComposite_1_5" sourceNode="FCMComposite_1_3"
In the XML above, I have to start from the 1st SourceNode (FCMComposite_1_1) to the 1st TargetNode (FCMComposite_1_2), then I have to navigate from this TargetNode (Last Node) to the SourceNode having the same value, in this case the 4th row, then from there to the destination Node and so on.
What is the best way to Achieve this? is Graph a good option for this, I am trying this in Python. Can someone please help me?

You can use a dictionary to store the connections. What you posted isn't actually XML, so I just use re to parse it, but you can do the parsing differently.
import re
data = '''
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_1"
targetNode="FCMComposite_1_4" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_6" sourceNode="FCMComposite_1_5"
targetNode="FCMComposite_1_8" sourceNode="FCMComposite_1_2"
targetNode="FCMComposite_1_2" sourceNode="FCMComposite_1_9"
targetNode="FCMComposite_1_3" sourceNode="FCMComposite_1_8"
targetNode="FCMComposite_1_5" sourceNode="FCMComposite_1_3"
'''
beginning = None
connections = {}
for line in data.split('\n'):
m = re.match(r'targetNode="([^"]+)" sourceNode="([^"]+)"', line)
if m:
target = m.group(1)
source = m.group(2)
if beginning is None:
beginning = source
connections[source] = target
print('Starting at', beginning)
current = beginning
while current in connections.keys():
print(current, '->', connections[current])
current = connections[current]
Output:
Starting at FCMComposite_1_1
FCMComposite_1_1 -> FCMComposite_1_2
FCMComposite_1_2 -> FCMComposite_1_8
FCMComposite_1_8 -> FCMComposite_1_3
FCMComposite_1_3 -> FCMComposite_1_5
FCMComposite_1_5 -> FCMComposite_1_6
I'm not sure whats's supposed to happen with the multiple targets for FCMComposite_1_5.

Merge two lists of objects containing lists

I have a directory tree containing html files called slides. Something like:
slides_root
|
|_slide-1
| |_slide-1.html
| |_slide-2.html
|
|_slide-2
| |
| |_slide-1
| | |_slide-1.html
| | |_slide-2.html
| | |_slide-3.html
| |
| |_slide-2
| |_slide-1.html
...and so on. They could go even deeper. Now imagine I have to replace some slides in this structure by merging it with another tree which is a subset of this.
WITH AN EXAMPLE: say that I want to replace slide-1.html and slide-3.html inside "slides_root/slide-2/slide-1" merging "slides_root" with:
slide_to_change
|
|_slide-2
|
|_slide-1
|_slide-1.html
|_slide-3.html
I would merge "slide_to_change" into "slides_root". The structure is the same so everything goes fine. But I have to do it in a python object representation of this scheme.
So the two trees are represented by two instances - slides1, slides2 - of the same "Slide" class which is structured as follows:
Slide(object):
def __init__(self, path):
self.path = path
self.slides = [Slide(path)]
Both slide1 and slide2 contains a path and a list that contain other Slide objects with other path and list of Slide objects and so on.
The rule is that if the the relative path is the same then I would replace the slide object in slide1 with the one in slide2.
How can achieve this result? It is really difficult and I can see no way out. Ideally something like:
for slide_root in slide1.slides:
for slide_dest in slide2.slides:
if slide_root.path == slide_dest.path:
slide_root = slide_dest
// now restart the loop at a deeper level
// repeat
Thank everyone for any answer.

Sounds not so complicated.
Just use a recursive function for walking the to-be-inserted tree and keep a hold on the corresponding place in the old tree.
If the parts match:
If the parts are both leafs (html thingies):
Insert (overwrite) the value.
If the parts are both nodes (slides):
Call yourself with the subslides (here's the recursion).
I know this is just kind of a hint, just kind of a sketch on how to do it. But maybe you want to start on this. In Python it could look sth like this (also not completely fleshed out):
def merge_slide(slide, old_slide):
for sub_slide in slide.slides:
sub_slide_position_in_old_slide = find_sub_slide_position_by_path(sub_slide.path)
if sub_slide_position_in_old_slide >= 0: # we found a match!
sub_slide_in_old_slide = old_slide.slides[sub_slide_position_in_old_slide]
if sub_slide.slides: # this is a node!
merge_slide(sub_slide, sub_slide_in_old_slide) # here we recurse
else: # this is a leaf! so we replace it:
old_slide[sub_slide_position_in_old_slide] = sub_slide
else: # nothing like this in old_slide
pass # ignore (you might want to consider this case!)
Maybe that gives you an idea on how I would approach this.

Discover All Paths in Single Source, Multi-Terminal (possibly cyclic) Directed Graph

I have a graph G = (V,E), where
V is a subset of {0, 1, 2, 3, …}
E is a subset of VxV
There are no unconnected components in G
The graph may contain cycles
There is a known node v in V, which is the source; i.e. there is no u in V such that (u,v) is an edge
There is at least one sink/terminal node v in V; i.e. there is no u in V such that (v,u) is an edge. The identities of the terminal nodes are not known - they must be discovered through traversal
What I need to do is to compute a set of paths P such that every possible path from the source node to any terminal node is in P. Now, if the graph contains cycles, it is possible that by this definition, P becomes an infinite set. This is not what I need. Rather, what I need is forPto contain a path that doesn't explore the loop and at least one path that does explore the loop.
I say "at least one path that does explore the loop", as the loop may contain branches internally, in which case, all of those branches will need to be explored as well. Thus, if the loop contains two internal branches, each with a branching factor of 2, then I need a total of four paths inP` that explore the loop.
For example, an algorithm run on the following graph:
+-------+
| |
v |
1->2->3->4->5->6 |
| | | |
v | v |
9 +->7-+
|
v
8
which can be represented as:
1:{2}
2:{3}
3:{4}
4:{5,9}
5:{6,7}
6:{7}
7:{4,8}
8:{}
9:{}
Should produce the set of paths:
1,2,3,4,9
1,2,3,4,5,6,7,8
1,2,3,4,5,6,7,4,9
1,2,3,4,5,7,8
1,2,3,4,5,7,4,9
1,2,3,4,5,7,4,5,6,7,8
1,2,3,4,5,7,4,5,7,8
Thus far, I have the following algorithm (in python) that works in some simple cases:
def extractPaths(G, s=None, explored=None, path=None):
_V,E = G
if s is None: s = 0
if explored is None: explored = set()
if path is None: path = [s]
explored.add(s)
if not len(set(E[s]) - explored):
print path
for v in set(E[s]) - explored:
if len(E[v]) > 1:
path.append(v)
for vv in set(E[v]) - explored:
extractPaths(G, vv, explored-set(n for n in path if len(E[n])>1), path+[vv])
else:
extractPaths(G, v, explored, path+[v])
but it fails horribly in the more complex cases.
I'd appreciate any help as this is a tool to validate an algorithm that I have developed for my Master's thesis.
Thank you in advance

I've though about this for a couple of hours, and have come up with this algorithm. It doesn't quite give the result you're asking for, but it's similar (and might be equivalent).
Observation: If we try to go to a node that has been seen before, the most recent visit, up until the current node, can be considered a loop. If we have seen that loop, we cannot go to that node.
def extractPaths(current_node,path,loops_seen):
path.append(current_node)
# if node has outgoing edges
if nodes[current_node]!=None:
for thatnode in nodes[current_node]:
valid=True
# if the node we are going to has been
# visited before, we are completeing
# a loop.
if thatnode-1 in path:
i=len(path)-1
# find the last time we visited
# that node
while path[i]!=thatnode-1:
i-=1
# the last time, to this time is
# a single loop.
new_loop=path[i:len(path)]
# if we haven't seen this loop go to
# the node and node we have seen this
# loop. else don't go to the node.
if new_loop in loops_seen:
valid=False
else:
loops_seen.append(new_loop)
if valid:
extractPaths(thatnode-1,path,loops_seen)
# this is the end of the path
else:
newpath=list()
# increment all the values for printing
for i in path:
newpath.append(i+1)
found_paths.append(newpath)
# backtrack
path.pop()
# graph defined by lists of outgoing edges
nodes=[[2],[3],[4],[5,9],[6,7],[7],[4,8],None,None]
found_paths=list()
extractPaths(0,list(),list())
for i in found_paths:
print(i)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a list of words into a tree - python

Related

Improving BFS performance with some kind of memoization

Random ultrametric trees

Navigating XML based on the last node you processed in Python

Merge two lists of objects containing lists

Discover All Paths in Single Source, Multi-Terminal (possibly cyclic) Directed Graph

Categories

Resources