BeautifulSoup lowest common ancestor - python

Does the BeautifulSoup library for Python have any function that can take a list of nodes and return the lowest common ancestor?
If not, has any of you ever implemented such a function and care to share it?

I think this is what you want, with link1 being one element and link2 being another;
link_1_parents = list(link1.parents)[::-1]
link_2_parents = list(link2.parents)[::-1]
common_parent = [x for x,y in zip(link_1_parents, link_2_parents) if x is y][-1]
print common_parent
print common_parent.name
It'll basically walk both elements' parents from root down, and return the last common one.

The accepted answer does not work if the distance from a tag in the input list to the lowest common ancestor is not the exact same for every nodes in the input.
It also uses every ancestors of each node, which is unnecessary and could be very expensive in some cases.
import collections
def lowest_common_ancestor(parents=None, *args):
if parents is None:
parents = collections.defaultdict(int)
for tag in args:
if not tag:
continue
parents[tag] += 1
if parents[tag] == len(args):
return tag
return lowest_common_ancestor(parents, *[tag.parent if tag else None for tag in args])

Since Arthur's answer is not correct in some cases. I modified Arthur's answer, and give my answer. I have tested the code for LCA with two nodes as input.
import collections
def lowest_common_ancestor(parents=None, *args):
if parents is None:
parents = collections.defaultdict(int)
for tag in args:
parents[tag] += 1
if parents[tag] == NUM_OF_NODES:
return tag
next_arg_list = [tag.parent for tag in args if tag.parent is not None]
return lowest_common_ancestor(parents, *next_arg_list)
Call the function like:
list_of_tag = [tag_a, tag_b]
NUM_OF_NODES = len(list_of_tag)
lca = lowest_common_ancestor(None, *list_of_tag)
print(lca)

You could also compute XPaths of all elements and then use os.path.commonprefix. I am not familiar with BeautifulSoup, but in lxml, I have done this:
def lowest_common_ancestor(nodes: list[lxml.html.HtmlElement]):
if len(set(nodes)) == 1: # all nodes are the same
return nodes[0]
tree: lxml.etree._ElementTree = nodes[0].getroottree()
xpaths = [tree.getpath(node) for node in nodes]
lca_xpath = os.path.commonprefix(xpaths)
lca_xpath = lca_xpath.rsplit('/', 1)[0] # strip partially matching tag names
return tree.xpath(lca_xpath)[0]

Related

Returning shortest path using breadth-first search

I have a graph:
graph = {}
graph['you'] = ['alice', 'bob', 'claire']
graph['bob'] = ['anuj', 'peggy']
graph['alice'] = ['peggy']
graph['claire'] = ['thom', 'jonny']
graph['anuj'] = []
graph['peggy'] = []
graph['thom'] = []
graph['jonny'] = []
A function that determines my end node:
def person_is_seller(name):
return name[-1] == 'm' # thom is the mango seller
And a breadth-first search algorithm:
from collections import deque
def search(name):
search_queue = deque()
search_queue += graph[name]
searched = set()
while search_queue:
# print(search_queue)
person = search_queue.popleft()
if person not in searched:
# print(f'Searching {person}')
if person_is_seller(person):
return f'{person} is a mango seller'
else:
search_queue += graph[person]
searched.add(person)
return f'None is a mango seller'
search('you')
# 'thom is a mango seller'
I am wondering if this algorithm can return the shortest path from you to thom?
[you, claire, thom] # as this is the shortest path to thom which is my end node
I checked this answer and it states that it does not let me find the shortest path but the second answer states that it is possible to provide the shortest path, I assume without using Djikstra's algorithm. So I am a bit confused, can I somehow keep track of the previous node, and if the final node is reached provide the shortest path as in the last code snippet or in any other format?
You can make searched a dictionary instead of a set, and then let the value for each key be a backreference to the node where you came from.
When you find the target, you can recover the path by walking back over those backreferences and then return the reverse of that.
Adapted code:
def search(name):
search_queue = deque()
search_queue.append((name, None))
searched = {} # dict instead of set
while search_queue:
# print(search_queue)
person, prev = search_queue.popleft()
if person not in searched:
searched[person] = prev
# print(f'Searching {person}')
if person_is_seller(person):
result = []
while person is not None:
result.append(person)
person = searched[person]
return result[::-1]
else:
search_queue += [(neighbor, person) for neighbor in graph[person]]
return []
Now the function returns a list. It will have the start and end node when a path is found, so in this case:
['you', 'claire', 'thom']
If no path is found, the result is an empty list.
You can use BFS to find the shortest path provided that every edge has the same length. Dykstra's algorithm is necessary when different edges have different weights.
Dykstra's algorithm in its pure form only computes the length of the shortest path. Since you probably want the shortest path itself, you'll want to associate each visited node with the node on the other end of the edge, which is generally done using an associative array (a "dictionary", in Python).

Python BST, change data of leaf nodes

I'm trying create a function to change the data in the leaf nodes (the ones with no children nodes) in a binary search tree to "Leif". Currently I have this code for my BST:
def add(tree, value, name):
if tree == None:
return {'data':value, 'data1':name, 'left':None, 'right':None}
elif value < tree['data']:
tree['left'] = add(tree['left'],value,name)
return tree
elif value > tree['data']:
tree['right'] = add(tree['right'],value,name)
return tree
else: # value == tree['data']
return tree # ignore duplicate
Essentially, I want to make a function that will change the name in data1 to "Leif" when there is no children nodes. What is the best way for me to achieve this? Thanks in advance.
Split the problem into smaller problems which can be solved with simple functions.
from itertools import ifilter
def is_leaf(tree):
return tree['left'] is None and tree['right'] is None
def traverse(tree):
if tree is not None:
yield tree
for side in ['left', 'right']:
for child in traverse(tree[side]):
yield child
def set_data1_in_leafes_to_leif(tree):
for leaf in ifilter(is_leaf, traverse(tree)):
leaf['data1'] = 'Leif'

Finding all common, non-overlapping substrings

Given two strings, I would like to identify all common sub-strings from longest to shortest.
I want to remove any "sub-"sub-strings. As an example, any substrings of '1234' would not be included in the match between '12345' and '51234'.
string1 = '51234'
string2 = '12345'
result = ['1234', '5']
I was thinking of finding the longest common substring, then recursively finding the longest substring(s) to the left/right. However, I do not want to remove a common substring after found. For example, the result below shares a 6 in the middle:
string1 = '12345623456'
string2 = '623456'
result = ['623456', '23456']
Lastly, I need to check one string against a fixed list of thousands of strings. I am unsure if there is a smart step I could take in hashing out all the substrings in these strings.
Previous Answers:
In this thread, a dynamic programming solution is found that takes O(nm) time, where n and m are the lengths of the strings. I am interested in a more efficient approach, which would use suffix trees.
Background:
I am composing song melodies from snippets of melodies. Sometimes, a combination manages to generate a melody matching too many notes in a row of an existing one.
I can use a string similarity measure, such as Edit Distance, but believe that tunes with very small differences to melodies are unique and interesting. Unfortunately, these tunes would have similar levels of similarity to songs that copy many notes of a melody in a row.
Let's start with the Tree
from collections import defaultdict
def identity(x):
return x
class TreeReprMixin(object):
def __repr__(self):
base = dict(self)
return repr(base)
class PrefixTree(TreeReprMixin, defaultdict):
'''
A hash-based Prefix or Suffix Tree for testing for
sequence inclusion. This implementation works for any
slice-able sequence of hashable objects, not just strings.
'''
def __init__(self):
defaultdict.__init__(self, PrefixTree)
self.labels = set()
def add(self, sequence, label=None):
layer = self
if label is None:
label = sequence
if label:
layer.labels.add(label)
for i in range(len(sequence)):
layer = layer[sequence[i]]
if label:
layer.labels.add(label)
return self
def add_ngram(self, sequence, label=None):
if label is None:
label = sequence
for i in range(1, len(sequence) + 1):
self.add(sequence[:i], label)
def __contains__(self, sequence):
layer = self
j = 0
for i in sequence:
j += 1
if not dict.__contains__(layer, i):
break
layer = layer[i]
return len(sequence) == j
def depth_in(self, sequence):
layer = self
count = 0
for i in sequence:
if not dict.__contains__(layer, i):
print "Breaking"
break
else:
layer = layer[i]
count += 1
return count
def subsequences_of(self, sequence):
layer = self
for i in sequence:
layer = layer[i]
return layer.labels
def __iter__(self):
return iter(self.labels)
class SuffixTree(PrefixTree):
'''
A hash-based Prefix or Suffix Tree for testing for
sequence inclusion. This implementation works for any
slice-able sequence of hashable objects, not just strings.
'''
def __init__(self):
defaultdict.__init__(self, SuffixTree)
self.labels = set()
def add_ngram(self, sequence, label=None):
if label is None:
label = sequence
for i in range(len(sequence)):
self.add(sequence[i:], label=label)
To populate the tree, you'd use the .add_ngram method.
The next part is a little trickier since you're looking for a concurrent traversal of strings whilst keeping track of tree coordinates. To pull all this off, we need some functions which operate on the tree and a query string
def overlapping_substrings(string, tree, solved=None):
if solved is None:
solved = PrefixTree()
i = 1
last = 0
matching = True
solutions = []
while i < len(string) + 1:
if string[last:i] in tree:
if not matching:
matching = True
else:
i += 1
continue
else:
if matching:
matching = False
solutions.append(string[last:i - 1])
last = i - 1
i -= 1
i += 1
if matching:
solutions.append(string[last:i])
for solution in solutions:
if solution in solved:
continue
else:
solved.add_ngram(solution)
yield solution
def slide_start(string):
for i in range(len(string)):
yield string[i:]
def seek_subtree(tree, sequence):
# Find the node of the search tree which
# is found by this sequence of items
node = tree
for i in sequence:
if i in node:
node = node[i]
else:
raise KeyError(i)
return node
def find_all_common_spans(string, tree):
# We can keep track of solutions to avoid duplicates
# and incomplete prefixes using a Prefix Tree
seen = PrefixTree()
for substring in slide_start(string):
# Drive generator forward
list(overlapping_substrings(substring, tree, seen))
# Some substrings are suffixes of other substrings which you do not
# want
compress = SuffixTree()
for solution in sorted(seen.labels, key=len, reverse=True):
# A substrings may be a suffix of another substrings, but that substrings
# is actually a repeating pattern. If a solution is
# a repeating pattern, `not solution in seek_subtree(tree, solution)` will tell us.
# Otherwise, discard the solution
if solution in compress and not solution in seek_subtree(tree, solution):
continue
else:
compress.add_ngram(solution)
return compress.labels
def search(query, corpus):
tree = SuffixTree()
if isinstance(corpus, SuffixTree):
tree = corpus
else:
for elem in corpus:
tree.add_ngram(elem)
return list(find_all_common_spans(query, tree))
So now to do the thing you wanted, do this:
search("12345", ["51234"])
search("623456", ["12345623456"])
If something is unclear, please let me know, and I'll try to clarify.

Retrieve Graph Lowest Height Node with Filter

Given a Tree T, sometimes binary or not, I need to retrieve the lowest Node that matches a criteria in each branch.
So, I need to retrieve a list (array) of those red marked nodes, where they label is equal to "NP" node.label() == 'NP'.
Actually I'm using NLTK Tree (nltk.tree.Tree) data structure, but you can post the pseudocode only, and I can implement it.
Here is the code that I've tried:
def traverseTree(tree):
if not isinstance(tree, nltk.Tree): return []
h = []
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
t = traverseTree(subtree)
if subtree.label() == 'NP' and len(t) == 0: h.append(subtree)
return h
you have a conditional that if the there are no better candidates for your specification then append subtree, but what if len(t)>0? in that case you want to keep the nodes found in sub calls:
def traverseTree(tree):
if not isinstance(tree, nltk.Tree): return []
h = []
for subtree in tree:
if type(subtree) == nltk.tree.Tree:
t = traverseTree(subtree)
#RIGHT HERE!! need to extend by t or the other found nodes are thrown out
h.extend(t)
if subtree.label() == 'NP' and len(t) == 0:
h.append(subtree)
return h
Keep in mind that if t is always empty you would append all the valid nodes one level below, but any end-of-branch "NP" nodes will be found and returned in t so you want to pass them up a level in the recursion.
Edit: the only case where this would fail is if the top level node is "NP" and there are no sub-nodes of "NP" in which case tree should be added to h:
#after for loop has finished
if len(h) == 0 and tree.label() == "NP":
h.append(tree)
return h
edit2: if you add tree to h then the check for subtrees will never actually come true since they are checking the same node with the same conditionals just in differnt levels of recursion, so you can actually just write the function like this:
def traverseTree(tree):
if not isinstance(tree, nltk.Tree): return []
h = []
for subtree in tree:
#no need to check here as well as right inside the call
h.extend(traverseTree(subtree))
if tree.label() == 'NP' and len(h) == 0:
h.append(tree)
return h

Python: Recursive function for browsing all tree nodes

I'm trying to make an recursive function which finds all nodes in a tree. My function, let's name it child(), can find all children of the current node and returns a listĀ“of them.
global nodes
nodes = []
def func():
if len(child(c))==0:
return []
else:
for ch in child(c):
nodes.append(ch)
return func(ch)
It seems it does not work for some reason.
Do you have an idea whats wrong or the problem should be somewhere else in my code?
EDIT: The problem is probably
if len(child(c))==0:
return []
it should check another child instead of return []. But don't know what to put there.
it's because of the return statement inside the for, the function passed only on the first child in the list.
this should work
global nodes
nodes = []
def func(c):
for ch in child(c):
nodes.append(ch)
func(ch)

Categories

Resources