Build a hierarchy from a relational data-set using Pyspark

Build a hierarchy from a relational data-set using Pyspark - python

I am new to Python and stuck with building a hierarchy out of a relational dataset.
It would be of immense help if someone has an idea on how to proceed with this.
I have a relational data-set with data like
_currentnode, childnode_
root, child1
child1, leaf2
child1, child3
child1, leaf4
child3, leaf5
child3, leaf6
so-on. I am looking for some python or pyspark code to
build a hierarchy dataframe like below
_level1, level2, level3, level4_
root, child1, leaf2, null
root, child1, child3, leaf5
root, child1, child3, leaf6
root, child1, leaf4, null
The data is alpha-numerics and is a huge dataset[~50mil records].
Also, the root of the hierarchy is known and can be hardwired in the code.
So in the example, above, the root of the hierarchy is 'root'.

Shortest Path with Pyspark
The input data can be interpreted as a graph with the connections between currentnode and childnode. Then the question is what is the shortest path between the root node and all leaf nodes and is called single source shortest path.
Spark has Graphx to handle parallel computations of graphs. Unfortunately, GraphX does not provide a Python API (more details can be found here). A graph library with Python support is GraphFrames. GraphFrames uses parts of GraphX.
Both GraphX and GraphFrames provide an solution for sssp. Unfortunately again, both implementations return only the length of the shortest paths, not the paths themselves (GraphX and GraphFrames). But this answer provides an implementation of the algorithm for GraphX and Scala that also returns the paths. All three solutions use Pregel.
Translating the aforementioned answer to GraphFrames/Python:
1. Data preparation
Provide unique IDs for all nodes and change the column names so that they fit to the names described here
import pyspark.sql.functions as F
df = ...
vertices = df.select("currentnode").withColumnRenamed("currentnode", "node").union(df.select("childnode")).distinct().withColumn("id", F.monotonically_increasing_id()).cache()
edges = df.join(vertices, df.currentnode == vertices.node).drop(F.col("node")).withColumnRenamed("id", "src")\
.join(vertices, df.childnode== vertices.node).drop(F.col("node")).withColumnRenamed("id", "dst").cache()
Nodes Edges
+------+------------+ +-----------+---------+------------+------------+
| node| id| |currentnode|childnode| src| dst|
+------+------------+ +-----------+---------+------------+------------+
| leaf2| 17179869184| | child1| leaf4| 25769803776|249108103168|
|child1| 25769803776| | child1| child3| 25769803776| 68719476736|
|child3| 68719476736| | child1| leaf2| 25769803776| 17179869184|
| leaf6|103079215104| | child3| leaf6| 68719476736|103079215104|
| root|171798691840| | child3| leaf5| 68719476736|214748364800|
| leaf5|214748364800| | root| child1|171798691840| 25769803776|
| leaf4|249108103168| +-----------+---------+------------+------------+
+------+------------+
2. Create the GraphFrame
from graphframes import GraphFrame
graph = GraphFrame(vertices, edges)
3. Create UDFs that will form the single parts of the Pregel algorithm
The message type:
from pyspark.sql.types import *
vertColSchema = StructType()\
.add("dist", DoubleType())\
.add("node", StringType())\
.add("path", ArrayType(StringType(), True))
The vertex program:
def vertexProgram(vd, msg):
if msg == None or vd.__getitem__(0) < msg.__getitem__(0):
return (vd.__getitem__(0), vd.__getitem__(1), vd.__getitem__(2))
else:
return (msg.__getitem__(0), vd.__getitem__(1), msg.__getitem__(2))
vertexProgramUdf = F.udf(vertexProgram, vertColSchema)
The outgoing messages:
def sendMsgToDst(src, dst):
srcDist = src.__getitem__(0)
dstDist = dst.__getitem__(0)
if srcDist < (dstDist - 1):
return (srcDist + 1, src.__getitem__(1), src.__getitem__(2) + [dst.__getitem__(1)])
else:
return None
sendMsgToDstUdf = F.udf(sendMsgToDst, vertColSchema)
Message aggregation:
def aggMsgs(agg):
shortest_dist = sorted(agg, key=lambda tup: tup[1])[0]
return (shortest_dist.__getitem__(0), shortest_dist.__getitem__(1), shortest_dist.__getitem__(2))
aggMsgsUdf = F.udf(aggMsgs, vertColSchema)
4. Combine the parts
from graphframes.lib import Pregel
result = graph.pregel.withVertexColumn(colName = "vertCol", \
initialExpr = F.when(F.col("node")==(F.lit("root")), F.struct(F.lit(0.0), F.col("node"), F.array(F.col("node")))) \
.otherwise(F.struct(F.lit(float("inf")), F.col("node"), F.array(F.lit("")))).cast(vertColSchema), \
updateAfterAggMsgsExpr = vertexProgramUdf(F.col("vertCol"), Pregel.msg())) \
.sendMsgToDst(sendMsgToDstUdf(F.col("src.vertCol"), Pregel.dst("vertCol"))) \
.aggMsgs(aggMsgsUdf(F.collect_list(Pregel.msg()))) \
.setMaxIter(10) \
.setCheckpointInterval(2) \
.run()
result.select("vertCol.path").show(truncate=False)
Remarks:
maxIter should be set to a value at least as large as the longest path. If the value is higher, the result will stay unchanged, but the computation time becomes longer. If the value is too small, the longer paths will be missing in the result. The current version of GraphFrames (0.8.0) does not support stopping the loop when no more new messages are sent.
checkpointInterval should be set to a value smaller than maxIter. The actual value depends on the data and the available hardware. When OutOfMemory exception occur or the Spark session hangs for some time, the value could be reduced.
The final result is a regular dataframe with the content
+-----------------------------+
|path |
+-----------------------------+
|[root, child1] |
|[root, child1, leaf4] |
|[root, child1, child3] |
|[root] |
|[root, child1, child3, leaf6]|
|[root, child1, child3, leaf5]|
|[root, child1, leaf2] |
+-----------------------------+
If necessary the non-leaf nodes could be filtered out here.

Related

Python dictionaries not copied properly causing repetitions, how to get this right?

I'm writing a function which is supposed to compare lists (significant genes for a test) and list out common elements (genes) for all possible combinations of the selection of lists.
These results are to be used for a venn diagram thingy...
The number of tests and genes being flexible.
The input JSON file looks something like this:
| test | genes |
|----------------- |--------------------------------------------------- |
| p-7trt_1/0con_1 | [ENSMUSG00000000031, ENSMUSG00000000049, ENSMU... |
| p-7trt_2/0con_1 | [ENSMUSG00000000031, ENSMUSG00000000037, ENSMU... |
| p-7trt_1/0con_2 | [ENSMUSG00000000037, ENSMUSG00000000049, ENSMU... |
| p-7trt_2/0con_2 | [ENSMUSG00000000028, ENSMUSG00000000031, ENSMU... |
| p-7trt_1/0con_3 | [ENSMUSG00000000088, ENSMUSG00000000094, ENSMU... |
| p-7trt_2/0con_3 | [ENSMUSG00000000028, ENSMUSG00000000031, ENSMU... |
So The function is follows:
import pandas as pd
def get_venn_compiled_data(dir_loc):
"""return json of compiled data for the venn thing
"""
data_frame = pd.read_json(dir_loc + "/venn.json", orient="records")
number_of_tests = data_frame.shape[0]
venn_data = []
venn_data_point = {"tests": [], "genes": []} # list of genes which are common across listed tests
binary = lambda x: bin(x)[2:] # to directly get the binary number
for dec_number in range(1, 2 ** number_of_tests):
# resetting
venn_data_point["tests"] = []
venn_data_point["genes"] = []
# using a binary number to get all the cases
for index, state in enumerate(binary(dec_number)):
if state == "0":
continue
# putting in all the genes from the first test
if venn_data_point["tests"] == []:
venn_data_point["genes"] = data_frame["data"][index].copy()
# removing the ones which are not common in current genes state and this.tests
else:
for gene_index, gene in enumerate(venn_data_point["genes"]):
if gene not in data_frame["data"][index]:
venn_data_point["genes"].pop(gene_index)
# putting the test in the tests list
venn_data_point["tests"].append(data_frame["name"][index])
venn_data.append(venn_data_point.copy())
return venn_data
I'm basically abusing the fact that binary number generate all possible combinations of 1's and 0's so corresponding every place of the binary number with a test, and for every binary number, if 0 is present then the list corresponding to that test is not taken for list comparison.
I tried my best to explain, please ask in the comments if I was not clear.
After running the function I am getting an output in which there are random places where test sets are repeated.
This is the test input file.
and
This is what cameout as the output
Any help is highly appreciated Thank you.

I realized what error I was making
I assumed that the binary function will magically always generate the string with the number of places I needed, Which it doesn't.
After updating the binary function to add those zeros things are doing fine.
import pandas as pd
def get_venn_compiled_data(dir_loc):
"""return json of compiled data for the venn thing
"""
# internal variables
data_frame = pd.read_json(dir_loc + "/venn.json", orient="records")
number_of_tests = data_frame.shape[0]
venn_data = []
# defining internal function
def binary(dec_no, length=number_of_tests):
"""Just to convert decimal number to binary of specified length
"""
bin_number = bin(dec_no)[2:]
if len(bin_number) < length:
bin_number = "0" * (length - len(bin_number)) + bin_number
return bin_number
# list of genes which are common across listed tests
venn_data_point = {
"tests": [],
"genes": [],
}
for dec_number in range(1, 2 ** number_of_tests):
# resetting
venn_data_point["tests"] = []
venn_data_point["genes"] = []
# using a binary number to get all the cases
for index, state in enumerate(binary(dec_number)):
if state == "0":
continue
# putting in all the genes from the first test
if venn_data_point["tests"] == []:
venn_data_point["genes"] = data_frame["data"][index].copy()
# removing the ones which are not common in current genes state and this.tests
else:
for gene_index, gene in enumerate(venn_data_point["genes"]):
if gene not in data_frame["data"][index]:
venn_data_point["genes"].pop(gene_index)
# putting the test in the tests list
venn_data_point["tests"].append(data_frame["name"][index])
venn_data.append(venn_data_point.copy())
return venn_data
If anyone else has a more optimized algorithm for this, help is appreciated.

Subtrees of Phylogenetic Trees in BioPython

I have a phylogenetic tree in newick format. I want to pull out a subtree based on the labels of the terminal nodes (so based on a list of species). A copy of the tree I am using can be found here: http://hgdownload.soe.ucsc.edu/goldenPath/dm6/multiz27way/dm6.27way.nh
Currently I have read in the tree using BioPython like so:
from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6', 'droSim1', 'droSec1', 'droYak3', 'droEre2', 'droBia2', 'droSuz1', 'droAna3', 'droBip2', 'droEug2', 'droEle2', 'droKik2', 'droTak2', 'droRho2', 'droFic2']
How would I pull out the subtree of only the species in species_list?

Ok yeah, assuming you want the smallest tree that has all the species in your species list you want the root node of this tree to be the most recent common ancestor (MRCA) of all the species in the list which is thankfully already implemented in Phylo:
from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6',
'droSim1',
'droSec1',
'droYak3',
'droEre2',
'droBia2',
'droSuz1',
'droAna3',
'droBip2',
'droEug2',
'droEle2',
'droKik2',
'droTak2',
'droRho2',
'droFic2']
common_ancestor = tree.common_ancestor(species_list)
Phylo.draw_ascii(common_ancestor)
output:
Clade
___ dm6
___|
| | , droSim1
| |_|
__________| | droSec1
| |
| | _____ droYak3
,| |_|
|| |____ droEre2
||
|| _______ droBia2
||_____|
| |_____ droSuz1
|
__| _______ droAna3
| |_________________________________|
| | |________ droBip2
| |
| |___________________ droEug2
|
|_____________ droEle2
,|
||______________________________ droKik2
__||
| ||______________ droTak2
___________________| |
| |____________ droRho2
|
|_______________ droFic2

Instead of using BioPython, use ete3.
from ete3 import Tree
t = Tree('dm6.27way.nh')
t.prune(species_list, preserve_branch_length=True)
t.write()
From the documentation,
From version 2.2, this function includes also the preserve_branch_length flag, which allows to remove nodes from a tree while keeping original distances among remaining nodes.

As #mitoRibo have pointed out in their comment to the OP, the question can be understood differently:
extracting the subtree rooted at the most recent common ancestor of the clades in the species_list (which may include other clades)
OR
extracting the subtree that has as its clades only the species in the species_list
In the example given in the OP these two are indistinguishable, as species_list contains all the clades descendant from their common ancestor, so the answer by #mitoRibi does the trick. However, if we were interested in the second interpretation above, it wouldn't suffice. Indeed, if we limit the list to only 2 species:
from Bio import Phylo
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
species_list = ['dm6', 'droFic2']
ca = tree.common_ancestor(species_list)
the result will be still the tree with 15 clades (the same as in the original list in the OP.)
I therefore post the solution that extracts the subtree containing only the clades in the species tree:
from Bio import Phylo
from copy import deepcopy
#read in Phylogenetic Tree
tree = Phylo.read('dm6.27way.nh', 'newick')
#list of species of interest
species_list = ['dm6', 'droFic2']
subtree = deepcopy(tree)
for t in tree.get_terminals():
if t.name not in species_list:
subtree.prune(t.name)
with the result:
______________________________________________________________________ dm6
_|
|___________________________________________ droFic2
This is admittedly a pedestrian solution, as looping over all clades might be a bit slow for big trees. However, to my knowledge, it is the only way it can be done with BioPython. This is why the solution with ete3, suggested by #IanFiddes, is well worth looking at.

Merge two lists of objects containing lists

I have a directory tree containing html files called slides. Something like:
slides_root
|
|_slide-1
| |_slide-1.html
| |_slide-2.html
|
|_slide-2
| |
| |_slide-1
| | |_slide-1.html
| | |_slide-2.html
| | |_slide-3.html
| |
| |_slide-2
| |_slide-1.html
...and so on. They could go even deeper. Now imagine I have to replace some slides in this structure by merging it with another tree which is a subset of this.
WITH AN EXAMPLE: say that I want to replace slide-1.html and slide-3.html inside "slides_root/slide-2/slide-1" merging "slides_root" with:
slide_to_change
|
|_slide-2
|
|_slide-1
|_slide-1.html
|_slide-3.html
I would merge "slide_to_change" into "slides_root". The structure is the same so everything goes fine. But I have to do it in a python object representation of this scheme.
So the two trees are represented by two instances - slides1, slides2 - of the same "Slide" class which is structured as follows:
Slide(object):
def __init__(self, path):
self.path = path
self.slides = [Slide(path)]
Both slide1 and slide2 contains a path and a list that contain other Slide objects with other path and list of Slide objects and so on.
The rule is that if the the relative path is the same then I would replace the slide object in slide1 with the one in slide2.
How can achieve this result? It is really difficult and I can see no way out. Ideally something like:
for slide_root in slide1.slides:
for slide_dest in slide2.slides:
if slide_root.path == slide_dest.path:
slide_root = slide_dest
// now restart the loop at a deeper level
// repeat
Thank everyone for any answer.

Sounds not so complicated.
Just use a recursive function for walking the to-be-inserted tree and keep a hold on the corresponding place in the old tree.
If the parts match:
If the parts are both leafs (html thingies):
Insert (overwrite) the value.
If the parts are both nodes (slides):
Call yourself with the subslides (here's the recursion).
I know this is just kind of a hint, just kind of a sketch on how to do it. But maybe you want to start on this. In Python it could look sth like this (also not completely fleshed out):
def merge_slide(slide, old_slide):
for sub_slide in slide.slides:
sub_slide_position_in_old_slide = find_sub_slide_position_by_path(sub_slide.path)
if sub_slide_position_in_old_slide >= 0: # we found a match!
sub_slide_in_old_slide = old_slide.slides[sub_slide_position_in_old_slide]
if sub_slide.slides: # this is a node!
merge_slide(sub_slide, sub_slide_in_old_slide) # here we recurse
else: # this is a leaf! so we replace it:
old_slide[sub_slide_position_in_old_slide] = sub_slide
else: # nothing like this in old_slide
pass # ignore (you might want to consider this case!)
Maybe that gives you an idea on how I would approach this.

How do I find which attributes my tree splits on, when using scikit-learn?

I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences.
My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices?
So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.
Thanks for your help and my apologies if I've missed something completely obvious.

Directly from the documentation ( http://scikit-learn.org/0.12/modules/tree.html ):
from io import StringIO
out = StringIO()
out = tree.export_graphviz(clf, out_file=out)
StringIO module is no longer supported in Python3, instead import io module.
There is also the tree_ attribute in your decision tree object, which allows the direct access to the whole structure.
And you can simply read it
clf.tree_.children_left #array of left children
clf.tree_.children_right #array of right children
clf.tree_.feature #array of nodes splitting feature
clf.tree_.threshold #array of nodes splitting points
clf.tree_.value #array of nodes values
for more details look at the source code of export method
In general you can use the inspect module
from inspect import getmembers
print( getmembers( clf.tree_ ) )
to get all the object's elements

If you just want a quick look at which what is going on in the tree, try:
zip(X.columns[clf.tree_.feature], clf.tree_.threshold, clf.tree_.children_left, clf.tree_.children_right)
where X is the data frame of independent variables and clf is the decision tree object. Notice that clf.tree_.children_left and clf.tree_.children_right together contain the order that the splits were made (each one of these would correspond to an arrow in the graphviz visualization).

Scikit learn introduced a delicious new method called export_text in version 0.21 (May 2019) to view all the rules from a tree. Documentation here.
Once you've fit your model, you just need two lines of code. First, import export_text:
from sklearn.tree.export import export_text
Second, create an object that will contain your rules. To make the rules look more readable, use the feature_names argument and pass a list of your feature names. For example, if your model is called model and your features are named in a dataframe called X_train, you could create an object called tree_rules:
tree_rules = export_text(model, feature_names=list(X_train))
Then just print or save tree_rules. Your output will look like this:
|--- Age <= 0.63
| |--- EstimatedSalary <= 0.61
| | |--- Age <= -0.16
| | | |--- class: 0
| | |--- Age > -0.16
| | | |--- EstimatedSalary <= -0.06
| | | | |--- class: 0
| | | |--- EstimatedSalary > -0.06
| | | | |--- EstimatedSalary <= 0.40
| | | | | |--- EstimatedSalary <= 0.03
| | | | | | |--- class: 1

Parsing a list of words into a tree

I have a list of words. For example:
reel
road
root
curd
I would like to store this data in a manner that reflects the following structure:
Start -> r -> e -> reel
-> o -> a -> road
o -> root
c -> curd
It is apparent to me that I need to implement a tree. From this tree, I must be able to easily obtain statistics such as the height of a node, the number of descendants of a node, searching for a node and so on. Adding a node should 'automatically' add it to the correct position in the tree, since this position is unique.
It would also like to be able to visualize the data in the form of an actual graphical tree. Since the tree is going to be huge, I would need zoom / pan controls on the visualization. And of course, a pretty visualization is always better than an ugly one.
Does anyone know of a Python package which would allow me to achieve all this simply? Writing the code myself will take quite a while. Do you think http://packages.python.org/ete2/ would be appropriate for this task?
I'm on Python 2.x, btw.
I discovered that NLTK has a trie class - nltk.containers.trie. This is convenient for me, since I already use NLTK. Does anyone know how to use this class? I can't find any examples anywhere! For example, how do I add words to the trie?

ETE2 is an environment for tree exploration, in principle made for browsing, building and exploring phylogenetic trees, and i've used it long time ago for these purposes.
But its possible that if you set your data properly, you could get it done.
You just have to place paretheses wherever you need to split your tree and create a branch. See the following example, taken from ETE doc.
If you change these "(A,B,(C,D));" for your words/letters it should be done.
from ete2 import Tree
unrooted_tree = Tree( "(A,B,(C,D));" )
print unrooted_tree
output:
/-A
|
----|--B
|
| /-C
\---|
\-D
...and this package will let u do most of the operations you want, giving u the chance to select every branch individually, and operating with it in an easy way.
I recommend u to give a look to the tutorial anyway, not pretty difficult :)

I think the following example does pretty much what you want, using the ETE toolkit.
from ete2 import Tree
words = [ "reel", "road", "root", "curd", "curl", "whatever","whenever", "wherever"]
#Creates a empty tree
tree = Tree()
tree.name = ""
# Lets keep tree structure indexed
name2node = {}
# Make sure there are no duplicates
words = set(words)
# Populate tree
for wd in words:
# If no similar words exist, add it to the base of tree
target = tree
# Find relatives in the tree
for pos in xrange(len(wd), -1, -1):
root = wd[:pos]
if root in name2node:
target = name2node[root]
break
# Add new nodes as necessary
fullname = root
for letter in wd[pos:]:
fullname += letter
new_node = target.add_child(name=letter, dist=1.0)
name2node[fullname] = new_node
target = new_node
# Print structure
print tree.get_ascii()
# You can also use all the visualization machinery from ETE
# (http://packages.python.org/ete2/tutorial/tutorial_drawing.html)
# tree.show()
# You can find, isolate and operate with a specific node using the index
wh_node = name2node["whe"]
print wh_node.get_ascii()
# You can rebuild words under a given node
def recontruct_fullname(node):
name = []
while node.up:
name.append(node.name)
node = node.up
name = ''.join(reversed(name))
return name
for leaf in wh_node.iter_leaves():
print recontruct_fullname(leaf)
/n-- /e-- /v-- /e-- /-r
/e--|
/w-- /h--| \r-- /e-- /v-- /e-- /-r
| |
| \a-- /t-- /e-- /v-- /e-- /-r
|
| /e-- /e-- /-l
----|-r--|
| | /o-- /-t
| \o--|
| \a-- /-d
|
| /-d
\c-- /u-- /r--|
\-l

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Build a hierarchy from a relational data-set using Pyspark - python

Related

Python dictionaries not copied properly causing repetitions, how to get this right?

Subtrees of Phylogenetic Trees in BioPython

Merge two lists of objects containing lists

How do I find which attributes my tree splits on, when using scikit-learn?

Parsing a list of words into a tree

Categories

Resources