DecisionTreeClassificationModel - how to parse and visualize decision tree in PySpark?

DecisionTreeClassificationModel - how to parse and visualize decision tree in PySpark? - python

I have a model fitted by DecisionTreeClassifier (class DecisionTreeClassificationModel) and need to parse it's tree nodes in order to visualize a subset or whole tree, but it seems that methods available in PySpark API are very limited.
For example - I'd like to take node N and get its parent or all the leaves.
Would this possible using PySpark API? So far all I can do is to call:
model.toDebugString()
and parse the string to recreate the tree structure.
I saw that Java API provides more options, but I don't know how to use it in PySpark script.
What I also found on the web is that there is a spark-tree-plotting package that even visualizes the tree, but I got some failures when trying to install it (seems that it is not maintained anymore).
I would appreciate any tips on how to efficiently parse the decision tree returned by the model.

Related

How does Elasticsearch search documents? How to customize preprocess pipeline and scoring functions in ES?

I want to implement Elasticsearch on a customized corpus. I have installed elasticsearch of version 7.5.1 and I do all my work in python using the official client.
Here I have a few questions:
How to customize preprocess pipeline? For example, I want to use a BertTokenizer to convert strings to tokens instead of ngrams
How to customize scoring function of each document w.r.t. the query? For example, I want to compare effects of tf-idf with bm25, or even using some neural models for scoring.
If there is great tutorial in python, please share with me. Thanks in advance.

You can customize the similarity function when creating an index. See the Similarity Module section of the documentation. You can find a good article that compares classical TF_IDF with BM25 on the OpenSource Connections site.
In general, Elasticsearch uses an inverted index to look up all documents that contain a specific word or token.
It sounds like you want to use vector fields for scoring, there is a good article on the elastic blog that explains how you can achieve that. Be aware that as of now Elasticsearch is not using vector fields for retrieval, only for scoring, if you want to use vector fields for retrieval you have to use a plugin, or the OpenSearch fork, or wait for version 8.
In my opinion, using ANN in real-time during search is too slow and expensive, and i have yet to see improvements in relevancy with normal search requests.
I would do the preprocessing of your documents in your own python environment before indexing and not use any Elasticsearch pipelines or plugins. It is easier to debug and iterate outside of Elasticsearch.
You could also take a look at the Haystack Project, it might have a lot of the functionality that you are looking for, already build in.

Getting around for loops in PySpark?

I have a clustering algorithm in Python that I am trying to convert to PySpark (for parallel processing).
I have a dataset that contains regions, and stores within those regions. I want to perform my clustering algorithm for all stores in a single region.
I have a few for loops before getting to the ML. How can I modify the code to remove the for loops in PySpark? I have read for loops in PySpark are generally not a good practice - but I need to be able to perform the model on many sub-datasets. Any advice?
For reference, I"m currently looping (through Pandas DataFrames) like this pseudocode below:
for region in df_region:
for distinct stores in region:
[apply ML clustering algorithm]

Search Built-in Algorithms
You could consider looking up RDD-based built-in clustering algorithms first since they are usually common and were released via rigorous validation process.
Clustering - RDD-based API
If you're more familiar with DataFrame-based API, then you could go here for a glance. And you might want to keep in mind as of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode (no new features, only bug fixes). The primary ML API is now the DataFrame-based API in the spark.ml package.
Implement Yourself
Pandas UDFs
If you do have a model object already, consider Pandas UDFs since they have iterator support now (Since 3.0.0). Simply saying, it means a model won't be loaded for each row.
from pyspark.sql.functions import pandas_udf
#pandas_udf(...)
def categorize(iterator):
model = ... # load model
for features in iterator:
yield model.predict(features)
"""
GROUP BY in Spark SQL or window functions can be considered.
It depends on your scenarios, just remember DataFrames are still based on RDDs.
They are immutable and are high-level abstraction.
"""
spark_df.withColumn("clustered_result", categorize("some_column")).show()
RDD Exploring
If, unfortunately, your intentional execution of the clustering algorithm is not included in the set of Spark built-in clustering algorithms and won't have a progress of training which means the generation of a model. You could consider converting the Pandas DataFrame into RDD data structures, then implementing your clustering algorithm. A rough process will be like the following:
pandas_df = ....
spark_df = spark.createDataFrame(pandas_df)
.
.
clustering_result = spark_df.rdd.map{p => cluster_algorithm(p)}
note1: It's only a rough progress, you might want to partition the whole dataset into few RDDs based on region then execute the clustering algorithm in each partitioned RDDs. Because the information of the clustering algorithm kinda not too clear, I could only give the advice based on some assumptions.
note2: RDD implementation should be your last option
RDD Programming Guide
2017, Chen Jin, A Scalable Hierarchical Clustering Algorithm Using Spark

How to save an AST generated by ANTLR

I have successfully generated an AST using ANTLR in python but I cannot figure out for the life of me how I can save this for later use. The only option I have been able to figure out is to use tree.toStringTree() method, but the output of this is messy and not overly convenient or easy to work with.
How do I save it and what format would be best/easiest to work with and be able to visualise and load it in in the future?
EDIT: I can see in the java documentation there is a DotGenerator() to generate a DOT file of the tree, but I can't find a way to do anything like this in python.

What you are looking for is a serializer/deserializer of the parse tree. Serialization was previously addressed in StackOverflow here. It isn't supported in the runtime (ASAIK) because it is not useful: one can reconstruct the tree very quickly by re-parsing the input. Even if you want to change the tree using a transformation, you can replace the nodes in the tree with sub-trees with node types that don't even exist in your parser, print out the tree, then re-parse to reconstruct the tree with the parse types for your grammar. It only makes sense if parsing with semantic analysis is very slow. So, you should consider the problem carefully.
However, it's not difficult to write a crude serializer/deserializer that does not consider "off-channel" content like spacing or comments. This C# program (which you could adapt to python) is an example that reconstructs the tree using the grammars-v4/sexpression.g4 grammar for a target grammar arithmetic.g4. Using toStringTree(rule-names), the tree is first serialized into a string. (Note, toStringTree() without the parser rule names is difficult to read, that is why I asked.) Then, the s-expression is parsed and a bottom-up reconstruction is performed using an Antlr visitor. Since toStringTree() does not mark the parse tree leaves with the type of the token (e.g., to distinguish between a number versus a symbol), the string is lexed to reconstruct the value. It also uses reflection to create the correct parse tree node type.
Outputting a Dot graph for the parse tree is also easy, which I included in the program, using a top-down recursive visitor. Here, the recursive function outputs each edge to a child for a particular node. Since each node name has to be unique (it's a tree), I added the pre-order tree number for the node to the name.
--Ken

How to visualize variable grouping or perform interactive grouping in PySpark world?

I was wondering whether there is a way how to perform interactive variables grouping (similar to one enabled by SAS Miner software) in PySpark/Python world. Variable grouping is intergral part of model development, so I suppose there has to be already some tool/library that might support this. Does anyone have experience with this?

Currently no such library exists for Python.
Interactive variable grouping is a multi-step process (offered as a node called IGN in SAS Enterprise Miner) that is part of SAS EM Credit Scoring solution and not base SAS. Although there are tools in Python world for some of the IGN steps such as binning, WoE, Gini, decision trees, etc. Scikit-learn is a good starting point for that.
There are a lot of Scikit-learn related projects including domain-specific ones. A project for credit scoring is a potential candidate in that list.

Graphical Visualization of XML data

I have an XML file that looks like this:
<rebase>
<Organism>
<Name>Aminomonas paucivorans</Name>
<Enzyme>M1.Apa12260I</Enzyme>
<Motif>GGAGNNNNNGGC</Motif>
<Enzyme>M2.Apa12260I</Enzyme>
<Motif>GGAGNNNNNGGC</Motif>
</Organism>
<Organism>
<Name>Bacillus cellulosilyticus</Name>
<Enzyme>M1.BceNI</Enzyme>
<Motif>CCCNNNNNCTC</Motif>
<Enzyme>M2.BceNI</Enzyme>
<Motif>CCCNNNNNCTC</Motif>
</Organism>
</rebase>
I want to visualize this XML data into a graphical format. The connectivity is such that a lot of enzymes can contain common motifs but no organims can have similar enzymes. I looked at d3.js but I dont think it has what im looking for. I was really excited with the visualization neo4j seems to provide but i will need to learn it from scratch. However I havent come across any good tutorials for importing or creating a graph in neo4j via XML datasets. I know in the world of programming anything is possible so I wanted to know the possible ways I could import my data (preferably using python) to a neo4j database to visualize it.
UPDATE
I tried following this answer (second answer under this question). I created the 2 CSV files that he suggested. However the query has a lot of syntax errors , such as :
Invalid input 'S': expected 'n/N' (line 6, column 2)
"USING PERIODIC COMMIT"
WITH is required between CREATE and LOAD CSV (line 6, column 1)
"MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})"
My cypher query skill are extremely limited and i couldnt get any imports to work so fixing the query by myself is proving to be really difficult. Any help will be greately appreciated

There is also a series of posts how to import XML into Neo4j.
http://supercompiler.wordpress.com/2014/07/22/navigating-xml-graph-using-cypher/
http://supercompiler.wordpress.com/2014/04/06/visualizing-an-xml-as-a-graph-neo4j-101/
First you should model how your data should look like as a graph, which entities do you need for your use-cases and which semantic connections.
In general if you can load the data in python, you can use py2neo or neo4jrestclient (see https://neo4j.com/developer/python/) to import your data into your model.

for this i would suggest to use directly gephi . at least a year ago it worked flawlessly, it supports xml/csv data format import directly and there is no need to use neo4j as pre-processor.
edit
oh, i see now, i though the connections are already included. in this case, you must create all the data from xml as a separate node - new node for each enzyme and motif and also for each organism(with a parameter name). those enzyme nad motif nodes must be unique - i.e. no duplicates. when creating an organism node, you connect the organism to its enzyme and motif nodes by a relationship. after this is done, querying/visualizing similar nodes is no problem, since common nodes share at least one of the enzyme/motif.
i don't know any smart way to import data xml to neo4j, but you should have no problem to convert it into two csv files. the format of that csv would than be:
first file:
name,enzyme
Aminomonas paucivorans,M1.Apa12260I
Aminomonas paucivorans,M2.Apa12260I
Bacillus cellulosilyticus,M1.BceNI
Bacillus cellulosilyticus,M2.BceNI
second file (i don't understand why the motif is duplicite thought):
name,motif
Aminomonas paucivorans,GGAGNNNNNGGC
Aminomonas paucivorans,GGAGNNNNNGGC
Bacillus cellulosilyticus,CCCNNNNNCTC
Bacillus cellulosilyticus,CCCNNNNNCTC
now we are going to do the import, which creates unique nodes and relationships (thus the above duplicite motifs would transfer just into 1 unique relation) (if neccessary, it is possible to have multiple relationships to the same motif node, too):
(i'm not sure with this import but it should work):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file1.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(e:Enzyme { name: csvLine.enzyme})
CREATE (o)-[:has_enzyme]->(e) //or maybe CREATE UNIQUE?
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file2.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})
CREATE (o)-[:has_motif]->(m) //or maybe CREATE UNIQUE?
this shall create th graph with 2 organism nodes, 4 enzyme nodes and 2 motif nodes. each organism node should than have a relationship to its enzymes and motifs. after this is done, you can move forward to the visualization part described at the beginning.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.