RDF is a schema-free system to represent data. However, most of the time I find myself writing a sort of well-known graph structure, and I have to build triple by triple.
In the more general case, this well known graph structure is of course not guaranteed to be complete nor fixed (e.g. something else can be added). However, if a more or less invariant backbone exists, it would be nice to describe this backbone with placeholders and then pass a context to produce the fully deployed RDF graph.
Does something like this exist in Python?
Sounds a little like using a SPARQL CONSTRUCT query to make the final graph. Run a regular query (WHERE {} ) against a graph to form some variable bindings and then use the CONSTRUCT {} block to make the templated graph into your final answer. Any modern rdf library should have support for SPARQL.
Related
are there any tools or examples for how to visualize things like linked lists and decisions trees using matplotlib?
I ask because I wrote a linked list type of class (each node can have multiple inputs/outputs, and there's a class variable that stores node names), and want to visualize it. Unfortunately, my computer at work is locked down so much that I can't download other packages, so I've got to use whatever is on hand- which is matplotlib
I've started reading it, and if I do it by hand, I can probably make something that visualizes one-directional linked lists (just give it the root node, and plop down a square with text for each operation). But if there's branching, or multiple inputs into a node things get a bit more complicated- for example is it possible to expand a figure after creating it?
Yes, you can use networkx library and the draw_networkx method. There are plenty of examples on Stack Overflow. Here is one example: https://stackoverflow.com/a/52683100/6361531
I'm looking for some general advice on how to either re-write application code to be non-naive, or whether to abandon neo4j for another data storage model. This is not only "subjective", as it relates significantly to specific, correct usage of the neo4j driver in Python and why it performs the way it does with my code.
Background:
My team and I have been using neo4j to store graph-friendly data that is initially stored in Python objects. Originally, we were advised by a local/in-house expert to use neo4j, as it seemed to fit our data storage and manipulation/querying requirements. The data are always specific instances of a set of carefully-constructed ontologies. For example (pseudo-data):
Superclass1 -contains-> SubclassA
Superclass1 -implements->SubclassB
Superclass1 -isAssociatedWith-> Superclass2
SubclassB -hasColor-> Color1
Color1 -hasLabel-> string::"Red"
...and so on, to create some rather involved and verbose hierarchies.
For prototyping, we were storing these data as sequences of grammatical triples (subject->verb/predicate->object) using RDFLib, and using RDFLib's graph-generator to construct a graph.
Now, since this information is just a complicated hierarchy, we just store it in some custom Python objects. We also do this in order to provide an easy API to others devs that need to interface with our core service. We hand them a Python library that is our Object model, and let them populate it with data, or, we populate it and hand it to them for easy reading, and they do what they want with it.
To store these objects permanently, and to hopefully accelerate the writing and reading (querying/filtering) of these data, we've built custom object-mapping code that utilizes the official neo4j python driver to write and read these Python objects, recursively, to/from a neo4j database.
The Problem:
For large and complicated data sets (e.g. 15k+ nodes and 15k+ relations), the object relational mapping (ORM) portion of our code is too slow, and scales poorly. But neither I, nor my colleague are experts in databases or neo4j. I think we're being naive about how to accomplish this ORM. We began to wonder if it even made sense to use neo4j, when more traditional ORMs (e.g. SQL Alchemy) might just be a better choice.
For example, the ORM commit algorithm we have now is a recursive function that commits an object like this (pseudo code):
def commit(object):
for childstr in object: # For each child object
child = getattr(object, childstr) # Get the actual object
if attribute is <our object base type): # Open transaction, make nodes and relationship
with session.begin_transaction() as tx:
<construct Cypher query with:
MERGE object (make object node)
MERGE child (make its child node)
MERGE object-[]->child (create relation)
>
tx.run(<All 3 merges>)
commit(child) # Recursively write the child and its children to neo4j
Is it naive to do it like this? Would an OGM library like Py2neo's OGM be better, despite ours being customized? I've seen this and similar questions that recommend this or that OGM method, but in this article, it says not to use OGMs at all.
Must we really just implement every method and benchmark for performance? It seems like there must be some best-practices (other than using the batch IMPORT, which doesn't fit our use cases). And we've read through articles like those linked, and seen the various tips on writing better queries, but it seems better to step back and examine the case more generally before attempting to optimize code line-by line. Although it's clear that we can improve the ORM algorithm to some degree.
Does it make sense to write and read large, deep hierarchical objects to/from neo4j using a recursive strategy like this? Is there something in Cypher, or the neo4j drivers that we're missing? Or is it better to use something like Py2neo's OGM? Is it best to just abandon neo4j altogether? The benefits of neo4j and Cypher are difficult to ignore, and our data does seem to fit well in a graph. Thanks.
It's hard to know without looking at all the code and knowing the class hierarchy, but at the moment I'd hazard a guess that your code is slow in the OGM bit because every relationship is created in its own transaction. So you're doing a huge number of transactions for a larger graph which is going to slow things down.
I'd suggest for an initial import where you're creating every class/object, rather than just adding a new one or editing the relationships for one class, that you use your class inspectors to simply create a graph representation of the data, and then use Cypher to construct it in a lot fewer transactions in Neo4J. Using some basic topological graph theory you could then optimise it by reducing the number of lookups you need to do, too.
You can create a NetworkX MultiDiGraph in your python code to model the structure of your classes. From there on in there are a few different strategies to put the data into Neo4J - I also just found this but have no idea about whether it works or how efficient it is.
The most efficient way to query to import your graph will depend on the topology of the graph, and whether it is cyclical or not. Some options are below.
1. Create the Graph in Two Sets of Queries
Run one query for every node label to create every node, and then another to create every edge between every combination of node labels (the efficiency of this will depend on how many different node labels you're using).
2. Starting from the topologically highest or lowest point in the graph, create the graph as a series of paths
If you have lots of different edge labels and node labels, this might involve writing a lot of cypher logic combining UNWIND and FOREACH (CASE r.label = 'SomeLabel' THEN [1] ELSE [] | CREATE (n:SomeLabel {node_unique_id: x})->, but if the graph is very hierarchical you could also use python to keep track of which nodes have all their lower nodes and relationships created already and then use that knowledge to limit the size of paths that get sent to Neo4J in a query.
3. Use APOC to import the whole graph
Another option, which may or may not fit your use case and may or may not be more performant would be to export the graph to GraphML using NetworkX and then use the APOC GraphML import tool.
Again, it's hard to offer a precise solution without seeing all your data, but I hope this is somewhat useful as a steer in the right direction! Happy to help / answer any other questions based on more data.
There is a lot going on here so I'll try to address this in smaller questions
Would an OGM library like Py2neo's OGM be better
With any ORM/OGM library, the reality is that you can always get better performance by bypassing them and delving into the belly of the beast. That is not really the ORMs entire job though. An ORM is meant to save you time and effort by making relatively efficient DB use easy.
So it depends, if you want best performance, skip the ORM, and invest your time working on as low a level as you can (*Requires advanced low level knowledge of the beast you are working with, and a lot of your time). Otherwise, an ORM library is usually your best bet.
Our code is too slow, and scales poorly
Databases are complex. If at all possible, I would recommend bringing someone(s) on board to be a company wide database admin/expert. (This is harder when you don't already have one to vet new hires actually know what they are talking about)
Assuming that is not an option, here are some things to consider.
IO is expensive. Especially over the network. Minimize data that has to be sent in either direction. (This is why you page return results. Only return the data you need, as you actually need it)
Caveat to that, creating request connections is very expensive. Minimize calls to the DB. (Have fun balancing the two ^_^) (Note: ORMs usually have built in mechanics to only commit what has changed)
Get to the data you want fast. Create indexes in the database to vastly improve fetch speed. The more unique and consistent the id is, the better.
Caveat, indexes have to be updated on writes that alter a value in them. So indexes reduce write speed and eat more memory to gain read speed. Minimize indexes.
Transactions are a memory operation. Committing a transaction is a disk IO operation. This is why batch jobs are far more efficient.
Caveat, Memory isn't infinite. Keep your jobs a reasonable size.
As you can probably tell, scaling DB operations to production levels is not fun. It's too easy to burn yourself over-optimizing on any axis, and this is just surface level over simplifications.
For prototyping, we were storing these data as sequences of grammatical triples
Less a question, and more a statement, but different types of databases have different strengths and weaknesses. Scheme-less DBs are more specialized for cache stores; Graph DBs are specialized for querying based on relationships (edges); Relational DBs are specialized for fetching/updating records (tables); And Triplestores are more Specialized for, well, triples (RDF); (ect. there are more types)
I mention this because it sounds like your data might be mostly "write once, read many". In this case, you probably actually should be using a Triplestore. You can use any DB type for anything, but picking the best DB requires you to know how you use your data, and how that use can possible evolve.
Must we really just implement every method and benchmark for performance?
Well, this is part of why stored procedures are so important. ORMs help abstract this part, and having an in house domain expert would really help. It could just be that you are pushing the limits of what 1 machine can do. Maybe you just need to upgrade to a cluster; or maybe you have horrible code inefficiencies that have you touching a node 10k times in 1 save operation when no (or 1) value changed. To be honest though, bench-marking doesn't do much unless you know what you are looking for. For example, usually the difference between 5 hours and 0.5 seconds could be as simple as creating 1 index.
(To be fair, while buying bigger and better database servers/clusters may be the inefficient solution, it is sometimes the most cost effective compared to the salary of 1 Database Admin. So, again, depends your your priorities. And I'm sure your boss would probably prioritize differently from what you'd like)
TL;DR
You should hire a domain expert to help you.
If that is not an option, go to the bookstore (or google) pick up Databases 4 dummies (hands on learn databases online tutorial classes), and become the domain expert yourself. (Which you can than use to boost your worth to the company)
If you don't have time for that, probably your only saving grace would be to just upgrade your hardware to solve the problem with brute force. (*As long as growth isn't exponential)
As a biology undergrad i'm often writing python software in order to do some data analysis. The general structure is always :
There is some data to load, perform analysis on (statistics, clustering...) and then visualize the results.
Sometimes for a same experiment, the data can come in different formats, you can have different ways to analyses them and different visualization possible which might or not depend of the analysis performed.
I'm struggling to find a generic "pythonic" and object oriented way to make it clear and easily extensible. It should be easy to add new type of action or to do slight variations of existing ones, so I'm quite convinced that I should do that with oop.
I've already done a Data object with methods to load the experimental data. I plan to create inherited class if I have multiple data source in order to override the load function.
After that... I'm not sure. Should I do a Analysis abstract class with child class for each type of analysis (and use their attributes to store the results) and do the same for Visualization with a general Experiment object holding the Data instance and the multiple Analysis and Visualization instances ? Or should the visualizations be functions that take an Analysis and/or Data object(s) as parameter(s) in order to construct the plots ? Is there a more efficient way ? Am I missing something ?
Your general idea would work, here are some more details that will hopefully help you to proceed:
Create an abstract Data class, with some generic methods like load, save, print etc.
Create concrete subclasses for each specific form of data you are interested in. This might be task-specific (e.g. data for natural language processing) or form-specific (data given as a matrix, where each row corresponds to a different observation)
As you said, create an abstract Analysis class.
Create concrete subclasses for each form of analysis. Each concrete subclass should override a method process which accepts a specific form of Data and returns a new instance of Data with the results (if you think the form of the results would be different of that of the input data, use a different class Result)
Create a Visualization class hierarchy. Each concrete subclass should override a method visualize which accepts a specific instance of Data (or Result if you use a different class) and returns some graph of some form.
I do have a warning: Python is abstract, powerful and high-level enough that you don't generally need to create your own OO design -- it is always possible to do what you want with mininal code using numpy, scipy, and matplotlib, so before start doing the extra coding be sure you need it :)
It has been a while since you asked your question, but this might be interesting.
I created and actively develop a python library to do exactly this (albeit with a slightly broader scope). It is designed so that you can fully customize your data processing, while still having some basic tools (including for plot
The library is called Experiment NoteBook (enb), and is available in github (https://github.com/miguelinux314/experiment-notebook) and via pip (e.g., pip install enb).
I recommend any interested reader to take a look at the tutorial-like documentation (https://miguelinux314.github.io/experiment-notebook) to get an idea of the intended workflow.
I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.
Example: I'd like to transform this node tree: (please excuse the S-expressions)
(A (B) (C) (D))
Into this one:
(C (B) (D))
As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.
Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.
Background:
I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.
I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.
Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.
def lift_middle_child(in_tree):
(A, (B,), (C,), (D,)) = in_tree
return (C, (B,), (D,))
print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too
This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.
Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.
Here's the code in action: http://ideone.com/02Uv0i
Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:
def compose2(a,b): # might want to get this from the functional library
return lambda *x: a(b(*x))
def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
return reduce(compose2,funcs)
def reverse_children(in_tree):
return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable
lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.
print lift_and_reverse(('A', ('B',), ('C',), ('D',)))
What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.
You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.
A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.
Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).
To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)
domain Python;
rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
C:expression,D:expression):primary->primary
= " \f(\A,(\B),(\C),(\D)) "
-> " \f(\C,(\B),(\D)) ";
The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.
The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.
With this rule, DMS could parse Python (using its Python parser) like
foobar(2+3,(x-y),(p),(baz()))
build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:
foobar(p,(x-y),(baz()))
and then prettyprint the surface syntax (valid) python back out.
If you intended your example to be a transformation on LISP code, you'd
need a LISP grammar for DMS (not hard to build, but we don't have much
call for this), and write corresponding surface syntax:
domain Lisp;
rule revise_form(A:form,B:form, C:form, D:form):form->form
= " (\A,(\B),(\C),(\D)) "
-> " (\C,(\B),(\D)) ";
You can get a better feel for this by looking at Algebra as a DMS domain.
If your goal is to implement all this in Python... I don't have much help.
DMS is a pretty big system, and it would be a lot of effort to replicate.
I want to implement a driving direction in Python using something like Djikstra's shortest path. The algorithm requires the data to be represents in graph structure. Raw GIS data (e.g. shape files or OpenStreetMap data, however, represent their data differently. Therefore, I was wondering is there any Python library that can convert GIS data to graph structure?
In Java I found that GeoTools has exactly what I described. Is there any similar library in Python?
Haven't used it yet, but there's a function that generates directed graphs from shapefiles in Networkx: http://networkx.lanl.gov/reference/readwrite.nx_shp.html. If it doesn't do exactly what you need, it might suggest a solution. Uses OGR's Python bindings to read data.
See also Graphserver http://bmander.github.com/graphserver/.