Querying the shared nodes in a RDF graph

Querying the shared nodes in a RDF graph - python

I have a graph of RDF data, that is the result of a SPARQL query in rdflib, but this question is valid just on any endpoint too. The graph looks like the picture below.
I want to find a way to query the nodes that are shared between two clusters. Those are basically the nodes that are:
Subject to two objects
Object to two subjects
Object to a subject, and, then subject to another object
I tried with Graph.subjects() and Graph.objects() on rdflib it seems to me that they are only iterable and I have to iterate the whole graph three times, for each of the above scenarios, and it would result in a lot of double counting.
I was wondering if anyone has an idea on how to do this in a better way, perhaps within SPARQL to begin with.

Related

What is the best way to create a graph from a compact pandas dataframe?

I have a dataset of postal codes for each store and nearby postal codes for each. It looks like the following:
PostalCode
nearPC
Travel Time
L2L 3J9
[N1K 0A1', 'N1K 0A2', 'N1K 0A3', 'N1K 0A4', '...
[nan,nan,9,5,nan...]
I know I can explode the data but that would result in tons more rows ~40M. Another preprocessing step I can perform is to remove the values in each list where the travel time is not available. However, then I would need to remove it from the nearPC list.
Is there a way to incorporate networkx to create this graph? I've tried using
G = nx.from_pandas_edgelist(df,'near_PC','PostalCode',['TravelTime'])
but I don't think it allows lists as the source or targets.
TypeError: unhashable type: 'list'
Is there a way around this? If not how can I remove the same indices of a list per row based on a conditional in an efficient way?

You've identified your problem, although you may not realize it. You have a graph with 40M edges, but you appropriately avoid the table explosion. You do have to code that explosion in some form, because your graph needs all 40M edges.
For what little trouble it might save you, I suggest that you write a simple generator expression for the edges: take one node from PostalCode, iterating through the nearPC list for the other node. Let Python and NetworkX worry about the in-line expansion.
You switch the nx build method you call, depending on the format you generate. You do slow down the processing somewhat, but the explosion details get hidden in the language syntax. Also, if there is any built-in parallelization between that generator and the nx method, you'll get that advantage implicitly.

What's the difference between index and internal ID in neo4j?

I'm setting up my database and sometimes I'll need to use an ID. At first, I added an ID as a property to my nodes of interest but realized I could also just use neo4j's internal id "". Then I stumbled upon the CREATE INDEX ON :label(something) and was wondering exactly what this would do? I thought an index and the would be the same thing?
This might be a stupid question, but since I'm kind of a beginner in databases, I may be missing some of these concepts.
Also, I've been reading about which kind of database to use (mySQL, MongoDB or neo4j) and decided on neo4j since my data pretty much follows a graph structure. (it will be used to build metabolic models: connections genes->proteins->reactions->compounds)
In SQL the syntax just seemed too complex as I had to go around several tables to make simple connections that neo4j accomplishes quite easily...
From what I understand MongoDb stores data independently, and, since my data is connected, it doesnt really seem to fit the data structure.
But again, since my knowledge on this subject is limited, perhaps I'm not doing the right choice?
Thanks in advance.

Graph dbs are ideal for connected data like this, it's a more natural fit for both storing and querying than relational dbs or document stores.
As far as indexes and ids, here's the index section of the docs, but the gist of it is that this has to do with how Neo4j can look up starting nodes. Neo4j only uses indexes for finding these starting nodes (though in 3.5 when we do index lookup like this, if you have ORDER BY on the indexed property, it will use the index to augment the performance of the ordering).
Here is what Neo4j will attempt to use, depending on availability, from fastest to slowest:
Lookup by internal ID - This is always quick, however we don't recommend preserving these internal ids outside the context of a query. The reason for that is that when graph elements are deleted, their ids become eligible for reuse. If you preserve the internal ids outside of Neo4j, and perform a lookup with them later, there is a chance that whatever you expected it to reference could have been deleted, and may point at nothing, or may point at some new node with completely different data.
Lookup by index - This where you would want to use CREATE INDEX ON (or add a unique constraint, if that makes sense for your model). When you use a MATCH or MERGE using the label and property (or properties) associated with the index, then this is a fast and direct lookup of the node(s) you want.
Lookup by label scan - If you perform a MATCH with a label present in the pattern, but no means to use an index (either no index present for the label/property combination, or only a label is present but no property), then a label scan will be performed, and every node of the given label will be matched to and filtered. This becomes more expensive as more nodes with those labels are added.
All nodes scan - If you do not supply any label in your MATCH pattern, then every node in your db will be scanned and filtered. This is very expensive as your db grows.
You can EXPLAIN or PROFILE a query to see its query plan, which will show you which means of lookup are used to find the starting nodes, and the rest of the operations for executing the query.
Once a starting node or nodes are found, then Neo4j uses relationship traversal and filtering to expand and find all paths matching your desired pattern.

Create a connected graph of common DBpedia entities

My problem is such: Say I have 4 entities: Renoir, Newton, Leibniz and Pissaro. I need to create a connected graph of all entities common to them from the Dbpedia Ontology.
Example: This is a connected graph between Renoir and Pissaro from DBPedia. The nodes in between are the DBPedia schema's common to both. See image: http://postimg.org/image/6037y9lu1/
We need such a graph between the 4: Renoir, Newton, Leibniz and Pissaro.
http://postimg.org/image/vud0o1lu1/
How should this be done?
I’m novice to DPPedia, R or anything related. Any help is useful.
My objective of doing this is to find transitive connections between entities at conceptual level.

Have you tried to use relFinder? (http://www.visualdataweb.org/relfinder/relfinder.php) It serves precisely this purpose. I attach the graph I obtained when I introduced the four entities in your example:
As you can see, if you want to find a connection between them at a conceptual level you should aim for the "influencedBy"/"influences" relationship.

Tree of trees? Table of trees? What kind of data structure have I created?

I am creating a python module that creates and operates on data structures to store lots of semantically tagged data and metadata from real experiments. So in an experiment you have:
subjects
treatments
replicates
Enclosing these 3 categories is the experiment, and combinations of the three categories are what I am calling "units". Now there is no inherently correct hierarchy between the 3 (table-like) but for certain analyses it is useful to think of a certain permutation of the 3 as a hierarchy,
e.g. (subjects-->(treatments-->(replicates)))
or
(replicates-->(treatments-->(subjects)))
Moreover, when collecting data, files will be copy-pasted into a folder on a desktop, so data is at least coming in as a tree. I have thought a lot about which hierarchy is "better" but I keep coming up with use cases for most of the 6 possible permutations. I want my module to be flexible in that the user can think of the experiment or collect the data using whatever hierarchy, table, hierarchy-table hybrid makes sense to them.
Also the "units" or (table entries) are containers for arbitrary amounts of data (bytes to Gigabytes, whatever ideally) of any organizational complexity. This is why I didn't think a relational database approach was really the way to go and a NoSQL type solution makes more sense. But then i have the problem of how to order the three categories if none is "correct".
So my question is what is this multifaceted data structure?
Does some sort of fluid data structure or set of algorithms exist to easily inter-convert or produce structured views?

The short answer is that HDF5 addresses these fairly common concerns and I would suggest it. http://www.hdfgroup.org/HDF5/
In python: http://docs.h5py.org/en/latest/high/group.html
http://odo.pydata.org/en/latest/hdf5.html
will help.

Graphical Visualization of XML data

I have an XML file that looks like this:
<rebase>
<Organism>
<Name>Aminomonas paucivorans</Name>
<Enzyme>M1.Apa12260I</Enzyme>
<Motif>GGAGNNNNNGGC</Motif>
<Enzyme>M2.Apa12260I</Enzyme>
<Motif>GGAGNNNNNGGC</Motif>
</Organism>
<Organism>
<Name>Bacillus cellulosilyticus</Name>
<Enzyme>M1.BceNI</Enzyme>
<Motif>CCCNNNNNCTC</Motif>
<Enzyme>M2.BceNI</Enzyme>
<Motif>CCCNNNNNCTC</Motif>
</Organism>
</rebase>
I want to visualize this XML data into a graphical format. The connectivity is such that a lot of enzymes can contain common motifs but no organims can have similar enzymes. I looked at d3.js but I dont think it has what im looking for. I was really excited with the visualization neo4j seems to provide but i will need to learn it from scratch. However I havent come across any good tutorials for importing or creating a graph in neo4j via XML datasets. I know in the world of programming anything is possible so I wanted to know the possible ways I could import my data (preferably using python) to a neo4j database to visualize it.
UPDATE
I tried following this answer (second answer under this question). I created the 2 CSV files that he suggested. However the query has a lot of syntax errors , such as :
Invalid input 'S': expected 'n/N' (line 6, column 2)
"USING PERIODIC COMMIT"
WITH is required between CREATE and LOAD CSV (line 6, column 1)
"MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})"
My cypher query skill are extremely limited and i couldnt get any imports to work so fixing the query by myself is proving to be really difficult. Any help will be greately appreciated

There is also a series of posts how to import XML into Neo4j.
http://supercompiler.wordpress.com/2014/07/22/navigating-xml-graph-using-cypher/
http://supercompiler.wordpress.com/2014/04/06/visualizing-an-xml-as-a-graph-neo4j-101/
First you should model how your data should look like as a graph, which entities do you need for your use-cases and which semantic connections.
In general if you can load the data in python, you can use py2neo or neo4jrestclient (see https://neo4j.com/developer/python/) to import your data into your model.

for this i would suggest to use directly gephi . at least a year ago it worked flawlessly, it supports xml/csv data format import directly and there is no need to use neo4j as pre-processor.
edit
oh, i see now, i though the connections are already included. in this case, you must create all the data from xml as a separate node - new node for each enzyme and motif and also for each organism(with a parameter name). those enzyme nad motif nodes must be unique - i.e. no duplicates. when creating an organism node, you connect the organism to its enzyme and motif nodes by a relationship. after this is done, querying/visualizing similar nodes is no problem, since common nodes share at least one of the enzyme/motif.
i don't know any smart way to import data xml to neo4j, but you should have no problem to convert it into two csv files. the format of that csv would than be:
first file:
name,enzyme
Aminomonas paucivorans,M1.Apa12260I
Aminomonas paucivorans,M2.Apa12260I
Bacillus cellulosilyticus,M1.BceNI
Bacillus cellulosilyticus,M2.BceNI
second file (i don't understand why the motif is duplicite thought):
name,motif
Aminomonas paucivorans,GGAGNNNNNGGC
Aminomonas paucivorans,GGAGNNNNNGGC
Bacillus cellulosilyticus,CCCNNNNNCTC
Bacillus cellulosilyticus,CCCNNNNNCTC
now we are going to do the import, which creates unique nodes and relationships (thus the above duplicite motifs would transfer just into 1 unique relation) (if neccessary, it is possible to have multiple relationships to the same motif node, too):
(i'm not sure with this import but it should work):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file1.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(e:Enzyme { name: csvLine.enzyme})
CREATE (o)-[:has_enzyme]->(e) //or maybe CREATE UNIQUE?
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file2.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})
CREATE (o)-[:has_motif]->(m) //or maybe CREATE UNIQUE?
this shall create th graph with 2 organism nodes, 4 enzyme nodes and 2 motif nodes. each organism node should than have a relationship to its enzymes and motifs. after this is done, you can move forward to the visualization part described at the beginning.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.