I have an XML file that looks like this:
<rebase>
<Organism>
<Name>Aminomonas paucivorans</Name>
<Enzyme>M1.Apa12260I</Enzyme>
<Motif>GGAGNNNNNGGC</Motif>
<Enzyme>M2.Apa12260I</Enzyme>
<Motif>GGAGNNNNNGGC</Motif>
</Organism>
<Organism>
<Name>Bacillus cellulosilyticus</Name>
<Enzyme>M1.BceNI</Enzyme>
<Motif>CCCNNNNNCTC</Motif>
<Enzyme>M2.BceNI</Enzyme>
<Motif>CCCNNNNNCTC</Motif>
</Organism>
</rebase>
I want to visualize this XML data into a graphical format. The connectivity is such that a lot of enzymes can contain common motifs but no organims can have similar enzymes. I looked at d3.js but I dont think it has what im looking for. I was really excited with the visualization neo4j seems to provide but i will need to learn it from scratch. However I havent come across any good tutorials for importing or creating a graph in neo4j via XML datasets. I know in the world of programming anything is possible so I wanted to know the possible ways I could import my data (preferably using python) to a neo4j database to visualize it.
UPDATE
I tried following this answer (second answer under this question). I created the 2 CSV files that he suggested. However the query has a lot of syntax errors , such as :
Invalid input 'S': expected 'n/N' (line 6, column 2)
"USING PERIODIC COMMIT"
WITH is required between CREATE and LOAD CSV (line 6, column 1)
"MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})"
My cypher query skill are extremely limited and i couldnt get any imports to work so fixing the query by myself is proving to be really difficult. Any help will be greately appreciated
There is also a series of posts how to import XML into Neo4j.
http://supercompiler.wordpress.com/2014/07/22/navigating-xml-graph-using-cypher/
http://supercompiler.wordpress.com/2014/04/06/visualizing-an-xml-as-a-graph-neo4j-101/
First you should model how your data should look like as a graph, which entities do you need for your use-cases and which semantic connections.
In general if you can load the data in python, you can use py2neo or neo4jrestclient (see https://neo4j.com/developer/python/) to import your data into your model.
for this i would suggest to use directly gephi . at least a year ago it worked flawlessly, it supports xml/csv data format import directly and there is no need to use neo4j as pre-processor.
edit
oh, i see now, i though the connections are already included. in this case, you must create all the data from xml as a separate node - new node for each enzyme and motif and also for each organism(with a parameter name). those enzyme nad motif nodes must be unique - i.e. no duplicates. when creating an organism node, you connect the organism to its enzyme and motif nodes by a relationship. after this is done, querying/visualizing similar nodes is no problem, since common nodes share at least one of the enzyme/motif.
i don't know any smart way to import data xml to neo4j, but you should have no problem to convert it into two csv files. the format of that csv would than be:
first file:
name,enzyme
Aminomonas paucivorans,M1.Apa12260I
Aminomonas paucivorans,M2.Apa12260I
Bacillus cellulosilyticus,M1.BceNI
Bacillus cellulosilyticus,M2.BceNI
second file (i don't understand why the motif is duplicite thought):
name,motif
Aminomonas paucivorans,GGAGNNNNNGGC
Aminomonas paucivorans,GGAGNNNNNGGC
Bacillus cellulosilyticus,CCCNNNNNCTC
Bacillus cellulosilyticus,CCCNNNNNCTC
now we are going to do the import, which creates unique nodes and relationships (thus the above duplicite motifs would transfer just into 1 unique relation) (if neccessary, it is possible to have multiple relationships to the same motif node, too):
(i'm not sure with this import but it should work):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file1.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(e:Enzyme { name: csvLine.enzyme})
CREATE (o)-[:has_enzyme]->(e) //or maybe CREATE UNIQUE?
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file2.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})
CREATE (o)-[:has_motif]->(m) //or maybe CREATE UNIQUE?
this shall create th graph with 2 organism nodes, 4 enzyme nodes and 2 motif nodes. each organism node should than have a relationship to its enzymes and motifs. after this is done, you can move forward to the visualization part described at the beginning.
Related
I have multiple node- and edgelists which form a large graph, lets call that the maingraph. My current strategy is to first read all the nodelists and import it with add_vertices. Every node then gets an internal id which depends on the order they are ingested and therefore isnt very reliable (as i've read it, if you delete one, all higher ids than the one deleted change). I assign every node a name attribute which corresponds to the external ID I use so I can keep track of my nodes between frameworks and a type attribute.
Now, how do I add the edges? When I read an edgelist it will start making a new graph (subgraph) and hence starts the internal ID at 0. Therefore, "merging" the graphs with maingraph.add_edges(subgraph.get_edgelist) inevitably fails.
It is possible to work around this and use the name attribute from both maingraph and subgraph to find out which internal ID each edges' incident nodes have in the maingraph:
def _get_real_source_and_target_id(edge):
''' takes an edge from the to-be-added subgraph and gets the ids of the corresponding nodes in the
maingraph by their name '''
source_id = maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index
target_id = maingraph.vs.select(name_eq=subgraph.vs[edge[1]]["name"])[0].index
return (source_id,target_id)
And then I tried
edgelist = [_get_real_source_and_target_id(x) for x in subgraph.get_edgelist()]
maingraph.add_edges(edgelist)
But that is hoooooorribly slow. The graph has millions of nodes and edges, which takes 10 seconds to load with the fast, but incorrect maingraph.add_edges(subgraph.get_edgelist) approach. with the correct approach explained above, it takes minutes (I usually stop it after 5 minutes o so). I will have to do this tens of thousands of times. I switched from NetworkX to Igraph because of the fast loading, but it doesn't really help if I have to do it like this.
Does anybody have a more clever way to do this? Any help much appreciated!
Thanks!
Nevermind, I figured out that the mistake was elsewhere. I used numpy.loadtxt() to read the node's names as strings, which somehow did funny stuff when the names were incrementing numbers with more than five figures (see my issue report here). Therefore the above solution got stuck when it tried to get the nodes where numpy messed up the node name. maingraph.vs.select(name_eq=subgraph.vs[edge[0]]["name"])[0].index simply sat there when it couldnt find the node. Now I use pandas to read the node names and it works fine.
The solution above is still ~10x faster than my previous NetworkX solution, so I will just leave it helps someone. Feel free to delete it otherwise.
I have a graph of RDF data, that is the result of a SPARQL query in rdflib, but this question is valid just on any endpoint too. The graph looks like the picture below.
I want to find a way to query the nodes that are shared between two clusters. Those are basically the nodes that are:
Subject to two objects
Object to two subjects
Object to a subject, and, then subject to another object
I tried with Graph.subjects() and Graph.objects() on rdflib it seems to me that they are only iterable and I have to iterate the whole graph three times, for each of the above scenarios, and it would result in a lot of double counting.
I was wondering if anyone has an idea on how to do this in a better way, perhaps within SPARQL to begin with.
I just need an algorithm to solve the following problem in an efficient manner.
I have tuples with combination of tags which usually come together.For example
(python, django, flask, numpy),
(java, spring),
(mysql, sql, join),
(javascript, angularjs, ajax, deferred)
Now I have two requirements.
I need to form different categories from given data.
Given a new tag or tuple of tags, I need to find the probability of this tag coming together with all other distinct tags in data
For example :
Say new tuple is (nodejs, ajax)
then the probabilities might be
(nodejs, ajax) - (javascript, angularjs, ajax, deferred) - .60
(nodejs, ajax) - (mysql, sql, join) - .20
(nodejs, ajax) - (java, spring) - .20
etc
How should I go about solving this.
I would suggest treating this as a graph problem, tags are nodes and the number of occurence of say (tag1,tag2) is the weight of the edge between tag1 and tag2 nodes. You can possibly then generate recommended tags using nearest neighbour algorithm or even community detection (which tags are always co-mentioned together).
With a well constructed graph, enough initial data and some normalisation, I think it would be possible to output probability say of link between cluster1 =(tag1,tag2) with cluster2=(tag3,tag4,tag5).
So,the best approach that solved this problem was basically Apriori algorithm. It will provide association rules for the transnational database (considering every row as a transaction).
Below is a link for a very simple tutorial with implementation.
http://aimotion.blogspot.com/2013/01/machine-learning-and-data-mining.html
I am new with Python. Recenty,I have a project which processing huge amount of health data in xml file.
Here is an example:
In my data, there is about 100 of them and each of them have different id, origin, type and text . I want to store in data all of them so that I could training this dataset, the first idea in my mind was to use 2D arry ( one stores id and origin the other stores text). However, I found there are too many features and I want to know which features belong to each document.
Could anyone recommend a best way to do it.
For scalability ,simplicity and maintainance, you should normalised those data, build a database schema and move those stuff into database (sqlite,postgres,mysql,whatever)
This will move complicate data logic out of python. This is a typical practice of Model-view-controller.
Create a python dictionary and traverse it are quick and dirty. It will become huge technical time sink very soon if you want to make practical sense out of the data.
I am trying to create an interface between structured data and NLTK. NLP libraries generally work with bags of words, hence I need to turn my structured data into bags of words.
I need to associate the offset of a word with it's meta-data.Therefore my best bet is to have some sort of container that holds ranges as keys (allowing nested ranges) and can retrieve all the meta-data (multiple if the word offset is part of a nested range).
What code can I pickup that would do this efficiently (--i.e., sparse represention of the data ) ? Efficient because my global corpus will have at least a few hundred megabytes.
Note :
I am serialising structured forum posts. which will include posts with sections of quotes with them. I want to know which topic a word belonged to, and weather it's a quote or user-text. There will probably be additional metadata as my work progresses. Note that a word belonging to a quote is what I meant by nested meta-data, so the word is part of a quote, that belongs to a post made by a user.
I know that one can tag words in NLTK I haven't looked into it, if its possible to do what I want that way please comment. But I am still looking for the original approach.
There is probably something in numpy that can solve my problem, looking at that now
edit
The input data is far too complex to rip out and post. I have found what I was looking for tho http://packages.python.org/PyICL/. I needed to talk about intervals and not ranges :D I have used boost extensively, however making that a dependency makes me a bit uneasy (Sadly, I am having compiler errors with PyICL :( ).
The question now is: anyone know an interval container library or data structure that can be used to index nested intervals in a sparse fashion. Or put differently provides similar semantics to boost.icl
If you don't want to use PyICL or boost.icl Instead of relying on a specialized library you could just use sqlite3 to do the job ? If you use an in0memory version it will still be a few orders of magnitudes slower than boost.icl (from experience coding other data structures vs sqlite3) but should be more effective than using a c++ std::vector style approach on top of python containers.
You can use two integers and have date_type_low < offset < date_type_high predicate in your where clause. And depending on your table structure this will return nested/overlapping ranges.