Parsing py2neo paths into Pandas - python

We are returning paths from a cypher query using py2neo. We would like to parse the result into a Pandas DataFrame. The cypher query is similar to the following query
query='''MATCH p=allShortestPaths(p1:Type1)-[r*..3]-(p2:Type1)
WHERE p1.ID =123456
RETURN distinct(p)''
result = graph.run(query)
The resulting object is a walkable object - which can be traversed. It should be noted that the Nodes and Relationships don't have the same properties.
What would be the most pythonic way to iterate over the object? Is it necessary to process the entire path or since the object is a dictionary is it possible to use the Pandas.from_dict method? There is an issue that sometimes the length of the paths are not equal.
Currently we are enumerating the object and if it is an un-equal object then it is a Node , otherwise we process the object as a relationship.
for index, item in enumerate(paths):
if index%2 == 0:
#process as Node
else:
#process as Relationship
We can use the isinstance method i.e.
if isinstance(item, py2neo.types.Node ):
#process as Node
But that still requires processing every element separately.

I solve the problem as follows:
I wrote a function that receives a list of paths with the properties of the nodes and relationships
def neo4j_graph_to_dict(paths, node_properties, rels_properties):
paths_dict=OrderedDict()
for (pathID, path) in enumerate(paths):
paths_dict[pathID]={}
for (i, node_rel) in enumerate(path):
n_properties = [node_rel[np] for np in node_properties]
r_properties = [node_rel[rp] for rp in rels_properties]
if isinstance(node_rel, Node):
node_fromat = [np+': {}|'for np in node_properties]
paths_dict[pathID]['Node'+str(i)]=('{}: '+' '.join(node_fromat)).format(list(node_rel.labels())[0], *n_properties)
elif isinstance(node_rel, Relationship):
rel_fromat = [np+': {}|'for np in rels_properties]
reltype= 'Rel'+str(i-1)
paths_dict[pathID][reltype]= ('{}: '+' '.join(rel_fromat)).format(node_rel.type(), *r_properties)
return paths_dict
Assuming the query returns the paths, nodes and relationships we can run the following code:
query='''MATCH paths=allShortestPaths(
(pr1:Type1 {ID:'123456'})-[r*1..9]-(pr2:Type2 {ID:'654321'}))
RETURN paths, nodes(paths) as nodes, rels(paths) as rels'''
df_qf = pd.DataFrame(graph.data(query))
node_properties = set([k for series in df_qf.nodes for node in series for k in node.keys() ]) # get unique values for Node properites
rels_properties = set([k for series in df_qf.rels for rel in series for k in rel.keys() ]) # get unique values for Rels properites
wg = [(walk(path)) for path in df_qf.paths ]
paths_dict = neo4j_graph_to_dict(wg, node_properties, rels_properties)
df = pd.DataFrame(paths_dict).transpose()
df = pd.DataFrame(df, columns=paths_dict[0].keys()).drop_duplicates()

Related

List vs Dict for list of objects with an ID

Firstly, speed is not a massive issue here as the length of lists is relatively small. I'm more interested in style, and code-economy.
I have a graph (nodes and edges) where I need to store data for each node. I use a class like this:
class Node:
def __init__(self,node_id,name,edges,[more data]):
self.node_id = node_id
self.name = name
etc.
etc.
My nodes are then (currently) read from a file and put into a list, like this:
with open("filepath.txt") as f:
content = f.readlines()
nodes = []
for line in content:
lst = ast.literal_eval(line)
nodes.append(Node([lst[0],lst[1],lst[2]...))
I don't really use the position of a node in the list nodes to mean anything; the node is always identified by node_id which is uniquely determined previously.
This means if I want to get the attribute someData from the node with node_id of 7, say, I have to use:
for n in nodes:
if n.node_id == 7:
print(n.someData)
which seems awfully inefficient.
So, I decided to use a dictionary, removing node_id from the Node class and using it as the key instead. A dictionary seems like the 'correct' structure to use, surely? However, in many places this has made my code worse!
For example, where before I had:
sumTotal = sum(n.someData for n in nodes)
I now have to use:
sumTotal = sum(nodes[k].someData for k in nodes)
or
sumTotal = sum(n.someData for n in nodes.values())
Am I missing something here? What would be the best practice for this type of data?
If the node_id is a unique key, you can do this:
nodes = {}
for line in content:
lst = ast.literal_eval(line)
nodes[lst[0]] = Node(lst[0],lst[1],lst[2]...))
And if you need to do anything with them later it will be faster and cleaner:
print nodes[7].someData
You will have to do something like this to get the sum though:
sumTotal = sum(nodes[k].someData for k in nodes)

py2neo - How can I use merge_one function along with multiple attributes for my node?

I have overcome the problem of avoiding the creation of duplicate nodes on my DB with the use of merge_one functions which works like that:
t=graph.merge_one("User","ID","someID")
which creates the node with unique ID. My problem is that I can't find a way to add multiple attributes/properties to my node along with the ID which is added automatically (date for example).
I have managed to achieve this the old "duplicate" way but it doesn't work now since merge_one can't accept more arguments! Any ideas???
Graph.merge_one only allows you to specify one key-value pair because it's meant to be used with a uniqueness constraint on a node label and property. Is there anything wrong with finding the node by its unique id with merge_one and then setting the properties?
t = graph.merge_one("User", "ID", "someID")
t['name'] = 'Nicole'
t['age'] = 23
t.push()
I know I am a bit late... but still useful I think
Using py2neo==2.0.7 and the docs (about Node.properties):
... and the latter is an instance of PropertySet which extends dict.
So the following worked for me:
m = graph.merge_one("Model", "mid", MID_SR)
m.properties.update({
'vendor':"XX",
'model':"XYZ",
'software':"OS",
'modelVersion':"",
'hardware':"",
'softwareVesion':"12.06"
})
graph.push(m)
This hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
The final result can be used to assess uniqueness and (if empty) create a new node, by combining the two functions together...
def merge_one_multiProp(graph, *labels, **properties):
r = find_multiProp(graph, *labels, **properties)
if not r:
# remove tuple association
node,= graph.create(Node(*labels, **properties))
else:
node = r[0]
return node
example...
from py2neo import Node, Graph
graph = Graph()
properties = {'p1':'v1', 'p2':'v2'}
labels = ('label1', 'label2')
graph.create(Node(*labels, **properties))
for l in labels:
graph.create(Node(l, **properties))
graph.create(Node(*labels, p1='v1'))
node = merge_one_multiProp(graph, *labels, **properties)

Error when Searching trees iterively

I need to search a tree by checking if the sum of the branches from a node is greater than zero. However, I'm running into a problem with the sum - I get a type error (int object is not callable) on the
branch_sum = [t[0] for t in current]
line. I thought it was because eventually I'll get a single node
current = [[1,'b']]
(for example), and so I added the if/else statement. I.e. I thought that I was trying to sum something that looked like this:
first = [1]
However, the problem still persists. I'm unsure of what could be causing this.
For reference, current is a list of lists, with the first slot is the node data the second slot is a node id (in the inner list). The group() function groups the data on a node based on the id of the sub-nodes (left subnodes have ids beginning with 1, right have ids beginning with 0).
The tree I'm searching is stored as a list of lists like:
tree = [[0, '1'], [1,'01'], [0,'001']]
i.e. it's a set of Huffman Codes.
from collections import deque
def group(items):
right = [[item[0],item[1][1:]] for item in items if item[1].startswith('1')]
left = [[item[0],item[1][1:]] for item in items if item[1].startswith('0')]
return left, right
def search(node):
loops = 0
to_crawl = deque(group(node))
while to_crawl:
current = to_crawl.popleft() # this is the left branch of the tree
branch_sum = 0
if len(current)==1:
branch_sum = sum([t for t in current])
else:
branch_sum = sum([t[0] for t in current])
if branch_sum !=0 :
l,r = group(current)
to_crawl.extendleft(r)
to_crawl.extendleft(l)
loops += 1
return loops
Here's what I'm trying to do:
GIven a tree, with a lot of the data being 0, find the 1. To do this, split the tree into two branches (via the group() function) and push onto deque. Pop a branch off the deque, then sum the data in the branch. If the sum is not zero split the branch into two sub branches, push the sub branches onto the deque. Keep on doing this until I've found the non-zero datum. I should end up with a single item of the form [1,'101'] in the deque when I exit.
I strongly assume that the error says
TypeError: 'int' object is not iterable
because you end up passing a 2-tuple as node to
to_crawl = deque(group(node))
which gives you a 2-element deque. Then
current = to_crawl.popleft()
gives you a single element (an integer) as current. This is clearly not iterable, which leads to the given error.
Side note: For brevity, you can use sum like this
sum(current)
instead of
sum([x for x in current])

py2neo how to retrieve a node based on node's property?

I've found related methods:
find - doesn't work because this version of neo4j doesn't support labels.
match - doesn't work because I cannot specify a relation, because the node has no relations yet.
match_one - same as match.
node - doesn't work because I don't know the id of the node.
I need an equivalent of:
start n = node(*) where n.name? = "wvxvw" return n;
Cypher query. Seems like it should be basic, but it really isn't...
PS. I'm opposed to using Cypher for too many reasons to mention. So that's not an option either.
Well, you should create indexes so that your start nodes are reduced. This will be automatically taken care of with the use of labels, but in the meantime, there can be a work around.
Create an index, say "label", which will have keys pointing to the different types of nodes you will have (in your case, say 'Person')
Now while searching you can write the following query :
START n = node:label(key_name='Person') WHERE n.name = 'wvxvw' RETURN n; //key_name is the key's name you will assign while creating the node.
user797257 seems to be out of the game, but I think this could still be useful:
If you want to get nodes, you need to create an index. An index in Neo4j is the same as in MySQL or any other database (If I understand correctly). Labels are basically auto-indexes, but an index offers additional speed. (I use both).
somewhere on top, or in neo4j itself create an index:
index = graph_db.get_or_create_index(neo4j.Node, "index_name")
Then, create your node as usual, but do add it to the index:
new_node = batch.create(node({"key":"value"}))
batch.add_indexed_node(index, "key", "value", new_node)
Now, if you need to find your new_node, execute this:
new_node_ref = index.get("key", "value")
This returns a list. new_node_ref[0] has the top item, in case you want/expect a single node.
use selector to obtain node from the graph
The following code fetches the first node from list of nodes matching the search
selector = NodeSelector(graph)
node = selector.select("Label",key='value')
nodelist=list(node)
m_node=node.first()
using py2neo, this hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
see my other answer for creating a merge_one() that will accept multiple properties...

Indexing nodes in neo4j in python

I'm building a database with tag nodes and url nodes, and the url nodes are connected to tag nodes. In this case if the same url is inserted in to the database, it should be linking to the tag node, rather than creating duplicate url nodes. I think indexing would solve this problem. How is it possible to do indexing and traversal with the neo4jrestclient?. Link to a tutorial would be fine. I'm currently using versae neo4jrestclient.
Thanks
The neo4jrestclient supports both indexing and traversing the graph, but I think by using just indexing could be enoguh for your use case. However, I don't know if I understood properly your problem. Anyway, something like this could work:
>>> from neo4jrestclient.client import GraphDatabase
>>> gdb = GraphDatabase("http://localhost:7474/db/data/")
>>> idx = gdb.nodes.indexes.create("urltags")
>>> url_node = gdb.nodes.create(url="http://foo.bar", type="URL")
>>> tag_node = gdb.nodes.create(tag="foobar", type="TAG")
We add the property count to the relationship to keep track the number of URLs "http://foo.bar" tagged with the tag foobar.
>>> url_node.relationships.create(tag_node["tag"], tag_node, count=1)
And after that, we index the url node according the value of the URL.
>>> idx["url"][url_node["url"]] = url_node
Then, when I need to create a new URL node tagged with a TAG node, we first query the index to check if that is yet indexed. Otherwise, we create the node and index it.
>>> new_url = "http://foo.bar2"
>>> nodes = idx["url"][new_url]
>>> if len(nodes):
... rel = nodes[0].relationships.all(types=[tag_node["tag"]])[0]
... rel["count"] += 1
... else:
... new_url_node = gdb.nodes.create(url=new_url, type="URL")
... new_url_node.relationships.create(tag_node["tag"], tag_node, count=1)
... idx["url"][new_url_node["url"]] = new_url_node
An important concept is that the indexes are key/value/object triplets where the object is either a node or a relationship you want to index.
Steps to create and use the index:
Create an instance of the graph database rest client.
from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")
Create a node or relationship index (Creating a node index here)
index = gdb.nodes.indexes.create('latin_genre')
Add nodes to the index
nelly = gdb.nodes.create(name='Nelly Furtado')
shakira = gdb.nodes.create(name='Shakira')
index['latin_genre'][nelly.get('name')] = nelly
index['latin_genre'][shakira.get('name')] = shakira
Fetch nodes based on the index and do further processing:
for artist in index['latin_genre']['Shakira']:
print artist.get('name')
More details can be found from the notes in the webadmin
Neo4j has two types of indexes, node and relationship indexes. With
node indexes you index and find nodes, and with relationship indexes
you do the same for relationships.
Each index has a provider, which is the underlying implementation
handling that index. The default provider is lucene, but you can
create your own index provides if you like.
Neo4j indexes take key/value/object triplets ("object" being a node or
a relationship), it will index the key/value pair, and associate this
with the object provided. After you have indexed a set of
key/value/object triplets, you can query the index and get back
objects that where indexed with key/value pairs matching your query.
For instance, if you have "User" nodes in your database, and want to
rapidly find them by username or email, you could create a node index
named "Users", and for each user index username and email. With the
default lucene configuration, you can then search the "Users" index
with a query like: "username:bob OR email:bob#gmail.com".
You can use the data browser to query your indexes this way, the
syntax for the above query is "node:index:Users:username:bob OR
email:bob#gmail.com".

Categories

Resources