I'm building a database with tag nodes and url nodes, and the url nodes are connected to tag nodes. In this case if the same url is inserted in to the database, it should be linking to the tag node, rather than creating duplicate url nodes. I think indexing would solve this problem. How is it possible to do indexing and traversal with the neo4jrestclient?. Link to a tutorial would be fine. I'm currently using versae neo4jrestclient.
Thanks
The neo4jrestclient supports both indexing and traversing the graph, but I think by using just indexing could be enoguh for your use case. However, I don't know if I understood properly your problem. Anyway, something like this could work:
>>> from neo4jrestclient.client import GraphDatabase
>>> gdb = GraphDatabase("http://localhost:7474/db/data/")
>>> idx = gdb.nodes.indexes.create("urltags")
>>> url_node = gdb.nodes.create(url="http://foo.bar", type="URL")
>>> tag_node = gdb.nodes.create(tag="foobar", type="TAG")
We add the property count to the relationship to keep track the number of URLs "http://foo.bar" tagged with the tag foobar.
>>> url_node.relationships.create(tag_node["tag"], tag_node, count=1)
And after that, we index the url node according the value of the URL.
>>> idx["url"][url_node["url"]] = url_node
Then, when I need to create a new URL node tagged with a TAG node, we first query the index to check if that is yet indexed. Otherwise, we create the node and index it.
>>> new_url = "http://foo.bar2"
>>> nodes = idx["url"][new_url]
>>> if len(nodes):
... rel = nodes[0].relationships.all(types=[tag_node["tag"]])[0]
... rel["count"] += 1
... else:
... new_url_node = gdb.nodes.create(url=new_url, type="URL")
... new_url_node.relationships.create(tag_node["tag"], tag_node, count=1)
... idx["url"][new_url_node["url"]] = new_url_node
An important concept is that the indexes are key/value/object triplets where the object is either a node or a relationship you want to index.
Steps to create and use the index:
Create an instance of the graph database rest client.
from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")
Create a node or relationship index (Creating a node index here)
index = gdb.nodes.indexes.create('latin_genre')
Add nodes to the index
nelly = gdb.nodes.create(name='Nelly Furtado')
shakira = gdb.nodes.create(name='Shakira')
index['latin_genre'][nelly.get('name')] = nelly
index['latin_genre'][shakira.get('name')] = shakira
Fetch nodes based on the index and do further processing:
for artist in index['latin_genre']['Shakira']:
print artist.get('name')
More details can be found from the notes in the webadmin
Neo4j has two types of indexes, node and relationship indexes. With
node indexes you index and find nodes, and with relationship indexes
you do the same for relationships.
Each index has a provider, which is the underlying implementation
handling that index. The default provider is lucene, but you can
create your own index provides if you like.
Neo4j indexes take key/value/object triplets ("object" being a node or
a relationship), it will index the key/value pair, and associate this
with the object provided. After you have indexed a set of
key/value/object triplets, you can query the index and get back
objects that where indexed with key/value pairs matching your query.
For instance, if you have "User" nodes in your database, and want to
rapidly find them by username or email, you could create a node index
named "Users", and for each user index username and email. With the
default lucene configuration, you can then search the "Users" index
with a query like: "username:bob OR email:bob#gmail.com".
You can use the data browser to query your indexes this way, the
syntax for the above query is "node:index:Users:username:bob OR
email:bob#gmail.com".
Related
I have produced a set of matching IDs from a database collection that looks like this:
{ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
Each ObjectId represents an ID on a collection in the DB.
I got that list by doing this: (which incidentally I also think I am doing wrong, but I don't yet know another way)
# Find all question IDs
question_list = list(mongo.db.questions.find())
all_questions = []
for x in question_list:
all_questions.append(x["_id"])
# Find all con IDs that match the question IDs
con_id = list(mongo.db.cons.find())
con_id_match = []
for y in con_id:
con_id_match.append(y["question_id"])
matches = set(con_id_match).intersection(all_questions)
print("matches", matches)
print("all_questions", all_questions)
print("con_id_match", con_id_match)
And that brings up all the IDs that are associated with a match such as the three at the top of this post. I will show what each print prints at the bottom of this post.
Now I want to get each ObjectId separately as a variable so I can search for these in the collection.
mongo.db.cons.find_one({"con": matches})
Where matches (will probably need to be a new variable) will be one of each ObjectId's that match the DB reference.
So, how do I separate the ObjectId in the matches so I get one at a time being iterated. I tried a for loop but it threw an error and I guess I am writing it wrong for a set. Thanks for the help.
Print Statements:
**matches** {ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
**all_questions** [ObjectId('5feafb52ae1b389f59423a91'), ObjectId('5feafb64ae1b389f59423a92'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feb247f1bb7a1297060342e'), ObjectId('6009b6e42b74a187c02ba9d7'), ObjectId('6010822e08050e32c64f2975'), ObjectId('601d125b3c4d9705f3a9720d')]
**con_id_match** [ObjectId('5feb247f1bb7a1297060342e'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8')]
Usually you can just use find method that yields documents one-by-one. And you can filter documents during iterating with python like that:
# fetch only ids
question_ids = {question['_id'] for question in mongo.db.questions.find({}, {'_id': 1})}
matches = []
for con in mongo.db.cons.find():
con_id = con['question_id']
if con_id in question_ids:
matches.append(con_id)
# you can process matched and loaded con here
print(matches)
If you have huge amount of data you can take a look to aggregation framework
I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]
You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)
Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.
I have overcome the problem of avoiding the creation of duplicate nodes on my DB with the use of merge_one functions which works like that:
t=graph.merge_one("User","ID","someID")
which creates the node with unique ID. My problem is that I can't find a way to add multiple attributes/properties to my node along with the ID which is added automatically (date for example).
I have managed to achieve this the old "duplicate" way but it doesn't work now since merge_one can't accept more arguments! Any ideas???
Graph.merge_one only allows you to specify one key-value pair because it's meant to be used with a uniqueness constraint on a node label and property. Is there anything wrong with finding the node by its unique id with merge_one and then setting the properties?
t = graph.merge_one("User", "ID", "someID")
t['name'] = 'Nicole'
t['age'] = 23
t.push()
I know I am a bit late... but still useful I think
Using py2neo==2.0.7 and the docs (about Node.properties):
... and the latter is an instance of PropertySet which extends dict.
So the following worked for me:
m = graph.merge_one("Model", "mid", MID_SR)
m.properties.update({
'vendor':"XX",
'model':"XYZ",
'software':"OS",
'modelVersion':"",
'hardware':"",
'softwareVesion':"12.06"
})
graph.push(m)
This hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
The final result can be used to assess uniqueness and (if empty) create a new node, by combining the two functions together...
def merge_one_multiProp(graph, *labels, **properties):
r = find_multiProp(graph, *labels, **properties)
if not r:
# remove tuple association
node,= graph.create(Node(*labels, **properties))
else:
node = r[0]
return node
example...
from py2neo import Node, Graph
graph = Graph()
properties = {'p1':'v1', 'p2':'v2'}
labels = ('label1', 'label2')
graph.create(Node(*labels, **properties))
for l in labels:
graph.create(Node(l, **properties))
graph.create(Node(*labels, p1='v1'))
node = merge_one_multiProp(graph, *labels, **properties)
I've found related methods:
find - doesn't work because this version of neo4j doesn't support labels.
match - doesn't work because I cannot specify a relation, because the node has no relations yet.
match_one - same as match.
node - doesn't work because I don't know the id of the node.
I need an equivalent of:
start n = node(*) where n.name? = "wvxvw" return n;
Cypher query. Seems like it should be basic, but it really isn't...
PS. I'm opposed to using Cypher for too many reasons to mention. So that's not an option either.
Well, you should create indexes so that your start nodes are reduced. This will be automatically taken care of with the use of labels, but in the meantime, there can be a work around.
Create an index, say "label", which will have keys pointing to the different types of nodes you will have (in your case, say 'Person')
Now while searching you can write the following query :
START n = node:label(key_name='Person') WHERE n.name = 'wvxvw' RETURN n; //key_name is the key's name you will assign while creating the node.
user797257 seems to be out of the game, but I think this could still be useful:
If you want to get nodes, you need to create an index. An index in Neo4j is the same as in MySQL or any other database (If I understand correctly). Labels are basically auto-indexes, but an index offers additional speed. (I use both).
somewhere on top, or in neo4j itself create an index:
index = graph_db.get_or_create_index(neo4j.Node, "index_name")
Then, create your node as usual, but do add it to the index:
new_node = batch.create(node({"key":"value"}))
batch.add_indexed_node(index, "key", "value", new_node)
Now, if you need to find your new_node, execute this:
new_node_ref = index.get("key", "value")
This returns a list. new_node_ref[0] has the top item, in case you want/expect a single node.
use selector to obtain node from the graph
The following code fetches the first node from list of nodes matching the search
selector = NodeSelector(graph)
node = selector.select("Label",key='value')
nodelist=list(node)
m_node=node.first()
using py2neo, this hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
see my other answer for creating a merge_one() that will accept multiple properties...
I have nodes in index with following proprties:
{'user_id': u'00050714572570434939', 'hosts': [u'http://shyjive.blogspot.com/'], 'follows': ['null']}
Now i have index and I am trying simple query to index to get nodes as :
index = gdb.nodes.indexes.create('blogger2')
uid = gdb.nodes.create()
uid["hosts"] = ['http://shyjive.blogspot.com/']
uid["user_id"] = "00050714572570434939"
uid["follows"] = ['null']
print index["user_id"]["00050714572570434939"][:]
this returns [] , what is wrong here !!
reason why i am using list in python as suggested by developers on neo4j groups is I want to store multi property values to the node , so instead of array i am using list here
You first need to index the node. If you are not using automatic indexing, the code for neo4j-rest-client would be:
index["user_id"]["00050714572570434939"] = uid
Now you have:
>>> index["user_id"]["00050714572570434939"][:]
[<Neo4j Node: http://localhost:7474/db/data/node/38>]