I have a directed, weighted, complete graph with 100 vertices. The vertices represent movies, and the edges represent preferences between two movies. Each time a user visits my site, I query a set of 5 vertices to show to the user (the set changes frequently). Let's call these vertices A, B, C, D, E. The user orders them (i.e. ranks these movies from most to least favorite). For example, he might order them D, B, A, C, E. I then need to update the graph as follows:
Graph[D][B] +=1
Graph[B][A] +=1
Graph[A][C] +=1
Graph[C][E] +=1
So the count Graph[V1][V2] ends up representing how many users ranked (movie) V1 directly above (movie) V2. When the data is collected, I can do all kinds of offline graph analysis, e.g. find the sinks and sources of the graph to identify the most and least favorite movies.
The problem is: how do I store a directed, weighted, complete graph in the datastore? The obvious answer is this:
class Vertex(db.Model):
name = db.StringProperty()
class Edge(db.Model):
better = db.ReferenceProperty(Vertex, collection_name = 'better_set')
worse = db.ReferenceProperty(Vertex, collection_name = 'worse_set')
count = db.IntegerProperty()
But the problem I see with this is that I have to make 4 separate ugly queries along the lines of:
edge = Edge.all().filter('better =', vertex1).filter('worse =', vertex2).get()
Then I need to update and put() the new edges in a fifth query.
A more efficient (fewer queries) but hacky implementation would be this one, which uses pairs of lists to simulate a dict:
class Vertex(db.Model):
name = db.StringProperty()
better_keys = db.ListProperty(db.Key)
better_values = db.ListProperty(int)
So to add a score saying that A is better than B, I would do:
index = vertexA.index(vertexB.key())
vertexA.better_values[index] += 1
Is there a more efficient way to model this?
I solved my own problem with a minor modification to the first design I suggested in my question.
I learned about the key_name argument that lets me set my own key names. So every time I create a new edge, I pass in the following argument to the constructor:
key_name = vertex1.name + ' > ' + vertex2.name
Then, instead of running this query multiple times:
edge = Edge.all().filter('better =', vertex1).filter('worse =', vertex2).get()
I can retrieve the edges easily since I know how to construct their keys. Using the Key.from_path() method, I construct a list of keys that refer to edges. Each key is obtained by doing this:
db.Key.from_path('Edge', vertex1.name + ' > ' + vertex2.name)
I then pass that list of keys to get all the objects in one query.
Related
I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]
You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)
Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.
I'm in starting neo4j and I'm using python3.5 and py2neo.
I had build two graph node with following code. and successfully create.[!
>>> u1 = Node("Person",name='Tom',id=1)
>>> u2 = Node('Person', name='Jerry', id=2)
>>> graph.create(u1,u2)
after that, I going to make a relation between 'Tom' and 'Jerry'
Tom's id property is 1, Jerry's id property is 2.
So. I think, I have to point to existing two node using id property.
and then I tried to create relation like below.
>>> u1 = Node("Person",id=1)
>>> u2 = Node("Person",id=2)
>>> u1_knows_u2=Relationship(u1, 'KKNOWS', u2)
>>> graph.create(u1_knows_u2)
above successfully performed. But the graph is something strange.
I don't know why unknown graph nodes are created. and why the relation is created between unknown two node.
You can have two nodes with the same label and same properties. The second node you get with u1 = Node("Person",id=1) is not the same one you created before. It's a new node with the same label/property.
When you define two nodes (i.e. your new u1 and u2) and create a relationships between them, the whole pattern will be created.
To get the two nodes and create a relationship between them you would do:
# create Tom and Jerry as before
u1 = Node("Person",name='Tom',id=1)
u2 = Node('Person', name='Jerry', id=2)
graph.create(u1,u2)
# either use u1 and u2 directly
u1_knows_u2 = Relationship(u1, 'KKNOWS', u2)
graph.create(u1_knows_u2)
# or find existing nodes and create a relationship between them
existing_u1 = graph.find_one('Person', property_key='id', property_value=1)
existing_u2 = graph.find_one('Person', property_key='id', property_value=2)
existing_u1_knows_u2 = Relationship(existing_u1, 'KKNOWS', existing_u2)
graph.create(existing_u1_knows_u2)
find_one() assumes that your id properties are unique.
Note also that you can use the Cypher query language with Py2neo:
graph.cypher.execute('''
MERGE (tom:Person {name: "Tom"})
MERGE (jerry:Person {name: "Jerry"})
CREATE UNIQUE (tom)-[:KNOWS]->(jerry)
''')
The MERGE statement in Cypher is similar to "get or create". If a Person node with the given name "Tom" already exists it will be bound to the variable tom, if not the node will be created and then bound to tom. This, combined with adding uniqueness constraints allows for avoiding unwanted duplicate nodes.
Check this Query,
MATCH (a),(b) WHERE id(a) =1 and id(b) = 2 create (a)-[r:KKNOWS]->(b) RETURN a, b
I have overcome the problem of avoiding the creation of duplicate nodes on my DB with the use of merge_one functions which works like that:
t=graph.merge_one("User","ID","someID")
which creates the node with unique ID. My problem is that I can't find a way to add multiple attributes/properties to my node along with the ID which is added automatically (date for example).
I have managed to achieve this the old "duplicate" way but it doesn't work now since merge_one can't accept more arguments! Any ideas???
Graph.merge_one only allows you to specify one key-value pair because it's meant to be used with a uniqueness constraint on a node label and property. Is there anything wrong with finding the node by its unique id with merge_one and then setting the properties?
t = graph.merge_one("User", "ID", "someID")
t['name'] = 'Nicole'
t['age'] = 23
t.push()
I know I am a bit late... but still useful I think
Using py2neo==2.0.7 and the docs (about Node.properties):
... and the latter is an instance of PropertySet which extends dict.
So the following worked for me:
m = graph.merge_one("Model", "mid", MID_SR)
m.properties.update({
'vendor':"XX",
'model':"XYZ",
'software':"OS",
'modelVersion':"",
'hardware':"",
'softwareVesion':"12.06"
})
graph.push(m)
This hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
The final result can be used to assess uniqueness and (if empty) create a new node, by combining the two functions together...
def merge_one_multiProp(graph, *labels, **properties):
r = find_multiProp(graph, *labels, **properties)
if not r:
# remove tuple association
node,= graph.create(Node(*labels, **properties))
else:
node = r[0]
return node
example...
from py2neo import Node, Graph
graph = Graph()
properties = {'p1':'v1', 'p2':'v2'}
labels = ('label1', 'label2')
graph.create(Node(*labels, **properties))
for l in labels:
graph.create(Node(l, **properties))
graph.create(Node(*labels, p1='v1'))
node = merge_one_multiProp(graph, *labels, **properties)
I've found related methods:
find - doesn't work because this version of neo4j doesn't support labels.
match - doesn't work because I cannot specify a relation, because the node has no relations yet.
match_one - same as match.
node - doesn't work because I don't know the id of the node.
I need an equivalent of:
start n = node(*) where n.name? = "wvxvw" return n;
Cypher query. Seems like it should be basic, but it really isn't...
PS. I'm opposed to using Cypher for too many reasons to mention. So that's not an option either.
Well, you should create indexes so that your start nodes are reduced. This will be automatically taken care of with the use of labels, but in the meantime, there can be a work around.
Create an index, say "label", which will have keys pointing to the different types of nodes you will have (in your case, say 'Person')
Now while searching you can write the following query :
START n = node:label(key_name='Person') WHERE n.name = 'wvxvw' RETURN n; //key_name is the key's name you will assign while creating the node.
user797257 seems to be out of the game, but I think this could still be useful:
If you want to get nodes, you need to create an index. An index in Neo4j is the same as in MySQL or any other database (If I understand correctly). Labels are basically auto-indexes, but an index offers additional speed. (I use both).
somewhere on top, or in neo4j itself create an index:
index = graph_db.get_or_create_index(neo4j.Node, "index_name")
Then, create your node as usual, but do add it to the index:
new_node = batch.create(node({"key":"value"}))
batch.add_indexed_node(index, "key", "value", new_node)
Now, if you need to find your new_node, execute this:
new_node_ref = index.get("key", "value")
This returns a list. new_node_ref[0] has the top item, in case you want/expect a single node.
use selector to obtain node from the graph
The following code fetches the first node from list of nodes matching the search
selector = NodeSelector(graph)
node = selector.select("Label",key='value')
nodelist=list(node)
m_node=node.first()
using py2neo, this hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
see my other answer for creating a merge_one() that will accept multiple properties...
I have a file from which I am reading the data.
I need advice on how to design the data structure which does the following:
So, the data is of form
id_1::id_2::similiarity_score
Now, though the data is in this form but it also means that
id_2::id_1::same_similiarity_Score
So, what I want is a datastructure which when I use in program.
So lets say I want to use this data in order to find which two items are similar
object.maxSimiliarity(object_id_1)
returns object_id_2 # has max score
but then this object_id_1 can also be in product_id_2 column in the database...
so in database in can be either of form:
object_id_1:: object_id_2::score
or object_id2::object_id_1::score
so I sort off want to design this datastructure in a way that
k_1, k_2:: value <--> k_2,k_1::value
A general trick for this sort of thing is to find a canonicalisation - a function that maps all members of a particular class to the same object. In this case, you might achieve it by sorting the first two components, which will transform B::A::Score to A::B::Score, while leaving A::B::Score as it is.
It seems to me that you could use the scores to build lists of best to worst matches:
d = {
'id1': [id_best_match_to_id1, id_next_best_match_to_id1, ..., id_worst_match_to_id1],
'id2': [id_best_match_to_id2, id_next_best_match_to_id2, ..., id_worst_match_to_id2],
...
}
If the similarity scores need to be retained, use a list of tuples in the form (id_best_match_to_id1, similarity_score_to_id1).
I don't see a way to exploit that similarity is a symmetric relation where sim(x,y)==sim(y,x).
The data look very much like nodes and edges of a weighted graph. If a is similar to b with a score 5.0, and similar to c with a score 1.0, you might visualise it thus:
a
/ \
/ \
5.0 1.0
/ \
b c
Networkx is a python lib that provides ready-made graph objects and algorithms. Loading up your data into a weighted multigraph (that is, it supports multiple connections between nodes A--B and B--A is trivial. After that, getting the most similar object given an object id is a case of finding the node, finding it's most weighted edge and returning the node at the end of it.
import networkx as nx
## Test data
data = """\
a::b::2
b::a::3
a::c::5
b::e::1
"""
rows = (row.split('::') for row in data.split())
class Similarity(object):
def __init__(self, data):
self.g = nx.MultiGraph()
self.load(data)
def load(self, data):
## Turn the row into data suitable for networkx graph
rows = ((row[0], row[1], float(row[2])) for row in data)
self.g.add_weighted_edges_from(rows)
def most_similar(self, obj_id):
## Get edges from obj_id node
edges = self.g.edges_iter(obj_id, data=True)
## Sort by weight, get first, get joined node
return sorted([(i[0], i[1], i[2].get('weight', 0)) for i in edges])[-1][1]
sc = Similarity(rows)
sc.most_similar('a') ## 'c'
## Add some more data linking a --> f with a high score
sc.load([('a', 'f', 10)])
sc.most_similar('a') ## 'f'