Advice on how to design a datastructure

Advice on how to design a datastructure - python

I have a file from which I am reading the data.
I need advice on how to design the data structure which does the following:
So, the data is of form
id_1::id_2::similiarity_score
Now, though the data is in this form but it also means that
id_2::id_1::same_similiarity_Score
So, what I want is a datastructure which when I use in program.
So lets say I want to use this data in order to find which two items are similar
object.maxSimiliarity(object_id_1)
returns object_id_2 # has max score
but then this object_id_1 can also be in product_id_2 column in the database...
so in database in can be either of form:
object_id_1:: object_id_2::score
or object_id2::object_id_1::score
so I sort off want to design this datastructure in a way that
k_1, k_2:: value <--> k_2,k_1::value

A general trick for this sort of thing is to find a canonicalisation - a function that maps all members of a particular class to the same object. In this case, you might achieve it by sorting the first two components, which will transform B::A::Score to A::B::Score, while leaving A::B::Score as it is.

It seems to me that you could use the scores to build lists of best to worst matches:
d = {
'id1': [id_best_match_to_id1, id_next_best_match_to_id1, ..., id_worst_match_to_id1],
'id2': [id_best_match_to_id2, id_next_best_match_to_id2, ..., id_worst_match_to_id2],
...
}
If the similarity scores need to be retained, use a list of tuples in the form (id_best_match_to_id1, similarity_score_to_id1).
I don't see a way to exploit that similarity is a symmetric relation where sim(x,y)==sim(y,x).

The data look very much like nodes and edges of a weighted graph. If a is similar to b with a score 5.0, and similar to c with a score 1.0, you might visualise it thus:
a
/ \
/ \
5.0 1.0
/ \
b c
Networkx is a python lib that provides ready-made graph objects and algorithms. Loading up your data into a weighted multigraph (that is, it supports multiple connections between nodes A--B and B--A is trivial. After that, getting the most similar object given an object id is a case of finding the node, finding it's most weighted edge and returning the node at the end of it.
import networkx as nx
## Test data
data = """\
a::b::2
b::a::3
a::c::5
b::e::1
"""
rows = (row.split('::') for row in data.split())
class Similarity(object):
def __init__(self, data):
self.g = nx.MultiGraph()
self.load(data)
def load(self, data):
## Turn the row into data suitable for networkx graph
rows = ((row[0], row[1], float(row[2])) for row in data)
self.g.add_weighted_edges_from(rows)
def most_similar(self, obj_id):
## Get edges from obj_id node
edges = self.g.edges_iter(obj_id, data=True)
## Sort by weight, get first, get joined node
return sorted([(i[0], i[1], i[2].get('weight', 0)) for i in edges])[-1][1]
sc = Similarity(rows)
sc.most_similar('a') ## 'c'
## Add some more data linking a --> f with a high score
sc.load([('a', 'f', 10)])
sc.most_similar('a') ## 'f'

Related

Get filtered Networkx MultiDiGraph to behave like a DiGraph

I have a MultiDiGraph with all my data in it, now I want to do some math on a filtered view of it that has only single directed edges between nodes.
>>> filtered_view[0][1]
Out[23]: AtlasView(FilterAtlas({0: {'d': 0.038, 'l': 2, 'showfl': True, 'type': 'pipe', 'q': 0.0001}}, <function FilterMultiInner.__getitem__.<locals>.new_node_ok at 0x7fa0987b55a0>))
I already have a lot of code that was working on a DiGraph, so a lot of it would not work anymore because of the differences in accessing and storing information. So thus my question:
Is there a way to have the view behave like a DiGraph?
Alternatively, I can do: ndg = nx.DiGraph(filtered_view)to get a DiGraph, but is there a smart (simple, clear, error free) way of merging it back into the main graph?

This is the implementation I came up with, it allows to either merge only data on existing nodes and edges (allnodes=False) or to merge the entire results_graph which is a DiGraph (allnodes=True). Condition is that the MultiDiGraph has not changed since the filtered view was created.
def merge_results_back(results_graph, multidigraph, allnodes=False):
for n in results_graph.nodes:
if n not in multidigraph.nodes and allnodes:
multidigraph.add_node(n)
if n in multidigraph.nodes:
nx.set_node_attributes(multidigraph, {n : results_graph.nodes[n]})
for e in results_graph.edges:
if e in multidigraph.edges:
for ed1, ed2, key, data in multidigraph.edges(e[0], keys=True, data=True):
if data['type'] == results_graph.edges[e]['type']:
nx.set_edge_attributes(multidigraph, {(e[0], e[1], key) : results_graph.edges[e]})
else:
nx.set_edge_attributes(multidigraph, {(e[0], e[1], 0): results_graph.edges[e]})

Offering a couple of suggestions for improvement here based on the code that you posted. It's unclear under what circumstances a node would be added (if the DiGraph is based on the MultiDiGraph, how is a new node possible?), so I'll leave that part alone.
In the loop for modifying edges, you end up looping through multidigraph every time a common edge is found. As an improvement, I'd suggest the following (assuming the type attribute differs based on the edge index, which wasn't clear in your question):
for u, v, data in results_graph.edges(data = True):
#only loops through each edge in the multidigraph one time
for i in range(multidigraph.number_of_edges(u, v)):
if multidigraph.edges[u, v, i]['type'] == data['type']:
multidigraph.edges[u, v, i].update(data)
If the type doesn't change based on the index, just eliminate that if statement line.
I think you can also get rid of the else block:
else:
nx.set_edge_attributes(multidigraph, {(e[0], e[1], 0): results_graph.edges[e]})
If edge e from results_graph isn't in multidigraph, then setting the edge attributes won't create edge e and it will be silently ignored. If you have a new edge and attributes though (again, unclear how this is possible if results_graph was created from multidigraph), you can add the following directly under the for u, v, data... line:
if (u, v) not in multidigraph.edges:
multidigraph.add_edge(u, v, **data)

Why are all my edges being assigned the same value in a networkx DiGraph?

I have been stuck on this simple problem for awhile and cant quite figure out the solution. I have a dictionary that is structured like {(node1, node2): weight} called EdgeDictFull. I wanted to create a DiGraph that has the weight stored as an attribute in the graph. I have tried a whole bunch of different ideas but no seem to work. When I run this code....
(weights is just a list of all the weights I want to add to the edges as attributes)
TG = nx.DiGraph()
for x in weights:
TG.add_edges_from(EdgeDictFull.keys(), weight = x)
TG.edges(data = True)
What this does is it will create all the correct edges, but all edges will have the attribute value of the last integer in my weights list. I think I understand why it does that, however, I cant seem to figure out how to fix it. I know it's something really simple. Any advice would be great!

# the problem with your code is that in every iteration of your loop you add
# *all* edges, and all of them get the same weight.
# you can do either of the following:
# zip:
TG = nx.DiGraph()
for edge, weight in zip(EdgeDictFull.keys(), weights):
TG.add_edge(*edge, weight=weight)
# or directly work with the dictionary:
## dummy dictionary:
EdgeDictFull = {(np.random.randint(5),np.random.randint(5)):np.random.rand() for i in range(3)}
TG = nx.DiGraph()
TG.add_weighted_edges_from((a,b,c) for (a,b), c in EdgeDictFull.items())
TG.edges(data = True)

py2neo - How can I use merge_one function along with multiple attributes for my node?

I have overcome the problem of avoiding the creation of duplicate nodes on my DB with the use of merge_one functions which works like that:
t=graph.merge_one("User","ID","someID")
which creates the node with unique ID. My problem is that I can't find a way to add multiple attributes/properties to my node along with the ID which is added automatically (date for example).
I have managed to achieve this the old "duplicate" way but it doesn't work now since merge_one can't accept more arguments! Any ideas???

Graph.merge_one only allows you to specify one key-value pair because it's meant to be used with a uniqueness constraint on a node label and property. Is there anything wrong with finding the node by its unique id with merge_one and then setting the properties?
t = graph.merge_one("User", "ID", "someID")
t['name'] = 'Nicole'
t['age'] = 23
t.push()

I know I am a bit late... but still useful I think
Using py2neo==2.0.7 and the docs (about Node.properties):
... and the latter is an instance of PropertySet which extends dict.
So the following worked for me:
m = graph.merge_one("Model", "mid", MID_SR)
m.properties.update({
'vendor':"XX",
'model':"XYZ",
'software':"OS",
'modelVersion':"",
'hardware':"",
'softwareVesion':"12.06"
})
graph.push(m)

This hacky function will iterate through the properties and values and labels gradually eliminating all nodes that don't match each criteria submitted. The final result will be a list of all (if any) nodes that match all the properties and labels supplied.
def find_multiProp(graph, *labels, **properties):
results = None
for l in labels:
for k,v in properties.iteritems():
if results == None:
genNodes = lambda l,k,v: graph.find(l, property_key=k, property_value=v)
results = [r for r in genNodes(l,k,v)]
continue
prevResults = results
results = [n for n in genNodes(l,k,v) if n in prevResults]
return results
The final result can be used to assess uniqueness and (if empty) create a new node, by combining the two functions together...
def merge_one_multiProp(graph, *labels, **properties):
r = find_multiProp(graph, *labels, **properties)
if not r:
# remove tuple association
node,= graph.create(Node(*labels, **properties))
else:
node = r[0]
return node
example...
from py2neo import Node, Graph
graph = Graph()
properties = {'p1':'v1', 'p2':'v2'}
labels = ('label1', 'label2')
graph.create(Node(*labels, **properties))
for l in labels:
graph.create(Node(l, **properties))
graph.create(Node(*labels, p1='v1'))
node = merge_one_multiProp(graph, *labels, **properties)

Storing a directed, weighted, complete graph in the GAE datastore

I have a directed, weighted, complete graph with 100 vertices. The vertices represent movies, and the edges represent preferences between two movies. Each time a user visits my site, I query a set of 5 vertices to show to the user (the set changes frequently). Let's call these vertices A, B, C, D, E. The user orders them (i.e. ranks these movies from most to least favorite). For example, he might order them D, B, A, C, E. I then need to update the graph as follows:
Graph[D][B] +=1
Graph[B][A] +=1
Graph[A][C] +=1
Graph[C][E] +=1
So the count Graph[V1][V2] ends up representing how many users ranked (movie) V1 directly above (movie) V2. When the data is collected, I can do all kinds of offline graph analysis, e.g. find the sinks and sources of the graph to identify the most and least favorite movies.
The problem is: how do I store a directed, weighted, complete graph in the datastore? The obvious answer is this:
class Vertex(db.Model):
name = db.StringProperty()
class Edge(db.Model):
better = db.ReferenceProperty(Vertex, collection_name = 'better_set')
worse = db.ReferenceProperty(Vertex, collection_name = 'worse_set')
count = db.IntegerProperty()
But the problem I see with this is that I have to make 4 separate ugly queries along the lines of:
edge = Edge.all().filter('better =', vertex1).filter('worse =', vertex2).get()
Then I need to update and put() the new edges in a fifth query.
A more efficient (fewer queries) but hacky implementation would be this one, which uses pairs of lists to simulate a dict:
class Vertex(db.Model):
name = db.StringProperty()
better_keys = db.ListProperty(db.Key)
better_values = db.ListProperty(int)
So to add a score saying that A is better than B, I would do:
index = vertexA.index(vertexB.key())
vertexA.better_values[index] += 1
Is there a more efficient way to model this?

I solved my own problem with a minor modification to the first design I suggested in my question.
I learned about the key_name argument that lets me set my own key names. So every time I create a new edge, I pass in the following argument to the constructor:
key_name = vertex1.name + ' > ' + vertex2.name
Then, instead of running this query multiple times:
edge = Edge.all().filter('better =', vertex1).filter('worse =', vertex2).get()
I can retrieve the edges easily since I know how to construct their keys. Using the Key.from_path() method, I construct a list of keys that refer to edges. Each key is obtained by doing this:
db.Key.from_path('Edge', vertex1.name + ' > ' + vertex2.name)
I then pass that list of keys to get all the objects in one query.

test for node membership in pydot graph

pydot has a huge number of bound methods for getting and setting every little thing in a dot graph, reading and writing, you-name-it, but I can't seem to find a simple membership test.
>>> d = pydot.Dot()
>>> n = pydot.Node('foobar')
>>> d.add_node(n)
>>> n in d.get_nodes()
False
is just one of many things that didn't work. It appears that nodes, once added to a graph, acquire a new identity
>>> d.get_nodes()[0]
<pydot.Node object at 0x171d6b0>
>>> n
<pydot.Node object at 0x1534650>
Can anyone suggest a way to create a node and test to see if it's in a graph before adding it so you could do something like this:
d = pydot.Dot()
n = pydot.Node('foobar')
if n not in d:
d.add_node(n)

Looking through the source code, http://code.google.com/p/pydot/source/browse/trunk/pydot.py, it seems that node names are unique values, used as the keys to locate the nodes within a graph's node dictionary (though, interestingly, rather than return an error for an existing node, it simply adds the attributes of the new node to those of the existing one).
So unless you want to add an implementation of __contains__() to one of the classes in the pydot.py file that does the following, you can just do the following in your code:
if n.get_name() not in d.obj_dict['nodes'].keys():
d.add_node(n)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Advice on how to design a datastructure - python

Related

Get filtered Networkx MultiDiGraph to behave like a DiGraph

Why are all my edges being assigned the same value in a networkx DiGraph?

py2neo - How can I use merge_one function along with multiple attributes for my node?

Storing a directed, weighted, complete graph in the GAE datastore

test for node membership in pydot graph

Categories

Resources