Hierarchical queries on datastore

Hierarchical queries on datastore - python

I am trying to create hierarchical queries on the App Engine.
My datastore have parents and children. Each parent has children. Imagine that I have to find children. I have a condition on the parent and an other on the children for example, imagine a real family datastore, my conditions are: I want all children that are a boy from parent who are 35 years old or more.
The query I have for now is something like:
P = Parent.query(Parent.age >= 35)
for p in P:
C = Children.query(gender == "boy", ancestor = p.key)
for c in C:
-> here I print information on the children
But the query is very slow with a lot of parent and children to test. I want to avoid iteration like for etc. because I think it take a long time because of that! What is the best practice to have his kind of query but quickly?
I have also brothers for the children and I can make queries, for example if I want all children that have parent that are more that 35 yers old and a sister named "Sisi", I will have (each children have his brothers on the value "brother"):
P = Parent.query(Parent.age >= 35)
for p in P:
C = Children.query(gender == "girl", name == "Sisi", ancestor = p.key)
for c in C:
C1 = Children.query(gender == "boy", brother == c.key, ancestor = p.key)
for c1 in C1:
->Here I print information about the children
In fact this example (family example) good for my project but it give an idea of the problem I have

How I've been able to do this before is to store the keys in a separate lookup entity. This goes along the key-value store premise that duplicate info is sometimes necessary for faster lookups. For example:
ParentChildLookup
- parent_key = ndb.KeyProperty()
- child_key = ndb.KeyProperty()
You could even add a third dimension of you are getting grandchildren:
ParentChildLookup
- parent_key = ndb.KeyProperty()
- child_key = ndb.KeyProperty()
- grandchildren_key = ndb.KeyProperty()
If you wanted to look everything up in a single query, you can add the repeated clause to make children and grandchildren a list:
ParentChildLookup
- parent_key = ndb.KeyProperty()
- child_key = ndb.KeyProperty(repeated=True)
- grandchildren_key = ndb.KeyProperty(repeated=True)
You then need to insert/update these lookup values anytime there's a change in relationships. The benefit of this is that you can avoid a lot of queries, specifically nested or multi-property queries. If you don't like this approach, I would recommend taking a look at the "Relationship Model" explanation here: https://cloud.google.com/appengine/articles/modeling. You can store the relationships between many-to-many objects without the need to store them all in the same entity.

Related

SQLAlchemy: how to select if in one list or another list?

I have a class which looks something like this:
class A(Base):
__tablename__ = "a";
id = Column(Integer, primary_key=True)
stuff = relationship('Stuff', secondary=stuff_a)
more_stuff = relationship('Stuff', secondary=more_stuff_a)
Basically two lists, stuff and more_stuff containing lists of Stuff.
I want to do a query which selects all A which have Stuff with id=X in either stuff list or in more_stuff list.
This is how I would do it for one list:
session.query(A).join(Stuff).filter(Stuff.id==X)
But that won't pick up Stuff from more_stuff.

I think that if you have two relationships from A to Stuff, even when you join for one, you need to explicitly specify which one, or sqlalchemy will rightfully complain. You can do this as follows:
q = (
session
.query(A)
.join(Stuff, A.stuff) # #note: here you specify the relationship
.filter(Stuff.id == X)
)
As to filter for both lists, you need to use an or_ operator in a filter. In order to be able to reference to both relationships, the easiest is to create aliases (give different names) to each of them. Then the code looks like below:
S1 = aliased(Stuff)
S2 = aliased(Stuff)
q = (
session
.query(A)
.join(S1, A.stuff) # S1 will refer to `A.stuff`
.join(S2, A.more_stuff) # S2 will refer to `A.more_stuff`
.filter(or_(S1.id == X, S2.id == X))
)
Alternatively, a cleaner code can be achieved with relationship.any():
q = (
session
.query(A)
.filter(or_(
A.stuff.any(Stuff.id == X), # here Stuff will refer to `A.stuff`
A.more_stuff.any(Stuff.id == X), # here Stuff will refer to `A.more_stuff`
))
)
but you will need to compare performance difference between two versions as the latter is implemented using EXISTS with sub-selects.

how to create relation between existing two node, I'm using neo4j.

I'm in starting neo4j and I'm using python3.5 and py2neo.
I had build two graph node with following code. and successfully create.[!
>>> u1 = Node("Person",name='Tom',id=1)
>>> u2 = Node('Person', name='Jerry', id=2)
>>> graph.create(u1,u2)
after that, I going to make a relation between 'Tom' and 'Jerry'
Tom's id property is 1, Jerry's id property is 2.
So. I think, I have to point to existing two node using id property.
and then I tried to create relation like below.
>>> u1 = Node("Person",id=1)
>>> u2 = Node("Person",id=2)
>>> u1_knows_u2=Relationship(u1, 'KKNOWS', u2)
>>> graph.create(u1_knows_u2)
above successfully performed. But the graph is something strange.
I don't know why unknown graph nodes are created. and why the relation is created between unknown two node.

You can have two nodes with the same label and same properties. The second node you get with u1 = Node("Person",id=1) is not the same one you created before. It's a new node with the same label/property.
When you define two nodes (i.e. your new u1 and u2) and create a relationships between them, the whole pattern will be created.
To get the two nodes and create a relationship between them you would do:
# create Tom and Jerry as before
u1 = Node("Person",name='Tom',id=1)
u2 = Node('Person', name='Jerry', id=2)
graph.create(u1,u2)
# either use u1 and u2 directly
u1_knows_u2 = Relationship(u1, 'KKNOWS', u2)
graph.create(u1_knows_u2)
# or find existing nodes and create a relationship between them
existing_u1 = graph.find_one('Person', property_key='id', property_value=1)
existing_u2 = graph.find_one('Person', property_key='id', property_value=2)
existing_u1_knows_u2 = Relationship(existing_u1, 'KKNOWS', existing_u2)
graph.create(existing_u1_knows_u2)
find_one() assumes that your id properties are unique.

Note also that you can use the Cypher query language with Py2neo:
graph.cypher.execute('''
MERGE (tom:Person {name: "Tom"})
MERGE (jerry:Person {name: "Jerry"})
CREATE UNIQUE (tom)-[:KNOWS]->(jerry)
''')
The MERGE statement in Cypher is similar to "get or create". If a Person node with the given name "Tom" already exists it will be bound to the variable tom, if not the node will be created and then bound to tom. This, combined with adding uniqueness constraints allows for avoiding unwanted duplicate nodes.

Check this Query,
MATCH (a),(b) WHERE id(a) =1 and id(b) = 2 create (a)-[r:KKNOWS]->(b) RETURN a, b

Optimizing a inequality query in ndb over two properties

I'm trying to do a query into a range of valid dates
q = Licence.query(Licence.valid_from <= today,
Licence.valid_to >= today,
ancestor = customer.key
).fetch(keys_only=True)
I know that Datastore doesn't support inequality queries over two propperties.
So I do this:
kl = Licence.query(Licence.valid_from <= today,
ancestor = customer.key
).fetch(keys_only=True)
licences = ndb.get_multi(kl)
for item in licences:
if item.valid_to < today:
licence.remove(item)
But I don't like because I think that I use too much RAM retrieving more entities (or keys) from the Datastore that I finally need.
Any body knows a better way of doing this type of queries?
Is enough to use .filter() before .get()?
Thanks

One solution would be to create a new field, like start_week, which buckets the queries and allows you to use an IN query to filter:
q = Licence.query(Licence.start_week in range(5,30),
Licence.valid_to >= today,
ancestor = customer.key)
Even simpler: Use a projection query to identify the right set of data without fetching full entities. This is faster than a regular query.
it = Licence.query(License.valid_from <= today,
ancestor = customer.key
).iter(projection=[License.valid_to])
keys = [e.key for e in it if e.valid_to >= today]
licenses = ndb.get_multi(keys)

Storing a directed, weighted, complete graph in the GAE datastore

I have a directed, weighted, complete graph with 100 vertices. The vertices represent movies, and the edges represent preferences between two movies. Each time a user visits my site, I query a set of 5 vertices to show to the user (the set changes frequently). Let's call these vertices A, B, C, D, E. The user orders them (i.e. ranks these movies from most to least favorite). For example, he might order them D, B, A, C, E. I then need to update the graph as follows:
Graph[D][B] +=1
Graph[B][A] +=1
Graph[A][C] +=1
Graph[C][E] +=1
So the count Graph[V1][V2] ends up representing how many users ranked (movie) V1 directly above (movie) V2. When the data is collected, I can do all kinds of offline graph analysis, e.g. find the sinks and sources of the graph to identify the most and least favorite movies.
The problem is: how do I store a directed, weighted, complete graph in the datastore? The obvious answer is this:
class Vertex(db.Model):
name = db.StringProperty()
class Edge(db.Model):
better = db.ReferenceProperty(Vertex, collection_name = 'better_set')
worse = db.ReferenceProperty(Vertex, collection_name = 'worse_set')
count = db.IntegerProperty()
But the problem I see with this is that I have to make 4 separate ugly queries along the lines of:
edge = Edge.all().filter('better =', vertex1).filter('worse =', vertex2).get()
Then I need to update and put() the new edges in a fifth query.
A more efficient (fewer queries) but hacky implementation would be this one, which uses pairs of lists to simulate a dict:
class Vertex(db.Model):
name = db.StringProperty()
better_keys = db.ListProperty(db.Key)
better_values = db.ListProperty(int)
So to add a score saying that A is better than B, I would do:
index = vertexA.index(vertexB.key())
vertexA.better_values[index] += 1
Is there a more efficient way to model this?

I solved my own problem with a minor modification to the first design I suggested in my question.
I learned about the key_name argument that lets me set my own key names. So every time I create a new edge, I pass in the following argument to the constructor:
key_name = vertex1.name + ' > ' + vertex2.name
Then, instead of running this query multiple times:
edge = Edge.all().filter('better =', vertex1).filter('worse =', vertex2).get()
I can retrieve the edges easily since I know how to construct their keys. Using the Key.from_path() method, I construct a list of keys that refer to edges. Each key is obtained by doing this:
db.Key.from_path('Edge', vertex1.name + ' > ' + vertex2.name)
I then pass that list of keys to get all the objects in one query.

Best practices for manipulating database result sets in Python?

I am writing a simple Python web application that consists of several pages of business data formatted for the iPhone. I'm comfortable programming Python, but I'm not very familiar with Python "idiom," especially regarding classes and objects. Python's object oriented design differs somewhat from other languages I've worked with. So, even though my application is working, I'm curious whether there is a better way to accomplish my goals.
Specifics: How does one typically implement the request-transform-render database workflow in Python? Currently, I am using pyodbc to fetch data, copying the results into attributes on an object, performing some calculations and merges using a list of these objects, then rendering the output from the list of objects. (Sample code below, SQL queries redacted.) Is this sane? Is there a better way? Are there any specific "gotchas" I've stumbled into in my relative ignorance of Python? I'm particularly concerned about how I've implemented the list of rows using the empty "Record" class.
class Record(object):
pass
def calculate_pnl(records, node_prices):
for record in records:
try:
# fill RT and DA prices from the hash retrieved above
if hasattr(record, 'sink') and record.sink:
record.da = node_prices[record.sink][0] - node_prices[record.id][0]
record.rt = node_prices[record.sink][1] - node_prices[record.id][1]
else:
record.da = node_prices[record.id][0]
record.rt = node_prices[record.id][1]
# calculate dependent values: RT-DA and PNL
record.rtda = record.rt - record.da
record.pnl = record.rtda * record.mw
except:
print sys.exc_info()
def map_rows(cursor, mappings, callback=None):
records = []
for row in cursor:
record = Record()
for field, attr in mappings.iteritems():
setattr(record, attr, getattr(row, field, None))
if not callback or callback(record):
records.append(record)
return records
def get_positions(cursor):
# get the latest position time
cursor.execute("SELECT latest data time")
time = cursor.fetchone().time
hour = eelib.util.get_hour_ending(time)
# fetch the current positions
cursor.execute("SELECT stuff FROM atable", (hour))
# read the rows
nodes = {}
def record_callback(record):
if abs(record.mw) > 0:
if record.id: nodes[record.id] = None
return True
else:
return False
records = util.map_rows(cursor, {
'id': 'id',
'name': 'name',
'mw': 'mw'
}, record_callback)
# query prices
for node_id in nodes:
# RT price
row = cursor.execute("SELECT price WHERE ? ? ?", (node_id, time, time)).fetchone()
rt5 = row.lmp if row else None
# DA price
row = cursor.execute("SELECT price WHERE ? ? ?", (node_id, hour, hour)).fetchone()
da = row.da_lmp if row else None
# update the hash value
nodes[node_id] = (da, rt5)
# calculate the position pricing
calculate_pnl(records, nodes)
# sort
records.sort(key=lambda r: r.name)
# return the records
return records

The empty Record class and the free-floating function that (generally) applies to an individual Record is a hint that you haven't designed your class properly.
class Record( object ):
"""Assuming rtda and pnl must exist."""
def __init__( self ):
self.da= 0
self.rt= 0
self.rtda= 0 # or whatever
self.pnl= None #
self.sink = None # Not clear what this is
def setPnl( self, node_prices ):
# fill RT and DA prices from the hash retrieved above
# calculate dependent values: RT-DA and PNL
Now, your calculate_pnl( records, node_prices ) is simpler and uses the object properly.
def calculate_pnl( records, node_prices ):
for record in records:
record.setPnl( node_prices )
The point isn't to trivially refactor the code in small ways.
The point is this: A Class Encapsulates Responsibility.
Yes, an empty-looking class is usually a problem. It means the responsibilities are scattered somewhere else.
A similar analysis holds for the collection of records. This is more than a simple list, since the collection -- as a whole -- has operations it performs.
The "Request-Transform-Render" isn't quite right. You have a Model (the Record class). Instances of the Model get built (possibly because of a Request.) The Model objects are responsible for their own state transformations and updates. Perhaps they get displayed (or rendered) by some object that examines their state.
It's that "Transform" step that often violates good design by scattering responsibility all over the place. "Transform" is a hold-over from non-object design, where responsibility was a nebulous concept.

Have you considered using an ORM? SQLAlchemy is pretty good, and Elixir makes it beautiful. It can really reduce the ammount of boilerplate code needed to deal with databases. Also, a lot of the gotchas mentioned have already shown up and the SQLAlchemy developers dealt with them.

Depending on how much you want to do with the data you may not need to populate an intermediate object. The cursor's header data structure will let you get the column names - a bit of introspection will let you make a dictionary with col-name:value pairs for the row.
You can pass the dictionary to the % operator. The docs for the odbc module will explain how to get at the column metadata.
This snippet of code to shows the application of the % operator in this manner.
>>> a={'col1': 'foo', 'col2': 'bar', 'col3': 'wibble'}
>>> 'Col1=%(col1)s, Col2=%(col2)s, Col3=%(col3)s' % a
'Col1=foo, Col2=bar, Col3=wibble'
>>>

Using a ORM for an iPhone app might be a bad idea because of performance issues, you want your code to be as fast as possible. So you can't avoid boilerplate code. If you are considering a ORM, besides SQLAlchemy I'd recommend Storm.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hierarchical queries on datastore - python

Related

SQLAlchemy: how to select if in one list or another list?

how to create relation between existing two node, I'm using neo4j.

Optimizing a inequality query in ndb over two properties

Storing a directed, weighted, complete graph in the GAE datastore

Best practices for manipulating database result sets in Python?

Categories

Resources