I have a table with the following model:
CREATE TABLE IF NOT EXISTS {} (
user_id bigint ,
pseudo text,
importance float,
is_friend_following bigint,
is_friend boolean,
is_following boolean,
PRIMARY KEY ((user_id), is_friend_following)
);
I also have a table containing my seeds. Those (20) users are the starting point of my graph. So I select their ID and search in the table above to get their Followers and friends, and from there I build my graph (networkX).
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
friend_ids = []
follower_ids = []
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
friend_ids.append(row.friend_follower_id)
if row.is_follower:
follower_ids.append(row.friend_follower_id)
if friend_ids:
for friend_id in friend_ids:
obj.graph.add_edge(seed, friend_id)
if follower_ids:
for follower_id in follower_ids:
obj.graph.add_edge(follower_id, seed)
return obj
The problem is that the time it takes to build the graph is too long and I would like to optimize it.
I've got approximately 5 millions rows in my table 'network_table'.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
I suspect the real issue may not be the queries but rather the processing time.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
There should not be any problem with doing a single query on the whole table if you enable paging (https://datastax.github.io/python-driver/query_paging.html - using fetch_size). Cassandra will return up to the fetch_size and will fetch additional results as you read them from the result_set.
Please note that if you have many rows in the table that are non seed related then a full scan may be slower as you will receive rows that will not include a "seed"
Disclaimer - I am part of the team building ScyllaDB - a Cassandra compatible database.
ScyllaDB have published lately a blog on how to efficiently do a full scan in parallel http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/ which applies to Cassandra as well - if a full scan is relevant and you can build the graph in parallel than this may help you.
It seems like you can get rid of the last 2 if statements, since you're going through data that you already have looped through once:
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
obj.graph.add_edge(seed, row.friend_follower_id)
elif row.is_follower:
obj.graph.add_edge(row.friend_follower_id, seed)
return obj
This also gets rid of many append operations on lists that you're not using, and should speed up this function.
I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]
You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)
Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.
I want to achieve something like the map drag search on airbnb (https://www.airbnb.com/s/Paris--France?source=ds&page=1&s_tag=PNoY_mlz&allow_override%5B%5D=)
I am saving the data like this in datastore
user.lat = float(lat)
user.lon = float(lon)
user.geoLocation = ndb.GeoPt(float(lat),float(lon))
and whenever I drag & drop map or zoom in or zoom out I get following parameters in my controller
def get(self):
"""
This is an ajax function. It gets the place name, north_east, and south_west
coordinates. Then it fetch the results matching the search criteria and
create a result list. After that it returns the result in json format.
:return: result
"""
self.response.headers['Content-type'] = 'application/json'
results = []
north_east_latitude = float(self.request.get('nelat'))
north_east_longitude = float(self.request.get('nelon'))
south_west_latitude = float(self.request.get('swlat'))
south_west_longitude = float(self.request.get('swlon'))
points = Points.query(Points.lat<north_east_latitude,Points.lat>south_west_latitude)
for row in points:
if row.lon > north_east_longitude and row.lon < south_west_longitude:
listingdic = {'name': row.name, 'desc': row.description, 'contact': row.contact, 'lat': row.lat, 'lon': row.lon}
results.append(listingdic)
self.write(json.dumps({'listings':results}))
My model class is given below
class Points(ndb.Model):
name = ndb.StringProperty(required=True)
description = ndb.StringProperty(required=True)
contact = ndb.StringProperty(required=True)
lat = ndb.FloatProperty(required=True)
lon = ndb.FloatProperty(required=True)
geoLocation = ndb.GeoPtProperty()
I want to improve the query.
Thanks in advance.
No, you cannot improve the solution by checking all 4 conditions in the query because ndb queries do not support inequality filters on multiple properties. From NDB Queries (emphasis mine):
Limitations: The Datastore enforces some restrictions on queries.
Violating these will cause it to raise exceptions. For example,
combining too many filters, using inequalities for multiple
properties, or combining an inequality with a sort order on a
different property are all currently disallowed. Also filters
referencing multiple properties sometimes require secondary indexes to
be configured.
and
Note: As mentioned earlier, the Datastore rejects queries using inequality filtering on more than one property.
We are using Google App Engine in Python. I have code that saves a new object to the database, and then queries the database to receive all the objects. The problem is that the query returns all the objects except the new object I created. Only after refreshing the page I see the new object. Is there a way to update the query to include all the objects, including the new object I created? Here is my code:
if (self.request.get("add_a_new_feature") == "true"):
features = Feature.gql("WHERE feature_name=:1 ORDER BY last_modified DESC LIMIT 1", NEW_FEATURE_NAME) # class Feature inherits from ndb.Model
if (features.count() == 0):
new_feature = Feature(feature_name=NEW_FEATURE_NAME)
new_feature.put()
...
features = Feature.gql("ORDER BY date_created")
if (features.count() > 0):
features_list = features.fetch()
for feature in features_list:
... # the list doesn't contain new_feature
As mentioned in the comments - this is an expected behaviour. Take a look at this article for additional information. As a quick fix/hack you could simply get the data from datastore before adding the new entity and then append it to the list.
features = Feature.gql("ORDER BY date_created")
if (self.request.get("add_a_new_feature") == "true"):
if (Feature.gql("WHERE feature_name=:1 ORDER BY last_modified DESC LIMIT 1", NEW_FEATURE_NAME).count() == 0):
new_feature = Feature(feature_name=NEW_FEATURE_NAME)
new_feature.put()
features.append(new_feature)
...
if (features.count() > 0):
features_list = features.fetch()
for feature in features_list:
... # the list now contain the new_feature at the end
Depending on what Entity.gql() returns when there are no results (None or [ ]?) you may need to check whether features is a list before appending. You could also probably avoid the second query since you already have a list of features and could loop through it in Python rather than sending another request to datastore.
I have a such a data structure,
"ID NAME BIRTH AGE SEX"
=================================
1 Joe 01011980 30 M
2 Rose 12111986 24 F
3 Tom 31121965 35 M
4 Joe 15091990 20 M
I want to use python + sqlite to store and query data in a easy way. I am in trying to design a dict like object to store and retrieve those information, also the database can be shared with other application in an easy way.(just a plain database table for other application, then the pickle and ySerial like object should not fit for it.)
For example:
d = mysqlitedict.open('student_table')
d['1'] = ["Joe","01011980","30","M"]
d['2'] = ["Rose","12111986","24","F"]
This can be reasonable because I can use __setitem__() to get ride of that if "ID" as the key and rest part as the value of that dict like object.
The problem is if I want to use other field either as key semantically, takes "NAME" for example:
d['Joe'] = ["1","01011980","30","M"]
That will be a problem, because a dict like object should have a key/value pair semantically, as now "ID" is the key, "NAME" can not as overrode key here.
Then my question is, can I design my class then I may do like this?
d[key="NAME", "Joe"] = ["1","01011980","30","M"]
d[key="ID",'1'] = ["Joe","01011980","30","M"]
d.update(key = "ID", {'1':["Joe","01011980","30","M"]})
>>>d[key="NAME", 'Joe']
["1","Joe","01011980","30","M"]
["1","Joe","15091990","20","M"]
>>>d.has_key(key="NAME", 'Joe']
True
I will be appreciated for any reply!
KC
sqlite is a SQL database and works by far best when used as such (wrapped in SQLAlchemy or whatever if you really insist;-).
Syntax such as d[key="NAME", 'Joe'] is simply illegal Python, no matter how much wrapping and huffing and puffing you may do. A simple class wrapper around the DB connection is easy, but it will never give you that syntax -- something like d.fetch('Joe', key='Name') is reasonably easy to achieve, but indexing has very different syntax from function calls, and even in the latter named arguments must come after positional ones.
If you're willing to renounce your ambitious syntax dreams in favor of sensible Python syntax, and need help designing a class to implement the latter, feel free to ask, of course (I'm off to bed pretty soon, but I'm sure other, later-sleepers will be eager to help;-).
Edit: given the OP's clarifications (in a comment), it looks like a set_key method is acceptable to maintain Python-acceptable syntax (though the semantics of course will still be a tad off, since the OP wants a "dict-like" object which may have non unique keys -- no such thing in Python, really... but, we can approximate it a bit, at least).
So, here's a very first sketch (requires Python 2.6 or better -- just because I've used collections.MutableMapping to get other dict-like methods and .format to format strings; if you're stuck in 2.5, %-formatting of strings and UserDict.DictMixin will work instead):
import collections
import sqlite3
class SqliteDict(collections.MutableMapping):
#classmethod
def create(cls, path, columns):
conn = sqlite3.connect(path)
conn.execute('DROP TABLE IF EXISTS SqliteDict')
conn.execute('CREATE TABLE SqliteDict ({0})'.format(','.join(columns.split())))
conn.commit()
return cls(conn)
#classmethod
def open(cls, path):
conn = sqlite3.connect(path)
return cls(conn)
def __init__(self, conn):
# looks like for sime weird reason you want str, not unicode, when feasible, so...:
conn.text_factory = sqlite3.OptimizedUnicode
c = conn.cursor()
c.execute('SELECT * FROM SqliteDict LIMIT 0')
self.cols = [x[0] for x in c.description]
self.conn = conn
# start with a keyname (==column name) of `ID`
self.set_key('ID')
def set_key(self, key):
self.i = self.cols.index(key)
self.kn = key
def __len__(self):
c = self.conn.cursor()
c.execute('SELECT COUNT(*) FROM SqliteDict')
return c.fetchone()[0]
def __iter__(self):
c = self.conn.cursor()
c.execute('SELECT * FROM SqliteDict')
while True:
result = c.fetchone()
if result is None: break
k = result.pop(self.i)
return k, result
def __getitem__(self, k):
c = self.conn.cursor()
# print 'doing:', 'SELECT * FROM SqliteDict WHERE {0}=?'.format(self.kn)
# print ' with:', repr(k)
c.execute('SELECT * FROM SqliteDict WHERE {0}=?'.format(self.kn), (k,))
result = [list(r) for r in c.fetchall()]
# print ' resu:', repr(result)
for r in result: del r[self.i]
return result
def __contains__(self, k):
c = self.conn.cursor()
c.execute('SELECT * FROM SqliteDict WHERE {0}=?'.format(self.kn), (k,))
return c.fetchone() is not None
def __delitem__(self, k):
c = self.conn.cursor()
c.execute('DELETE FROM SqliteDict WHERE {0}=?'.format(self.kn), (k,))
self.conn.commit()
def __setitem__(self, k, v):
r = list(v)
r.insert(self.i, k)
if len(r) != len(self.cols):
raise ValueError, 'len({0}) is {1}, must be {2} instead'.format(r, len(r), len(self.cols))
c = self.conn.cursor()
# print 'doing:', 'REPLACE INTO SqliteDict VALUES({0})'.format(','.join(['?']*len(r)))
# print ' with:', r
c.execute('REPLACE INTO SqliteDict VALUES({0})'.format(','.join(['?']*len(r))), r)
self.conn.commit()
def close(self):
self.conn.close()
def main():
d = SqliteDict.create('student_table', 'ID NAME BIRTH AGE SEX')
d['1'] = ["Joe", "01011980", "30", "M"]
d['2'] = ["Rose", "12111986", "24", "F"]
print len(d), 'items in table created.'
print d['2']
print d['1']
d.close()
d = SqliteDict.open('student_table')
d.set_key('NAME')
print len(d), 'items in table opened.'
print d['Joe']
if __name__ == '__main__':
main()
The class is not meant to be instantiated directly (though it's OK to do so by passing an open sqlite3 connection to a DB with an appropriate SqliteDict table) but through the two class methods create (to make a new DB or wipe out an existing one) and open, which seems to match the OP's desires better than the alternative (have __init__ take a DB file path an an option string describing how to open it, just like modules such as gdbm take -- 'r' to open read-only, 'c' to create or wipe out, 'w' to open read-write -- easy to adjust of course). Among the columns passed (as a whitespace-separated string) to create, there must be one named ID (I haven't given much care to raising "the right" errors for any of the many, many user errors that can occur on building and using instances of this class; errors will occur on all incorrect usage, but not necessarily ones obvious to the user).
Once an instance is opened (or created), it behaves as closely to a dict as possible, except that all values set must be lists of exactly the right length, while the values returned are lists of lists (due to the weird "non-unique key" issue). For example, the above code, when run, prints
2 items in table created.
[['Rose', '12111986', '24', 'F']]
[['Joe', '01011980', '30', 'M']]
2 items in table opened.
[['1', '01011980', '30', 'M']]
The "Pythonically absurd" behavior is that d[x] = d[x] will fail -- because the right hand side is a list e.g. with a single item (which is a list of the column values) while the item assignment absolutely requires a list with e.g. four items (the column values). This absurdity is in the OP's requested semantics, and could be altered only by drastically changing such absurd required semantics again (e.g., forcing item assignment to have a list of lists on the RHS, and using executemany in lieu of plain execute).
Non-uniqueness of keys also makes it impossible to guess if d[x] = v, for a key k which corresponds to some number n of table entries, is meant to replace one (and if so, which one?!) or all of those entries, or add another new entry instead. In the code above I've taken the "add another entry" interpretation, but with a SQL statement REPLACE that, should the CREATE TABLE be changed to specify some uniqueness constraints, will change some semantics from "add entry" to "replace entries" if and when uniqueness constraints would otherwise be violated.
I'll let you all to play with this code, and reflect how huge the semantic gap is between Python mappings and relational tables, that the OP is desperately keen to bridge (apparently as a side effect of his urge to "use nicer syntax" than SQL affords -- I wonder if he has looked at SqlAlchemy as I recommended).
I think, in the end, the important lesson is what I stated right at the start, in the first paragraph of the part of the answer I wrote yesterday, and I self-quote...:
sqlite is a SQL database and works
by far best when used as such (wrapped
in SQLAlchemy or whatever if you
really insist;-).