I have this query that returns a list of student objects:
query = db.session.query(Student).filter(Student.is_deleted == false())
query = query.options(joinedload('project'))
query = query.options(joinedload('image'))
query = query.options(joinedload('student_locator_map'))
query = query.options(subqueryload('attached_addresses'))
query = query.options(subqueryload('student_meta'))
query = query.order_by(Student.student_last_name, Student.student_first_name,
Student.student_middle_name, Student.student_grade, Student.student_id)
query = query.filter(filter_column == field_value)
students = query.all()
The query itself does not take much time. The problem is converting all these objects (can be 5000+) to Python dicts. It takes over a minute with this many objects.Currently, the code loops thru the objects and converts using to_dict(). I have also tried _dict__ which was much faster but this does not convert all relational objects it seems.
How can I convert all these Student objects and related objects quickly?
Maybe this will help you...
from collections import defaultdict
def query_to_dict(student_results):
result = defaultdict(list)
for obj in student_results:
instance = inspect(obj)
for key, x in instance.attrs.items():
result[key].append(x.value)
return result
output = query_to_dict(students)
query = query.options(joinedload('attached_addresses').joinedload('address'))
By chaining address joinedload to attached_addresses I was able to significantly speed up the query.
My understanding of why this is the case:
Address objects were not being loaded with the initial query. Every iteration thru the loop, the db would get hit to retrieve the Address object. With joined load, Address objects are now loaded upon initial query.
Thanks to Corley Brigman for the help.
Related
I have this structure in firestore. Many collections with id the user_id and inside each of them many documents with IDs the date of departure. The documents contain the fields "from" and "to" with the airport name.
I want to retrieve all the IDs of collections (the users IDs) that have the same documents of a choosed user in input for see who shared the flight with this user in all the flights he made.
I'm using python.
UPDATE: I solved my issue in this way.
#app.route('/infos/<string:user_id>/', methods=['GET'])
def user_info(user_id):
docs = db.collection(f'{user_id}').stream()
travels = []
for doc in docs:
sharing_travellers = []
tmp = doc.to_dict()
tmp['date'] = doc.id
colls = db.collections()
for coll in colls:
if coll.id != user_id:
date = datetime.strptime(doc.id, '%Y-%m-%d')
query = db.collection(f'{coll.id}').stream()
for q in query:
other_date = datetime.strptime(q.id, '%Y-%m-%d')
if abs((date - other_date).days) < 1:
json_obj = q.to_dict()
if json_obj['from'] == tmp['from'] and json_obj['to'] == tmp['to']:
sharing_travellers.append(coll.id)
tmp['shared'] = sharing_travellers
travels.append(tmp)
return render_template('user_info.html', title=user_id, travels=travels)
The only way to read across collections is if those collections have the same name. If that was the case, you could use a collection group query.
Since your collections don't have the same name though, you'll have to get the list collections, and then look in each collection separately.
I support Frank's answer but I want to elaborate that it might be wise to reformat the structure of your database to better accommodate this type of situation. cross collection searching is limited to collection group queries which are already limited, and additional methods will require costly solutions.
It's often better to have a dedicated collection with those ID as field values of which you can query per user and in a collective group.
While importing SQL data into mongodb, I have merged few tables as an embedded array but while implementing I get syntactic errors stating 'key errors'.
Below is my code.
import pyodbc, json, collections, pymongo, datetime
arrayCol =[]
mongoConStr = 'localhost:27017'
sqlConStr = 'DRIVER={MSSQL-NC1311};SERVER=tcp:172.16.1.75,1433;DATABASE=devdb;UID=qauser;PWD=devuser'
mongoConnect = pymongo.MongoClient(mongoConStr)
sqlConnect = pyodbc.connect(sqlConStr)
dbo = mongoConnect.eaedw.ctArrayData
sqlCur = sqlConnect.cursor()
sqlCur.execute('''SELECT M.fldUserId ,TRU.intRuleGroupId ,TGM.strGroupName FROM TBL_USER_MASTER M
JOIN TBL_RULEGROUP_USER TRU ON M.fldUserId = TRU.intUserId
JOIN tbl_Group_Master TGM ON TRU.intRuleGroupId = TGM.intGroupId
''')
tuples = sqlCur.fetchall()
for tuple in tuples:
doc = collections.OrderedDict()
doc['fldUserId'] = tuple.fldUserId
doc['groups.gid'].append(tuple.intRuleGroupId)
doc['groups.gname'].append(tuple.strGroupName)
arrayCol.append(doc)
mongoImp = dbo.insert_many(arrayCol)
sqlCur.close()
mongoConnect.close()
sqlConnect.close()
Here, I was trying to create an embedded array name groups which will hold gid and groupname as a sub-doc in the array.
I get error for using append, it runs successfully without the embedded array.
Is there any error or mistake with the array definition?
You can't append to a list that doesn't exist. When you call append on them, doc['groups.gid'] and doc['groups.gname'] have no value. Even once you fix that problem, PyMongo prohibits you from inserting a document with keys like "groups.gid" that include dots. I think you intend to do this:
for tuple in tuples:
doc = collections.OrderedDict()
doc['fldUserId'] = tuple.fldUserId
doc['groups'] = collections.OrderedDict([
('gid', tuple.intRuleGroupId),
('gname', tuple.strGroupName)
])
arrayCol.append(doc)
I'm only guessing, based on your question, the schema that you really want to create.
I have a table with the following model:
CREATE TABLE IF NOT EXISTS {} (
user_id bigint ,
pseudo text,
importance float,
is_friend_following bigint,
is_friend boolean,
is_following boolean,
PRIMARY KEY ((user_id), is_friend_following)
);
I also have a table containing my seeds. Those (20) users are the starting point of my graph. So I select their ID and search in the table above to get their Followers and friends, and from there I build my graph (networkX).
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
friend_ids = []
follower_ids = []
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
friend_ids.append(row.friend_follower_id)
if row.is_follower:
follower_ids.append(row.friend_follower_id)
if friend_ids:
for friend_id in friend_ids:
obj.graph.add_edge(seed, friend_id)
if follower_ids:
for follower_id in follower_ids:
obj.graph.add_edge(follower_id, seed)
return obj
The problem is that the time it takes to build the graph is too long and I would like to optimize it.
I've got approximately 5 millions rows in my table 'network_table'.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
I suspect the real issue may not be the queries but rather the processing time.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
There should not be any problem with doing a single query on the whole table if you enable paging (https://datastax.github.io/python-driver/query_paging.html - using fetch_size). Cassandra will return up to the fetch_size and will fetch additional results as you read them from the result_set.
Please note that if you have many rows in the table that are non seed related then a full scan may be slower as you will receive rows that will not include a "seed"
Disclaimer - I am part of the team building ScyllaDB - a Cassandra compatible database.
ScyllaDB have published lately a blog on how to efficiently do a full scan in parallel http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/ which applies to Cassandra as well - if a full scan is relevant and you can build the graph in parallel than this may help you.
It seems like you can get rid of the last 2 if statements, since you're going through data that you already have looped through once:
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
obj.graph.add_edge(seed, row.friend_follower_id)
elif row.is_follower:
obj.graph.add_edge(row.friend_follower_id, seed)
return obj
This also gets rid of many append operations on lists that you're not using, and should speed up this function.
I have two long lists of objects in Python:
queries_list (list of Query objects) and
results_list (list of Result objects)
I'd like find Result objects that are related to a Query, using a common field 'search_id', then I should append related results to Query.results list.
The pseudocode is as below:
for q in queries_list
for r in results_list
if q.search_id == r.search_id
q.results.append(r)
Your pseudocode is almost python-code, but here is a python variant using filter.
for query in queries_list:
hasQueryId = lambda result: result.search_id == query.search_id
query.results.extend(filter(hasQueryId, results_list))
This should result in all your queries result-lists being populated. This is still O(m*n), if you're looking for more efficient ways Id try sorting the results and queries by id.
Your pseudo-code is almost Python. You're just missing colons:
for q in queries_list:
for r in results_list:
if q.search_id == r.search_id:
q.results.append(r)
This is assuming your query objects already have results attributes.
If not, you can create them at runtime:
for q in queries_list:
for r in results_list:
if q.search_id == r.search_id:
try:
q.results.append(r)
except AttributeError:
q.results = [r]
I'm trying to query an NDB model using a list of provided key id strings. The model has string ids that are assigned at creation - for example:
objectKey = MyModel(
id="123456ABC",
name="An Object"
).put()
Now I can't figure out how to query the NDB key ids with a list filter. Normally you can do the MyModel.property.IN() to query properties:
names = ['An Object', 'Something else', 'etc']
# This query works
query = MyModel.query(MyModel.name.IN(names))
When I try to filter by a list of keys, I can't get it to work:
# This simple get works
object = MyModel.get_by_id("123456ABC")
ids = ["123456ABC", "CBA654321", "etc"]
# These queries DON'T work
query = MyModel.query(MyModel.id.IN(ids))
query = MyModel.query(MyModel.key.id.IN(ids))
query = MyModel.query(MyModel.key.id().IN(ids))
query = MyModel.query(MyModel._properties['id'].IN(ids))
query = MyModel.query(getattr(MyModel, 'id').IN(ids))
...
I always get AttributeError: type object 'MyModel' has no attribute 'id' errors.
I need to be able to filter by a list of IDs, rather than iterate through each ID in the list (which is sometimes long). How do I do it?
The following should work:
keys = [ndb.Key(MyModel, anid) for anid in ids]
objs = ndb.get_multi(keys)
You can also use urlsafe keys If you have problems using the ids.
keys = ndb.get_multi([ndb.Key(urlsafe=k) for k in ids])