find all documents with same id in different collections in firestore - python

I have this structure in firestore. Many collections with id the user_id and inside each of them many documents with IDs the date of departure. The documents contain the fields "from" and "to" with the airport name.
I want to retrieve all the IDs of collections (the users IDs) that have the same documents of a choosed user in input for see who shared the flight with this user in all the flights he made.
I'm using python.
UPDATE: I solved my issue in this way.
#app.route('/infos/<string:user_id>/', methods=['GET'])
def user_info(user_id):
docs = db.collection(f'{user_id}').stream()
travels = []
for doc in docs:
sharing_travellers = []
tmp = doc.to_dict()
tmp['date'] = doc.id
colls = db.collections()
for coll in colls:
if coll.id != user_id:
date = datetime.strptime(doc.id, '%Y-%m-%d')
query = db.collection(f'{coll.id}').stream()
for q in query:
other_date = datetime.strptime(q.id, '%Y-%m-%d')
if abs((date - other_date).days) < 1:
json_obj = q.to_dict()
if json_obj['from'] == tmp['from'] and json_obj['to'] == tmp['to']:
sharing_travellers.append(coll.id)
tmp['shared'] = sharing_travellers
travels.append(tmp)
return render_template('user_info.html', title=user_id, travels=travels)

The only way to read across collections is if those collections have the same name. If that was the case, you could use a collection group query.
Since your collections don't have the same name though, you'll have to get the list collections, and then look in each collection separately.

I support Frank's answer but I want to elaborate that it might be wise to reformat the structure of your database to better accommodate this type of situation. cross collection searching is limited to collection group queries which are already limited, and additional methods will require costly solutions.
It's often better to have a dedicated collection with those ID as field values of which you can query per user and in a collective group.

Related

How do you iterate over a set or a list in Flask and PyMongo?

I have produced a set of matching IDs from a database collection that looks like this:
{ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
Each ObjectId represents an ID on a collection in the DB.
I got that list by doing this: (which incidentally I also think I am doing wrong, but I don't yet know another way)
# Find all question IDs
question_list = list(mongo.db.questions.find())
all_questions = []
for x in question_list:
all_questions.append(x["_id"])
# Find all con IDs that match the question IDs
con_id = list(mongo.db.cons.find())
con_id_match = []
for y in con_id:
con_id_match.append(y["question_id"])
matches = set(con_id_match).intersection(all_questions)
print("matches", matches)
print("all_questions", all_questions)
print("con_id_match", con_id_match)
And that brings up all the IDs that are associated with a match such as the three at the top of this post. I will show what each print prints at the bottom of this post.
Now I want to get each ObjectId separately as a variable so I can search for these in the collection.
mongo.db.cons.find_one({"con": matches})
Where matches (will probably need to be a new variable) will be one of each ObjectId's that match the DB reference.
So, how do I separate the ObjectId in the matches so I get one at a time being iterated. I tried a for loop but it threw an error and I guess I am writing it wrong for a set. Thanks for the help.
Print Statements:
**matches** {ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
**all_questions** [ObjectId('5feafb52ae1b389f59423a91'), ObjectId('5feafb64ae1b389f59423a92'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feb247f1bb7a1297060342e'), ObjectId('6009b6e42b74a187c02ba9d7'), ObjectId('6010822e08050e32c64f2975'), ObjectId('601d125b3c4d9705f3a9720d')]
**con_id_match** [ObjectId('5feb247f1bb7a1297060342e'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8')]
Usually you can just use find method that yields documents one-by-one. And you can filter documents during iterating with python like that:
# fetch only ids
question_ids = {question['_id'] for question in mongo.db.questions.find({}, {'_id': 1})}
matches = []
for con in mongo.db.cons.find():
con_id = con['question_id']
if con_id in question_ids:
matches.append(con_id)
# you can process matched and loaded con here
print(matches)
If you have huge amount of data you can take a look to aggregation framework

Fast way to convert SQLAlchemy objects to Python dicts

I have this query that returns a list of student objects:
query = db.session.query(Student).filter(Student.is_deleted == false())
query = query.options(joinedload('project'))
query = query.options(joinedload('image'))
query = query.options(joinedload('student_locator_map'))
query = query.options(subqueryload('attached_addresses'))
query = query.options(subqueryload('student_meta'))
query = query.order_by(Student.student_last_name, Student.student_first_name,
Student.student_middle_name, Student.student_grade, Student.student_id)
query = query.filter(filter_column == field_value)
students = query.all()
The query itself does not take much time. The problem is converting all these objects (can be 5000+) to Python dicts. It takes over a minute with this many objects.Currently, the code loops thru the objects and converts using to_dict(). I have also tried _dict__ which was much faster but this does not convert all relational objects it seems.
How can I convert all these Student objects and related objects quickly?
Maybe this will help you...
from collections import defaultdict
def query_to_dict(student_results):
result = defaultdict(list)
for obj in student_results:
instance = inspect(obj)
for key, x in instance.attrs.items():
result[key].append(x.value)
return result
output = query_to_dict(students)
query = query.options(joinedload('attached_addresses').joinedload('address'))
By chaining address joinedload to attached_addresses I was able to significantly speed up the query.
My understanding of why this is the case:
Address objects were not being loaded with the initial query. Every iteration thru the loop, the db would get hit to retrieve the Address object. With joined load, Address objects are now loaded upon initial query.
Thanks to Corley Brigman for the help.

Querying objects using attribute of member of many-to-many

I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]
You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)
Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.

Optimizing a inequality query in ndb over two properties

I'm trying to do a query into a range of valid dates
q = Licence.query(Licence.valid_from <= today,
Licence.valid_to >= today,
ancestor = customer.key
).fetch(keys_only=True)
I know that Datastore doesn't support inequality queries over two propperties.
So I do this:
kl = Licence.query(Licence.valid_from <= today,
ancestor = customer.key
).fetch(keys_only=True)
licences = ndb.get_multi(kl)
for item in licences:
if item.valid_to < today:
licence.remove(item)
But I don't like because I think that I use too much RAM retrieving more entities (or keys) from the Datastore that I finally need.
Any body knows a better way of doing this type of queries?
Is enough to use .filter() before .get()?
Thanks
One solution would be to create a new field, like start_week, which buckets the queries and allows you to use an IN query to filter:
q = Licence.query(Licence.start_week in range(5,30),
Licence.valid_to >= today,
ancestor = customer.key)
Even simpler: Use a projection query to identify the right set of data without fetching full entities. This is faster than a regular query.
it = Licence.query(License.valid_from <= today,
ancestor = customer.key
).iter(projection=[License.valid_to])
keys = [e.key for e in it if e.valid_to >= today]
licenses = ndb.get_multi(keys)

Python Collections.DefaultDict Sort + Output Top X Custom Class Object

Problem: I need to output the TOP X Contributors determined by the amount of messages posted.
Data: I have a collection of the messages posted. This is not a Database/SQL question by the sample query below just give an overview of the code.
tweetsSQL = db.GqlQuery("SELECT * FROM TweetModel ORDER BY date_created DESC")
My Model:
class TweetModel(db.Model):
# Model Definition
# Tweet Message ID is the Key Name
to_user_id = db.IntegerProperty()
to_user = db.StringProperty(multiline=False)
message = db.StringProperty(multiline=False)
date_created = db.DateTimeProperty(auto_now_add=False)
user = db.ReferenceProperty(UserModel, collection_name = 'tweets')
From examples on SO, I was able to find the TOP X Contributors by doing this:
visits = defaultdict(int)
for t in tweetsSQL:
visits[t.user.from_user] += 1
Now I can then sort it using:
c = sorted(visits.iteritems(), key=operator.itemgetter(1), reverse=True)
But the only way now to retrieve the original Objects is to loop through object c, find the KeyName and then look in TweetsSQL for it to obtain the TweetModel Object.
Is there a better way?
*** Sorry I should have added that Count(*) is not available due to using google app engine
[EDIT 2]
In Summary, given a List of Messages, how do I order them by User's message Count.
IN SQL, it would be:
SELECT * FROM TweetModel GROUP BY Users ORDER BY Count(*)
But I cannot do it in SQL and need to duplicate this functionality in code. My starting point is "SELECT * FROM TweetModel"
Use heapq.nlargest() instead of sorted(), for efficiency; it's what it's for. I don't know the answer about the DB part of your question.
I think your job would be a lot easier if you change the SQL query to something like:
SELECT top 100 userId FROM TweetModel GROUP BY userId ORDER BY count(*)
I wouldn't bother with the TweetModel class if you only need the data to solve the stated problem.
Why not invert the dictionary, once you have constructed it, so that the keys are the message counts and the values are the users? Then you can sort the keys and easily get to the users.

Categories

Resources