Select batch of rows sqlalchemy mysql - python

I have a MySQL database with a few thousand forum posts + text. I would like to grab them in batches, say 1000 at a time, and do stuff to them in python3.
My single post query looks like:
pquery = session.query(Post).\
filter(Post.post_id.like(post_id))
How can I change this so that given a post_id, it returns that post and the 999 posts after it?

Use limit and offset:
pquery = session.query(Post).filter(Post.post_id.like(post_id)).limit(1000).offset(the_offset_val)

Related

Faster cosmos db query

I followed the example from cosmos db example using SQL API, but getting the data is quite slow. I'm trying to get data for one week (around 1M records). Sample code below.
client = cosmos_client.CosmosClient(HOST, {'masterKey': KEY})
database = client.get_database_client(DB_ID)
container = database.get_container_client(COLLECTION_ID)
query = """
SELECT some columns
FROM c
WHERE columna = 'a'
and columnb >= '100'
"""
result = list(container.query_items(
query=query, enable_cross_partition_query=True))
My question is, is there any other way to query data faster? Does putting the query result in list make it slow? What am I doing wrong here?
There are a couple of things you could do.
Model your data such that you don't have to do a cross partition query. These will always take more time because your query needs to go touch more partitions for the data. You can learn more here, Model and partition data in Cosmos DB
You can do this even faster when you only need a single item by using a point read instead of a query read_item

How can get last nth item of a query on PostgreSql

Question:
How I can get the last 750 records of a query in the Database level?
Here is What I have tried:
# Get last 750 applications
apps = MyModel.active_objects.filter(
**query_params
).order_by('-created_at').values_list('id', flat=True)[:750]
This query fetches all records that hit the query_params filter and after that return the last 750 records. So I want to do this work at the database level, like mongoDb aggregate queries. Is it possible?
Thanks.
Actually that's not how Django works. The limit part is also done in database level.
Django docs - Limiting QuerySets:
Generally, slicing a QuerySet returns a new QuerySet – it doesn’t evaluate the query.
To see what query is actually being run in the database you can simply print the query like this:
apps = MyModel.active_objects.filter(
**query_params
).order_by('-created_at').values_list('id', flat=True)[:750]
print(apps.query)
The result will be something like this:
SELECT * FROM "app_mymodel" WHERE <...> ORDER BY "app_mymodel"."created_at" DESC LIMIT 750

Django ORM: Get latest record for distinct field

I'm having loads of trouble translating some SQL into Django.
Imagine we have some cars, each with a unique VIN, and we record the dates that they are in the shop with some other data. (Please ignore the reason one might structure the data this way. It's specifically for this question. :-) )
class ShopVisit(models.Model):
vin = models.CharField(...)
date_in_shop = models.DateField(...)
mileage = models.DecimalField(...)
boolfield = models.BooleanField(...)
We want a single query to return a Queryset with the most recent record for each vin and update it!
special_vins = [...]
# Doesn't work
ShopVisit.objects.filter(vin__in=special_vins).annotate(max_date=Max('date_in_shop').filter(date_in_shop=F('max_date')).update(boolfield=True)
# Distinct doesn't work with update
ShopVisit.objects.filter(vin__in=special_vins).order_by('vin', '-date_in_shop).distinct('vin').update(boolfield=True)
Yes, I could iterate over a queryset. But that's not very efficient and it takes a long time when I'm dealing with around 2M records. The SQL that could do this is below (I think!):
SELECT *
FROM cars
INNER JOIN (
SELECT MAX(dateInShop) as maxtime, vin
FROM cars
GROUP BY vin
) AS latest_record ON (cars.dateInShop= maxtime)
AND (latest_record.vin = cars.vin)
So how can I make this happen with Django?
This is somewhat untested, and relies on Django 1.11 for Subqueries, but perhaps something like:
latest_visits = Subquery(ShopVisit.objects.filter(id=OuterRef('id')).order_by('-date_in_shop').values('id')[:1])
ShopVisit.objects.filter(id__in=latest_visits)
I had a similar model, so went to test it but got an error of:
"This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery"
The SQL it generated looked reasonably like what you want, so I think the idea is sound. If you use PostGres, perhaps it has support for that type of subquery.
Here's the SQL it produced (trimmed up a bit and replaced actual names with fake ones):
SELECT `mymodel_activity`.* FROM `mymodel_activity` WHERE `mymodel_activity`.`id` IN (SELECT U0.`id` FROM `mymodel_activity` U0 WHERE U0.`id` = (`mymodel_activity`.`id`) ORDER BY U0.`date_in_shop` DESC LIMIT 1)
I wonder if you found the solution yourself.
I could come up with only raw query string. Django Raw SQL query Manual
UPDATE "yourapplabel_shopvisit"
SET boolfield = True WHERE date_in_shop
IN (SELECT MAX(date_in_shop) FROM "yourapplabel_shopvisit" GROUP BY vin);

How to distinctly bulk update all objects of a django model without iterating over them in python?

Basically can we achieve the same result without doing this:
from my_app import models
for prd,count in x.iteritems():
models.AggregatedResult.objects.filter(product=prd).update(linked_epp_count=count)
?
As is evident, x is a dictionary containing keys same as AggregatedResult's product field and the 'value' is the count that I wish to update. It is taking more than 2 - 3 minutes to run on a test table having < 15k rows and the size of the table is ~ 200k currently and is expected to grow upto a million. So, I need help.
Easiest (but not the safest) way is to use raw sql query.
Something like:
for prd,count in x.iteritems():
from django.db import connection, transaction
cursor = connection.cursor()
query = """
UPDATE {table}
set {column} = {value}
where {condition} = {condition_value}""".format(
table=AggregatedResult._meta.db_table,
condition='product',
condition_value=prd,
column='linked_epp_count',
value=count
)
cursor.execute(query)
transaction.commit_unless_managed()
Warning: Not tested and extremely vulnerable to sql-injections. Use at your own risk
Alternative (much safer) approach would be to first load contents of x into temporary table, than issue just one raw query to update. Assuming temp table for x is temp_prod:
update aggregated_result ar
set linked_epp_count=tp.count
from temp_prod tp
where ar.product = tp.product
How do you upload data from x to temp table is something that I'm not very proficient with, so it's left for you. :)

prefetch limited number of related objects in django

I want to display list of Posts with 5 latest Comments for each of them. How do I do that with minimum number of db queries?
Post.objects.filter(...).prefetch_related('comment_set')
retrieves all comments while I need only few of them.
I would go with two queries. First get posts:
posts = list(Post.objects.filter(...))
Now run raw SQL query with UNION (NOTE: omitted ordering for simplicity):
sql = "SELECT * FROM comments WHERE post_id=%s LIMIT 5"
query = []
for post in posts:
query.append( sql % post.id )
query = " UNION ".join(query)
and run it:
comments = Comments.objects.raw(query)
After that you can loop over comments and group them on the Python side.
I haven't tried it, but it looks ok.
There are other possible solutions for your problem (possibly getting down to one query), have a look at this:
http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Categories

Resources