What is the best way to bulk create in a through table? - python

This is what my through table looks like:
class ThroughTable(models.Model):
user = models.ForeignKey(User)
society = models.ForeignKey(Society)
I am getting 2 lists containing ids of 2 model objects, which have to be added in my through table.
user_list = [1,2,5,6,9]
society_list = [1,2,3,4]
Here I want to create the entry in Through table for each possible pair in these 2 lists.
I was thinking about using using nested loops to iterate and create the objects in Through table, but it seems very naive, and has a complexity of n*n.
Is there a better approach to solve this issue?

Django provides a bulk_create() method to make entries in the database. It has optional argument batch_size because if you have millions of records, you can not enter all the records in one go so break the records in batches and enter the database.
ThroughTable.objects.bulk_create(item, batch_size)

From the docs:
bulk_create(objs, batch_size=None, ignore_conflicts=False)
This method inserts the provided list of objects into the database in an efficient manner >(generally only 1 query, no matter how many objects there are):
For your case first create a list with all the possible combinations and then save it.
items = []
for user, society in user_list:
for society in society_list:
item = ThroughTable(user=user, society=society)
items.append(item)
ThroughTable.objects.bulk_create(items)

Related

Django storing a lot of data in table

Right now, I use this code to save the data to the database-
for i in range(len(companies)):
for j in range(len(final_prices)):
linechartdata = LineChartData()
linechartdata.foundation = company //this refers to a foreign-key of a different model
linechartdata.date = finald[j]
linechartdata.price = finalp[j]
linechartdata.save()
Now len(companies) can vary from [3-50] and len(final_prices) can vary from somewhere between [5000-10000]. I know its a very inefficient way to store it in the database and takes a lot of time. What should I do to make it effective and less time consuming?
If you really need to store them in the database you might check bulk_create. From the documents:
This method inserts the provided list of objects into the database in an efficient manner (generally only 1 query, no matter how many objects there are):
Although, I never personally used it for that many objects, docs say it can. This could make your code more efficient in terms of hitting the database and using multiple save().
Basically to try; create list of objects (without saving) and then use bulk_create. Like this:
arr = []
for i in range(len(companies)):
for j in range(len(final_prices)):
arr.append(
LineChartData(
foundation = company,
date = finald[j],
price = finalp[j]
)
)
LineChartData.objects.bulk_create(arr)

How to efficiently fetch objects after created using bulk_create function of Django ORM?

I have to insert multiple objects in a table, there are two ways to do that-
1) Insert each one using save(). But in this case there will be n sql dB queries for n objects.
2) Insert all of them together using bulk_create(). In this case there will be one sql dB query for n objects.
Clearly, second option is better and hence I am using that. Now the problem with bulk__create is that it does not return ids of the inserted objects hence they can not be used further to create objects of other models which have foreign key to the created objects.
To overcome this, we need to fetch the objects created by bulk_create.
Now the question is "assuming as in my situation, there is no way to uniquely identify the created objects, how do we fetch them?"
Currently I am maintaining a time_stamp to fetch them, something like below-
my_objects = []
# Timestamp to be used for fetching created objects
time_stamp = datetime.datetime.now()
# Creating list of intantiated objects
for obj_data in obj_data_list:
my_objects.append(MyModel(**obj_data))
# Bulk inserting the instantiated objects to dB
MyModel.objects.bulk_create(my_objects)
# Using timestamp to fetch the created objects
MyModel.objects.filter(created_at__gte=time_stamp)
Now this works good, but will fail in one case.
If at the time of bulk-creating these objects, some more objects are created from somewhere else, then those objects will also be fetched in my query, which is not desired.
Can someone come up with a better solution?
As bulk_create will not create the primary keys, you'll have to supply the keys yourself.
This process is simple if you are not using the default generated primary key, which is an AutoField.
If you are sticking with the default, you'll need to wrap your code into an atomic transaction and supply the primary key yourself. This way you'll know what records are inserted.
from django.db import transaction
inserted_ids = []
with transacation.atomic():
my_objects = []
max_id = int(MyModel.objects.latest('pk').pk)
id_count = max_id
for obj_data in obj_data_list:
id_count += 1
obj_data['id'] = id_count
inserted_ids.append(obj_data['id'])
my_objects.append(MyModel(**obj_data))
MyModel.objects.bulk_create(my_objects)
inserted_ids = range(max_id, id_count)
As you already know.
If the model’s primary key is an AutoField it does not retrieve and
set the primary key attribute, as save() does.
The way you're trying to do, it's usually the way people do.
The other solution in some cases, this way is better.
my_ids = MyModel.objects.values_list('id', flat=True)
objs = MyModel.objects.bulk_create(my_objects)
new_objs = MyModel.objects.exclude(id__in=my_ids).values_list('id', flat=True)

Retrieve all related records using values_list or similar

I have 3 Models: Album, Picture and PictureFile, linked by Picture.album and Picture.pic_file
I'd like to do something like the following, and end up with a list of PictureFile records:
pic_files = Picture.objects.filter(album_id=1).exclude(pic_file=None).values_list('pic_file', flat=True)
The above just seems to spit out a list of pic_file_ids however I'm wondering if Django provides a built in way to handle this or if I simply need to loop through each and construct the list myself?

Remove rows with same ID from two lists in Django

I have a list of objects called checkins that lists all the times a user has checked into something and another set of objects called flagged_checkins that is certain checkins that the user has flagged. They both reference a 3rd table called with a location_id
I'd like to take the two lists of objects, and remove any of the checkins which have a location_id in flagged_checkins
How do I compare these sets and remove the rows from 'checkins'
As per this SO question,
checkins.objects.filter( location_id__in = list(flagged_checkins.objects.values_list('location_id', flat=True)) ).delete()
If you are talking about a queryset then, you can definitely try:
checkins.objects.exclude( location_id__in = list(flagged_checkins.objects.values_list('location_id', flat=True)) )
This would remove the objects based on your criteria. But not from the db level.

Filter and sort music info on Google App Engine

I've enjoyed building out a couple simple applications on the GAE, but now I'm stumped about how to architect a music collection organizer on the app engine. In brief, I can't figure out how to filter on multiple properties while sorting on another.
Let's assume the core model is an Album that contains several properties, including:
Title
Artist
Label
Publication Year
Genre
Length
List of track names
List of moods
Datetime of insertion into database
Let's also assume that I would like to filter the entire collection using those properties, and then sorting the results by one of:
Publication year
Length of album
Artist name
When the info was added into the database
I don't know how to do this without running into the exploding index conundrum. Specifically, I'd love to do something like:
Albums.all().filter('publication_year <', 1980).order('artist_name')
I know that's not possible, but what's the workaround?
This seems like a fairly general type of application. The music albums could be restaurants, bottles of wine, or hotels. I have a collection of items with descriptive properties that I'd like to filter and sort.
Is there a best practice data model design that I'm overlooking? Any advice?
There's a couple of options here: You can filter as best as possible, then sort the results in memory, as Alex suggests, or you can rework your data structures for equality filters instead of inequality filters.
For example, assuming you only want to filter by decade, you can add a field encoding the decade in which the song was recorded. To find everything before or after a decade, do an IN query for the decades you want to span. This will require one underlying query per decade included, but if the number of records is large, this can still be cheaper than fetching all the results and sorting them in memory.
Since storage is cheap, you could create your own ListProperty based indexfiles with key_names that reflect the sort criteria.
class album_pubyear_List(db.Model):
words = db.StringListProperty()
class album_length_List(db.Model):
words = db.StringListProperty()
class album_artist_List(db.Model):
words = db.StringListProperty()
class Album(db.Model):
blah...
def save(self):
super(Album, self).save()
# you could do this at save time or batch it and do
# it with a cronjob or taskqueue
words = []
for field in ["title", "artist", "label", "genre", ...]:
words.append("%s:%s" %(field, getattr(self, field)))
word_records = []
now = repr(time.time())
word_records.append(album_pubyear_List(parent=self, key_name="%s_%s" %(self.pubyear, now)), words=words)
word_records.append(album_length_List(parent=self, key_name="%s_%s" %(self.album_length, now)), words=words)
word_records.append(album_artist_List(parent=self, key_name="%s_%s" %(self.artist_name, now)), words=words)
db.put(word_records)
Now when it's time to search you create an appropriate WHERE clause and call the appropriate model
where = "WHERE words = " + "%s:%s" %(field-a, value-a) + " AND " + "%s:%s" %(field-b, value-b) etc.
aModel = "album_pubyear_List" # or anyone of the other key_name sorted wordlist models
indexes = db.GqlQuery("""SELECT __key__ from %s %s""" %(aModel, where))
keys = [k.parent() for k in indexes[offset:numresults+1]] # +1 for pagination
object_list = db.get(keys) # returns a sorted by key_name list of Albums
As you say, you can't have an inequality condition on one field and an order by another (or inequalities on two fields, etc, etc). The workaround is simply to use the "best" inequality condition to get data in memory (where "best" means the one that's expected to yield the least data) and then further refine it and order it by Python code in your application.
Python's list comprehensions (and other forms of loops &c), list's sort method and the sorted built-in function, the itertools module in the standard library, and so on, all help a lot to make these kinds of tasks quite simple to perform in Python itself.

Categories

Resources