Django storing a lot of data in table

Django storing a lot of data in table - python

Right now, I use this code to save the data to the database-
for i in range(len(companies)):
for j in range(len(final_prices)):
linechartdata = LineChartData()
linechartdata.foundation = company //this refers to a foreign-key of a different model
linechartdata.date = finald[j]
linechartdata.price = finalp[j]
linechartdata.save()
Now len(companies) can vary from [3-50] and len(final_prices) can vary from somewhere between [5000-10000]. I know its a very inefficient way to store it in the database and takes a lot of time. What should I do to make it effective and less time consuming?

If you really need to store them in the database you might check bulk_create. From the documents:
This method inserts the provided list of objects into the database in an efficient manner (generally only 1 query, no matter how many objects there are):
Although, I never personally used it for that many objects, docs say it can. This could make your code more efficient in terms of hitting the database and using multiple save().
Basically to try; create list of objects (without saving) and then use bulk_create. Like this:
arr = []
for i in range(len(companies)):
for j in range(len(final_prices)):
arr.append(
LineChartData(
foundation = company,
date = finald[j],
price = finalp[j]
)
)
LineChartData.objects.bulk_create(arr)

Related

What is the best way to bulk create in a through table?

This is what my through table looks like:
class ThroughTable(models.Model):
user = models.ForeignKey(User)
society = models.ForeignKey(Society)
I am getting 2 lists containing ids of 2 model objects, which have to be added in my through table.
user_list = [1,2,5,6,9]
society_list = [1,2,3,4]
Here I want to create the entry in Through table for each possible pair in these 2 lists.
I was thinking about using using nested loops to iterate and create the objects in Through table, but it seems very naive, and has a complexity of n*n.
Is there a better approach to solve this issue?

Django provides a bulk_create() method to make entries in the database. It has optional argument batch_size because if you have millions of records, you can not enter all the records in one go so break the records in batches and enter the database.
ThroughTable.objects.bulk_create(item, batch_size)

From the docs:
bulk_create(objs, batch_size=None, ignore_conflicts=False)
This method inserts the provided list of objects into the database in an efficient manner >(generally only 1 query, no matter how many objects there are):
For your case first create a list with all the possible combinations and then save it.
items = []
for user, society in user_list:
for society in society_list:
item = ThroughTable(user=user, society=society)
items.append(item)
ThroughTable.objects.bulk_create(items)

How to efficiently fetch objects after created using bulk_create function of Django ORM?

I have to insert multiple objects in a table, there are two ways to do that-
1) Insert each one using save(). But in this case there will be n sql dB queries for n objects.
2) Insert all of them together using bulk_create(). In this case there will be one sql dB query for n objects.
Clearly, second option is better and hence I am using that. Now the problem with bulk__create is that it does not return ids of the inserted objects hence they can not be used further to create objects of other models which have foreign key to the created objects.
To overcome this, we need to fetch the objects created by bulk_create.
Now the question is "assuming as in my situation, there is no way to uniquely identify the created objects, how do we fetch them?"
Currently I am maintaining a time_stamp to fetch them, something like below-
my_objects = []
# Timestamp to be used for fetching created objects
time_stamp = datetime.datetime.now()
# Creating list of intantiated objects
for obj_data in obj_data_list:
my_objects.append(MyModel(**obj_data))
# Bulk inserting the instantiated objects to dB
MyModel.objects.bulk_create(my_objects)
# Using timestamp to fetch the created objects
MyModel.objects.filter(created_at__gte=time_stamp)
Now this works good, but will fail in one case.
If at the time of bulk-creating these objects, some more objects are created from somewhere else, then those objects will also be fetched in my query, which is not desired.
Can someone come up with a better solution?

As bulk_create will not create the primary keys, you'll have to supply the keys yourself.
This process is simple if you are not using the default generated primary key, which is an AutoField.
If you are sticking with the default, you'll need to wrap your code into an atomic transaction and supply the primary key yourself. This way you'll know what records are inserted.
from django.db import transaction
inserted_ids = []
with transacation.atomic():
my_objects = []
max_id = int(MyModel.objects.latest('pk').pk)
id_count = max_id
for obj_data in obj_data_list:
id_count += 1
obj_data['id'] = id_count
inserted_ids.append(obj_data['id'])
my_objects.append(MyModel(**obj_data))
MyModel.objects.bulk_create(my_objects)
inserted_ids = range(max_id, id_count)

As you already know.
If the model’s primary key is an AutoField it does not retrieve and
set the primary key attribute, as save() does.
The way you're trying to do, it's usually the way people do.
The other solution in some cases, this way is better.
my_ids = MyModel.objects.values_list('id', flat=True)
objs = MyModel.objects.bulk_create(my_objects)
new_objs = MyModel.objects.exclude(id__in=my_ids).values_list('id', flat=True)

Optimizing performance of Postgresql database writes in Django?

I've got a Django 1.1 app that needs to import data from some big json files on a daily basis. To give an idea, one of these files is over 100 Mb and has 90K entries that are imported to a Postgresql database.
The problem I'm experiencing is that it takes really a long time for the data to be imported, i.e. in the order of hours. I would have expected it would take some time to write that number of entries to the database, but certainly not that long, which makes me think I'm doing something inherently wrong. I've read similar stackexchange questions, and the solutions proposed suggest using transaction.commit_manually or transaction.commit_on_success decorators to commit in batches instead of on every .save(), which I'm already doing.
As I say, I'm wondering if I'm doing anything wrong (e.g. batches to commit are too big?, too many foreign keys?...), or whether I should just go away from Django models for this function and use the DB API directly. Any ideas or suggestions?
Here are the basic models I'm dealing with when importing data (I've removed some of the fields in the original code for the sake of simplicity)
class Template(models.Model):
template_name = models.TextField(_("Name"), max_length=70)
sourcepackage = models.TextField(_("Source package"), max_length=70)
translation_domain = models.TextField(_("Domain"), max_length=70)
total = models.IntegerField(_("Total"))
enabled = models.BooleanField(_("Enabled"))
priority = models.IntegerField(_("Priority"))
release = models.ForeignKey(Release)
class Translation(models.Model):
release = models.ForeignKey(Release)
template = models.ForeignKey(Template)
language = models.ForeignKey(Language)
translated = models.IntegerField(_("Translated"))
And here's the bit of code that seems to take ages to complete:
#transaction.commit_manually
def add_translations(translation_data, lp_translation):
releases = Release.objects.all()
# There are 5 releases
for release in releases:
# translation_data has about 90K entries
# this is the part that takes a long time
for lp_translation in translation_data:
try:
language = Language.objects.get(
code=lp_translation['language'])
except Language.DoesNotExist:
continue
translation = Translation(
template=Template.objects.get(
sourcepackage=lp_translation['sourcepackage'],
template_name=lp_translation['template_name'],
translation_domain=\
lp_translation['translation_domain'],
release=release),
translated=lp_translation['translated'],
language=language,
release=release,
)
translation.save()
# I realize I should commit every n entries
transaction.commit()
# I've also got another bit of code to fill in some data I'm
# not getting from the json files
# Add missing templates
languages = Language.objects.filter(visible=True)
languages_total = len(languages)
for language in languages:
templates = Template.objects.filter(release=release)
for template in templates:
try:
translation = Translation.objects.get(
template=template,
language=language,
release=release)
except Translation.DoesNotExist:
translation = Translation(template=template,
language=language,
release=release,
translated=0,
untranslated=0)
translation.save()
transaction.commit()

Going through your app and processing every single row is a lot slower loading the data directly to the server. Even with optimized code. Also, inserting / updating one row at a time is a lot slower again than processing all at once.
If the import files are available locally to the server you can use COPY. Else you could use the meta command \copy in the standard interface psql. You mention JSON, for this to work, you would have to convert the data to a suitable flat format like CSV.
If you just want to add new rows to a table:
COPY tbl FROM '/absolute/path/to/file' FORMAT csv;
Or if you want to INSERT / UPDATE some rows:
First off: Use enough RAM for temp_buffers (at least temporarily, if you can) so the temp table does not have to be written to disk. Be aware that this has to be done before accessing any temporary tables in this session.
SET LOCAL temp_buffers='128MB';
In-memory representation takes somewhat more space than on.disc representation of data. So for a 100 MB JSON file .. minus the JSON overhead, plus some Postgres overhead, 128 MB may or may not be enough. But you don't have to guess, just do a test run and measure it:
select pg_size_pretty(pg_total_relation_size('tmp_x'));
Create the temporary table:
CREATE TEMP TABLE tmp_x (id int, val_a int, val_b text);
Or, to just duplicate the structure of an existing table:
CREATE TEMP TABLE tmp_x AS SELECT * FROM tbl LIMIT 0;
Copy values (should take seconds, not hours):
COPY tmp_x FROM '/absolute/path/to/file' FORMAT csv;
From there INSERT / UPDATE with plain old SQL. As you are planning a complex query, you may even want to add an index or two on the temp table and run ANALYZE:
ANALYZE tmp_x;
For instance, to update existing rows, matched by id:
UPDATE tbl
SET col_a = tmp_x.col_a
USING tmp_x
WHERE tbl.id = tmp_x.id;
Finally, drop the temporary table:
DROP TABLE tmp_x;
Or have it dropped automatically at the end of the session.

Best method to determine which of a set of keys exist in the datastore

I have a few hundred keys, all of the same Model, which I have pre-computed:
candidate_keys = [db.Key(...), db.Key(...), db.Key(...), ...]
Some of these keys refer to actual entities in the datastore, and some do not. I wish to determine which keys do correspond to entities.
It is not necessary to know the data within the entities, just whether they exist.
One solution would be to use db.get():
keys_with_entities = set()
for entity in db.get(candidate_keys):
if entity:
keys_with_entities.add(entity.key())
However this procedure would fetch all entity data from the store which is unnecessary and costly.
A second idea is to use a Query with an IN filter on key_name, manually fetching in chunks of 30 to fit the requirements of the IN pseudo-filter. However keys-only queries are not allowed with the IN filter.
Is there a better way?

IN filters are not supported directly by the App Engine datastore; they're a convenience that's implemented in the client library. An IN query with 30 values is translated into 30 equality queries on one value each, resulting in 30 regular queries!
Due to round-trip times and the expense of even keys-only queries, I suspect you'll find that simply attempting to fetch all the entities in one batch fetch is the most efficient. If your entities are large, however, you can make a further optimization: For every entity you insert, insert an empty 'presence' entity as a child of that entity, and use that in queries. For example:
foo = AnEntity(...)
foo.put()
presence = PresenceEntity(key_name='x', parent=foo)
presence.put()
...
def exists(keys):
test_keys = [db.Key.from_path('PresenceEntity', 'x', parent=x) for x in keys)
return [x is not None for x in db.get(test_keys)]

At this point, the only solution I have is to manually query by key with keys_only=True, once per key.
for key in candidate_keys:
if MyModel.all(keys_only=True).filter('__key__ =', key).count():
keys_with_entities.add(key)
This may in fact be slower then just loading the entities in batch and discarding them, although the batch load also hammers the Data Received from API quota.

How not to do it (update based on Nick Johnson's answer):
I am also considering adding a parameter specifically for the purpose of being able to scan for it with an IN filter.
class MyModel(db.Model):
"""Some model"""
# ... all the old stuff
the_key = db.StringProperty(required=True) # just a duplicate of the key_name
#... meanwhile back in the example
for key_batch in batches_of_30(candidate_keys):
key_names = [x.name() for x in key_batch]
found_keys = MyModel.all(keys_only=True).filter('the_key IN', key_names)
keys_with_entities.update(found_keys)
The reason this should be avoided is that the IN filter on a property sequentially performs an index scan, plus lookup once per item in your IN set. Each lookup takes 160-200ms so that very quickly becomes a very slow operation.

Filter and sort music info on Google App Engine

I've enjoyed building out a couple simple applications on the GAE, but now I'm stumped about how to architect a music collection organizer on the app engine. In brief, I can't figure out how to filter on multiple properties while sorting on another.
Let's assume the core model is an Album that contains several properties, including:
Title
Artist
Label
Publication Year
Genre
Length
List of track names
List of moods
Datetime of insertion into database
Let's also assume that I would like to filter the entire collection using those properties, and then sorting the results by one of:
Publication year
Length of album
Artist name
When the info was added into the database
I don't know how to do this without running into the exploding index conundrum. Specifically, I'd love to do something like:
Albums.all().filter('publication_year <', 1980).order('artist_name')
I know that's not possible, but what's the workaround?
This seems like a fairly general type of application. The music albums could be restaurants, bottles of wine, or hotels. I have a collection of items with descriptive properties that I'd like to filter and sort.
Is there a best practice data model design that I'm overlooking? Any advice?

There's a couple of options here: You can filter as best as possible, then sort the results in memory, as Alex suggests, or you can rework your data structures for equality filters instead of inequality filters.
For example, assuming you only want to filter by decade, you can add a field encoding the decade in which the song was recorded. To find everything before or after a decade, do an IN query for the decades you want to span. This will require one underlying query per decade included, but if the number of records is large, this can still be cheaper than fetching all the results and sorting them in memory.

Since storage is cheap, you could create your own ListProperty based indexfiles with key_names that reflect the sort criteria.
class album_pubyear_List(db.Model):
words = db.StringListProperty()
class album_length_List(db.Model):
words = db.StringListProperty()
class album_artist_List(db.Model):
words = db.StringListProperty()
class Album(db.Model):
blah...
def save(self):
super(Album, self).save()
# you could do this at save time or batch it and do
# it with a cronjob or taskqueue
words = []
for field in ["title", "artist", "label", "genre", ...]:
words.append("%s:%s" %(field, getattr(self, field)))
word_records = []
now = repr(time.time())
word_records.append(album_pubyear_List(parent=self, key_name="%s_%s" %(self.pubyear, now)), words=words)
word_records.append(album_length_List(parent=self, key_name="%s_%s" %(self.album_length, now)), words=words)
word_records.append(album_artist_List(parent=self, key_name="%s_%s" %(self.artist_name, now)), words=words)
db.put(word_records)
Now when it's time to search you create an appropriate WHERE clause and call the appropriate model
where = "WHERE words = " + "%s:%s" %(field-a, value-a) + " AND " + "%s:%s" %(field-b, value-b) etc.
aModel = "album_pubyear_List" # or anyone of the other key_name sorted wordlist models
indexes = db.GqlQuery("""SELECT __key__ from %s %s""" %(aModel, where))
keys = [k.parent() for k in indexes[offset:numresults+1]] # +1 for pagination
object_list = db.get(keys) # returns a sorted by key_name list of Albums

As you say, you can't have an inequality condition on one field and an order by another (or inequalities on two fields, etc, etc). The workaround is simply to use the "best" inequality condition to get data in memory (where "best" means the one that's expected to yield the least data) and then further refine it and order it by Python code in your application.
Python's list comprehensions (and other forms of loops &c), list's sort method and the sorted built-in function, the itertools module in the standard library, and so on, all help a lot to make these kinds of tasks quite simple to perform in Python itself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django storing a lot of data in table - python

Related

What is the best way to bulk create in a through table?

How to efficiently fetch objects after created using bulk_create function of Django ORM?

Optimizing performance of Postgresql database writes in Django?

Best method to determine which of a set of keys exist in the datastore

Filter and sort music info on Google App Engine

Categories

Resources