Django Table with Million of rows

Django Table with Million of rows - python

I have a project with 2 applications ( books and reader ).
Books application has a table with 4 milions of rows with this fields:
book_title = models.CharField(max_length=40)
book_description = models.CharField(max_length=400)
To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).
In "reader" application, I am thinking to insert this fields:
reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()
So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").
Is a good solution to avoid to query a table with 4 milions of rows ?
Is there an alternative solution?

Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). Databases are made to handle tables of this size, so your performance problem is probably somewhere else.
Indexes are the first step, it sounds like you've done this though. 4 million rows should be ok for the db to handle with an index.
Second, check the number of queries you're running. You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made.
Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. This is where you will see your biggest performance boost for the little effort required.
If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). Please don't use genre to split the tables! pick something that you wont ever, ever change and that you'll always know when making a query. Like author and divide by first letter of the surname or something. This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it!
[edit]
I left out denormalisation! Put common counts, sums etc in the eg author table to prevent joins on common queries. The downside is that you have to maintain it yourself (until django adds a DenormalizedField). I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning.

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model.
Edit:
And for pete's sake, keep it in one table and use indexes as appropriate.

You haven't mentioned which database you're using. Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers.
If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice.
For example, with MySQL, you will probably need to tune the InnoDB settings; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings.

I'm not familiar with Django, but I have a general understanding of DB.
When you have large databases, it's pretty normal to index your database. That way, retrieving data, should be pretty quick.
When it comes to associate a book with a reader, you should create another table, that links reader to books.
It's not a bad idea to divide the books into subjects. But I'm not sure what you mean by having 20 applications.

Are you having performance problems? If so, you might need to add a few indexes.
One way to get an idea where an index would help is by looking at your db server's query log (instructions here if you're on MySQL).
If you're not having performance problems, then just go with it. Databases are made to handle millions of records, and django is pretty good at generating sensible queries.

A common approach to this type of problem is Sharding. Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. However, I'm not sure 4 million rows is really all that bad. Your queries should still be entirely manageable.
Perhaps you should look in to caching with something like memcached. Django supports this quite well.

You can use a server-side datatable. If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.

Related

Is Django Make Indexes for ManyToMany field?

i ask if Django make database indexes for ManyToMany Field ... and if yes is it do this for the model i provide in through ?
i just want to make sure that database will go fast when it have a lot of data

It does for plain m2m fields - can't tell for sure for "through" tables since I don't have any in my current project and don't have access to other projects ATM, but it still should be the case since this index is necessary for the UNIQUE constraint.
FWIW you can easily check it by yourself by looking at the tables definitions in your database (in MySQL / mariadb using the show create table yourtablename).
This being said, db index are only useful when there are mainly distinct values, and can actually degrade performances (depending on your db vendor etc) if that's not the case - for example, if you have a field with like 3 or 4 possible values (ie "gender" or something similar), indexing it might not yield the expected results. Should not be an issue for m2m tables since the (table1_id, table2_id) pair is supposed to be unique, but the point is that you should not believe that just adding indexes will automagically boost your performances - tuning SQL database performances is a trade in and by itself, and actually depends a lot of how you actually use your data.

Building dynamic SQL queries with psycopg2 and postgresql

I'm not really sure the best way to go about this or if i'm just asking for a life that's easier than it should be. I have a backend for a web application and I like to write all of the queries in raw SQL. For instance getting a specific user profile, or a number of users I have a query like this:
SELECT accounts.id,
accounts.username,
accounts.is_brony,
WHERE accounts.id IN %(ids)s;
This is really nice because I can get one user profile, or many user profiles with the same query. Now my real query is actually almost 50 lines long. It has a lot of joins and other conditions for this profile.
Lets say I want to get all of the same information from a user profile but instead of getting a specific user ID i want to get a single random user? I don't think it makes sense to copy and paste 50 lines of code just to modify two lines at the end.
SELECT accounts.id,
accounts.username,
accounts.is_brony,
ORDER BY Random()
LIMIT 1;
Is there some way to use some sort of inheritance in building queries, so that at the end I can modify a couple of conditions while keeping the core similarities the same?
I'm sure I could manage it by concatenating strings and such, but I was curious if there's a more widely accepted method for approaching such a situation. Google has failed me.

The canonical answer is to create a view and use that with different WHERE and ORDER BY clauses in queries.
But, depending on your query and your tables, that might not be a good solution for your special case.
A query that is blazingly fast with WHERE accounts.id IN (1, 2, 3) might perform abysmally with ORDER BY random() LIMIT 1. In that case you'll have to come up with a different query for the second requirement.

Dynamic Scalable Mysql Table

Here is my situation. I used Python, Django and MySQL for a web development.
I have several tables for form posting, whose fields may change dynamically. Here is an example.
Like a table called Article, it has three fields now, called id INT, title VARCHAR(50), author VARCHAR(20), and it should be able store some other values dynamically in the future, like source VARCHAR(100) or something else.
How can I implement this gracefully? Is MySQL be able to handle it? Anyway, I don't want to give up MySQL totally, for that I'm not really familiar with NoSQL databases, and it may be risky to change technique plan in the process of development.
Any ideas welcome. Thanks in advance!

You might be interested in this post about FriendFeed's schemaless SQL approach.
Loosely:
Store documents in JSON, extracting the ID as a column but no other columns
Create new tables for any indexes you require
Populate the indexes via code
There are several drawbacks to this approach, such as indexes not necessarily reflecting the actual data. You'll also need to hack up django's ORM pretty heavily. Depending on your requirements you might be able to keep some of your fields as pure DB columns and store others as JSON?

I've never actually used it, but django-not-eav looks like the tool for the job.
"This app attempts the impossible: implement a bad idea the right way." I already love it :)
That said, this question sounds like a "rethink your approach" situation, for sure. But yes, sometimes that is simply not an option...

SqlAlchemy look-ahead caching?

Edit: Main Topic:
Is there a way I can force SqlAlchemy to pre-populate the session as far as it can? Syncronize as much state from the database as possible (there will be no DB updates at this point).
I am having some mild performance issues and I believe I have traced it to SqlAlchemy. I'm sure there are changes in my declarative and db-schema that could improve time, but that is not what I am asking about here.
My SqlAlchemy declarative defines 8 classes, my database has 11 tables with only 7 of them holding my real data, and total my database has 800 records (all Integers and UnicodeText). My database engine is sqlite and the actual size is currently 242Kb.
Really, the number of entities is quite small, but many of the table relationships have recursive behavior (5-6 levels deep). My problem starts with the wonderful automagic that SA does for me, and my reluctance to properly extract the data with my own python classes.
I have ORM attribute access scattered across all kinds of iterators, recursive evaluators, right up to my file I/O streams. The access on these attributes is largely non-linear, and ever time I do a lookup, my callstack disappears into SqlAlchemy for quite some time, and I am getting lots of singleton queries.
I am using mostly default SA settings (python 2.7.2, sqlalchemy 0.7).
Considering that RAM is not an issue, and that my database is so small (for the time being), is there a way I can just force SqlAlchemy to pre-populate the session as far as it can. I am hoping that if I just load the raw data into memory, then the most I will have to do is chase a few joins dynamically (almost all queries are pretty straighforward).
I am hoping for a 5 minute fix so I can run some reports ASAP. My next month of TODO is likely going to be full of direct table queries and tighter business logic that can pipeline tuples.

A five minute fix for that kind of issue is unlikely, but for many-to-one "singleton" gets there is a simple recipe I use often. Suppose you're loading lots of User objects and they all have many-to-one references to a Category of some kind:
# load all categories, then hold onto them
categories = Session.query(Category).all()
for user in Session.query(User):
print user, user.category # no SQL will be emitted for the Category
this because the query.get() that a many-to-one emits will look in the local identity map for the primary key first.
If you're looking for more caching than that (and have a bit more than five minutes to spare), the same concept can be expanded to also cache the results of SELECT statements in a way that the cache is associated only with the current Session - check out the local_session_caching.py recipe included with the distribution examples.

Database change underneath SQLObject

I'm starting a web project that likely should be fine with SQLite. I have SQLObject on top of it, but thinking long term here -- if this project should require a more robust (e.g. able to handle high traffic), I will need to have a transition plan ready. My questions:
How easy is it to transition from one DB (SQLite) to another (MySQL or Firebird or PostGre) under SQLObject?
Does SQLObject provide any tools to make such a transition easier? Is it simply take the objects I've defined and call createTable?
What about having multiple SQLite databases instead? E.g. one per visitor group? Does SQLObject provide a mechanism for handling this scenario and if so, what is the mechanism to use?
Thanks,
Sean

3) Is quite an interesting question. In general, SQLite is pretty useless for web-based stuff. It scales fairly well for size, but scales terribly for concurrency, and so if you are planning to hit it with a few requests at the same time, you will be in trouble.
Now your idea in part 3) of the question is to use multiple SQLite databases (eg one per user group, or even one per user). Unfortunately, SQLite will give you no help in this department. But it is possible. The one project I know that has done this before is Divmod's Axiom. So I would certainly check that out.
Of course, it would probably be much easier to just use a good concurrent DB like the ones you mention (Firebird, PG, etc).
For completeness:
1 and 2) It should be straightforward without you actually writing much code. I find SQLObject a bit restrictive in this department, and would strongly recommend SQLAlchemy instead. This is far more flexible, and if I was starting a new project today, I would certainly use it over SQLObject. It won't be moving "Objects" anywhere. There is no magic involved here, it will be transferring rows in tables in a database. Which as mentioned you could do by hand, but this might save you some time.

Your success with createTable() will depend on your existing underlying table schema / data types. In other words, how well SQLite maps to the database you choose and how SQLObject decides to use your data types.
The safest option may be to create the new database by hand. Then you'll have to deal with data migration, which may be as easy as instantiating two SQLObject database connections over the same table definitions.
Why not just start with the more full-featured database?

I'm not sure I understand the question.
The SQLObject documentation lists six kinds of connections available. Further, the database connection (or scheme) is specified in a connection string. Changing database connections from SQLite to MySQL is trivial. Just change the connection string.
The documentation lists the different kinds of schemes that are supported.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.