Fast and efficient random queryset - Django [duplicate] - python

This question already has answers here:
Best way to select random rows PostgreSQL
(13 answers)
Closed 8 years ago.
I look on some posts in SO about random queryset in Django. All i can find is that order_by('?') is not fast and efficient at all, but my experience does not tell so.
I have a table with 30.000 entries (final production will have around 200.000 entries). I can create separate tables around 10-15k entries each.
So, i want to get very fast and efficient a random of 100 (maybe 200) items.
Idea of creating list of 100 random numbers is in my opinion not good enough, because some PKs will be missing (because of deleting, etc..).
And, i don't want a generate a random number, and then 99 following items.
I will be using Postgresql (no special reason...i can choose other if they are better).
I tested order_by('id')[:100] and it seems very fast (i think). It took only? 0.017s per list.
Why the docs says that this is not good operation for random?
Which random do you prefer?
Is there any better way to do this?

ORDER BY random()
LIMIT n
is a valid approach but slow because every single row in the table has to be considered.
This is still fast with 30 k rows, but with 30 M rows .. not so much.
I suggest this related questiion:
Best way to select random rows PostgreSQL

Related

smart way to structure my SQLite Database

I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.
It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.
In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?

Which has optimal performance for generating a randomized list: `random.shuffle(ids)` or `.order_by("?")`?

I need to generate a randomized list of 50 items to send to the front-end for a landing page display. The landing page already loads much too slowly, so any optimization would be wonderful!
Given the pre-existing performance issues and the large size of this table, I'm wondering which implementation is better practice, or if the difference is negligible:
Option A:
unit_ids = list(units.values_list('id', flat=True).distinct())
random.shuffle(unit_ids)
unit_ids = unit_ids[:50]
Option B:
list(units.values_list('id', flat=True).order_by("?")[:50])
My concern is that according to the django docs, order_by('?') "may be expensive and slow"
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.order_by
We are using a MySQL db. I've tried searching for more info about implementation, but I'm not seeing anything more specific than what's in the docs. Help!
Option B should be faster in most cases since a database engine is usually faster than a code in python.
In option A, you are retrieving some ids which should be all the ids by my guess and then you are shuffling them on python. and according to you, the table is large so that makes it a bad idea to do it in python. Also, you are only getting the ids which mean if you need the actual data, you have to make another query.
With all the explanations, you should still try both and see which one is faster because they both depend on different variables. Just time them both and see which one works faster for you and then go with that.
Tradeoffs:
Shoveling large amounts of data to the client (TEXT columns; all the rows; etc)
Whether the table is so big that fetching N random rows is likely to hit the disk N times.
My first choice would be simply:
SELECT * FROM t ORDER BY RAND() LIMIT 50;
My second choice would be to use "lazy loading" (not unlike your random.shuffle, but better because it does not need a second round-trip):
SELECT t.*
FROM ( SELECT id FROM t ORDER BY RAND() LIMIT 50 ) AS r
JOIN t USING(id)
If that is not "fast enough", then first find out whether the subquery is the slowdown or the outer query.
If the inner query is the problem, then see http://mysql.rjweb.org/doc.php/random
If the outer query is the problem, you are doomed. It is already optimal (assuming PRIMARY KEY(id)).

Dataframe writing to Postgresql poor performance

working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.
Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles

Storing entries in a very large database

I am writing a Django application that will have entries entered by users of the site. Now suppose that everything goes well, and I get the expected number of visitors (unlikely, but I'm planning for the future). This would result in hundreds of millions of entries in a single PostgreSQL database.
As iterating through such a large number of entries and checking their values is not a good idea, I am considering ways of grouping entries together.
Is grouping entries in to sets of (let's say) 100 a better idea for storing this many entries? Or is there a better way that I could optimize this?
Store one at a time until you absolutely cannot anymore, then design something else around your specific problem.
SQL is a declarative language, meaning "give me all records matching X" doesn't tell the db server how to do this. Consequently, you have a lot of ways to help the db server do this quickly even when you have hundreds of millions of records. Additionally RDBMSs are optimized for this problem over a lot of years of experience so to a certain point, you will not beat a system like PostgreSQL.
So as they say, premature optimization is the root of all evil.
So let's look at two ways PostgreSQL might go through a table to give you the results.
The first is a sequential scan, where it iterates over a series of pages, scans each page for the values and returns the records to you. This works better than any other method for very small tables. It is slow on large tables. Complexity is O(n) where n is the size of the table, for any number of records.
So a second approach might be an index scan. Here PostgreSQL traverses a series of pages in a b-tree index to find the records. Complexity is O(log(n)) to find each record.
Internally PostgreSQL stores the rows in batches with fixed sizes, as pages. It already solves this problem for you. If you try to do the same, then you have batches of records inside batches of records, which is usually a recipe for bad things.

Efficient way to store millions of arrays, and perform IN check

There are around 3 millions of arrays - or Python lists\tuples (does not really matter). Each array consists of the following elements:
['string1', 'string2', 'string3', ...] # totally, 10000 elements
These arrays should be stored in some kind of key-value storage. Let's assume now it's a Python's dict, for a simple explanation.
So, 3 millions of keys, each key represents a 10000-elements array.
Lists\tuples or any other custom thing - it doesn't really matter. What matters is that arrays should consist strings - utf8 or unicode strings, from 5 to about 50 chars each. There are about 3 millions of possible strings as well. It is possible to replace them with integers if it's really needed, but for more efficient further operations, I would prefer to have strings.
Though it's hard to give you a full description of the data (it's complicated and odd), it's something similar to synonyms - let's assume we have 3 millions of words - as the dict keys - and 10k synonyms for each of the word - or element of the list.
Like that (not real synonyms but it will give you the idea):
{
'computer': ['pc', 'mac', 'laptop', ...], # (10k totally)
'house': ['building', 'hut', 'inn', ...], # (another 10k)
...
}
Elements - 'synonyms' - can be sorted if it's needed.
Later, after the arrays are populated, there's a loop: we go thru all the keys and check if some var is in its value. For example, user inputs the words 'computer' and 'laptop' - and we must quickly reply if the word 'laptop' is a synonym of the word 'computer'. The issue here is that we have to check it millions of time, probably 20 millions or so. Just imagine we have a lot of users entering some random words - 'computer' and 'car', 'phone' and 'building', etc. etc. They may 'match', or they may not 'match'.
So, in short - what I need is to:
store these data structures memory-efficiently,
be able to quickly check if some item is in array.
I should be able to keep memory usage below 30GB. Also I should be able to perform all the iterations in less than 10 hours on a Xeon CPU.
It's ok to have around 0.1% of false answers - both positive and negative - though it would be better to reduce them or don't have them at all.
What is the best approach here? Algorithms, links to code, anything is really appreciated. Also - a friend of mine suggested using bloom filters or marisa tries here - is he right? I didn't work with none of them.
I would map each unique string to a numeric ID then associate a bloom filter with around 20 bits per element for your <0.1% error rate. 20 bits * 10000 elements * 3 million keys is 75GB so if you are space limited, then store a smaller less accurate filter in memory and the more accurate filter on disk which is called up if the first filter says it might be a match.
There are alternatives, but they will only reduce the size from 1.44·n·ln2(1/ε) to n·ln2(1/ε) per key, in your case ε=0.001 so the theoretical limit is a data structure of 99658 bits per key, or 10 bits per element, which would be 298,974,000,000 bits or 38 GB.
So 30GB is below the theoretical limit for a data structure with the performance and number of entries that you require, but within the ball park.
Why do you want to maintain your own in-memory data-structure? Why not use a regular database for this purpose? If that is too slow, why no use an in-memory database? One solution is to use in-memory sqlite3. Check this SO link, for example: Fast relational Database for simple use with Python
You create the in-memory database by passing ':memory:' to connect method.
import sqlite3
conn = sqlite3.connect(':memory:')
What will your schema be? I can think of a wide-schema, with a string as an id key (e.g. 'computer', 'house' in your example and about 10000 additional columns: 'field1' to 'field10000'; one of each element of your array). Once you construct the schema, iteratively inserting your data in the database will be simple: one SQL statement per row of your data. And from your description, the insert part is one-time-only. There are no further modifications to the database.
The biggest question is retrieval (more crucially, speed of retrieval). Retrieving entire array for a single key like computer is again a simple SQL statement. The scalability and speed is something I don't have an idea about and this is something you will have to experiment. There is still hope that in-memory database will speed up the retrieval part. Yet, I believe that this is the cheapest and fastest solution you can implement and test (much cheaper than multiple node cluster)
Why am I suggesting this solution? Because the setup that you have in mind is extremely similar to that of a fast-growing database-backed internet startup. All good startups have similar number of requests per day; use some sort of database with caching (Caching would be next thing to look for your problem if a simple database doesn't scale to million requests. Again, it is much easier and cheaper than buying RAM/nodes).

Categories

Resources