Most efficent way to get one random row from oracle - python

I have a scenario where I have to obfuscate data(=scramble, for testing purposes, so it is not possible to see the real data, there is no need on unscramble/unobfuscate it) in database. There are several tables that are referencing the address_table. I can not obfuscate the address_table, so I figured that I simply change the references in those tables with random other address_table ID-s. The address_table contains 6M+ records. So I would create a temp table with all the address ID-s and then, when needed call some function to get a random one from there. So I could possibly generate a random value and take that row like:
Select * From (
Select Id, Rownum Rn From myTempTable )
WHERE RN = x;
where x is some random value generated by dbms_random. Now, although this is what I need, it does not perform anything near to what I expect.
Other thing I have tried is to call the sample() function, this (at least on small table) performs I bit better, but it is not good enough.
I know there are several threads on this matter like this or this on mySql, but they do not directly answer it in terms of performance.
Also, I am not limited in using pl/sql. I know a very little of pl/sql, how is it in terms of performance? I mean, it is just another process in DB server processing queue, perhaps i could get better performance doing the processing (i mean generating the update scripts, populating randoms etcetc) on client side using something like python, even considering network latency etc? Does anybody have any experience on this?

Use sample clause
select * from myTempTable SAMPLE(10);
This will return only 10% of rows.

If you just want to hide the real data why don't you take care of that in the select part of the query. Insteady of querying:
select column_name from table;
you could select
select scrambling_function(column_name) from table;
scrambling_function can be whatever you like.

There is not a good way to sample randomly using SQL that I am aware of. The sample function available in some SQL versions is not a sufficient random sample. The best way is to export the full sample set and use random software to determine the index of rows to be included in your final solution. Or if you have a simple number index (1,2,3...n) and know how many rows you need to select from you could upload a list of index's to include and query against that. Try random.org for random number generation, their API is located at http://www.random.org/clients/http/.

Related

smart way to structure my SQLite Database

I am new to database things and only have a very basic understanding of them.
I need to save historic data of a leaderboard and I am not sure how to do that in a good way.
I will get a list of accountName, characterName and xp.
Options I was thinking of so far:
An extra table for each account where I add their xp as another entry every 10 min (not sure where to put the character name in that option)
A table where I add another table into it every 10 min containing all the data I got for that interval
I am not very sure the first option since there will be about 2000 players I don't know if I want to have 2000 tables (would that be a problem?). But I also don't feel like the second option is a good idea.
It feels like with some basic dimensional modeling techniques you will be able to solve this.
Specifically it sounds like you are in need of a Player Dimension and a Play Fact table...maybe a couple more supporting tables along the way.
It is my pleasure to introduce you to the Guru of Dimensional Modeling (IMHO): Kimball Group - Dimensional Modeling Techniques
My advice - invest a bit of time there, put a few basic dimensional modeling tools in your toolbox, and this build should be quite enjoyable.
In general you want to have a small number of tables, and the number of rows per table doesn't matter so much. That's the case databases are optimized for. Technically you'd want to strive for a structure that implements the Third normal form.
If you wanted to know which account had the most xp, how would you do it? If each account has a separate table, you'd have to query each table. If there's a single table with all the accounts, it's a trivial single query. Expanding that to say the top 15 is likewise a simple single query.
If you had a history table with a snapshot every 10 minutes, that would get pretty big over time but should still be reasonable by database standards. A snapshot every 10 minutes for 2000 characters over 10 years would result in 1,051,920,000 rows, which might be close to the maximum number of rows in a sqlite table. But if you got to that point I think you might be better off splitting the data into multiple databases rather than multiple tables. How far back do you want easily accessible history?

Dynamically update MySQL table in Python in linear time

I have an existing table with a large number of entries and I want to calculate a new column for every row. I have only found the following solution. This works, but it's slow as it needs to scan most of the entries of the table.
What I would like is a way to:
Read a row
Calculate the value for new column based on contents of row
Update into database
This way it would only go through the table once and would have linear complexity.
cursor.execute("SELECT tweet FROM Table")
row = cursor.fetchone()
while row is not None:
vader = analyser.polarity_scores(row)
sentiment_vader = vader["compound"]
cursor2.execute(
"UPDATE Table SET sentiment_vader = %s WHERE tweet = %s LIMIT 1",
(sentiment_vader, row[0]))
kody.cnx.commit()
row = cursor.fetchone()
The main performance issue I see is that you should not commit for each row update as this adds an overhead. You should commit in the end of the while or after a batch.
while row is not None:
...
else:
kody.cnx.commit()
Also, if the tweet column is not indexed just create an index on that column in order not to make a table scan during the update.
OK, so first, not to critize the other answers, which are correct, given a generalized assumption that you have to do it in Python.
However, when you really have bulk volumes, chasing after a client-side, in-Python answer is often not the best approach. Since you want to update all the rows, assuming you can translate your polarity_scores algorithm into sql
UPDATE Table
SET sentiment_vader = <sql expressing your polarity_scores>;
would be the best performer. There is no back and forth with the database and everything gets committed at once.
Now, I am not saying it's easy or even possible. Often in these cases, even assuming the algorithm can be expressed in SQL, you may have to use work tables to store intermediate results and there is a lot of SQL going on. It's a different skill set than writing Python code.
But, if you truly need performance and you have large volumes, letting the server do the job on its own, in SQL, can be the way to go. That can be done via a series of sql commands, or using stored procedures.
In a previous job, we had explicit instructions to avoid loop and write constructs in client code and code reviews would almost always reject it on bulk data manipulations. I remember advising a colleague that doing a select-update on a table with potentially up 5M rows seemed a bad approach. He certainly ignored me at the time, but 3 months later his mission-critical code had all mysteriously shifted to a no-loop approach.
Note however one key conceptual difference: an error on a server-side update would rollback the transaction for all rows indiscriminately, whereas you could maybe choose to commit row-by-row using a loop construct like yours (even though you don't want it in your case).
The expected performance profile server-side is usually considerably better than O(n) linear time. Most of the time you should be nearly at constant O(1) time complexity, once you have correctly written your queries and indexes. Linear time to update, for a RDBMS vendor, would be commercial suicide. Usually what you see is a near constant time, followed by non-linear and hard-to-predict performance degradation past very high volume thresholds. You will see linear time earlier when indices can't be used and for your queries the RDBMS falls back to performing full-table scans.
Is this MySQLdb ? Maybe you can try an executemany.
cursor.execute("SELECT tweet FROM Table")
cursor2.executemany(
"UPDATE Table SET sentiment_vader = %s WHERE tweet = %s LIMIT 1",
((analyser.polarity_scores(row)["compound"], row[0]) for row in cursor)
)
kody.cnx.commit()
Just as #abc suggested above, you should also make sure autocommit is set to False, so that each query isn't committed separately during the executemany.

Which has optimal performance for generating a randomized list: `random.shuffle(ids)` or `.order_by("?")`?

I need to generate a randomized list of 50 items to send to the front-end for a landing page display. The landing page already loads much too slowly, so any optimization would be wonderful!
Given the pre-existing performance issues and the large size of this table, I'm wondering which implementation is better practice, or if the difference is negligible:
Option A:
unit_ids = list(units.values_list('id', flat=True).distinct())
random.shuffle(unit_ids)
unit_ids = unit_ids[:50]
Option B:
list(units.values_list('id', flat=True).order_by("?")[:50])
My concern is that according to the django docs, order_by('?') "may be expensive and slow"
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.order_by
We are using a MySQL db. I've tried searching for more info about implementation, but I'm not seeing anything more specific than what's in the docs. Help!
Option B should be faster in most cases since a database engine is usually faster than a code in python.
In option A, you are retrieving some ids which should be all the ids by my guess and then you are shuffling them on python. and according to you, the table is large so that makes it a bad idea to do it in python. Also, you are only getting the ids which mean if you need the actual data, you have to make another query.
With all the explanations, you should still try both and see which one is faster because they both depend on different variables. Just time them both and see which one works faster for you and then go with that.
Tradeoffs:
Shoveling large amounts of data to the client (TEXT columns; all the rows; etc)
Whether the table is so big that fetching N random rows is likely to hit the disk N times.
My first choice would be simply:
SELECT * FROM t ORDER BY RAND() LIMIT 50;
My second choice would be to use "lazy loading" (not unlike your random.shuffle, but better because it does not need a second round-trip):
SELECT t.*
FROM ( SELECT id FROM t ORDER BY RAND() LIMIT 50 ) AS r
JOIN t USING(id)
If that is not "fast enough", then first find out whether the subquery is the slowdown or the outer query.
If the inner query is the problem, then see http://mysql.rjweb.org/doc.php/random
If the outer query is the problem, you are doomed. It is already optimal (assuming PRIMARY KEY(id)).

SqlAlchemy mapped bulk update - make safer and faster?

I'm using Postgres 9.2 and SqlAlchemy. Currently, this is my code to update the rankings of my Things in my database:
lock_things = session.query(Thing).\
filter(Thing.group_id == 4).\
with_for_update().all()
tups = RankThings(lock_things) # return sorted tuple (<numeric>, <primary key Thing id>)
rank = 1
for prediction, id in tups:
thing = session.query(Thing).\
filter(Thing.group_id == 4).\
filter(Thing.id == id).one()
thing.rank = rank
rank += 1
session.commit()
However, this seems slow. It's also something I want to be atomic, which I why I use the with_for_update() syntax.
I feel like there must be a way to "zip" up the query and so an update in that way.
How can I make this faster and done all in one query?
EDIT: I think I need to create a temp table to join and make a fast update, see:
https://stackoverflow.com/a/20224370/712997
http://tapoueh.org/blog/2013/03/15-batch-update
Any ideas how to do this in SqlAlchemy?
Generally speaking with such operations you aim for two things:
Do not execute a query inside a loop
Reduce the number of queries required by performing computations on the SQL side
Additionally, you might want to merge some of the queries you have, if possible.
Let's start with 2), because this is very specific and often not easily possible. Generally, the fastest operation here would be to write a single query that returns the rank. There are two options with this:
The query is quick to run so you just execute it whenever you need the ranking. This would be the very simple case of something like this:
SELECT
thing.*,
(POINTS_QUERY) as score
FROM thing
ORDER BY score DESC
In this case, this will give you an ordered list of things by some artificial score (e.g. if you build some kind of competition). The POINTS_QUERY would be something that uses a specific thing in a subquery to determine its score, e.g. aggregate the points of all the tasks it has solved.
In SQLAlchemy, this would look like this:
score = session.query(func.sum(task.points)).filter(task.thing_id == Thing.id).correlate(Thing).label("score")
thing_ranking = session.query(thing, score).order_by(desc("score"))
This is somewhat a little bit more advanced usage of SQLAlchemy: We construct a subquery that returns a scalar value we labled score. With correlate we tell it that thing will come from an outer query (this is important).
So that was the case where you run a single query that gives you a ranking (the ranks a determined based on the index in the list and depend on your ranking strategy). If you can achieve this, it is the best case
The query itself is expensive you want the values cached. This means you can either use the solution above and cache the values outside of the database (e.g. in a dict or using a caching library). Or you compute them like above but update a database field (like Thing.rank). Again, the query from above gives us the ranking. Additionally, I assume the simplest kind of ranking: the index denotes the rank:
for rank, (thing, score) in enumerate(thing_ranking):
thing.rank = rank
Notice how I base my rank based on the index using enumerate. Additionally, I take advantage of the fact that since I just queried thing, I already have it in the session, so no need for an extra query. So this might be your solution right here, but read on for some additional info.
Using the last idea from above, we can now tackle 1): Get the query outside the loop. In general I noticed that you pass a list of things to a sorting function that only seems to return IDs. Why? If you can change it, make it so that it returns the things as a whole.
However, it might be possible that you cannot change this function so let's consider what we do if we can't change it. We already have a list of all relevant things. And we get a sorted list of their IDs. So why not build a dict as a lookup for ID -> Thing?
things_dict = dict(thing.id, thing for thing in lock_things)
We can use this dict instead of querying inside the loop:
for prediction, id in tups:
thing = things_dict[id]
However, it may be possible (for some reason I missed in your example) that not all IDs were returned previously. In that case (or in general) you can take advantage of a similar mapping SQLAlchemy keeps itself: You can ask it for a primary key and it will not query the database if it already has it:
for prediction, id in tups:
thing = session.query(Thing).get(id)
So that way we have reduced the problem and only execute queries for objects we don't already have.
One last thing: What if we don't have most of the things? Then I didn't solve your problem, I just replaced the query. In that case, you will have to create a new query that fetches all the elements you need. In general this depends on the source of the IDs and how they are determined, but you could always go the least efficient way (which is still way faster than inside-loop queries): Using SQL's IN:
all_things = session.query(Thing).filter(Thing.group_id == 4).filter(Thing.id.in_([id for _, id in tups]).all()
This would construct a query that filters with the IN keyword. However, with a large list of things this is terribly inefficient and thus if you are in this case, it is most likely better you construct some more efficient way in SQL that determines if this is an ID you want.
Summary
So this was a long text. So sum up:
Perform queries in SQL as much as possible if you can write it efficiently there
Use SQLAlchemy's awesomeness to your advantage, e.g. create subqueries
Try to never execute queries inside a loop
Create some mappings for yourself (or use that of SQLAlchemy to your advantage)
Do it the pythonic way: Keep it simple, keep it explicit.
One final thought: If your queries get really complex and you fear you loose control over the queries executed by the ORM, drop it and use the Core instead. It is almost as awesome as the ORM and gives you huge amounts of control over the queries as you build them yourselves. With this you can construct almost any SQL query you can think of and I am certain that the batch updates you mentioned are also possible here (If you see that my queries above lead to many UPDATE statements, you might want to use the Core).

What should i do for accommodating large scale data storage and retrieval?

There's two columns in the table inside mysql database. First column contains the fingerprint while the second one contains the list of documents which have that fingerprint. It's much like an inverted index built by search engines. An instance of a record inside the table is shown below;
34 "doc1, doc2, doc45"
The number of fingerprints is very large(can range up to trillions). There are basically following operations in the database: inserting/updating the record & retrieving the record accoring to the match in fingerprint. The table definition python snippet is:
self.cursor.execute("CREATE TABLE IF NOT EXISTS `fingerprint` (fp BIGINT, documents TEXT)")
And the snippet for insert/update operation is:
if self.cursor.execute("UPDATE `fingerprint` SET documents=CONCAT(documents,%s) WHERE fp=%s",(","+newDocId, thisFP))== 0L:
self.cursor.execute("INSERT INTO `fingerprint` VALUES (%s, %s)", (thisFP,newDocId))
The only bottleneck i have observed so far is the query time in mysql. My whole application is web based. So time is a critical factor. I have also thought of using cassandra but have less knowledge of it. Please suggest me a better way to tackle this problem.
Get a high end database. Oracle has some offers. SQL Server also.
TRILLIONS of entries is well beyond the scope of a normal database. THis is very high end very special stuff, especially if you want decent performance. Also get the hardware for it - this means a decent mid range server, 128+gb memory for caching, and either a decent SAN or a good enough DAS setup via SAS.
Remember, TRILLIONS means:
1000gb used for EVERY BYTE.
If the fingerprint is stored as an int64 this is 8000gb disc space alone for this data.
Or do you try running that from a small cheap server iwth a couple of 2tb discs? Good luck.
That data structure isn't a great fit for SQL - the 'correct' design in SQL would be to have a row for each fingerprint/document pair, but querying would be impossibly slow unless you add an index that would take up too much space. For what you are trying to do, SQL adds a lot of overhead to support functions you don't need while not supporting the multiple value column that you do need.
A redis cluster might be a good fit - the atomic set operations should be perfect for what you are doing, and with the right virtual memory setup and consistent hashing to distribute the fingerprints across nodes it should be able to handle the data volume. The commands would then be
SADD fingerprint, docid
to add or update the record, and
SMEMBERS fingerprint
to get all the document ids with that fingerprint.
SADD is O(1). SMEMBERS is O(n), but n is the number of documents in the set, not the number of documents/fingerprints in the system, so effectively also O(1) in this case.
The SQL insert you are currently using is O(n) with n being the very large total number of records, because the records are stored as an ordered list which must be reordered on insert rather than a hash table which is constant time for both get and set.
Greenplum data warehouse, FOC, postgres driven, good luck ...

Categories

Resources