Fastest way of checking whether a record exists

Fastest way of checking whether a record exists - python

What is the fastest way of checking whether a record exists, when I know the primary key? select, count, filter, where or something else?

When you use count, the database has to continue the search even if it found the record, because a second one might exist.
So you should search for the actual record, and tell the database to stop after the first one.
When you ask to return data from the record, then the database has to read that data from the table. But if the record can be found by looking up the ID in an index, that table access would be superfluous.
So you should return nothing but the ID you're using to search:
SELECT id FROM MyTable WHERE id = ? LIMIT 1;
Anyway, not reading the actual data and the limit are implied when you are using EXISTS, which is simpler in peewee:
SELECT EXISTS (SELECT * FROM MyTable WHERE id = ?);
MyTable.select().where(MyTable.id == x).exists()

You can check yourself via EXPLAIN QUERY PLAN which will tell you the cost & what it intends to do for a particular query.
Costs don't directly compare between runs, but you should get a decent idea of whether there are any major differences.
That being said, I would expect COUNT(id) FROM table WHERE table.id="KEY" is probably the ideal, as it will take advantage of any partial lookup ability (particularly fast in columnar databases like amazon's redshift) and the primary key indexing.

Related

Batch inserting into multiple tables using DataStax model operations in Cassandra

Following DataStax's advice to 'use roughly one table per query pattern' as mentioned here, I have set up the same table twice, but keyed differently to optimize read times.
-- This table supports queries that filter on specific first_ids and a gt/lt filter on time
CREATE TABLE IF NOT EXISTS table_by_first_Id
(
first_id INT,
time TIMESTAMP,
second_id INT,
value FLOAT,
PRIMARY KEY (first_id, time, second_id)
);
-- Same table, but rearranged to filter on specific second_ids and the same gt/lt time filter
CREATE TABLE IF NOT EXISTS table_by_second_Id
(
second_id INT,
time TIMESTAMP,
first_id INT,
value FLOAT,
PRIMARY KEY (second_id, time, first_id)
);
Then, I have created 2 models using DataStax's Python driver, one for each table.
class ModelByFirstId (...)
class ModelBySecondId (...)
The Problem
I can't seem to figure out how to cleanly ensure atomicity when inserting into one of the tables to also insert into the other table. The only thing I can think of is
def insert_some_data(...):
ModelByFirstId.create(...)
ModelBySecondId.create(...)
I'm looking to see if there's an alternative way to ensure that insertion into one table is reflected into the other - perhaps in the model or table definition, in order to hopefully protect against errant inserts into just one of the models.
I'm also open to restructuring or remaking my tables altogether to accommodate this if needed.

NoSQL databases specially made for high availability and partition tolerance (AP of CAP) are not made to provide high referential integrity. Rather they are designed to provide high throughput and low latency reads and writes. Cassandra itself has no concept of referential integrity across tables. But do look for LWT (light weight transactions) and batches concept for your use case.
Please find some good material to read for the same:
https://www.oreilly.com/content/cassandra-data-modeling/
https://docs.datastax.com/en/cql-oss/3.3/cql/cql_using/useBatch.html
Specifically for your use case, try if you can go for below single table data model:
CREATE TABLE IF NOT EXISTS table_by_Id
(
primary_id INT,
secondary_id INT,
time TIMESTAMP,
value FLOAT,
PRIMARY KEY (primary_id ,secondary_id ,time)
);
for each input record you can create two entries in the table , one with first id as primary_id ( second_id and secondary_id) and second record with second_id as primary_id (and first_id as secondary_id). Now use batch inserts (as mentioned in above documentation. This might not be a best solution for your problem but think about it.

Insert hash id in database based on auto-increment id (needed for fast search?)

I'm writing a simple flask-restful API and I need to insert some resource into database. I want to have hash id visible in the URL like this /api/resource/hSkR3V9aS rather than just simple auto-increment id /api/resource/34
My first thought was to use Hashids and just generate the hash_id from auto-increment id and store both values in the database, but the problem is that I would have to first INSERT new row of data, GET the id and then UPDATE the hash_id field.
Second attempt was to generate hash_id (e.g. sha1) not from id but some other field that I'm passing to databse and use it as a primary key (get rid of auto-inc id), but I fear that searching and comparing string each time rather than int will be much, much slower.
What is the best way to achive desired hash_id based URL along with acceptable speed of database SELECT queries?
I think this is the most related stack question, but it doesn't answer my question.
Major technology details: Python 3.6, flask_mysqldb library, MySQL database
Please let me know if I ommited some information and I will provide it.

I think I found a decent solution myself in this answer
Use cursor.lastrowid to get the last row ID inserted on the cursor
object, or connection.insert_id() to get the ID from the last insert
on that connection.
It's per-connection based so there is no fear that I'll have 2 rows with the same ID.
I'll now use previously mentioned by myself Hashids and return hashed value to client. Hashids can be also decoded and I'll do it each time I get a request from url with this hash id included.
Also I found out that MongoDB database generates this kind of hashed id by itself, maybe this is a solution for someone else with similar problem.

Is there a way to speed up database transactions?

Sorry for the vague question, let me explain...
I have a list of words and counts in a database that has, no doubt, reached a gigantic amount. ~80mb database with each entry being two columns (word, integer)
Now when I am trying to add a word, I check to see if it is already in the database like this...python sqlite3 class method...
self.c.execute('SELECT * FROM {tn} WHERE {cn} = """{wn}"""'.format(tn=self.table1, cn=self.column1, wn=word_name))
exist = self.c.fetchall()
if exist:
do something

So you're checking for the existence of a word within a very large table of words? I think the short and simple answer to your question is to create an index for your word column.
The next step would be to setup a real database (e.g. Postgres) instead of sqlite. Sqlite is doesn't have the optimization tweaks of a production database and you'd likely see a performance gain after switching.
Even for a table with millions of rows, this shouldn't be a super time-intensive query if your table is properly indexed. If you already have an index and are still facing performance issues there's either something wrong with either your database setup/environment or perhaps there's a bottleneck in your Python code or DB adapter. Hard to say without more information.

I would imagine that using COUNT within SQL would be faster:
self.c.execute('SELECT COUNT(*) FROM {tn} WHERE {cn} = """{wn}"""'.format(tn=self.table1, cn=self.column1, wn=word_name))
num = self.c.fetchone()[0]
if num:
#do something
though I haven't tested it.
See How to check the existence of a row in SQLite with Python? for a similar question.

Speed up massive insertion with subqueries for foreign keys

I have to insert massive data (from a Python programme into a SQLite DB), where many fields are validated via foreign keys.
The query looks like this, and I perform the insertion with executemany()
INSERT INTO connections_to_jjos(
connection_id,
jjo_error_id,
receiver_task_id
sender_task_id
)
VALUES
(
:connection_id,
(select id from rtt_errors where name = :rtx_error),
(select id from tasks where name = :receiver_task),
(select id from tasks where name = :sender_task)
)
About 300 insertions take something like 15seconds, which I think it way too much. In production, there should be blocks of 1500 insertions in bulk or so. In similar cases without subqueries for the foreign keys, speed is unbelievable. It's quite clear that FK's will add overhead and slow down the process, but this is too much.
I could do a pre-query to catch all the foreign key id's, and then insert them directly, but I feel there must be a cleaner option.
On the other hand, I have read about the Isolation level, and if I don't understand it wrong, it could be that before each SELECT query, there is an automated COMMIT to enforce integrity... that could result in slowing down the process as well, but my attempts to work in this way were totally unsuccessful.
Maybe I'm doing something essentially wrong with the FK's. How can I improve the performance?
ADDITIONAL INFORMATION
The query:
EXPLAIN QUERY PLAN select id from rtt_errors where name = '--Unknown--'
Outputs:
SEARCH TABLE
rtt_errors
USING COVERING INDEX sqlite_autoindex_rtt_errors_1 (name=?) (~1 rows)
I have created an index in rtt_errors.name, but apparently it is not using it.

In theory, Python's default COMMITs should not happen between consecutive INSERTs, but your extremely poor performance look as if this is what is happening.
Set the isolation level to None, and then execute a pair of BEGIN/COMMIT commands once around all the INSERTs.

Database query optimization

Ok my Giant friends once again I seek a little space in your shoulders :P
Here is the issue, I have a python script that is fixing some database issues but it is taking way too long, the main update statement is this:
cursor.execute("UPDATE jiveuser SET username = '%s' WHERE userid = %d" % (newName,userId))
That is getting called about 9500 times with different newName and userid pairs...
Any suggestions on how to speed up the process? Maybe somehow a way where I can do all updates with just one query?
Any help will be much appreciated!
PS: Postgres is the db being used.

Insert all the data into another empty table (called userchanges, say) then UPDATE in a single batch:
UPDATE jiveuser
SET username = userchanges.username
FROM userchanges
WHERE userchanges.userid = jiveuser.userid
AND userchanges.username <> jiveuser.username
See this documentation on the COPY command for bulk loading your data.
There are also tips for improving performance when populating a database.

First of all, do not use the % operator to construct your SQL. Instead, pass your tuple of arguments as the second parameter to cursor.execute, which also negates the need to quote your argument and allows you to use %s for everything:
cursor.execute("UPDATE jiveuser SET username = %s WHERE userid = %s", (newName, userId))
This is important to prevent SQL Injection attacks.
To answer your question, you can speed up these updates by creating an index on the userid column, which will allow the database to update in O(1) constant time rather than having to scan the entire database table, which is O(n). Since you're using PostgreSQL, here's the syntax to create your index:
CREATE INDEX username_lookup ON jiveuser (userid);
EDIT: Since your comment reveals that you already have an index on the userid column, there's not much you could possibly do to speed up that query. So your main choices are either living with the slowness, since this sounds like a one-time fix-something-broken thing, or following VeeArr's advice and testing whether cursor.executemany will give you a sufficient boost.

The reason it's taking so long is probably that you've got autocommit enabled and each update gets done in its own transaction.
This is slow because even if you have a battery-backed raid controller (which you should definitely have on all database servers, of course), it still needs to do a write into that device for every transaction commit to ensure durability.
The solution is to do more than one row per transaction. But don't make transactions TOO big or you run into problems too. Try committing every 10,000 rows of changes as a rough guess.

You might want to look into executemany(): Information here

Perhaps you can create an index on userid to speed things up.

I'd do an explain on this. If it's doing an indexed lookup to find the record -- which it should if you have an index on userid -- then I don't see what you could do to improve performance. If it's not using the index, then the trick is figuring out why not and fixing it.
Oh, you could try using a prepared statement. With 9500 inserts, that should help.

Move this to a stored procedure and execute it from the database self.

First ensure you have an index on 'userid', this will ensure the dbms doesn't have to do a table scan each time
CREATE INDEX jiveuser_userid ON jiveuser (userid);
Next try preparing the statement, and then calling execute on it. This will stop the optimizer from having to examine the query each time
PREPARE update_username(string,integer) AS UPDATE jiveuser SET username = $1 WHERE userid = $2;
EXECUTE update_username("New Name", 123);
Finally, a bit more performance could be squeezed out by turning off autocommit
\set autocommit off

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way of checking whether a record exists - python

What is the fastest way of checking whether a record exists, when I know the primary key? select, count, filter, where or something else?

Related

Batch inserting into multiple tables using DataStax model operations in Cassandra

Insert hash id in database based on auto-increment id (needed for fast search?)

Is there a way to speed up database transactions?

Speed up massive insertion with subqueries for foreign keys

Database query optimization

Categories

Resources