Speed up massive insertion with subqueries for foreign keys - python

I have to insert massive data (from a Python programme into a SQLite DB), where many fields are validated via foreign keys.
The query looks like this, and I perform the insertion with executemany()
INSERT INTO connections_to_jjos(
connection_id,
jjo_error_id,
receiver_task_id
sender_task_id
)
VALUES
(
:connection_id,
(select id from rtt_errors where name = :rtx_error),
(select id from tasks where name = :receiver_task),
(select id from tasks where name = :sender_task)
)
About 300 insertions take something like 15seconds, which I think it way too much. In production, there should be blocks of 1500 insertions in bulk or so. In similar cases without subqueries for the foreign keys, speed is unbelievable. It's quite clear that FK's will add overhead and slow down the process, but this is too much.
I could do a pre-query to catch all the foreign key id's, and then insert them directly, but I feel there must be a cleaner option.
On the other hand, I have read about the Isolation level, and if I don't understand it wrong, it could be that before each SELECT query, there is an automated COMMIT to enforce integrity... that could result in slowing down the process as well, but my attempts to work in this way were totally unsuccessful.
Maybe I'm doing something essentially wrong with the FK's. How can I improve the performance?
ADDITIONAL INFORMATION
The query:
EXPLAIN QUERY PLAN select id from rtt_errors where name = '--Unknown--'
Outputs:
SEARCH TABLE
rtt_errors
USING COVERING INDEX sqlite_autoindex_rtt_errors_1 (name=?) (~1 rows)
I have created an index in rtt_errors.name, but apparently it is not using it.

In theory, Python's default COMMITs should not happen between consecutive INSERTs, but your extremely poor performance look as if this is what is happening.
Set the isolation level to None, and then execute a pair of BEGIN/COMMIT commands once around all the INSERTs.

Related

Is it more efficient to store id values in dictionary or re-query database

I have a script that repopulates a large database and would generate id values from other tables when needed.
Example would be recording order information when given customer names only. I would check to see if the customer exists in a CUSTOMER table. If so, SELECT query to get his ID and insert the new record. Else I would create a new CUSTOMER entry and get the Last_Insert_Id().
Since these values duplicate a lot and I don't always need to generate a new ID -- Would it be better for me to store the ID => CUSTOMER relationship as a dictionary that gets checked before reaching the database or should I make the script constantly requery the database? I'm thinking the first approach is the best approach since it reduces load on the database, but I'm concerned for how large the ID Dictionary would get and the impacts of that.
The script is running on the same box as the database, so network delays are negligible.
"Is it more efficient"?
Well, a dictionary is storing the values in a hash table. This should be quite efficient for looking up a value.
The major downside is maintaining the dictionary. If you know the database is not going to be updated, then you can load it once and the in-application memory operations are probably going to be faster than anything you can do with a database.
However, if the data is changing, then you have a real challenge. How do you keep the memory version aligned with the database version? This can be very tricky.
My advice would be to keep the work in the database, using indexes for the dictionary key. This should be fast enough for your application. If you need to eke out further speed, then using a dictionary is one possibility -- but no doubt, one possibility out of many -- for improving the application performance.

Fastest way of checking whether a record exists

What is the fastest way of checking whether a record exists, when I know the primary key? select, count, filter, where or something else?
When you use count, the database has to continue the search even if it found the record, because a second one might exist.
So you should search for the actual record, and tell the database to stop after the first one.
When you ask to return data from the record, then the database has to read that data from the table. But if the record can be found by looking up the ID in an index, that table access would be superfluous.
So you should return nothing but the ID you're using to search:
SELECT id FROM MyTable WHERE id = ? LIMIT 1;
Anyway, not reading the actual data and the limit are implied when you are using EXISTS, which is simpler in peewee:
SELECT EXISTS (SELECT * FROM MyTable WHERE id = ?);
MyTable.select().where(MyTable.id == x).exists()
You can check yourself via EXPLAIN QUERY PLAN which will tell you the cost & what it intends to do for a particular query.
Costs don't directly compare between runs, but you should get a decent idea of whether there are any major differences.
That being said, I would expect COUNT(id) FROM table WHERE table.id="KEY" is probably the ideal, as it will take advantage of any partial lookup ability (particularly fast in columnar databases like amazon's redshift) and the primary key indexing.

Understanding SQLAlchemy performance when iteration over an expired query and updating

I was iterating over a query in SQLAlchemy and performing updates on the rows and I was experiencing unexpectedly poor performance, orders of magnitude slower that I anticipated. The database has about 12K rows. My original query and loop looked like this:
query_all = session.query(MasterImages).all()
for record_counter, record in enumerate(query_all):
# Some stuff happens here, set_id and set_index are defined
session.query(MasterImages).\
filter(MasterImages.id == record.id).\
update({'set_id':set_id, 'set_index':set_index})
if record_counter % 100 == 0:
session.commit()
print 'Updated {:,} records'.format(record_counter)
session.commit()
The first iteration through the loop was very fast but then it would seemingly stop after the first commit. I tried a bunch of different approaches but was getting no where. Then I tried changing my query so it only selected the fields I needed to calculate the set_id and set_index values I use in my update, like this:
query_all = session.query(MasterImages.id, MasterImages.project_id,
MasterImages.visit, MasterImages.orbit,
MasterImages.drz_mode, MasterImages.cr_mode).all()
This produced the performance I was expecting, plowing through all the records in well under a minute. After thinking about it for a while I think my issue was that in my first query, after the commit, turned into a stale query because I had updated a field that was (unnecessarily) in the query I was iterating over. This, I think, forced Alchemy to regenerate the query after every commit. By removing the fields I was updating from the query I was iterating over I think I was able to use the same query and this resulted in my performance increase.
Am I correct?
Turn off expire_on_commit.
With expire_on_commit on, SQLAlchemy marks all of the objects in query_all as expired (or "stale", as you put it) after the .commit(). What being expired means is that the next time you try to access an attribute on your object SQLAlchemy will issue a SELECT to refresh the object. Turning off expire_on_commit will prevent it from doing that. expire_on_commit is an option so that naive uses of the ORM are not broken, so if you know what you're doing you can safely turn it off.
Your fix works because when you specify columns (MasterImages.id, etc) instead of a mapped class (MasterImages) in your query(), SQLAlchemy returns to you a plain python tuple instead of an instance of the mapped class. The tuple does not offer ORM features, i.e. it does not expire and it will never re-fetch itself from the database.

Python sqite3 user defined queries (selecting tables)

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Database query optimization

Ok my Giant friends once again I seek a little space in your shoulders :P
Here is the issue, I have a python script that is fixing some database issues but it is taking way too long, the main update statement is this:
cursor.execute("UPDATE jiveuser SET username = '%s' WHERE userid = %d" % (newName,userId))
That is getting called about 9500 times with different newName and userid pairs...
Any suggestions on how to speed up the process? Maybe somehow a way where I can do all updates with just one query?
Any help will be much appreciated!
PS: Postgres is the db being used.
Insert all the data into another empty table (called userchanges, say) then UPDATE in a single batch:
UPDATE jiveuser
SET username = userchanges.username
FROM userchanges
WHERE userchanges.userid = jiveuser.userid
AND userchanges.username <> jiveuser.username
See this documentation on the COPY command for bulk loading your data.
There are also tips for improving performance when populating a database.
First of all, do not use the % operator to construct your SQL. Instead, pass your tuple of arguments as the second parameter to cursor.execute, which also negates the need to quote your argument and allows you to use %s for everything:
cursor.execute("UPDATE jiveuser SET username = %s WHERE userid = %s", (newName, userId))
This is important to prevent SQL Injection attacks.
To answer your question, you can speed up these updates by creating an index on the userid column, which will allow the database to update in O(1) constant time rather than having to scan the entire database table, which is O(n). Since you're using PostgreSQL, here's the syntax to create your index:
CREATE INDEX username_lookup ON jiveuser (userid);
EDIT: Since your comment reveals that you already have an index on the userid column, there's not much you could possibly do to speed up that query. So your main choices are either living with the slowness, since this sounds like a one-time fix-something-broken thing, or following VeeArr's advice and testing whether cursor.executemany will give you a sufficient boost.
The reason it's taking so long is probably that you've got autocommit enabled and each update gets done in its own transaction.
This is slow because even if you have a battery-backed raid controller (which you should definitely have on all database servers, of course), it still needs to do a write into that device for every transaction commit to ensure durability.
The solution is to do more than one row per transaction. But don't make transactions TOO big or you run into problems too. Try committing every 10,000 rows of changes as a rough guess.
You might want to look into executemany(): Information here
Perhaps you can create an index on userid to speed things up.
I'd do an explain on this. If it's doing an indexed lookup to find the record -- which it should if you have an index on userid -- then I don't see what you could do to improve performance. If it's not using the index, then the trick is figuring out why not and fixing it.
Oh, you could try using a prepared statement. With 9500 inserts, that should help.
Move this to a stored procedure and execute it from the database self.
First ensure you have an index on 'userid', this will ensure the dbms doesn't have to do a table scan each time
CREATE INDEX jiveuser_userid ON jiveuser (userid);
Next try preparing the statement, and then calling execute on it. This will stop the optimizer from having to examine the query each time
PREPARE update_username(string,integer) AS UPDATE jiveuser SET username = $1 WHERE userid = $2;
EXECUTE update_username("New Name", 123);
Finally, a bit more performance could be squeezed out by turning off autocommit
\set autocommit off

Categories

Resources