Ok my Giant friends once again I seek a little space in your shoulders :P
Here is the issue, I have a python script that is fixing some database issues but it is taking way too long, the main update statement is this:
cursor.execute("UPDATE jiveuser SET username = '%s' WHERE userid = %d" % (newName,userId))
That is getting called about 9500 times with different newName and userid pairs...
Any suggestions on how to speed up the process? Maybe somehow a way where I can do all updates with just one query?
Any help will be much appreciated!
PS: Postgres is the db being used.
Insert all the data into another empty table (called userchanges, say) then UPDATE in a single batch:
UPDATE jiveuser
SET username = userchanges.username
FROM userchanges
WHERE userchanges.userid = jiveuser.userid
AND userchanges.username <> jiveuser.username
See this documentation on the COPY command for bulk loading your data.
There are also tips for improving performance when populating a database.
First of all, do not use the % operator to construct your SQL. Instead, pass your tuple of arguments as the second parameter to cursor.execute, which also negates the need to quote your argument and allows you to use %s for everything:
cursor.execute("UPDATE jiveuser SET username = %s WHERE userid = %s", (newName, userId))
This is important to prevent SQL Injection attacks.
To answer your question, you can speed up these updates by creating an index on the userid column, which will allow the database to update in O(1) constant time rather than having to scan the entire database table, which is O(n). Since you're using PostgreSQL, here's the syntax to create your index:
CREATE INDEX username_lookup ON jiveuser (userid);
EDIT: Since your comment reveals that you already have an index on the userid column, there's not much you could possibly do to speed up that query. So your main choices are either living with the slowness, since this sounds like a one-time fix-something-broken thing, or following VeeArr's advice and testing whether cursor.executemany will give you a sufficient boost.
The reason it's taking so long is probably that you've got autocommit enabled and each update gets done in its own transaction.
This is slow because even if you have a battery-backed raid controller (which you should definitely have on all database servers, of course), it still needs to do a write into that device for every transaction commit to ensure durability.
The solution is to do more than one row per transaction. But don't make transactions TOO big or you run into problems too. Try committing every 10,000 rows of changes as a rough guess.
You might want to look into executemany(): Information here
Perhaps you can create an index on userid to speed things up.
I'd do an explain on this. If it's doing an indexed lookup to find the record -- which it should if you have an index on userid -- then I don't see what you could do to improve performance. If it's not using the index, then the trick is figuring out why not and fixing it.
Oh, you could try using a prepared statement. With 9500 inserts, that should help.
Move this to a stored procedure and execute it from the database self.
First ensure you have an index on 'userid', this will ensure the dbms doesn't have to do a table scan each time
CREATE INDEX jiveuser_userid ON jiveuser (userid);
Next try preparing the statement, and then calling execute on it. This will stop the optimizer from having to examine the query each time
PREPARE update_username(string,integer) AS UPDATE jiveuser SET username = $1 WHERE userid = $2;
EXECUTE update_username("New Name", 123);
Finally, a bit more performance could be squeezed out by turning off autocommit
\set autocommit off
Related
I am building a little interface where I would like users to be able to write out their entire sql statement and then see the data that is returned. However, I don't want a user to be able to do anything funny ie delete from user_table;. Actually, the only thing I would like users to be able to do is to run select statements. I know there aren't specific users for SQLite, so I am thinking what I am going to have to do, is have a set of rules that reject certain queries. Maybe a regex string or something (regex scares me a little bit). Any ideas on how to accomplish this?
def input_is_safe(input):
input = input.lower()
if "select" not in input:
return False
#more stuff
return True
I can suggest a different approach to your problem. You can restrict the access to your database as read-only. That way even when the users try to execute delete/update queries they will not be able to damage your data.
Here is the answer for Python on how to open a read-only connection:
db = sqlite3.connect('file:/path/to/database?mode=ro', uri=True)
Python's sqlite3 execute() method will only execute a single SQL statement, so if you ensure that all statements start with the SELECT keyword, you are reasonably protected from dumb stuff like SELECT 1; DROP TABLE USERS. But you should check sqlite's SQL syntax to ensure there is no way to embed a data definition or data modification statement as a subquery.
My personal opinion is that if "regex scares you a little bit", you might as well just put your computer in a box and mail it off to <stereotypical country of hackers>. Letting untrusted users write SQL code is playing with fire, and you need to know what you're doing or you'll get fried.
Open the database as read only, to prevent any changes.
Many statements, such as PRAGMA or ATTACH, can be dangerous. Use an authorizer callback (C docs) to allow only SELECTs.
Queries can run for a long time, or generate a large amount of data. Use a progress handler to abort queries that run for too long.
Sorry for the vague question, let me explain...
I have a list of words and counts in a database that has, no doubt, reached a gigantic amount. ~80mb database with each entry being two columns (word, integer)
Now when I am trying to add a word, I check to see if it is already in the database like this...python sqlite3 class method...
self.c.execute('SELECT * FROM {tn} WHERE {cn} = """{wn}"""'.format(tn=self.table1, cn=self.column1, wn=word_name))
exist = self.c.fetchall()
if exist:
do something
So you're checking for the existence of a word within a very large table of words? I think the short and simple answer to your question is to create an index for your word column.
The next step would be to setup a real database (e.g. Postgres) instead of sqlite. Sqlite is doesn't have the optimization tweaks of a production database and you'd likely see a performance gain after switching.
Even for a table with millions of rows, this shouldn't be a super time-intensive query if your table is properly indexed. If you already have an index and are still facing performance issues there's either something wrong with either your database setup/environment or perhaps there's a bottleneck in your Python code or DB adapter. Hard to say without more information.
I would imagine that using COUNT within SQL would be faster:
self.c.execute('SELECT COUNT(*) FROM {tn} WHERE {cn} = """{wn}"""'.format(tn=self.table1, cn=self.column1, wn=word_name))
num = self.c.fetchone()[0]
if num:
#do something
though I haven't tested it.
See How to check the existence of a row in SQLite with Python? for a similar question.
I was iterating over a query in SQLAlchemy and performing updates on the rows and I was experiencing unexpectedly poor performance, orders of magnitude slower that I anticipated. The database has about 12K rows. My original query and loop looked like this:
query_all = session.query(MasterImages).all()
for record_counter, record in enumerate(query_all):
# Some stuff happens here, set_id and set_index are defined
session.query(MasterImages).\
filter(MasterImages.id == record.id).\
update({'set_id':set_id, 'set_index':set_index})
if record_counter % 100 == 0:
session.commit()
print 'Updated {:,} records'.format(record_counter)
session.commit()
The first iteration through the loop was very fast but then it would seemingly stop after the first commit. I tried a bunch of different approaches but was getting no where. Then I tried changing my query so it only selected the fields I needed to calculate the set_id and set_index values I use in my update, like this:
query_all = session.query(MasterImages.id, MasterImages.project_id,
MasterImages.visit, MasterImages.orbit,
MasterImages.drz_mode, MasterImages.cr_mode).all()
This produced the performance I was expecting, plowing through all the records in well under a minute. After thinking about it for a while I think my issue was that in my first query, after the commit, turned into a stale query because I had updated a field that was (unnecessarily) in the query I was iterating over. This, I think, forced Alchemy to regenerate the query after every commit. By removing the fields I was updating from the query I was iterating over I think I was able to use the same query and this resulted in my performance increase.
Am I correct?
Turn off expire_on_commit.
With expire_on_commit on, SQLAlchemy marks all of the objects in query_all as expired (or "stale", as you put it) after the .commit(). What being expired means is that the next time you try to access an attribute on your object SQLAlchemy will issue a SELECT to refresh the object. Turning off expire_on_commit will prevent it from doing that. expire_on_commit is an option so that naive uses of the ORM are not broken, so if you know what you're doing you can safely turn it off.
Your fix works because when you specify columns (MasterImages.id, etc) instead of a mapped class (MasterImages) in your query(), SQLAlchemy returns to you a plain python tuple instead of an instance of the mapped class. The tuple does not offer ORM features, i.e. it does not expire and it will never re-fetch itself from the database.
I have to insert massive data (from a Python programme into a SQLite DB), where many fields are validated via foreign keys.
The query looks like this, and I perform the insertion with executemany()
INSERT INTO connections_to_jjos(
connection_id,
jjo_error_id,
receiver_task_id
sender_task_id
)
VALUES
(
:connection_id,
(select id from rtt_errors where name = :rtx_error),
(select id from tasks where name = :receiver_task),
(select id from tasks where name = :sender_task)
)
About 300 insertions take something like 15seconds, which I think it way too much. In production, there should be blocks of 1500 insertions in bulk or so. In similar cases without subqueries for the foreign keys, speed is unbelievable. It's quite clear that FK's will add overhead and slow down the process, but this is too much.
I could do a pre-query to catch all the foreign key id's, and then insert them directly, but I feel there must be a cleaner option.
On the other hand, I have read about the Isolation level, and if I don't understand it wrong, it could be that before each SELECT query, there is an automated COMMIT to enforce integrity... that could result in slowing down the process as well, but my attempts to work in this way were totally unsuccessful.
Maybe I'm doing something essentially wrong with the FK's. How can I improve the performance?
ADDITIONAL INFORMATION
The query:
EXPLAIN QUERY PLAN select id from rtt_errors where name = '--Unknown--'
Outputs:
SEARCH TABLE
rtt_errors
USING COVERING INDEX sqlite_autoindex_rtt_errors_1 (name=?) (~1 rows)
I have created an index in rtt_errors.name, but apparently it is not using it.
In theory, Python's default COMMITs should not happen between consecutive INSERTs, but your extremely poor performance look as if this is what is happening.
Set the isolation level to None, and then execute a pair of BEGIN/COMMIT commands once around all the INSERTs.
The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);
The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.
In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.