I have a job which copies a large file to a table temp_a and also creates an index idx_temp_a_j on a column j. Now once the job finishes copying all the data, I have to rename this table to a table prod_a which is production facing and queries are always running against it with very less idle time. But once I run the rename queries, the queries coming in and the queries which are already running, are backed up producing high API error rates. I want to know what are the possible strategies I can implement so the renaming of the table happens with less downtime.
So far, below are the strategies I came up with:
First, just rename the table and allow queries to be backed up. This approach seems unreliable as rename table query acquires the EXCLUSIVE LOCK and all other queries are backed up, I am getting high level of API error rates.
Second, write a polling function which checks if there any queries running now if not then rename the table and index. In this approach the polling function will check periodically to see if any query is running, any queries are running, then wait , if not then run the alter table query. This approach will only queue up queries which are coming after the alter table rename query has placed an EXCLUSIVE LOCK on the table. Once the renaming finishes, the queued up queries will get executed. I still need to find database APIs which will help me in writing this function.
What are the other strategies which can allow this "seamless" renaming of the table? I am using postgres (PostgreSQL) 11.4 and the job which does all this is in Python.
You cannot avoid blocking concurrent queries while a table is renamed.
The operation itself is blazingly fast, so any delay you experience must be because the ALTER TABLE itself is blocked by long running transactions using the table. All later operations on the table then have to queue behind the ALTER TABLE.
The solution for painless renaming is to keep database transactions very short (which is always desirable, since it also reduces the danger of deadlocks).
Related
I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.
Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.
I'm working on a project that is using Psycopg2 to connect to a Postgre database in python, and in the project there is a place where all the modifications to the database are being committed after performing a certain amount of insertions.
So i have a basic question:
Is there any benefit to committing in smaller chunks, or is it the same as waiting to commit until the end?
for example, say im going to insert 7,000 rows of data, should i commit after inserting 1,000 or just wait until all the data is added?
If there is problems with large commits what are they? could i possibly crash the database or something? or cause some sort of overflow?
Unlike some other database systems, there is no problem with modifying arbitrarily many rows in a single transaction.
Usually it is better to do everything in a single transaction, so that the whole thing succeeds or fails, but there are two considerations:
If the transaction takes a lot of time, it will keep VACUUM from doing its job on the rest of the database for the duration of the transaction. That may cause table bloat if there is a lot of concurrent write activity.
It may be useful to do the operation in batches if you expect many failures and don't want to restart from the beginning every time.
I wanna migrate from sqlite3 to MySQL in Django. Now I have been working in Oracle, MS Server and I know that I can make Exception to try over and over again until it is done... However this is insert in a same table where the data must be INSERTED right away because users will not be happy for waiting their turn on INSERT on the same table.
So I was wondering, will the deadlock happen on table if to many users make insert in same time and what should I do to bypass that, so that users don't sense it?
I don't think you can get deadlock just from rapid insertions. Deadlock occurs when you have two processes that are each waiting for the other one to do something before they can make the change that the other one is waiting for. If two processes are just inserting, the database will simply process them in the order that they're received, there's no dependency between them.
If you're using InnoDB, it uses row-level locking. So unless two inserts both try to insert the same unique key, they shouldn't even lock each other out, they can be done concurrently.
We have a smallish, 40k rows, (thus far) transactional table with an index enabled on a single column. This index is extremely valuable to us as the reads to the table tend to be quite frequent.
At certain times, multiple bulk insert statements are performed on this transactional table, quite often 100s of mini bulk inserts (< 50 rows) in say an hour or two. Then it might lie idle for a while. Whilst each individual insert tends to work quite well, these concurrent INSERT statements tend to break (i.e. fail) after a while and won't work unless we restart the instance.
Is this because of the index? How can we work around that limitation? Is cursor.executemany preferable over cursor.execute in this case? Would sending these INSERT queries to a task queue make a difference?
Any help would be appreciated!
What's the symptom of failure? Any error messages?
Could you check the innodb monitor status when it happens? http://dev.mysql.com/doc/refman/5.5/en/innodb-monitors.html
I have run a few trials and there seems to be some improvement in speed if I set autocommit to False.
However, I am worried that doing one commit at the end of my code, the database rows will not be updated. So, for example, I do several updates to the database, none are committed, does querying the database then give me the old data? Or, does it know it should commit first?
Or, am I completely mistaken as to what commit actually does?
Note: I'm using pyodbc and MySQL. Also, the table I'm using are InnoDB, does that make a difference?
There are some situations which trigger an implicit commit. However under most situations not commiting means data will be unavailable to other connections.
It also means that if another connection tries to perform an action that conflicts with an ongoing transaction (another connection locked that resource) the last request will have to wait for the lock to be released.
As for performance concerns, autocommit causes every change to be immediate. The performance hit will be quite noticeable on big tables as on each commit indexes and constraints need to be updated/checked too. If you only commit after a series of queries, indexes/constraints will only be updated at that time.
On the other hand, not committing frequently enough might cause the server to have too much work trying to maintain consistency between the two sets of data. So there is a trade-off.
And yes, using InnoDB makes a difference. If you were using for instance MyISAM you wouldn't have transactions at all, so any change will be permanent (similarly to autocommit=True). On MyISAM you can play with the delay-key-write option.
For more information about transactions have a look at the official documentation. For more tips about optimization have a look at this article.
The default transaction mode for InnoDB is REPEATABLE READ, all the read will be consistent within a transaction. If you insert rows and query them in the same transaction, you will not see the newly inserted row, but they will be stored when you commit the transaction. If you want to see the newly inserted row before you commit the transaction, you can set the isolation level to READ COMMITTED.
As long as you use the same connection, the database should show you a consistent view on the data, e.g. with all changes made so far in this transaction.
Once you commit, the changes will be written to disk and be visible to other (new) transactions and connections.