Concurrent bulk insert statements on transactional table with index - python

We have a smallish, 40k rows, (thus far) transactional table with an index enabled on a single column. This index is extremely valuable to us as the reads to the table tend to be quite frequent.
At certain times, multiple bulk insert statements are performed on this transactional table, quite often 100s of mini bulk inserts (< 50 rows) in say an hour or two. Then it might lie idle for a while. Whilst each individual insert tends to work quite well, these concurrent INSERT statements tend to break (i.e. fail) after a while and won't work unless we restart the instance.
Is this because of the index? How can we work around that limitation? Is cursor.executemany preferable over cursor.execute in this case? Would sending these INSERT queries to a task queue make a difference?
Any help would be appreciated!

What's the symptom of failure? Any error messages?
Could you check the innodb monitor status when it happens? http://dev.mysql.com/doc/refman/5.5/en/innodb-monitors.html

Related

How to limit number of rows in mysql and if exceeds then remove older row

I am using MySQL database via python for storing logs.
I was wondering if there is any efficient way to remove the oldest row if the num of rows exceeds the limit.
I was able to do this by executing a query to find total rows and then delete the older ones by arranging them in ascending and deleting. But this method is taking too much time. Is there a way to make this efficient by making a rule while creating a table, so that MySQL itself takes care if the limit exceeds?
Thanks in advance.
Well, there's no simple and built-in way to do this in MySQL.
Solutions that use triggers to delete old rows when you insert a new row are risky, because the trigger might fail. Or the transaction that spawned the trigger might be rolled back. In either of these cases, your intended deletion will not happen.
Also putting the burden of deleting on the thread that inserts new data causes extra work for the insert request, and usually we'd prefer not to make things slower for our current users.
It's more common to run an asynchronous job periodically to delete older data. This can be scheduled to run at off-hours, and run in batches. It also gives more flexibility to archive old data, or execute retries if the deletion or archiving fails or is interrupted.
MySQL does support an EVENT system, so you can run a stored routine based on a schedule. But you can only do tasks you can do in a stored routine, and it's not easy to make it do retries, or archive to any external system (e.g. cloud archive), or notify you when it's done.
Sorry there is no simple solution. There are just too many variations on how people would like it to work, and too many edge cases of potential failure.
The way I'd implement this is to use cron or else a timer thread in my web service to check the database, say once per hour. If it finds the number of rows is greater than the limit, it deletes the oldest rows in modestly sized batches (e.g. 1000 rows at a time) until the count is under the threshold.
I like to write scheduled jobs in a way that can be easily controlled and monitored. So I can make it run immediately if I want, and I can disable or resume the schedule if I want, and I can view a progress report about how much it deleted the last time it ran, and how long until the next time it runs, etc.

Why do writes through Cassandra Python Driver add records with a delay?

I'm writing 2.5 million records into Cassandra using a python program. The program finishes quickly but on querying the data, the records are reflected after a long time. The number of records gradually increase and it seems like the database is performing the writes to the tables in a queue fashion. The writes continue on till all the records are finished. Why do writes reflect late?
It is customary to provide a minimal code example plus steps to replicate the issue but you haven't provided much information.
My guess is that you've issued a lot of asynchronous writes which means that those queries get queued up because that's how asynchronous programming works. Until they eventually reach the cluster and get processed, you won't be able to immediately see the results.
In addition, you haven't provided information on how you're verifying the data so I'm going to make another guess and say you're doing a SELECT COUNT(*) which requires a full table scan in Cassandra. Given that you've issued millions of writes, chances are the nodes are overloaded and take a while to respond.
For what it's worth, if you are doing a COUNT() you might be interested in this post where I've explained why it's bad to do it in Cassandra -- https://community.datastax.com/questions/6897/. Cheers!

Strategies for renaming table and indices while select queries are running

I have a job which copies a large file to a table temp_a and also creates an index idx_temp_a_j on a column j. Now once the job finishes copying all the data, I have to rename this table to a table prod_a which is production facing and queries are always running against it with very less idle time. But once I run the rename queries, the queries coming in and the queries which are already running, are backed up producing high API error rates. I want to know what are the possible strategies I can implement so the renaming of the table happens with less downtime.
So far, below are the strategies I came up with:
First, just rename the table and allow queries to be backed up. This approach seems unreliable as rename table query acquires the EXCLUSIVE LOCK and all other queries are backed up, I am getting high level of API error rates.
Second, write a polling function which checks if there any queries running now if not then rename the table and index. In this approach the polling function will check periodically to see if any query is running, any queries are running, then wait , if not then run the alter table query. This approach will only queue up queries which are coming after the alter table rename query has placed an EXCLUSIVE LOCK on the table. Once the renaming finishes, the queued up queries will get executed. I still need to find database APIs which will help me in writing this function.
What are the other strategies which can allow this "seamless" renaming of the table? I am using postgres (PostgreSQL) 11.4 and the job which does all this is in Python.
You cannot avoid blocking concurrent queries while a table is renamed.
The operation itself is blazingly fast, so any delay you experience must be because the ALTER TABLE itself is blocked by long running transactions using the table. All later operations on the table then have to queue behind the ALTER TABLE.
The solution for painless renaming is to keep database transactions very short (which is always desirable, since it also reduces the danger of deadlocks).

pyscopg2, should I commit in chunks?

I'm working on a project that is using Psycopg2 to connect to a Postgre database in python, and in the project there is a place where all the modifications to the database are being committed after performing a certain amount of insertions.
So i have a basic question:
Is there any benefit to committing in smaller chunks, or is it the same as waiting to commit until the end?
for example, say im going to insert 7,000 rows of data, should i commit after inserting 1,000 or just wait until all the data is added?
If there is problems with large commits what are they? could i possibly crash the database or something? or cause some sort of overflow?
Unlike some other database systems, there is no problem with modifying arbitrarily many rows in a single transaction.
Usually it is better to do everything in a single transaction, so that the whole thing succeeds or fails, but there are two considerations:
If the transaction takes a lot of time, it will keep VACUUM from doing its job on the rest of the database for the duration of the transaction. That may cause table bloat if there is a lot of concurrent write activity.
It may be useful to do the operation in batches if you expect many failures and don't want to restart from the beginning every time.

How can I commit all pending queries until an exception occurs in a python connection object

I am using psycopg2 in python, but my question is DBMS agnostic (as long as the DBMS supports transactions):
I am writing a python program that inserts records into a database table. The number of records to be inserted is more than a million. When I wrote my code so that it ran a commit on each insert statement, my program was too slow. Hence, I altered my code to run a commit every 5000 records and the difference in speed was tremendous.
My problem is that at some point an exception occurs when inserting records (some integrity check fails) and I wish to commit my changes up to that point, except of course for the last command that caused the exception to happen, and continue with the rest of my insert statements.
I haven't found a way to achieve this; the only thing I've achieved was to capture the exception, rollback my transaction and keep on from that point, where I loose my pending insert statements. Moreover, I tried (deep)copying the cursor object and the connection object without any luck, either.
Is there a way to achieve this functionality, either directly or indirectly, without having to rollback and recreate/re-run my statements?
Thank you all in advance,
George.
I doubt you'll find a fast cross-database way to do this. You just have to optimize the balance between the speed gains from batch size and the speed costs of repeating work when an entry causes a batch to fail.
Some DBs can continue with a transaction after an error, but PostgreSQL can't. However, it does allow you to create subtransactions with the SAVEPOINT command. These are far from free, but they're lower cost than a full transaction. So what you can do is every (say) 100 rows, issue a SAVEPOINT and then release the prior savepoint. If you hit an error, ROLLBACK TO SAVEPOINT, commit, then pick up where you left off.
If you are committing your transactions after every 5000 record interval, it seems like you could do a little bit of preprocessing of your input data and actually break it out into a list of 5000 record chunks, i.e. [[[row1_data],[row2_data]...[row4999_data]],[[row5000_data],[row5001_data],...],[[....[row1000000_data]]]
Then run your inserts, and keep track of which chunk you are processing as well as which record you are currently inserting. When you get the error, you rerun the chunk, but skip the the offending record.

Categories

Resources