MySQL Bulk Insert taking long time - python

I'm using python, using MySQL connector. I'm having nearly 67 Million (14GB) entries in a table. When I do a bulk insert of 2K data each time, it is taking very long to insert.
Inserted 2000 rows in 23 Seconds
Inserted 2000 rows in 25 Seconds
Inserted 2000 rows in 29 Seconds
Inserted 2000 rows in 28 Seconds
For another table (having less data), insertion speed is fine(2-4 seconds).
After using the transaction:
Inserted 2000 rows in 21 Seconds
Inserted 2000 rows in 20 Seconds
Inserted 2000 rows in 20 Seconds
Inserted 2000 rows in 18 Seconds
How can I improve the speed?
I'm using AWS RDS, Aurora MySQL version 5.7.12 (db.t3.medium) having CPU usage 4% to 8%. My objective is to insert around 50K data into a table. This table is currently having nearly 67 Million (14GB) data already. Data must need to be inserted ASAP. This almost real-time data is very important for the client. The table is having 18 columns:
id(PK auto-increment), customer, serial_number, batch, data, and some others.
Indexes are on (customer,serial_number) - To make the combination unique, batch - For searching, data(unique). All are by default BTREE indexed.
This insertion should need to take less than 1 minute for 50K. But currently taking around 15 minutes. I've tried inserting on an empty table. It is inserting 50K data just in 5-7 seconds. As you increase the number of entries in the table, the insertion process time is increasing.
Is upgrading MySQL version is going to speed-up the insertion process anyhow?
Is it the last option to split or partitioning the table?
I cannot consolidate the data because each data is important, specially the last 2 years of data.
Please help.
My table schema is already having some default values in 8 columns and these data are never going to update later because real-time data is very important for us.
There are not many Read/Write operations are going on. Almost 2 or in some cases 3 selects per second as per RDS monitor shows.

Not an expert on MySQL, but here are few strategies you can try
Partitioning the table. https://dev.mysql.com/doc/refman/5.7/en/partitioning.html
Archive older data into separate tables if feasible. Smaller the index memory footprint, the writes will be faster
Give a bigger machine so that InnoDB has more memory and processing power

I've had the same problem with the UPDATE command - some of the delay may be attributable to Python's intrinsic speed issues, but most is likely due to mySQL and general server latencies.
I've gone "serverless" using SQLite (local db, everything in "core") and it's improved performance.

Depending on what your goal is there are several options you might consider. More information ultimately is useful.
If you are simply looking to free up availability you may consider using INSERT LOW PRIORITY
https://dev.mysql.com/doc/refman/5.7/en/insert.html
What type of database engine are you using?
What indexes do you have on the table? Unique indexes?
Is it possible to insert the rows with default values and run updates later asynchronously?
Are there a lot of write/read operations on that table happening at the same time?

Use in your my.cnf ( or my.ini for Windows )
innodb_flush_neighbors=2 # to expedite reducing innodb_buffer_pool_pages_dirty ASAP
innodb_change_buffer_max_size=50 # to expedite insert capacity per second
see dba.stackexchange.com Question 196715 Rolando's suggestion # 2, please.

Related

Turning fast_executemany on is even slower than when it is off

Using pyodbc, I wrote a Python program to extract data from Oracle and loads into SQL Server. The extraction from Oracle was instant, but there are some tables taking very long time to load, especially the tables with many columns (over 100+ columns) with a few of those columns at VARCHAR(4000) size (I am running pyodbc's executemany for the INSERT).
Turning fast_executemany = True seem to make the INSERT even slower. When turned off, loading a table of 40k rows took about 3minutes; and when turned on, loading the same amount of rows took about 15minutes.
Not sure if this means anything, but I did turned on SQL Profiler during each try and here is what I found: When it is turned off, the backend is doing a bunch of "sp_prepexec" and "sp_unprepare" for each inserts; and when it is turned on, the backend just did one time of "sp_prepare" and then a bunch of "sp_execute".
Any idea why the fast_executemany is not speeding up the INSERT, and in fact is even much longer?
Update: I was able to resolve my problem by limiting how many rows get inserted each time. I set my batch size for each INSERT operation to only 1000 rows at a time, and now the same INSERT operation of 40k rows took about 40secs, comparing to without setting a batch size of which took 15minutes.
I am guessing fast_executemany puts everything into memory before executing that INSERT, but if my column sizes are huge and if there are many rows to be inserted for each operation, it will put lot of burden on the memory and hence get much slower (?).

How to divide and iterate SQL query result into chunks in Python, is it going to decrease the execution time if I have millons of records?

I am using impala as database and trying to fetch 25 millions records, but it is taking too much time i.e. approx. 40 minutes while I am executing it from Python. I was thinking to divide the entire query result into chunks and iterate each chunks one by one it this is going to boost the performance?
But how would I do that?

MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff)

I'm performing a repetitive update operation to add documents into my MongoDB as part of some performance evaluation. I've discovered a huge non-linearity in execution time based on the number of updates (w/ upserts) I'm performing:
Looping with the following command in Python...
collection.update({'timestamp': x}, {'$set': {'value1':y, v1 : y/2, v2 : y/4}}, upsert=True)
Gives me these results...
500 document upserts 2 seconds.
1000 document upserts 3 seconds.
2000 document upserts 3 seconds.
4000 document upserts 6 seconds.
8000 document upserts 14 seconds.
16000 document upserts 77 seconds.
32000 document upserts 280 seconds.
Notice how after 8k document updates the performance starts to rapidly degrade, and by 32k document updates we're seeing a 6x reduction in throughput. Why is this? It seems strange that "manually" running 4k document updates 8 times in a row would be 6x faster than having Python perform them all consecutively.
I've seen that in mongostats I'm getting a ridiculously high locked db ratio (>100%) and
top is showing me >85% CPU usage when this is running. I've got an i7 processor with 4 cores available to the VM.
You should put an ascending index on your "timestamp" field:
collection.ensure_index("timestamp") # shorthand for single-key, ascending index
If this index should contain unique values:
collection.ensure_index("timestamp", unique=True)
Since the spec is not indexed and you are performing updates, the database has to check every document in the collection to see if any documents already exist with that spec. When you do this for 500 documents (in a blank collection), the effects are not so bad...but when you do it for 32k, it does something like this (in the worst case):
document 1 - assuming blank collection, definitely gets inserted
document 2 - check document 1, update or insert occurs
document 3 - check documents 1-2, update or insert occurs
...etc...
document 32000 - check documents 1-31999, update or insert
When you add the index, the database no longer has to check every document in the collection; instead, it can use the index to find any possible matches much more quickly using a B-tree cursor instead of a basic cursor.
You should compare the results of collection.find({"timestamp": x}).explain() with and without the index (note you may need to use the hint() method to force it to use the index). The critical factor is how many documents you have to iterate over (the "nscanned" result of explain()) versus how many documents match your query (the "n" key). If the db only has to scan exactly what matches or close to that, that is very efficient; if you scan 32000 items but only found 1 or a handful of matches, that is terribly inefficient, especially if the db has to do something like that for each and every upsert.
A notable wrinkle for you to double check- since you have not set multi=True in your update call, if an update operation finds a matching document, it will update just it and not continue to check the entire collection.
Sorry for the link spam, but these are all must-reads:
http://docs.mongodb.org/manual/core/indexes/
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.ensure_index
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update
http://docs.mongodb.org/manual/reference/method/cursor.explain/

Optimizing Sqlite3 for 20,000+ Updates

I have lists of about 20,000 items that I want to insert into a table (with about 50,000 rows in it). Most of these items update certain fields in existing rows and a minority will insert entirely new rows.
I am accessing the database twice for each item. First is a select query that checks whether the row exists. Next I insert or update a row depending on the result of the select query. I commit each transaction right after the update/insert.
For the first few thousand entries, I am getting through about 3 or 4 items per second, then it starts to slow down. By the end it takes more than 1/2 second for each iteration. Why might it be slowing down?
My average times are: 0.5 seconds for an entire run divided up as .18s per select query and .31s per insert/update. The last 0.01 is due to a couple of unmeasured processes to do with parsing the data before entering into the database.
Update
I've commented out all the commits as a test and got no change, so that's not it (any more thoughts on optimal committing would be welcome, though).
As to table structure:
Each row has twenty columns. The first four are TEXT fields (all set with the first insert) and the 16 are REAL fields, one of which is inputted with the initial insert statement.
Over time the 'outstanding' REAL fields will be populated with the process I'm trying to optimize here.
I don't have an explicit index, though one of the fields is unique key to each row.
I should note that as the database has gotten larger both the SELECT and UPDATE queries have taken more and more time, with a particularly remarkable deterioration in performance in the SELECT operation.
I initially thought this might be some kind of structural problem with SQLITE (whatever that means), but haven't been able to find any documentation anywhere that suggests there are natural limits to the program.
The database is about 60ish megs, now.
I think your bottleneck is that you commit with/avec each insert/update:
I commit each transaction right after the update/insert.
Either stop doing that, or at least switch to WAL journaling; see this answer of mine for why:
SQL Server CE 4.0 performance comparison
If you have a primary key you can optimize out the select by using the ON CONFLICT clause with INSERT INTO:
http://www.sqlite.org/lang_conflict.html
EDIT : Earlier I meant to write "if you have a primary key " rather than foreign key; I fixed it.
Edit: shame on me. I misread the question and somehow understood this was for mySQL rather that SQLite... Oops.
Please disregard this response, other than to get generic ideas about upating DBMSes. The likely solution to the OP's problem is with the overly frequent commits, as pointed in sixfeetsix' response.
A plausible explanation is that the table gets fragmented.
You can verify this fact by defragmenting the table every so often, and checking if the performance returns to the 3 or 4 items per seconds rate. (Which BTW, is a priori relatively slow, but then may depend on hardware, data schema and other specifics.) Of course, you'll need to consider the amount of time defragmentation takes, and balance this against the time lost by slow update rate to find an optimal frequency for the defragmentation.
If the slowdown is effectively caused, at least in part, by fragmentation, you may also look into performing the updates in a particular order. It is hard to be more specific without knowing details of the schema of of the overall and data statistical profile, but fragmentation is indeed sensitive to the order in which various changes to the database take place.
A final suggestion, to boost the overall update performance, is (if this is possible) drop a few indexes on the table, perform the updates, and recreate the indexes anew. This counter-intuitive approach works for relative big updates because the cost for re-creating new indexes is often less that the cumulative cost for maintaining them as the update progresses.

Insert performance with Cassandra

sorry for my English in advance.
I am a beginner with Cassandra and his data model. I am trying to insert one million rows in a cassandra database in local on one node. Each row has 10 columns and I insert those only in one column family.
With one thread, that operation took around 3 min. But I would like do the same operation with 2 millions rows, and keeping a good time. Then I tried with 2 threads to insert 2 millions rows, expecting a similar result around 3-4min. bUT i gor a result like 7min...twice the first result. As I check on differents forums, multithreading is recommended to improve performance.
That is why I am asking that question : is it useful to use multithreading to insert data in local node (client and server are in the same computer), in only one column family?
Some informations :
- I use pycassa
- I have separated commitlog repertory and data repertory on differents disks
- I use batch insert for each thread
- Consistency Level : ONE
- Replicator factor : 1
It's possible you're hitting the python GIL but more likely you're doing something wrong.
For instance, putting 2M rows in a single batch would be Doing It Wrong.
Try running multiple clients in multiple processes, NOT threads.
Then experiment with different insert sizes.
1M inserts in 3 mins is about 5500 inserts/sec, which is pretty good for a single local client. On a multi-core machine you should be able to get several times this amount provided that you use multiple clients, probably inserting small batches of rows, or individual rows.
You might consider Redis. Its single-node throughput is supposed to be faster. It's different from Cassandra though, so whether or not it's an appropriate option would depend on your use case.
The time taken doubled because you inserted twice as much data. Is it possible that you are I/O bound?

Categories

Resources