MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff)

MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff) - python

I'm performing a repetitive update operation to add documents into my MongoDB as part of some performance evaluation. I've discovered a huge non-linearity in execution time based on the number of updates (w/ upserts) I'm performing:
Looping with the following command in Python...
collection.update({'timestamp': x}, {'$set': {'value1':y, v1 : y/2, v2 : y/4}}, upsert=True)
Gives me these results...
500 document upserts 2 seconds.
1000 document upserts 3 seconds.
2000 document upserts 3 seconds.
4000 document upserts 6 seconds.
8000 document upserts 14 seconds.
16000 document upserts 77 seconds.
32000 document upserts 280 seconds.
Notice how after 8k document updates the performance starts to rapidly degrade, and by 32k document updates we're seeing a 6x reduction in throughput. Why is this? It seems strange that "manually" running 4k document updates 8 times in a row would be 6x faster than having Python perform them all consecutively.
I've seen that in mongostats I'm getting a ridiculously high locked db ratio (>100%) and
top is showing me >85% CPU usage when this is running. I've got an i7 processor with 4 cores available to the VM.

You should put an ascending index on your "timestamp" field:
collection.ensure_index("timestamp") # shorthand for single-key, ascending index
If this index should contain unique values:
collection.ensure_index("timestamp", unique=True)
Since the spec is not indexed and you are performing updates, the database has to check every document in the collection to see if any documents already exist with that spec. When you do this for 500 documents (in a blank collection), the effects are not so bad...but when you do it for 32k, it does something like this (in the worst case):
document 1 - assuming blank collection, definitely gets inserted
document 2 - check document 1, update or insert occurs
document 3 - check documents 1-2, update or insert occurs
...etc...
document 32000 - check documents 1-31999, update or insert
When you add the index, the database no longer has to check every document in the collection; instead, it can use the index to find any possible matches much more quickly using a B-tree cursor instead of a basic cursor.
You should compare the results of collection.find({"timestamp": x}).explain() with and without the index (note you may need to use the hint() method to force it to use the index). The critical factor is how many documents you have to iterate over (the "nscanned" result of explain()) versus how many documents match your query (the "n" key). If the db only has to scan exactly what matches or close to that, that is very efficient; if you scan 32000 items but only found 1 or a handful of matches, that is terribly inefficient, especially if the db has to do something like that for each and every upsert.
A notable wrinkle for you to double check- since you have not set multi=True in your update call, if an update operation finds a matching document, it will update just it and not continue to check the entire collection.
Sorry for the link spam, but these are all must-reads:
http://docs.mongodb.org/manual/core/indexes/
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.ensure_index
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update
http://docs.mongodb.org/manual/reference/method/cursor.explain/

Related

MySQL Bulk Insert taking long time

I'm using python, using MySQL connector. I'm having nearly 67 Million (14GB) entries in a table. When I do a bulk insert of 2K data each time, it is taking very long to insert.
Inserted 2000 rows in 23 Seconds
Inserted 2000 rows in 25 Seconds
Inserted 2000 rows in 29 Seconds
Inserted 2000 rows in 28 Seconds
For another table (having less data), insertion speed is fine(2-4 seconds).
After using the transaction:
Inserted 2000 rows in 21 Seconds
Inserted 2000 rows in 20 Seconds
Inserted 2000 rows in 20 Seconds
Inserted 2000 rows in 18 Seconds
How can I improve the speed?
I'm using AWS RDS, Aurora MySQL version 5.7.12 (db.t3.medium) having CPU usage 4% to 8%. My objective is to insert around 50K data into a table. This table is currently having nearly 67 Million (14GB) data already. Data must need to be inserted ASAP. This almost real-time data is very important for the client. The table is having 18 columns:
id(PK auto-increment), customer, serial_number, batch, data, and some others.
Indexes are on (customer,serial_number) - To make the combination unique, batch - For searching, data(unique). All are by default BTREE indexed.
This insertion should need to take less than 1 minute for 50K. But currently taking around 15 minutes. I've tried inserting on an empty table. It is inserting 50K data just in 5-7 seconds. As you increase the number of entries in the table, the insertion process time is increasing.
Is upgrading MySQL version is going to speed-up the insertion process anyhow?
Is it the last option to split or partitioning the table?
I cannot consolidate the data because each data is important, specially the last 2 years of data.
Please help.
My table schema is already having some default values in 8 columns and these data are never going to update later because real-time data is very important for us.
There are not many Read/Write operations are going on. Almost 2 or in some cases 3 selects per second as per RDS monitor shows.

Not an expert on MySQL, but here are few strategies you can try
Partitioning the table. https://dev.mysql.com/doc/refman/5.7/en/partitioning.html
Archive older data into separate tables if feasible. Smaller the index memory footprint, the writes will be faster
Give a bigger machine so that InnoDB has more memory and processing power

I've had the same problem with the UPDATE command - some of the delay may be attributable to Python's intrinsic speed issues, but most is likely due to mySQL and general server latencies.
I've gone "serverless" using SQLite (local db, everything in "core") and it's improved performance.

Depending on what your goal is there are several options you might consider. More information ultimately is useful.
If you are simply looking to free up availability you may consider using INSERT LOW PRIORITY
https://dev.mysql.com/doc/refman/5.7/en/insert.html
What type of database engine are you using?
What indexes do you have on the table? Unique indexes?
Is it possible to insert the rows with default values and run updates later asynchronously?
Are there a lot of write/read operations on that table happening at the same time?

Use in your my.cnf ( or my.ini for Windows )
innodb_flush_neighbors=2 # to expedite reducing innodb_buffer_pool_pages_dirty ASAP
innodb_change_buffer_max_size=50 # to expedite insert capacity per second
see dba.stackexchange.com Question 196715 Rolando's suggestion # 2, please.

Check if an INSERT with a SELECT was successfull in PyMySQL

I had an INSERT query where it got values from a SELECT statement. But since the SELECT returns millions of records, it put too much load on the MySQL server. So, we decided to break the SELECT query into parts and execute by having a LIMIT clause.
INSERT INTO target_table
SELECT * FROM source_table
WHERE my_condition = value
...
LIMIT <start>, <end>
We will keep increasing start and end values until SELECT returns 0 rows. I'm also thinking of making this multi-threaded.
How can I do it with PyMySQL?
Do I need to execute the SELECT, get the results and then generate the INSERT?

First of all, to answer your question: in PyMySQL, you get that value as the result of cursor.execute:
execute(query, args=None)
Execute a query
Parameters:
query (str) – Query to execute.
args (tuple, list or dict) – parameters used with query. (optional)
Returns: Number of affected rows
So you could just execute your query repeatedly until you get a value less then your selected range as a result.
Anyway, please consider:
the first thing you should check if you can optimize your select (assuming it's not as simple as in your example), e.g. by adding indexes. You may also want to test the difference between just selecting and actually inserting to get a rough idea which part is more relevant.
if the insertion is causing the problem, it can be due to the size of the transaction. In that case, splitting it up will only reduce the problems if you can also split up the transaction (although since you consider executing queries in parallel, this doesn't seem to be a concern)
if a query generates too much (cpu) load, running multiple instances of that query in parallel can, at best, only spread it over multiple cores, which will actually reduce the available cpu time for other queries. If "load" is related to I/O-load, effects of limited resources or "general responsiveness" , it it possible though, e.g. a small query might generate a small temporary table in memory, and big query generates a big temporary table on disk (although specifically with offset, this is unlikely, see below.) Otherwise, you would usually need to add small pauses between (small enough) parts that you run consecutively, to spread the same workload over a longer time.
limit only makes sense if you have an order by (probably by the primary key), otherwise, in successive runs, the m-th row can be a different row than before (because the order is not fixed). This may or may not increase the load (and resource requirements) depending on your indexes and your where-condition.
the same is true for updates to your source table, as if you add or remove a row from the resultset (e.g. changing the value of my_condition of the first row), all successive offsets will shift, and you may skip a row or get a row twice. You will probably need to lock the rows, which might prevent running your queries in parallel (as they lock the same rows), and also might influence the decision if you can split the transaction (see 2nd bullet point).
using an offset requires MySQL to first find and then skip rows. So if you split the query in n parts, the first row will need to be processed n times (and the last row usually once), so the total work (for selecting) will be increased by (n^2-n)/2. So especially if selecting the rows is the most relevant part (see 1st bullet point), this can actually make your situation much worse: just the last run will need to find the same amount of rows as your current query (although it throws most of them away), and might even need more resources for it depending on the effect of order by.
You may be able to get around some of the offset-problems by using the primary key in the condition, e.g. have a loop that contains something like this:
select max(id) as new_max from
where id > last_id and <your condition>
order by id limit 1000 -- no offset!
Exit the loop if new_max is null, otherwise do the insert:
insert ... select ...
where id > last_id and id <= new_max and <your condition>
Then set last_id = new_max and continue the loop.
It doubles the number of queries, as in contrast to limit with an offset, you need to know the actual id. It still requires your primary key and your where-condition to be compatible (so you may need to add an index that fits). If your search condition finds a significant percentage (more than about 15% or 20%) of your source table anyway, using the primary key might be the best execution plan anyway though.
If you want to parallize this (depending on your transaction requirements and if it is potentially worthwile, see above), you could first get the maximum value for primary key (select max(id) as max_id from ...) , and give each threads a range to work with. E.g. for max_id=3000 and 3 threads, start them with one of (0..1000), (1001, 2000), (2001..3000) and include that into the first query:
select max(id) as new_max from
where id > last_id
and id >= $threadmin_id and id <= $threadmax_id
and <your condition>
order by id limit 1000
It may depend on your data distribution if those ranges are equally sized (and you may find better ranges in your situation; calculating the exact ranges would require to execute the query though, so you probably can't be exact).

Copying huge index to create parent-child structure

I have 2 sets of indexes, indexes A_* and indexes B_*. I need to create a parent-child structure in B_*. B_* already contains the parent documents, and A_* contains the child documents. So, essentially, I need to copy the child documents from A_* into B_* with some logic in the middle that matches child documents to parent documents based on matching on several fields that serve as a unique key.
A_* contains about 40 indexes with document counts ranging between 100-250 million. Each index is between 100-500 GB. B_* contains 16 indexes with 15 million documents each and of size 20 GB each.
I have tried to do this via a python script, with the main logic being the following:
doc_chunk = helpers.scan(self.es, index=some_index_from_A, size=4000, scroll='5m')
actions = self.doc_iterator(doc_chunk)
deque(helpers.parallel_bulk(self.es, actions, chunk_size=1000, thread_count=4))
The function doc_iterator scrolls through the iterator returned by helpers.scan and, based on values of certain fields in a given child document, determines the id of that document's parent. For each document, it yields indexing actions that index the child documents under the appropriate parent in B_*.
I've tried several different approaches to create this parent-child index, but nothing seems to work:
Running the script in parallel using xargs results in BulkIndexingErrors and leads to, at most, only 1/3 of the corpus being indexed. If this worked, it would be the ideal approach as it would cut down this whole process to 2-4 days.
Running the python script in 1 process doesn't result in BulkIndexingErrors, but it only indexes about 22-28 million documents, at which point a read timeout occurs, and the whole process just hangs indefinitely. This is the less ideal approach as in the best case it would take 7-8 days to finish. During one of my attempts to run it this way, I was monitoring the cluster in Kibana and noticed that searches had spiked to 30,000 documents/second, after which they immediately plummeted to 0 and never picked up afterwards. Indexing tapered off at that point.
I have tried different values for scan size, chunk size, and thread count. I get the fastest performance for 1 process with scan size of 6000, chunk size of 1000, and thread count of 6, but I also noticed the aforementioned read spike with this setting, so it seems like I may be reading too much. Taking it down to a scan size of 4000 still resulted in the read timeouts (I was unable to monitor the search rate at that setting).
Some more details:
ES version: 5.2.1
Nodes: 6
Primary shards: 956
Replicas: 76
I currently need to run the script from a different server from the one where ES is running.
I need to find a way to finish the parent-child index in as few days as possible. Any tips to fix the problems with my aforementioned attempts would help, and new ideas are also welcome.

Efficient way to store millions of arrays, and perform IN check

There are around 3 millions of arrays - or Python lists\tuples (does not really matter). Each array consists of the following elements:
['string1', 'string2', 'string3', ...] # totally, 10000 elements
These arrays should be stored in some kind of key-value storage. Let's assume now it's a Python's dict, for a simple explanation.
So, 3 millions of keys, each key represents a 10000-elements array.
Lists\tuples or any other custom thing - it doesn't really matter. What matters is that arrays should consist strings - utf8 or unicode strings, from 5 to about 50 chars each. There are about 3 millions of possible strings as well. It is possible to replace them with integers if it's really needed, but for more efficient further operations, I would prefer to have strings.
Though it's hard to give you a full description of the data (it's complicated and odd), it's something similar to synonyms - let's assume we have 3 millions of words - as the dict keys - and 10k synonyms for each of the word - or element of the list.
Like that (not real synonyms but it will give you the idea):
{
'computer': ['pc', 'mac', 'laptop', ...], # (10k totally)
'house': ['building', 'hut', 'inn', ...], # (another 10k)
...
}
Elements - 'synonyms' - can be sorted if it's needed.
Later, after the arrays are populated, there's a loop: we go thru all the keys and check if some var is in its value. For example, user inputs the words 'computer' and 'laptop' - and we must quickly reply if the word 'laptop' is a synonym of the word 'computer'. The issue here is that we have to check it millions of time, probably 20 millions or so. Just imagine we have a lot of users entering some random words - 'computer' and 'car', 'phone' and 'building', etc. etc. They may 'match', or they may not 'match'.
So, in short - what I need is to:
store these data structures memory-efficiently,
be able to quickly check if some item is in array.
I should be able to keep memory usage below 30GB. Also I should be able to perform all the iterations in less than 10 hours on a Xeon CPU.
It's ok to have around 0.1% of false answers - both positive and negative - though it would be better to reduce them or don't have them at all.
What is the best approach here? Algorithms, links to code, anything is really appreciated. Also - a friend of mine suggested using bloom filters or marisa tries here - is he right? I didn't work with none of them.

I would map each unique string to a numeric ID then associate a bloom filter with around 20 bits per element for your <0.1% error rate. 20 bits * 10000 elements * 3 million keys is 75GB so if you are space limited, then store a smaller less accurate filter in memory and the more accurate filter on disk which is called up if the first filter says it might be a match.
There are alternatives, but they will only reduce the size from 1.44·n·ln2(1/ε) to n·ln2(1/ε) per key, in your case ε=0.001 so the theoretical limit is a data structure of 99658 bits per key, or 10 bits per element, which would be 298,974,000,000 bits or 38 GB.
So 30GB is below the theoretical limit for a data structure with the performance and number of entries that you require, but within the ball park.

Why do you want to maintain your own in-memory data-structure? Why not use a regular database for this purpose? If that is too slow, why no use an in-memory database? One solution is to use in-memory sqlite3. Check this SO link, for example: Fast relational Database for simple use with Python
You create the in-memory database by passing ':memory:' to connect method.
import sqlite3
conn = sqlite3.connect(':memory:')
What will your schema be? I can think of a wide-schema, with a string as an id key (e.g. 'computer', 'house' in your example and about 10000 additional columns: 'field1' to 'field10000'; one of each element of your array). Once you construct the schema, iteratively inserting your data in the database will be simple: one SQL statement per row of your data. And from your description, the insert part is one-time-only. There are no further modifications to the database.
The biggest question is retrieval (more crucially, speed of retrieval). Retrieving entire array for a single key like computer is again a simple SQL statement. The scalability and speed is something I don't have an idea about and this is something you will have to experiment. There is still hope that in-memory database will speed up the retrieval part. Yet, I believe that this is the cheapest and fastest solution you can implement and test (much cheaper than multiple node cluster)
Why am I suggesting this solution? Because the setup that you have in mind is extremely similar to that of a fast-growing database-backed internet startup. All good startups have similar number of requests per day; use some sort of database with caching (Caching would be next thing to look for your problem if a simple database doesn't scale to million requests. Again, it is much easier and cheaper than buying RAM/nodes).

Optimizing Sqlite3 for 20,000+ Updates

I have lists of about 20,000 items that I want to insert into a table (with about 50,000 rows in it). Most of these items update certain fields in existing rows and a minority will insert entirely new rows.
I am accessing the database twice for each item. First is a select query that checks whether the row exists. Next I insert or update a row depending on the result of the select query. I commit each transaction right after the update/insert.
For the first few thousand entries, I am getting through about 3 or 4 items per second, then it starts to slow down. By the end it takes more than 1/2 second for each iteration. Why might it be slowing down?
My average times are: 0.5 seconds for an entire run divided up as .18s per select query and .31s per insert/update. The last 0.01 is due to a couple of unmeasured processes to do with parsing the data before entering into the database.
Update
I've commented out all the commits as a test and got no change, so that's not it (any more thoughts on optimal committing would be welcome, though).
As to table structure:
Each row has twenty columns. The first four are TEXT fields (all set with the first insert) and the 16 are REAL fields, one of which is inputted with the initial insert statement.
Over time the 'outstanding' REAL fields will be populated with the process I'm trying to optimize here.
I don't have an explicit index, though one of the fields is unique key to each row.
I should note that as the database has gotten larger both the SELECT and UPDATE queries have taken more and more time, with a particularly remarkable deterioration in performance in the SELECT operation.
I initially thought this might be some kind of structural problem with SQLITE (whatever that means), but haven't been able to find any documentation anywhere that suggests there are natural limits to the program.
The database is about 60ish megs, now.

I think your bottleneck is that you commit with/avec each insert/update:
I commit each transaction right after the update/insert.
Either stop doing that, or at least switch to WAL journaling; see this answer of mine for why:
SQL Server CE 4.0 performance comparison
If you have a primary key you can optimize out the select by using the ON CONFLICT clause with INSERT INTO:
http://www.sqlite.org/lang_conflict.html
EDIT : Earlier I meant to write "if you have a primary key " rather than foreign key; I fixed it.

Edit: shame on me. I misread the question and somehow understood this was for mySQL rather that SQLite... Oops.
Please disregard this response, other than to get generic ideas about upating DBMSes. The likely solution to the OP's problem is with the overly frequent commits, as pointed in sixfeetsix' response.
A plausible explanation is that the table gets fragmented.
You can verify this fact by defragmenting the table every so often, and checking if the performance returns to the 3 or 4 items per seconds rate. (Which BTW, is a priori relatively slow, but then may depend on hardware, data schema and other specifics.) Of course, you'll need to consider the amount of time defragmentation takes, and balance this against the time lost by slow update rate to find an optimal frequency for the defragmentation.
If the slowdown is effectively caused, at least in part, by fragmentation, you may also look into performing the updates in a particular order. It is hard to be more specific without knowing details of the schema of of the overall and data statistical profile, but fragmentation is indeed sensitive to the order in which various changes to the database take place.
A final suggestion, to boost the overall update performance, is (if this is possible) drop a few indexes on the table, perform the updates, and recreate the indexes anew. This counter-intuitive approach works for relative big updates because the cost for re-creating new indexes is often less that the cumulative cost for maintaining them as the update progresses.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.