I had an INSERT query where it got values from a SELECT statement. But since the SELECT returns millions of records, it put too much load on the MySQL server. So, we decided to break the SELECT query into parts and execute by having a LIMIT clause.
INSERT INTO target_table
SELECT * FROM source_table
WHERE my_condition = value
...
LIMIT <start>, <end>
We will keep increasing start and end values until SELECT returns 0 rows. I'm also thinking of making this multi-threaded.
How can I do it with PyMySQL?
Do I need to execute the SELECT, get the results and then generate the INSERT?
First of all, to answer your question: in PyMySQL, you get that value as the result of cursor.execute:
execute(query, args=None)
Execute a query
Parameters:
query (str) – Query to execute.
args (tuple, list or dict) – parameters used with query. (optional)
Returns: Number of affected rows
So you could just execute your query repeatedly until you get a value less then your selected range as a result.
Anyway, please consider:
the first thing you should check if you can optimize your select (assuming it's not as simple as in your example), e.g. by adding indexes. You may also want to test the difference between just selecting and actually inserting to get a rough idea which part is more relevant.
if the insertion is causing the problem, it can be due to the size of the transaction. In that case, splitting it up will only reduce the problems if you can also split up the transaction (although since you consider executing queries in parallel, this doesn't seem to be a concern)
if a query generates too much (cpu) load, running multiple instances of that query in parallel can, at best, only spread it over multiple cores, which will actually reduce the available cpu time for other queries. If "load" is related to I/O-load, effects of limited resources or "general responsiveness" , it it possible though, e.g. a small query might generate a small temporary table in memory, and big query generates a big temporary table on disk (although specifically with offset, this is unlikely, see below.) Otherwise, you would usually need to add small pauses between (small enough) parts that you run consecutively, to spread the same workload over a longer time.
limit only makes sense if you have an order by (probably by the primary key), otherwise, in successive runs, the m-th row can be a different row than before (because the order is not fixed). This may or may not increase the load (and resource requirements) depending on your indexes and your where-condition.
the same is true for updates to your source table, as if you add or remove a row from the resultset (e.g. changing the value of my_condition of the first row), all successive offsets will shift, and you may skip a row or get a row twice. You will probably need to lock the rows, which might prevent running your queries in parallel (as they lock the same rows), and also might influence the decision if you can split the transaction (see 2nd bullet point).
using an offset requires MySQL to first find and then skip rows. So if you split the query in n parts, the first row will need to be processed n times (and the last row usually once), so the total work (for selecting) will be increased by (n^2-n)/2. So especially if selecting the rows is the most relevant part (see 1st bullet point), this can actually make your situation much worse: just the last run will need to find the same amount of rows as your current query (although it throws most of them away), and might even need more resources for it depending on the effect of order by.
You may be able to get around some of the offset-problems by using the primary key in the condition, e.g. have a loop that contains something like this:
select max(id) as new_max from
where id > last_id and <your condition>
order by id limit 1000 -- no offset!
Exit the loop if new_max is null, otherwise do the insert:
insert ... select ...
where id > last_id and id <= new_max and <your condition>
Then set last_id = new_max and continue the loop.
It doubles the number of queries, as in contrast to limit with an offset, you need to know the actual id. It still requires your primary key and your where-condition to be compatible (so you may need to add an index that fits). If your search condition finds a significant percentage (more than about 15% or 20%) of your source table anyway, using the primary key might be the best execution plan anyway though.
If you want to parallize this (depending on your transaction requirements and if it is potentially worthwile, see above), you could first get the maximum value for primary key (select max(id) as max_id from ...) , and give each threads a range to work with. E.g. for max_id=3000 and 3 threads, start them with one of (0..1000), (1001, 2000), (2001..3000) and include that into the first query:
select max(id) as new_max from
where id > last_id
and id >= $threadmin_id and id <= $threadmax_id
and <your condition>
order by id limit 1000
It may depend on your data distribution if those ranges are equally sized (and you may find better ranges in your situation; calculating the exact ranges would require to execute the query though, so you probably can't be exact).
Related
I've got a function which interacts with a postgres DB.
The function takes a parameter called pagination_data_required (boolean).
If pagination_required is set to true, the function executes a query as well as a query.count() which according to the documentation docs.peewee link here, puts the query in a wrapped count() function.
def list_records(pagination_data_required):
query = table1.select(table1.columns...).join(table2....).distinct() ## returns nearly 500k rows
if (filter_request_body.pagination_data_required):
total_count = query.count()
My problem arises when .count() is called. Without a .count() my api returns results within a second whereas with .count(), the response time skyrockets to ~18 seconds.
I need this total count due to a requirement from the frontend team.
The query is returning roughly 500k records (which is needed, plus there's a .paginate() function being called)
How do I efficiently count the number of rows returned in query ?
I've tried the api with pagination_data_required True and Falseand the results remain the same.
I've tried to call .dicts() on the original query and take the count of items but it gives the same response time.
The only way to count the number of rows returned by a query is to execute the query and count the results. I don't know how your ORM implements pagination, but I assume that it will append a LIMIT clause at the end of the query. That can speed up execution, because only the first few rows of the result set have to be calculated. But calculating the count will take much longer for a large result set.
So there is no good solution for this problem other than not showing an exact result count. See my article for a discussion of the problem and potential workarounds.
This one's a classic.
It is usually not possible to count the rows returned by a query without actually running the query. If it includes things that don't change the count, like left joins, sorts, joins on foreign keys that don't add or remove rows, etc, then you could remove them and get a bit of a speedup, but you will still be running the query. But if it is using a LIMIT'ed index scan for efficient searching of the most recent rows (for exampel) then that optimization won't work with a count. Also reading such a large amount of useless data will trash your cache. If the count query is run often, all the data it uses will fill your cache, and evict data that other more useful queries need, this will make these queries slow. Or you will have to upgrade your RAM.
In some cases, like a forum, displaying a topic always uses the same search criteria. It is simply "where topic_id=... order by post_id". In this case, counting the posts is very wasteful, always doing the exact same query all over again, and paginating results with (LIMIT+OFFSET) is also slow as it discards all the selected rows before the requested offset. Since the most often requested page is the last one, the worst case is the most common.
However, with such fixed search and ordering criteria, the row number of any row in the result set is always the same, so it is possible to cache it as "post number in topic" in the posts table. Then, to get one specific page, it is simply a matter of "post_number BETWEEN ... AND ...", and to count the posts in a topic, just select the post_number of the last one. In this case it is possible to get the exact count without actually counting, and to paginate without using OFFSET, which is much faster.
For a generic search query that can use many criteria, it is not possible to store the row number in such a simple way. However, knowing the exact count is usually not necessary. When the GUI displays:
Page: 1 2 3 4 .... 50000 50001
Will a user ever navigate to page 837? Probably not. What users do in this case is use sort to get the result they want on top, or refine their search criteria to reduce the number of results to something manageable. So the time spent in this huge count() query is almost always wasted. Basically, the information that is relevant to the user is: are there few pages, so it's possible to scan them by eye, or are there a lot, so he should refine his search criteria?
This does not need an accurate count, so the easiest way to fix this is to limit the counted results to something that would fill a number of pages like 5 or 10. Instead of:
SELECT count(*) FROM ...
use:
SELECT count(*) FROM (subquery ORDER BY ... LIMIT ...) AS foo
The next step is to realize selecting a few pages of results will quite often be almost as fast as selecting one page, so this is a good opportunity to cache the results for at least the first few pages when the first page is requested. This allows getting rid of the count, as you retrieve more results than necessary.
It is also possible to return the first few pages to the client and paginate on the client side using javascript, which means side queries.
Quite often the user will click on the last page instead of reversing the order, in this case you should flip the ORDER BY direction to keep a small LIMIT, not count all the rows and use a huge OFFSET to skip all pages except the last. When using the correct ORDER BY direction depending on which page is requested, the most common ones (first and last page) are fastest, with the worst case being in the middle, which is rarely clicked.
Another option is to cache the counts. The largest counts will most likely be for queries involving few search criteria, perhaps with common values, which results in a few combinations that can be cached beforehand. In addition, if the user clicks on page 2, reuse the cached count from the previous page. Of course the counts won't be exact, but that doesn't matter. It would only matter if the pagination logic was done wrong, ie not flipping the ORDER BY for pages close to the last one are requested.
I need this total count due to a requirement from the frontend team.
It's not possible, so the frontend team needs to read the answers to your question and act accordingly.
Currently, we have a table containing a varchar2 column with 4000 characters, however, it became a limitation as the size of the 'text' being inserted can grow bigger than 4000 characters, therefore we decided to use CLOB as the data type for this specific column, what happens now is that both the insertions and selections are way too slow compared to the previous varchar2(4000) data type.
We are using Python combined with SqlAlchemy to do both the insertions and the retrieval of the data. In simple words, the implementation itself did not change at all, only the column data type in the database.
Does anyone have any idea on how to tweak the performance?
There are two kinds of storage for CLOB's
in row
The clob is stored like any other column in the row. This can only be
done for clob up to a certain size (approx 4k). Clobs larger than this
will stored in a separate segment (the "lobsegment")
out of row
The clob is always stored out of the row in the lobsegment
You can which is being used for your table by checking USER_LOBS.
It is possible, particularly in the first 'in row' instance that your
table consume more blocks for the "normal" rows because of the
interspersed lob data, and hence takes longer to scan.
See here: https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:9536389800346757758
If your LOB is stored separately, you have to navigate a separate indexing architecture and additional data files to get to the data. This will slow things down because of the additional background operations and disk I/O involved. There isn't much you can do about that except minimize the number of operations that access the LOB data, or get faster storage media.
LOBs have to be read and/or written behind the scenes by your client driver using a loop that reads or writes into a small buffer and keeps doing so until it reaches the end of the LOB. That buffer is smaller than you'd want it to be, and so it becomes quite chatty (lots of back and forth packets across the network between client and database). Not to mention creating the LOB locator and closing it. Too much back-and-forth.
Given that you recently made this change because some data exceeded 4KB, that suggests that most of your data is < 4KB. Take advantage of that situation and work with those < 4KB values using varchar2(4000), and only use CLOB where you have to (with values > 4KB).
Actual implementation of course depends on what you're doing. But this concept should be applied in some way. If you are selecting data, then select like this:
SELECT CAST(clobcolumn AS varchar2(4000)) clobcolumn_as_varchar
FROM mytable
WHERE NVL(dbms_lob.getlength(clobcolumn),0) <= 4000
AND [other predicates you need]
Then do a separate select for the larger rows:
SELECT clobcolumn
FROM mytable
WHERE NVL(dbms_lob.getlength(bigcolumn),0) > 4000
AND [other predicates you need]
For inserts, define a client variable that will be mapped to a varchar, and one that will be mapped to a CLOB, use the first for the < 4KB values, the larger for the >4KB values, separate insert statements. Even if the varchar2 version is inserted into a CLOB column, Oracle will do the implicit conversion for you within the database, and it's fast. The key thing is, the network conversation used varchar2.
Note that this approach can make a world of difference if the great majority of your values are small (< 4KB). If most of them are over the 4KB boundary, then there's little benefit.
You could also ask your DBA if possible to upgrade the DB to max_string_size=EXTENDED, then the max VARCHAR2 size would be 32K.
When using df = connectorx.read_sql(conn=cx_con, query=f"SELECT * FROM table;"), I occasionally get a Dataframe returned with the correct columns, but all the rows are zeros or it crashes with the message Process finished with exit code -1073741819 (0xC0000005). This only happens when the table is being updated at the same time with df.to_sql("table", con=con_in, if_exists="append")
My program reads a table from a local database that I am continuously updating in a concurrently running program. This issue does not occur when I try to read from the table using pandas.read_sql_query() (which is far slower). This indicates that there is some issue with the handling of the read/write traffic accessing the table when using connectorx that does not exist with the pandas read. Is this a bug with connectorx or is there something I can do to prevent this from happening?
I'm using PyCharm 2022.2.1, Windows11, and Python 3.10
Reason for the error
ConnectorX is more designed for OLAP scenarios where the data is static with readonly queries. The reason that causes zero rows / crushes is because of the inconsistency between multiple queries. In order to achieve maximum speed up, ConnectorX issues queries to fetch metadata before fetching the real query result, including:
limit 1 query to get result schema (column types and names)
count query to get the number of the rows in the result
min/max query to get the min and max value of the partition column (if partition enabled)
The first two are used to pre-allocate the destination pandas.DataFrame in advance (step 1 and 2 in the below example workflow). Getting the dataframe in the beginning makes it possible for ConnectorX to stream the result values directly to the final destination, avoiding extra data copy and result concatenations (step 4-6 below, done in streaming fashion).
If the data is updating, the result of the count query X may be different from the real number of rows that the query returns Y. In such case, the program may crash (if X < Y) or return some rows with all zeros (if X > Y).
Possible workarounds
Avoid COUNT query through arrow
One possible way to avoid the count query is to set the return_type to arrow2 to get the arrow format first. Since arrow table is consisted with multiple record batches, ConnectorX can allocate the memory on demand without issuing the count query. After getting the arrow result, you can then convert arrow to pandas using the efficient to_pandas API provided by pyarrow. Here is an example:
import connectorx as cx
table = cx.read_sql(conn, query, return_type="arrow2")
df = table.to_pandas(split_blocks=False, date_as_object=False)
However, one thing need to be noticed is that if you are using partition, the result might still be incorrect due to the inconsistency among min/max and multiple partitioned queries.
Add predicates for consistency
If your data is append-only in a certain way, for example a monotonic ID column. You can add a predicate like ID <= current max ID so the concurrently appending data will be filtered out in both count query and fetch result query. If you are using partition, you can also partition on this ID column so that the result can be consistent.
I have an existing table with a large number of entries and I want to calculate a new column for every row. I have only found the following solution. This works, but it's slow as it needs to scan most of the entries of the table.
What I would like is a way to:
Read a row
Calculate the value for new column based on contents of row
Update into database
This way it would only go through the table once and would have linear complexity.
cursor.execute("SELECT tweet FROM Table")
row = cursor.fetchone()
while row is not None:
vader = analyser.polarity_scores(row)
sentiment_vader = vader["compound"]
cursor2.execute(
"UPDATE Table SET sentiment_vader = %s WHERE tweet = %s LIMIT 1",
(sentiment_vader, row[0]))
kody.cnx.commit()
row = cursor.fetchone()
The main performance issue I see is that you should not commit for each row update as this adds an overhead. You should commit in the end of the while or after a batch.
while row is not None:
...
else:
kody.cnx.commit()
Also, if the tweet column is not indexed just create an index on that column in order not to make a table scan during the update.
OK, so first, not to critize the other answers, which are correct, given a generalized assumption that you have to do it in Python.
However, when you really have bulk volumes, chasing after a client-side, in-Python answer is often not the best approach. Since you want to update all the rows, assuming you can translate your polarity_scores algorithm into sql
UPDATE Table
SET sentiment_vader = <sql expressing your polarity_scores>;
would be the best performer. There is no back and forth with the database and everything gets committed at once.
Now, I am not saying it's easy or even possible. Often in these cases, even assuming the algorithm can be expressed in SQL, you may have to use work tables to store intermediate results and there is a lot of SQL going on. It's a different skill set than writing Python code.
But, if you truly need performance and you have large volumes, letting the server do the job on its own, in SQL, can be the way to go. That can be done via a series of sql commands, or using stored procedures.
In a previous job, we had explicit instructions to avoid loop and write constructs in client code and code reviews would almost always reject it on bulk data manipulations. I remember advising a colleague that doing a select-update on a table with potentially up 5M rows seemed a bad approach. He certainly ignored me at the time, but 3 months later his mission-critical code had all mysteriously shifted to a no-loop approach.
Note however one key conceptual difference: an error on a server-side update would rollback the transaction for all rows indiscriminately, whereas you could maybe choose to commit row-by-row using a loop construct like yours (even though you don't want it in your case).
The expected performance profile server-side is usually considerably better than O(n) linear time. Most of the time you should be nearly at constant O(1) time complexity, once you have correctly written your queries and indexes. Linear time to update, for a RDBMS vendor, would be commercial suicide. Usually what you see is a near constant time, followed by non-linear and hard-to-predict performance degradation past very high volume thresholds. You will see linear time earlier when indices can't be used and for your queries the RDBMS falls back to performing full-table scans.
Is this MySQLdb ? Maybe you can try an executemany.
cursor.execute("SELECT tweet FROM Table")
cursor2.executemany(
"UPDATE Table SET sentiment_vader = %s WHERE tweet = %s LIMIT 1",
((analyser.polarity_scores(row)["compound"], row[0]) for row in cursor)
)
kody.cnx.commit()
Just as #abc suggested above, you should also make sure autocommit is set to False, so that each query isn't committed separately during the executemany.
I have lists of about 20,000 items that I want to insert into a table (with about 50,000 rows in it). Most of these items update certain fields in existing rows and a minority will insert entirely new rows.
I am accessing the database twice for each item. First is a select query that checks whether the row exists. Next I insert or update a row depending on the result of the select query. I commit each transaction right after the update/insert.
For the first few thousand entries, I am getting through about 3 or 4 items per second, then it starts to slow down. By the end it takes more than 1/2 second for each iteration. Why might it be slowing down?
My average times are: 0.5 seconds for an entire run divided up as .18s per select query and .31s per insert/update. The last 0.01 is due to a couple of unmeasured processes to do with parsing the data before entering into the database.
Update
I've commented out all the commits as a test and got no change, so that's not it (any more thoughts on optimal committing would be welcome, though).
As to table structure:
Each row has twenty columns. The first four are TEXT fields (all set with the first insert) and the 16 are REAL fields, one of which is inputted with the initial insert statement.
Over time the 'outstanding' REAL fields will be populated with the process I'm trying to optimize here.
I don't have an explicit index, though one of the fields is unique key to each row.
I should note that as the database has gotten larger both the SELECT and UPDATE queries have taken more and more time, with a particularly remarkable deterioration in performance in the SELECT operation.
I initially thought this might be some kind of structural problem with SQLITE (whatever that means), but haven't been able to find any documentation anywhere that suggests there are natural limits to the program.
The database is about 60ish megs, now.
I think your bottleneck is that you commit with/avec each insert/update:
I commit each transaction right after the update/insert.
Either stop doing that, or at least switch to WAL journaling; see this answer of mine for why:
SQL Server CE 4.0 performance comparison
If you have a primary key you can optimize out the select by using the ON CONFLICT clause with INSERT INTO:
http://www.sqlite.org/lang_conflict.html
EDIT : Earlier I meant to write "if you have a primary key " rather than foreign key; I fixed it.
Edit: shame on me. I misread the question and somehow understood this was for mySQL rather that SQLite... Oops.
Please disregard this response, other than to get generic ideas about upating DBMSes. The likely solution to the OP's problem is with the overly frequent commits, as pointed in sixfeetsix' response.
A plausible explanation is that the table gets fragmented.
You can verify this fact by defragmenting the table every so often, and checking if the performance returns to the 3 or 4 items per seconds rate. (Which BTW, is a priori relatively slow, but then may depend on hardware, data schema and other specifics.) Of course, you'll need to consider the amount of time defragmentation takes, and balance this against the time lost by slow update rate to find an optimal frequency for the defragmentation.
If the slowdown is effectively caused, at least in part, by fragmentation, you may also look into performing the updates in a particular order. It is hard to be more specific without knowing details of the schema of of the overall and data statistical profile, but fragmentation is indeed sensitive to the order in which various changes to the database take place.
A final suggestion, to boost the overall update performance, is (if this is possible) drop a few indexes on the table, perform the updates, and recreate the indexes anew. This counter-intuitive approach works for relative big updates because the cost for re-creating new indexes is often less that the cumulative cost for maintaining them as the update progresses.