what is the fastest way to update thousands rows in mysql - python

lets assume you have a table with 1M rows and growing ...
every five minutes of every day you run a python programm which have to update some fields of 50K rows
my question is: what is the fastest way to do the work?
runs those updates in loop and after last one is executed than fire up a cursor commit?
or generate file and than run it throught command line?
create temp table by huge and fast insert and than run a single update to production table?
do prepared statements?
split it up to 1K updates per execute, to generate smaller logs files?
turn off logging while running update?
or do a cases in mysql examples (but this works only up to 255 rows)
i dont know ... have anyone do something like this? what is the best practise? i need to run it as fast as possible ...

Here's some ways you could speed up your UPDATES.
When you UPDATE, the table records are just being rewritten with new data. And all this must be done again on INSERT. That's why you should always use INSERT ... ON DUPLICATE KEY UPDATE instead of REPLACE.
The former one is an UPDATE operation in case of a key violation, while the latter one is DELETE / INSERT
Here's an example INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; More on this here.
UPDATE1: It's a good idea to do your inserts all in a single query. This should speed up your UPDATES. See here on how to do that.
UPDATE2: Now that I have had a chance to read your other sub-questions. Here's what I know-
instead of in a loop, try to execute all UPDATE in a single sql & single commit.
Not sure this is going to make any difference. SQL queries are more important.
Now this is something you could experiment with. Benchmark it. This kind of a thing depends on the size of the TABLE & the INDEXES you have, plus INNODB or MYISAM.
No idea about this.
refer first point.
Yes, this might speed your stuff up slightly. Also see if you have slow_query_log turned on. This logs all slow queries to a separate logfile. Turn this off too.
Again. refer first point.

query execution process: Server first parsing your query then execute you need to analysis
the query
then server take less time to parse then he execute faster instead of slow in other way

Related

What is the best practice for recording a run number into a Postgres database through a Python connection?

Every few days I have to load around 50 rows of data into a Postgres 14 database through a Python script. As part of this I want to record a run number as a column in the db. This number should be the same for all rows I am inserting at that time and larger than any number currently in that column in the database, but other than that its actual value doesn't matter (ie, I need to make the number myself, I'm not just pulling it in from somewhere else).
The obvious way to do this would be with two calls from Python to the database - one to get the current max run number and one to save the data with the run number set to one more than the number retrieved in the first query. Is this best practice? Is there a better way to do this with only one call? Would people recommend making a function in postgres that does this instead and then call that function? I feel like this is likely a common situation with accepted best practice but I don't know what it is.

psycopg2 not improving speed when using execute_batch method

So, I'm working in updating thousands of rows in a Postgres DB with Python (v3.6). After cleaning the data and preparing it, I'm having issues with times on the row updating. I've already indexed the columns that are being used to do the query.
I'm using psycopg2 to execute a "execute_batch" update on the table after having created the column, but the times just do not have any sense. It takes 40 seconds to update 10k rows, and what is breaking my mind, is that changing the "page_size" parameter of the function doesn't seem to change the speed of the updates.
These two codes would give the same time results:
psycopg2.extras.execute_batch(self.cursor, query, field_list, page_size=1000)
psycopg2.extras.execute_batch(self.cursor, query, field_list, page_size=10)
With all this, am I doing something wrong? Is it necessary to change anything in the database configuration so that the page_size argument would change its behaviour?
So far I've found a post that obtain improvements when using this method, but I cannot reproduce its results:
https://hakibenita.com/fast-load-data-python-postgresql#measuring-time
Any light in this would be awesome.
Many thanks!
Unless the bottleneck which execute_batch removes is the bottleneck you actually face, there is no reason to expect a performance improvement.
If the time to do the update is dominated by index maintenance (which is likely, if your table is indexed), then nothing else is going to matter.
If python is running on the same server as your database, or they are on a reasonably fast LAN, reducing network round trips is probably of little importance, until every other bottleneck has been removed first.

peewee with bulk insert is very slow into sqlite db

I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.

How to write proper big data loader to sqlite

I`m trying to write loader to sqlite that will load as fast as possible simple rows in DB.
Input data looks like rows retrieved from postgres DB. Approximated amount of rows that will go to sqlite: from 20mil to 100mil.
I cannot use other DB except sqlite due to project restrictions.
My question is :
what is a proper logic to write such loader?
At first try I`ve tried to write set of encapsulated generators, that will take one row from Postgres, slightly ammend it and put it into sqlite. I ended up with the fact that for each row, i create separate sqlite connection and cursor. And that looks awfull.
At second try , i moved sqlite connection and cursor out of the generator , to the body of the script and it became clear that i do not commit data to sqlite untill i fetch and process all 20mils records. And this possibly could crash all my hardware.
At third try I strated to consider to keep Sqlite connection away from the loops , but create/close cursor each time i process and push one row to Sqlite. This is better but i think also have some overhead.
I also considered to play with transactions : One connection, one cursor, one transaction and commit called in generator each time row is being pushed to Sqlite. Is this i right way i`m going?
Is there some widely-used pattern to write such a component in python? Because I feel as if I am inventing a bicycle.
SQLite can handle huge transactions with ease, so why not commit at the end? Have you tried this at all?
If you do feel one transaction is a problem, why not commit ever n transactions? Process rows one by one, insert as needed, but every n executed insertions add a connection.commit() to spread the load.
See my previous answer about bulk and SQLite. Possibly my answer here as well.
A question: Do you control the SQLite database? There are compile time options you can tweak related to cache sizes, etc. you can adjust for your purposes as well.
In general, the steps in #1 are going to get you the biggest bang-for-your-buck.
Finally i managed to resolve my problem. Main issue was in exessive amount of insertions in sqlite. After i started to load all data from postgress to memory, aggregate it proper way to reduce amount of rows, i was able to decrease processing time from 60 hrs to 16 hrs.

How do you make Python / PostgreSQL faster?

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)
Is it possible for this to go faster or is that a limitation on the language/database?
Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.
Three Things.
One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.
Two. Don't Update.
Specifically, Do not do this. It's bad code for two reasons.
cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)
It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]
And. Never build SQL using string substitution. Completely dumb.
If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.
Three. Use Bind Variables. Always.
cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )
The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.
In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:
insert_statement="""
INSERT INTO chats(person_id, message_type, created_at, channel)
VALUES(:person_id,:message_type,:created_at,:channel)
"""
Then in place of each sql statement you execute put this in place:
cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)
This will make things run faster because:
The cursor object won't have to reparse the statement each time
The db server won't have to generate a new execution plan as it can use the one it create previously.
You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.
Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.
Other optimizations:
You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.
Use bind variables instead of literal values in the sql statements and create a cursor for
each unique sql statement so that the statement does not need to be reparsed the next time it is used. From the python db api doc:
Prepare and execute a database
operation (query or command).
Parameters may be provided as sequence
or mapping and will be bound to
variables in the operation. Variables
are specified in a database-specific
notation (see the module's paramstyle
attribute for details). [5]
A reference to the operation will be
retained by the cursor. If the same
operation object is passed in again,
then the cursor can optimize its
behavior. This is most effective for
algorithms where the same operation is
used, but different parameters are
bound to it (many times).
ALWAYS ALWAYS ALWAYS use bind variables.
As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).
Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.
Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)
Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)
Additionally to the many fine suggestions #Mark Roddy has given, do the following:
don't use readlines, you can iterate over file objects
try to use executemany rather than execute: try to do batch inserts rather single inserts, this tends to be faster because there's less overhead. It also reduces the number of commits
str.rstrip will work just fine instead of stripping of the newline with a regex
Batching the inserts will use more memory temporarily, but that should be fine when you don't read the whole file into memory.

Categories

Resources