How do you make Python / PostgreSQL faster? - python

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)
Is it possible for this to go faster or is that a limitation on the language/database?

Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.
Three Things.
One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.
Two. Don't Update.
Specifically, Do not do this. It's bad code for two reasons.
cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)
It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]
And. Never build SQL using string substitution. Completely dumb.
If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.
Three. Use Bind Variables. Always.
cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )
The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.

In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:
insert_statement="""
INSERT INTO chats(person_id, message_type, created_at, channel)
VALUES(:person_id,:message_type,:created_at,:channel)
"""
Then in place of each sql statement you execute put this in place:
cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)
This will make things run faster because:
The cursor object won't have to reparse the statement each time
The db server won't have to generate a new execution plan as it can use the one it create previously.
You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.
Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.
Other optimizations:
You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.

Use bind variables instead of literal values in the sql statements and create a cursor for
each unique sql statement so that the statement does not need to be reparsed the next time it is used. From the python db api doc:
Prepare and execute a database
operation (query or command).
Parameters may be provided as sequence
or mapping and will be bound to
variables in the operation. Variables
are specified in a database-specific
notation (see the module's paramstyle
attribute for details). [5]
A reference to the operation will be
retained by the cursor. If the same
operation object is passed in again,
then the cursor can optimize its
behavior. This is most effective for
algorithms where the same operation is
used, but different parameters are
bound to it (many times).
ALWAYS ALWAYS ALWAYS use bind variables.

As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).
Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.
Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)
Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)

Additionally to the many fine suggestions #Mark Roddy has given, do the following:
don't use readlines, you can iterate over file objects
try to use executemany rather than execute: try to do batch inserts rather single inserts, this tends to be faster because there's less overhead. It also reduces the number of commits
str.rstrip will work just fine instead of stripping of the newline with a regex
Batching the inserts will use more memory temporarily, but that should be fine when you don't read the whole file into memory.

Related

How to execute many SELECT statements at once using python sqlite

I have some business logic that iterates many many times and needs to perform a simple query every time. Rather than make a call to the db every time I would like to store the SELECT statements as an array of strings or something similar and then execute all of the statements at once after the loop. Is this possible with python and sqlite?
The documentation says:
execute() will only execute a single SQL statement. If you try to execute more than one statement with it, it will raise a Warning. Use executescript() if you want to execute multiple SQL statements with one call.
However, executescript() does not allow you to access all the results.
To get multiple query results, you have to do the loop yourself:
def execute_many_selects(cursor, queries):
return [cursor.execute(query).fetchall() for query in queries]
SQLite is an embedded library, so there is no client/server communication overhead when doing multiple database calls.
I suspect you'd be better off if you work out a "larger" query and then decompose the result set after retrieving the information.
In other words, rather than three calls to the database (one each for Alice, Betty and Claire), use something like:
select stuff from a_table
where person in ('Alice', 'Betty', 'Claire')
and then process the actual data taking person into account.
Obviously, that will only work in the case where you can figure out the query before executing any of the person-based actions, but it looks like that's the case anyway, based on your question.

peewee with bulk insert is very slow into sqlite db

I'm trying to do a large scale bulk insert into a sqlite database with peewee. I'm using atomic but the performance is still terrible. I'm inserting the rows in blocks of ~ 2500 rows, and due to the SQL_MAX_VARIABLE_NUMBER I'm inserting about 200 of them at a time. Here is the code:
with helper.db.atomic():
for i in range(0,len(expression_samples),step):
gtd.GeneExpressionRead.insert_many(expression_samples[i:i+step]).execute()
And the list expression_samples is a list of dictionaries with the appropriate fields for the GeneExpressionRead model. I've timed this loop, and it takes anywhere from 2-8 seconds to execute. I have millions of rows to insert, and the way I have my code written now it will likely take 2 days to complete. As per this post, there are several pragmas that I have set in order to improve performance. This also didn't really change anything for me performance wise. Lastly, as per this test on the peewee github page it should be possible to insert many rows very fast (~50,000 in 0.3364 seconds) but it also seems that the author used raw sql code to get this performance. Has anyone been able to do such a high performance insert using peewee methods?
Edit: Did not realize that the test on peewee's github page was for MySQL inserts. May or may not apply to this situation.
Mobius was trying to be helpful in the comments but there's a lot of misinformation in there.
Peewee creates indexes for foreign keys when you create the table. This happens for all database engines currently supported.
Turning on the foreign key PRAGMA is going to slow things down, why would it be otherwise?
For best performance, do not create any indexes on the table you are bulk-loading into. Load the data, then create the indexes. This is much much less work for the database.
As you noted, disabling auto increment for the bulk-load speeds things up.
Other information:
Use PRAGMA journal_mode=wal;
Use PRAGMA synchronous=0;
Use PRAGMA locking_mode=EXCLUSIVE;
Those are some good settings for loading in a bunch of data. Check the sqlite docs for more info:
http://sqlite.org/pragma.html
In all of the documentation where code using atomic appears as a context manager, it's been used as a function. Since it sounds like you're never seeing your code exit the with block, you're probably not seeing an error about not having an __exit__ method.
Can you try with helper.db.atomic():?
atomic() is starting a transaction. Without an open transaction, inserts are much slower because some expensive book keeping has to be done for every write, as opposed to only at the beginning and end.
EDIT
Since the code to start the question was changed, can I have some more information about the table you're inserting into? Is it large, how many indices are there?
Since this is SQLite, you're just writing to a file, but do you know if that file is on a local disk or on a network-mounted drive? I've had issues just like this because I was trying to insert into a database on an NFS.

Can anyone tell me what' s the point of connection.commit() in python pyodbc ?

I used to be able to run and execute python using simply execute statement. This will insert value 1,2 into a,b accordingly. But started last week, I got no error , but nothing happened in my database. No flag - nothing... 1,2 didn't get insert or replace into my table.
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
I finally found the article that I need commit() if I have lost the connection to the server. So I have add
connect.execute("REPLACE INTO TABLE(A,B) VALUES(1,2)")
connect.commit()
now it works , but I just want to understand it a little bit , why do I need this , if I know I my connection did not get lost ?
New to python - Thanks.
This isn't a Python or ODBC issue, it's a relational database issue.
Relational databases generally work in terms of transactions: any time you change something, a transaction is started and is not ended until you either commit or rollback. This allows you to make several changes serially that appear in the database simultaneously (when the commit is issued). It also allows you to abort the entire transaction as a unit if something goes awry (via rollback), rather than having to explicitly undo each of the changes you've made.
You can make this functionality transparent by turning auto-commit on, in which case a commit will be issued after each statement, but this is generally considered a poor practice.
Not commiting puts all your queries into one transaction which is safer (and possibly better performance wise) when queries are related to each other. What if the power goes between two queries that doesn't make sense independently - for instance transfering money from one account to another using two update queries.
You can set autocommit to true if you don't want it, but there's not many reasons to do that.

what is the fastest way to update thousands rows in mysql

lets assume you have a table with 1M rows and growing ...
every five minutes of every day you run a python programm which have to update some fields of 50K rows
my question is: what is the fastest way to do the work?
runs those updates in loop and after last one is executed than fire up a cursor commit?
or generate file and than run it throught command line?
create temp table by huge and fast insert and than run a single update to production table?
do prepared statements?
split it up to 1K updates per execute, to generate smaller logs files?
turn off logging while running update?
or do a cases in mysql examples (but this works only up to 255 rows)
i dont know ... have anyone do something like this? what is the best practise? i need to run it as fast as possible ...
Here's some ways you could speed up your UPDATES.
When you UPDATE, the table records are just being rewritten with new data. And all this must be done again on INSERT. That's why you should always use INSERT ... ON DUPLICATE KEY UPDATE instead of REPLACE.
The former one is an UPDATE operation in case of a key violation, while the latter one is DELETE / INSERT
Here's an example INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE c=c+1; More on this here.
UPDATE1: It's a good idea to do your inserts all in a single query. This should speed up your UPDATES. See here on how to do that.
UPDATE2: Now that I have had a chance to read your other sub-questions. Here's what I know-
instead of in a loop, try to execute all UPDATE in a single sql & single commit.
Not sure this is going to make any difference. SQL queries are more important.
Now this is something you could experiment with. Benchmark it. This kind of a thing depends on the size of the TABLE & the INDEXES you have, plus INNODB or MYISAM.
No idea about this.
refer first point.
Yes, this might speed your stuff up slightly. Also see if you have slow_query_log turned on. This logs all slow queries to a separate logfile. Turn this off too.
Again. refer first point.
query execution process: Server first parsing your query then execute you need to analysis
the query
then server take less time to parse then he execute faster instead of slow in other way

Python + MySQLDB Batch Insert/Update command for two of the same databases

I'm working with two databases, a local version and the version on the server. The server is the most up to date version and instead of recopying all values on all tables from the server to my local version,
I would like to enter each table and only insert/update the values that have changed, from server, and copy those values to my local version.
Is there some simple method to handling such a case? Some sort of batch insert/update? Googl'ing up the answer isn't working and I've tried my hand at coding one but am starting to get tied up in error handling..
I'm using Python and MySQLDB... Thanks for any insight
Steve
If all of your tables' records had timestamps, you could identify "the values that have changed in the server" -- otherwise, it's not clear how you plan to do that part (which has nothing to do with insert or update, it's a question of "selecting things right").
Once you have all the important values, somecursor.executemany will let you apply them all as a batch. Depending on your indexing it may be faster to put them into a non-indexed auxiliary temporary table, then insert/update from all of that table into the real one (before dropping the aux/temp one), the latter of course being a single somecursor.execute.
You can reduce wall-clock time for the whole job by using one (or a few) threads to do the selects and put the results onto a Queue.Queue, and a few worker threads to apply results plucked from the queue into the internal/local server. (Best balance of reading vs writing threads is best obtained by trying a few and measuring -- writing per se is slower than reading, but your bandwidth to your local server may be higher than to the other one, so it's difficult to predict).
However, all of this is moot unless you do have a strategy to identify "the values that have changed in the server", so it's not necessarily very useful to enter into more discussion about details "downstream" from that identification.

Categories

Resources