Psycopg2, Postgresql, Python: Fastest way to bulk-insert - python

I'm looking for the most efficient way to bulk-insert some millions of tuples into a database. I'm using Python, PostgreSQL and psycopg2.
I have created a long list of tulpes that should be inserted to the database, sometimes with modifiers like geometric Simplify.
The naive way to do it would be string-formatting a list of INSERT statements, but there are three other methods I've read about:
Using pyformat binding style for parametric insertion
Using executemany on the list of tuples, and
Using writing the results to a file and using COPY.
It seems that the first way is the most efficient, but I would appreciate your insights and code snippets telling me how to do it right.

Yeah, I would vote for COPY, providing you can write a file to the server's hard drive (not the drive the app is running on) as COPY will only read off the server.

There is a new psycopg2 manual containing examples for all the options.
The COPY option is the most efficient. Then the executemany. Then the execute with pyformat.

in my experience executemany is not any faster than running many inserts yourself,
the fastest way is to format a single INSERT with many values yourself, maybe in the future executemany will improve but for now it is quite slow
i subclass a list and overload the append method ,so when a the list reaches a certain size i format the INSERT to run it

You could use a new upsert library:
$ pip install upsert
(you may have to pip install decorator first)
conn = psycopg2.connect('dbname=mydatabase')
cur = conn.cursor()
upsert = Upsert(cur, 'mytable')
for (selector, setter) in myrecords:
upsert.row(selector, setter)
Where selector is a dict object like {'name': 'Chris Smith'} and setter is a dict like { 'age': 28, 'state': 'WI' }
It's almost as fast as writing custom INSERT[/UPDATE] code and running it directly with psycopg2... and it won't blow up if the row already exists.

The newest way of inserting many items is using the execute_values helper (https://www.psycopg.org/docs/extras.html#fast-execution-helpers).
from psycopg2.extras import execute_values
insert_sql = "INSERT INTO table (id, name, created) VALUES %s"
# this is optional
value_template="(%s, %s, to_timestamp(%s))"
cur = conn.cursor()
items = []
items.append((1, "name", 123123))
# append more...
execute_values(cur, insert_sql, items, value_template)
conn.commit()

Anyone using SQLalchemy could try 1.2 version which added support of bulk insert to use psycopg2.extras.execute_batch() instead of executemany when you initialize your engine with use_batch_mode=True like:
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
use_batch_mode=True)
http://docs.sqlalchemy.org/en/latest/changelog/migration_12.html#change-4109
Then someone would have to use SQLalchmey won't bother to try different combinations of sqla and psycopg2 and direct SQL together.

After some testing, unnest often seems to be an extremely fast option, as I learned from #Clodoaldo Neto's answer to a similar question.
data = [(1, 100), (2, 200), ...] # list of tuples
cur.execute("""CREATE TABLE table1 AS
SELECT u.id, u.var1
FROM unnest(%s) u(id INT, var1 INT)""", (data,))
However, it can be tricky with extremely large data.

The first and the second would be used together, not separately. The third would be the most efficient server-wise though, since the server would do all the hard work.

A very related question: Bulk insert with SQLAlchemy ORM
All Roads Lead to Rome, but some of them crosses mountains, requires ferries but if you want to get there quickly just take the motorway.
In this case the motorway is to use the execute_batch() feature of psycopg2. The documentation says it the best:
The current implementation of executemany() is (using an extremely charitable understatement) not particularly performing. These functions can be used to speed up the repeated execution of a statement against a set of parameters. By reducing the number of server roundtrips the performance can be orders of magnitude better than using executemany().
In my own test execute_batch() is approximately twice as fast as executemany(), and gives the option to configure the page_size for further tweaking (if you want to squeeze the last 2-3% of performance out of the driver).
The same feature can easily be enabled if you are using SQLAlchemy by setting use_batch_mode=True as a parameter when you instantiate the engine with create_engine()

Related

Doing a bulk copy of dictionary list to postgres using sqlalchemy

I have a list of dictionary list_dict like the one below:
[{u'default_bhp': False, u'price_period': u'Monthly'},{u'default_bhp': False, u'price_period': u'Yearly'}]
At the moment I am inserting this to my db using:
conn = engine.connect()
conn.execute(ModelClass.__table__.insert(), list_dict)
I just wanted to check if there is a faster way to insert data to the database, because I have a huge amount of data.
Can we use a bulk copy or something here?
How to use use_batch_mode functionality?? Something like below:
engine = create_engine('postgresql+psycopg2://postgres:postgres#localhost/test_db', use_batch_mode=True)
conn = engine.connect()
conn.execute_batch(ModelClass.__table__.insert(), list_dict)
Consider using bulk_insert_mappings (if you don't know about this already), this is probably closest to what you want to achieve.
But if you're actually have a lot of data, eager unpacking into list of dicts may be not the good way to go, so you may need grouper/chunk management on top of bulk_insert

Why is fetching a postgreSQL table so much slower in Python than in R?

I have to carry out some statistical treatments on data that is stored in PostgreSQL tables. I have been hesitating between using R and Python.
With R I use the following code:
require("RPostgreSQL")
(...) #connection to the database, etc
my_table <- dbGetQuery(con, "SELECT * FROM some_table;")
which is very fast : it will take only 5 seconds to fetch a table with ~200 000 lines and 15 columns and almost no NULL's in it.
With Python, I use the following code:
import psycopg2
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
cursor.execute("SELECT * FROM some_table;")
my_table = cursor.fetchall()
and surprisingly, it causes my Python session to freeze and my computer to crash.
As I use these librairies as "black boxes", I don't understand why something that is so quick in R can be that slow (and thus almost impossible for a practical use) in Python.
Can someone explain this difference in performance, and can someone tell if there exists a more efficient method to fetch a pgSQL table in Python?
I am no expert in R but very obviously what dbGetQuery() (actually : what dbFetch()) returns is a lazy object that will not load all results in memory - else it would of course take ages and eat all your ram too.
wrt/ Python / psycopg, you definitly DONT want to fetchall() a huge dataset. The proper solution here is to use a server-side cursor and iterate over it.
Edit - answering the questions in your comments:
so the option cursor_factory=psycopg2.extras.DictCursor when executing fetchall()does the trick, right?
Not at all. As written in all letter in the example I likned to, what "does the trick" is using a server side cursor, which is done (in psycopg) by naming the cursor:
HERE IS THE IMPORTANT PART, by specifying a name for the cursor
psycopg2 creates a server-side cursor, which prevents all of the
records from being downloaded at once from the server.
cursor = conn.cursor('cursor_unique_name')
The DictCursor stuff is actually irrelevant (and should not be mentionned in this example since it obviously confuses newcomers).
I have a side question regarding the concept of lazy object (the one returned in R). How is it possible to return the object as a data-frame without storing it in my RAM? I find it a bit magical.
As I mentionned I don't zilch about R and it's implementation - I deduce that whatever dbFetch returns is a lazy object from the behaviour you describe -, but there's nothing magical in having an object that lazily fetches values from an external source. Python's file object is a known example:
with open("/some/huge/file.txt") as f:
for line in f:
print line
In the above snippet, the file object f fetches data from disk only when needed. All that needs to be stored is the file pointer position (and a buffer of the last N bytes that were read from disk, but that's an implementation detail).
If you want to learn more, read about Python's iteratable and iterator.

How to bulk insert data to mysql with python

Currently i'm using Alchemy as a ORM, and I look for a way to speed up my insert operation, I have bundle of XML files to import
for name in names:
p=Product()
p.name="xxx"
session.commit()
i use above code to insert my data paser from batch xml file to mysql,it's very slow
also i tried to
for name in names:
p=Product()
p.name="xxx"
session.commit()
but it seems didn't change anything
You could bypass the ORM for the insertion operation and use the SQL Expression generator instead.
Something like:
conn.execute(Product.insert(), [dict(name=name) for name in names])
That should create a single statement to do your inserting.
That example was taken from lower down the same page.
(I'd be interested to know what speedup you got from that)

Join with Pythons SQLite module is slower than doing it manually

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.
There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite
You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)
I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!

Large Sqlite database search

How is it possible to implement an efficient large Sqlite db search (more than 90000 entries)?
I'm using Python and SQLObject ORM:
import re
...
def search1():
cr = re.compile(ur'foo')
for item in Item.select():
if cr.search(item.name) or cr.search(item.skim):
print item.name
This function runs in more than 30 seconds. How should I make it run faster?
UPD: The test:
for item in Item.select():
pass
... takes almost the same time as my initial function (0:00:33.093141 to 0:00:33.322414). So the regexps eat no time.
A Sqlite3 shell query:
select '' from item where name like '%foo%';
runs in about a second. So the main time consumption happens due to the inefficient ORM's data retrieval from db. I guess SQLObject grabs entire rows here, while Sqlite touches only necessary fields.
The best way would be to rework your logic to do the selection in the database instead of in your python program.
Instead of doing Item.select(), you should rework it to do Item.select("""name LIKE ....
If you do this, and make sure you have the name and skim columns indexed, it will return very quickly. 90000 entries is not a large database.
30 seconds to fetch 90,000 rows might not be all that bad.
Have you benchmarked the time required to do the following?
for item in Item.select():
pass
Just to see if the time is DB time, network time or application time?
If your SQLite DB is physically very large, you could be looking at -- simply -- a lot of physical I/O to read all that database stuff in.
If you really need to use a regular expression, there's not really anything you can do to speed that up tremendously.
The best thing would be to write an sqlite function that performs the comparison for you in the db engine, instead of Python.
You could also switch to a db server like postgresql that has support for SIMILAR.
http://www.postgresql.org/docs/8.3/static/functions-matching.html
I would definitely take a suggestion of Reed to pass the filter to the SQL (forget the index part though).
I do not think that selecting only specified fields or all fields make a difference (unless you do have a lot of large fields). I would bet that SQLObject creates/instanciates 80K objects and puts them into a Session/UnitOfWork for tracking. This could definitely take some time.
Also if you do not need objects in your session, there must be a way to select just what the fields you need using custom-query creation so that no Item objects are created, but only tuples.
Initially doing regex via Python was considered for y_serial, but that
was dropped in favor of SQLite's GLOB (which is far faster).
GLOB is similar to LIKE except that it's syntax is more
conventional: * instead of %, ? instead of _ .
See the Endnotes at http://yserial.sourceforge.net/ for more details.
Given your example and expanding on Reed's answer your code could look a bit like the following:
import re
import sqlalchemy.sql.expression as expr
...
def search1():
searchStr = ur'foo'
whereClause = expr.or_(itemsTable.c.nameColumn.contains(searchStr), itemsTable.c.skimColumn.contains(searchStr))
for item in Items.select().where(whereClause):
print item.name
which translates to
SELECT * FROM items WHERE name LIKE '%foo%' or skim LIKE '%foo%'
This will have the database do all the filtering work for you instead of fetching all 90000 records and doing possibly two regex operations on each record.
You can find some info here on the .contains() method here.
As well as the SQLAlchemy SQL Expression Language Tutorial here.
Of course the example above assumes variable names for your itemsTable and the column it has (nameColumn and skimColumn).

Categories

Resources