SQLAlchemy Core bulk insert slow - python

I'm trying to truncate a table and insert only ~3000 rows of data using SQLAlchemy, and it's very slow (~10 minutes).
I followed the recommendations on this doc and leveraged sqlalchemy core to do my inserts, but it's still running very very slow. What are possible culprits for me to look at? Database is a postgres RDS instance. Thanks!
engine = sa.create_engine(db_string, **kwargs, pool_recycle=3600)
with engine.begin() as conn:
conn.execute("TRUNCATE my_table")
conn.execute(
MyTable.__table__.insert(),
data #where data is a list of dicts
)

I was bummed when I saw this didn't have an answer... I ran into the exact same problem the other day: Trying to bulk-insert about millions of rows to a Postgres RDS Instance using CORE. It was taking hours.
As a workaround, I ended up writing my own bulk-insert script that generated the raw sql itself:
bulk_insert_str = []
for entry in entry_list:
val_str = "('{}', '{}', ...)".format(entry["column1"], entry["column2"], ...)
bulk_insert_str.append(val_str)
engine.execute(
"""
INSERT INTO my_table (column1, column2 ...)
VALUES {}
""".format(",".join(bulk_insert_str))
)
While ugly, this gave me the performance we needed (~500,000 rows/minute)
Did you find a CORE-based solution? If not, hope this helps!
UPDATE: Ended up moving my old script into a spare EC2 instance that we weren't using which actually fixed the slow performance issue. Not sure what your setup is, but apparently there's a network overhead in communicating with RDS from an external (non-AWS) connection.

Some time ago I had been struggling with the problem while working in the company, so we had created a library with functions for bulk insert and update. Hope we've taken into account all performance and security concerns. This library is open-sourced and available on PyPI, its name: bulky.
Let me show you some examples of usage:
insert:
import bulky
from your.sqlalchemy.models import Model
from your.sqlalchemy.session import Session
data = [
{Model.column_float: random()}
for _ in range(100_000_000)
]
rows_inserted = bulky.insert(
session=Session,
table_or_model=Model,
values_series=data,
returning=[Model.id, Model.column_float]
)
new_items = {row.id: row.column_float for row in rows_inserted}
update:
import bulky
from your.sqlalchemy.models import ManyToManyTable
from your.sqlalchemy.session import Session
data = [
{
ManyToManyTable.fk1: i,
ManyToManyTable.fk2: j,
ManyToManyTable.value: i + j,
}
for i in range(100_000_000)
for j in range(100_000_000)
]
rows_updated = bulky.update(
session=Session,
table_or_model=ManyToManyTable,
values_series=data,
returning=[
ManyToManyTable.fk1,
ManyToManyTable.fk2,
ManyToManyTable.value,],
reference=[
ManyToManyTable.fk1,
ManyToManyTable.fk2,],
)
updated_items = {(row.fk1, row.fk2): row.value for row in rows_updated}
Not sure if links are allowed, so I'll put them under spoiler
Readme and PyPI

Related

Python 3 CouchDB single update vs batch

This is my first post :) so I'll apologize beforehand. I'm working on exporting data from mysql to couchdb, when an item has been saved, i mark the mysql item with a recently updated date. Below I have a python function which takes in a json object one by one and some random id to update on mysql locally:
def write_json_to_couchdb(json_obj, id):
#couchdb auto create doc_id and rev
doc_id = ''
revision_or_exception = ''
for (success, doc_id, revision_or_exception) in db.update(json_obj):
print(success, doc_id, revision_or_exception)
# mark id inside mysql db, so we know its been saved to couchdb
mysql.update_item_date(id)
This solution above works but is quite slow, both writing to couchdb and updating onto mysql, how can I use "bulk api" or "batch api", without using curl. I believe couchdb's db.update(item) can also take a list like this db.update(dict_of_items). How can i specify "batch ok". Are there any other method i'm unaware of. Seems there's few examples online.
Would this increase speed significantly? Also how can i specify the "batch size" of lets say 1000 records.
Here's what I'm thinking a better solution would be:
def write_json_to_couchdb_bulk(json_obj_list, id_list):
doc_id = ''
revision_or_exception = ''
for (success, doc_id, revision_or_exception) in db.update(json_obj_list):
print(success, doc_id, revision_or_exception)
# update added_date with current datetime
for id in id_list:
mysql.update_item_date(id)
Thanks,
SW
import couchdb
from couchdb import *
def write_json_to_couchdb_bulk(json_obj_list):
for doc in db.update(json_obj_list):
print(repr(doc))
json_obj_list = [
Document(type='Person', name='John Doe'),
Document(type='Person', name='Mary Jane'),
Document(type='City', name='Gotham City')
]
write_json_to_couchdb_bulk(json_obj_list)
Here's the solution I came up with, its much faster.

Iterating over query results from sqlalchemy

I have a sqlalchemy query function like this
def foo():
local_session = Session()
results = local_session.query(T.x, T.y, T.z, T.a, T.b, T.c
, T.d, T.e, T.f, T.g, T.h, T.i, T.j, T.k, T.l
, T.m, T.n, T.o, T.p, T.q, T.r, T.s, T.t, T.u
, T.v,
User.gender).join(User)\
.filter(T.language == 'en', T.where_i_am_from == 'US',
User.some_num >= 0.9).limit(1000000)
local_session.close()
return results, results.count()
The query works fine.
and then I call this function here:
def fubar():
raw_data,raw_data_length = myModule.foo()
df = pd.DataFrame()
for each in raw_data:
df = df.append(pd.DataFrame({ #add each.x etc to df..... }}
return df
The issue is that it wont iterate over the "for each in raw_data" loop when I have a .limit on my foo query above 5000, or use .all() or have no limit. The program will just hang and do nothing (0 cpu usage). I've tested this both on my local sql server and my amazon one. When I run the sql directly on the database I return around 800,000 rows. Why is this happening?
I'm using the latest mysql and latest sqlalchemy.
This may be like MySQL driver problem. I would do the following in order:
Run python with -v flag, like python -v yourprogram.py.
This has a potential of showing you where the program got stuck.
Get those 800,000 results and stick them in SQLite with tables in equivalent schema.
That's relatively cheap to do, all you have to do afterwards is change SQA database string. Obviously, this would show you whether the problem lies with the driver or it's in your code.
You're doing a join between two classes (T, User) - do eager load instead of default lazy load.
If you have 800,000 rows and doing a lazy join, that may be a problem. Add a joinedload (eagerload in earlier versions of SQLAlchemy) to options.

python 5x slower than perl mySql query

I am translating a code from perl to python.
Even if it works exactly the same, there is a part of the code that is 5x slower in python than in perl and I cannot figure out why.
Both perl and python are in the same machine, as well as the mysql database.
The code queries a db to download all columns of a table and then process each row.
There are more than 5 million rows to process and the big issue is in retrieving the data from the database to the python processing.
Here I attach the two code samples:
Python:
import os
import mysql.connector **<--- import mySqlDb**
import time
outDict = dict()
## DB parameters
db = mysql.connector.connect **<----- mySqlDb.connect( ...**
(host=dbhost,
user=username, # your username
passwd=passw, # your password
db=database) # name of the data base
cur = db.cursor(prepared=True)
sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat;"
cur.execute(sql)
print('\t eDiVa public omics start')
s = time.time()
sz = 1000
rows = cur.fetchall()
for row in rows:
## process out dict
print time.time() - s
cur.close()
db.close()
While here comes the Perl equivalent script:
use strict;
use Digest::MD5 qw(md5);
use DBI;
use threads;
use threads::shared;
my $dbh = DBI->connect('dbi:mysql:'.$database.';host='.$dbhost.'',$username,$pass)
or die "Connection Error!!\n";
my $sql = "select chr,pos,lengthofrepeat,copyNum,region from db.Table_simpleRepeat\;";
## prepare statement and query
my $stmt = $dbh->prepare($sql);
$stmt->execute or die "SQL Error!!\n";
my $c = 0;
#process query result
while (my #res = $stmt->fetchrow_array)
{
$edivaStr{ $res[0].";".$res[1] } = $res[4].",".$res[2];
$c +=1;
}
print($c."\n");
## close DB connection
$dbh->disconnect();
The runtime for these two scripts is:
~40s for the Perl script
~200s for the Python script
I cannot figure out why this happens [I tried using fetchone() or fetchmany() to see if there are memory issues but the runtime at most reduces 10% from the 200s].
My main problem is understanding why there is such a relevant performance difference between the two functionally equivalent code blocks.
Any idea about how can I verify what is happening would be greatly appreciated.
Thanks!
UPDATE ABOUT SOLUTION
Peeyush'comment could be an answer and I'd like him to post it because it allowed me to find a solution.
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance.
the cursor.fetchall means you load all your data in memory at once, instead of doing it slowly when needed.
Replace
row = cur.fetchall()
for row in rows:
by
for row in cur:
The problem is the python connector. I just changed that for mySqlDb module which is a C compiled module. That made the python code slightly faster than the perl code.
I added the changes in the python code with a <---- "" to show how easy it has been to gain performance
I encounter the same problem. With Python cx_Oracle, here's my environment performance stats -- Python takes very long to connect to Oracle DB.
connect to DB, elaps:0.38108
run query, elaps:0.00092
get filename from table, elaps:8e-05
run query to read BLOB, elaps:0.00058
decompress data and write to file, elaps:0.00187
close DB connection, elaps:0.00009
Over all, elaps:0.38476
same function in Perl, elaps:0.00213
If anyone else is struggling with using Python and MySQL, I think that Oracle's mysql.connector for Python tends to be really slow for doing UPDATEs and DELETEs. I've found that mysql.connector is really fast for doing SELECT queries, and using .executemany() for doing INSERTs is blazing fast as well. However, UPDATEs and DELETEs are painfully slow from what I've found. The solution I've decided to go with is to move my data over to PostgreSQL simply because I know that Postgres has a really good Python library (psycopg2). Anyway, hope my feedback helps!
Python for loops are quite slow. You should look into an alternative to treat your query.
From python wiki : https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops

Join with Pythons SQLite module is slower than doing it manually

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.
There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite
You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)
I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!

Memory usage with Django + SQLite3

I've got a very large SQLite table with over 500,000 rows with about 15 columns (mostly floats). I'm wanting to transfer data from the SQLite DB to a Django app (which could be backed by many RDBMs, but Postgres in my case). Everything works OK, but as the iteration continues, memory usage jumps by 2-3 meg a second for the Python process. I've tried using 'del' to delete the EVEMapDenormalize and row objects at the end of each iteration, but the bloat continues. Here's an excerpt, any ideas?
class Importer_mapDenormalize(SQLImporter):
def run_importer(self, conn):
c = conn.cursor()
for row in c.execute('select * from mapDenormalize'):
mapdenorm, created = EVEMapDenormalize.objects.get_or_create(id=row['itemID'])
mapdenorm.x = row['x']
mapdenorm.y = row['y']
mapdenorm.z = row['z']
if row['typeID']:
mapdenorm.type = EVEInventoryType.objects.get(id=row['typeID'])
if row['groupID']:
mapdenorm.group = EVEInventoryGroup.objects.get(id=row['groupID'])
if row['solarSystemID']:
mapdenorm.solar_system = EVESolarSystem.objects.get(id=row['solarSystemID'])
if row['constellationID']:
mapdenorm.constellation = EVEConstellation.objects.get(id=row['constellationID'])
if row['regionID']:
mapdenorm.region = EVERegion.objects.get(id=row['regionID'])
mapdenorm.save()
c.close()
I'm not at all interested in wrapping this SQLite DB with the Django ORM. I'd just really like to figure out how to get the data transferred without sucking all of my RAM.
Silly me, this was addressed in the Django FAQ.
Needed to clear the DB query cache while in DEBUG mode.
from django import db
db.reset_queries()
I think a select * from mapDenormalize and loading the result into memory will always be a bad idea. My advise is - spread script into chunks. Use LIMIT to get data in portions.
Get first portion, work with it, close to cursor, and then get the next portion.

Categories

Resources