How to scale psycopg2 insert and select with single process in python? - python

It takes average of about 0.300095081329 for my insert to go through to finish commit to postgres.
Here is my table pattern
id_table
latest_update_id (primary index)
product_id (index)
publish_date
product_meta_table
latest_update_id (index)
product_id (index)
meta_related_info1
meta_related_info2
...etc
product_table
latest_update_id (index)
product_id (index)
note_related_info1
note_related_info2
....etc
Here are some of my inserts
db_cursor.execute("INSERT INTO id_table (product_id, publish_date) \
VALUES (%s, %s) RETURNING latest_update_id",
(my_dict["product_id"], my_dict["publish_date"])
)
db_cursor.execute("INSERT INTO product_table ( \
latest_update_id, \
product_id, \
note_related_info1, \
note_related_info2, \
...etc) \
VALUES (%s, %s, %s, %s) RETURNING *",
(my_dict["latest_update_id"],
my_dict["product_id"],
my_dict["note_related_info1"],
my_dict["note_related_info2"])
)
Using the insert time my throughput is about 1/0.3 = 3qps
I know I can scale this horizontally by adding more instances but I want to try to see if I can hit at least 3000qps.
I am thinking of either using aync or threading, but was not sure of GIL is going to interfere or not.
Is there a general good practice and technique on how to scale insert statements using psycopg2?
Thanks
Note: I am using python 2.7
Note: python process is communicating with sql server through https
Note: the inserts to each table are staggered, table2 inserts after table1, table3 inserts after table2. Technically table2 and table3 only have to wait for table1 to finish insert because they need latest_update_id

Do a single insert query in instead of 3. Notice the triple quotes and dictionary parameter passing:
insert_query = """
with i as (
insert into id_table (product_id, publish_date)
values (%(product_id)s, %(publish_date)s)
returning latest_update_id
)
insert into product_table (
latest_update_id,
product_id,
note_related_info1,
note_related_info2
) values (
(select latest_update_id from i),
%(product_id)s, %(note_related_info1)s, %(note_related_info2)s
)
returning *
"""
db_cursor.execute(insert_query, my_dict)

Followup on my network comment.
Say you have 100ms roundtrip (like the time for SELECT 1).
If you want to chain queries, then you will have no other choice than to do INSERT... with tons of values to amortize the roundtrip time.
This is cumbersome, as you then will have to sort through the returned ids, to insert the dependent rows. Also, if your bandwidth is low, you will saturate it, and it won't be that fast anyway.
If your bandwidth is high enough but your ping is slow, you may be tempted to multithread... but this creates another problem...
Instead of having, say 1-2 server process churning through queries very fast, you'll have 50 processes sitting there doing nothing except waste valuable server RAM while they wait for the queries to come over the slow network.
Also, concurrency and lock issues may arise. You won't do just INSERTs... You're going to do some SELECT FOR UPDATE which grabs a lock...
...and then other processes pile up to acquire that lock while your next query crawls over the network...
This feels like using MyISAM in a concurrent write-intensive scenario. Locks should be held for the shortest time possible... fast pings help, putting the whole chain of queries from lock acquisition to release lock inside a stored proc is even better, so it is held for only a very short time.
So, consider executing your python script on the DB server, or on a server on the same LAN.

Related

Inserting into MySQL from Python using mysql.connector module - How to use executemany() to inert rows and child rows?

I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.

Potential problems rolling back multiple-line SQL Transaction

I need to insert a CSV file into a table on SQL Server using Python (BULK INSERT is turned off). Instead of using SQLAlchemy I'm writing my own function (may God forgive me). I'm creating lists of SQL code as strings
sql_code_list = ["insert into table_name values (1,'aa'),(2,'ab'),(3,'ac')...(100,'az')",
"insert into table_name values (101,'ba'),(102,'bb'),(103,'bc')...(200,'bz')"]
and I plan to run them in the DB using pyodbc package one by one. To ensure data integrity, I want to use BEGIN TRANS ... ROLLBACK / COMMIT TRANS ... syntaxis. So I want to send command
DECLARE #TransactionName varchar(20) = 'TransInsert'
BEGIN TRANS #TransactionName
then send all my ```INSERT`` statements, and send on success
DECLARE #TransactionName varchar(20) = 'TransInsert'
COMMIT TRANS #TransactionName
or on failure
DECLARE #TransactionName varchar(20) = 'TransInsert'
ROLLBACK TRANS #TransactionName
There will be many INSERT statements, let's say 10,000 statements each inserting 100 rows, and I plan to send them from the same connection.cursor object but in multiple batches. Does this overall look like a correct procedure? What problems may I run into when I send these commands from a Python application?
There is no need for a named transaction here.
You could submit a transactional batch of multiple statements like this to conditionally rollback and throw on error:
SET XACT_ABORT, NO_COUNT ON;
BEGIN TRY
BEGIN TRAN;
<insert-statements-here>;
COMMIT;
END TRY
BEGIN CATCH
IF ##TRANCOUNT > 0 ROLLBACK;
THROW;
END CATCH;
The maximum SQL Server batch size is 64K * and the default network packet size is 4K, so each batch may be up to 256MB by default. 10K inserts will likely fit within that limit so you could try sending all in a single batch and break it into multiple smaller batches only if needed.
An alternative method to insert multiple rows is with an INSERT...SELECT from a table-valued parameter source. See this answer for an example of passing TVP value. I would expect much better performance with that technique because it avoids parsing a large batch and SQL Server internally bulk-inserts TVP data into tempdb.

Speed up insertion of pandas dataframe using fast_executemany Python pyodbc

I am trying to insert data contained in a .csv file from my pc to a remote server. The values are inserted in a table that contains 3 columns, namely Timestamp, Value and TimeseriesID. I have to insert approximately 3000 rows at a time, therefore I am currently using pyodbc and executemany.
My code up to now is the one shown below:
with contextlib.closing(pyodbc.connect(connection_string, autocommit=True)) as conn:
with contextlib.closing(conn.cursor()) as cursor:
cursor.fast_executemany = True # new in pyodbc 4.0.19
# Innsert values in the DataTable table
insert_df = df[["Time (UTC)", column]]
insert_df["id"] = timeseriesID
insert_df = insert_df[["id", "Time (UTC)", column]]
sql = "INSERT INTO %s (%s, %s, %s) VALUES (?, ?, ?)" % (
sqltbl_datatable, 'TimeseriesId', 'DateTime', 'Value')
params = [i.tolist() for i in insert_df.values]
cursor.executemany(sql, params)
As I am using pyodbc 4.0.19, I have the option fast_executemany set to True, which is supposed to speed up things. However, for some reason, I do not see any great improvement when I enable the fast_executemany option. Is there any alternative way that I could use in order to speed up insertion of my file?
Moreover, regarding the performance of the code shown above, I noticed that when disabling the autocommit=True option, and instead I included the cursor.commit() command in the end of my data was imported significantly faster. Is there any specific reason why this happens that I am not aware of?
Any help would be greatly appreciated :)
Regarding the cursor.commit() speed up that you are noticing: when you are using autocommit=True you are requesting the code to execute one database transaction per each of the insert. This means that the code resumes only after the database confirms the data is stored on disk. When you use cursor.commit() after the numerous INSERTs you are effectively executing one database transaction and the data is stored in RAM in the interim (it may be written to disk but not all at the time when you instruct the database to finalize the transaction).
The process of finalizing the transaction typically entails updating tables on disk, updating indexes, flushing logs, syncing copies, etc. which is costly. That is why you observe such a speed up between the 2 scenarios you describe.
When going the faster way please note that until you execute cursor.commit() you cannot be 100% sure that the data is in the database so there may be a need to reissue the query in case of an error (any partial transaction is going to be rolled back).

Python sqlite3 never returns an inner join with 28 milion+ rows

Sqlite database with two tables, each over 28 million rows long. Here's the schema:
CREATE TABLE MASTER (ID INTEGER PRIMARY KEY AUTOINCREMENT,PATH TEXT,FILE TEXT,FULLPATH TEXT,MODIFIED_TIME FLOAT);
CREATE TABLE INCREMENTAL (INC_ID INTEGER PRIMARY KEY AUTOINCREMENT,INC_PATH TEXT,INC_FILE TEXT,INC_FULLPATH TEXT,INC_MODIFIED_TIME FLOAT);
Here's an example row from MASTER:
ID PATH FILE FULLPATH MODIFIED_TIME
---------- --------------- ---------- ----------------------- -------------
1 e:\ae/BONDS/0/0 100.bin e:\ae/BONDS/0/0/100.bin 1213903192.5
The tables have mostly identical data, with some differences between MODIFIED_TIME in MASTER and INC_MODIFIED_TIME in INCREMENTAL.
If I execute the following query in sqlite, I get the results I expect:
select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;
That query will pause for a minute or so, return a number of rows, pause again, return some more, etc., and finish without issue. Takes about 2 minutes to fully return everything.
However, if I execute the same query in Python:
changed_files = conn.execute("select ID from MASTER inner join INCREMENTAL on FULLPATH = INC_FULLPATH and MODIFIED_TIME != INC_MODIFIED_TIME;")
It will never return - I can leave it running for 24 hours and still have nothing. The python32.exe process doesn't start consuming a large amount of cpu or memory - it stays pretty static. And the process itself doesn't actually seem to go unresponsive - however, I can't Ctrl-C to break, and have to kill the process to actually stop the script.
I do not have these issues with a small test database - everything runs fine in Python.
I realize this is a large amount of data, but if sqlite is handling the actual queries, python shouldn't be choking on it, should it? I can do other large queries from python against this database. For instance, this works:
new_files = conn.execute("SELECT DISTINCT INC_FULLPATH, INC_PATH, INC_FILE from INCREMENTAL where INC_FULLPATH not in (SELECT DISTINCT FULLPATH from MASTER);")
Any ideas? Are the pauses in between sqlite returning data causing a problem for Python? Or is something never occurring at the end to signal the end of the query results (and if so, why does it work with small databases)?
Thanks. This is my first stackoverflow post and I hope I followed the appropriate etiquette.
Python tends to have older versions of the SQLite library, especially Python 2.x, where it is not updated.
However, your actual problem is that the query is slow.
Use the usual mechanisms to optimize it, such as creating a two-column index on INC_FULLPATH and INC_MODIFIED_TIME.

Remove all data from table but last N entries

I'm using psycopg2 with Python.
I'd like to periodically flush data from my db. I've set up a task with Timer for this. I had asked this question before, but using the answer listed there freezes up my machine (keyboard stops responding and entire system grinds to halt). Instead, I would like to delete all entries in my table albeit the last N (Not sure that this is the right approach either).
Basically, there is another python process that is running (separate executable), which is populating the db that I wish to interrogate. It seems that if I delete all entries, and that other process is running, that it can lead to the freeze. I don't know of a safe way in which I can remove entries; it's almost as if the other process is relying on an incrementing ID as it writes to the db.
If anyone could help me work this out it'd be greatly appreciated. Thoughts?
A possible solution is to run a DELETE on all ids except those returned by select ... order by pk desc limit N given an autoincremental pk. If no such pk exists, having a created_date and ordering by it should do the same.
Non tested example:
import psycopg2
connection = psycopg2.connect('dbname=test user=postgres')
cursor = conn.cursor()
query = 'delete from my_table where id not in (
select id from my_table order by id desc limit 30)'
cursor.execute(query)
cursor.commit() #Don't know if necessary
cursor.close()
connection.close()
This is probably much faster:
CRETE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl ORDER BY <undisclosed> LIMIT <N>;
TRUNCATE TABLE tbl;
INSERT INTO tbl SELECT * FROM tbl_tmp;
Do it all in one session. Specifics depend on additional circumstances you did not disclose.
Compare to this related, comprehensive answer (your case is simpler):
Remove duplicates from table based on multiple criteria and persist to other table

Categories

Resources