I can do very efficient bulk inserts in Sqlite3 on Python (2.7) with this code:
cur.executemany("INSERT INTO " + tableName + " VALUES (?, ?, ?, ?);", data)
But I can't get updates to work efficiently. I thought it might be a problem of the database structure/indexing, but even on a test database with only one table of 100 rows, the update still takes about 2-3 seconds.
I've tried different code variations. The latest code I have is from this answer to a previous question about update and executemany, but it's just as slow for me as any other attempt I've made:
data = []
for s in sources:
source_id = s['source_id']
val = get_value(s['source_attr'])
x=[val, source_id]
data.append(x)
cur.executemany("UPDATE sources SET source_attr = ? WHERE source_id = ?", data)
con.commit()
How could I improve this code to do a big bulk update efficiently?
When inserting a record, the database just needs to write a row at the end of the table (unless you have something like UNIQUE constraints).
When updating a record, the database needs to find the row. This requires scanning through the entire table (for each command), unless you have an index on the search column:
CREATE INDEX whatever ON sources(source_id);
But if source_id is the primary key, you should just declare it as such (which creates an implicit index):
CREATE TABLE sources(
source_id INTEGER PRIMARY KEY,
source_attr TEXT,
[...]
);
Related
Instead of using a JSON file to store data, I've decided I wanted to use a database instead. Here is how the data currently looks inside of the JSON file:
{"userID": ["reason 1", "reason 2", "reason 3"]}
I made it so that after a certain amount of time a reason is removed. For example, "reason 2" will be removed after 12 hours of it being added. However, I realised that if I terminate the process and then run it again the reason would just stay there until I manually remove it.
I've decided to use sqlite3 to make a database and have a discord.py task loop to remove it for me. How can I replicate the dictionary inside the database? Here is what I'm thinking at the moment:
c = sqlite3.connect('file_name.db')
cursor = c.cursor()
cursor.execute("""CREATE TABLE table_name (
userID text,
reason blob
)"""
Try the following table to store the reasons:
CREATE TABLE reasons (
reason_id PRIMARY KEY,
user_id,
reason,
is_visible,
created_at
)
Then the reasons could be soft deleted for every user by running:
UPDATE reasons
SET is_visible = 0
WHERE created_at + 3600 < CAST( strftime('%s', 'now') AS INT )
The example shows hiding reasons after 1 hour (3600 seconds).
The reasons can be hard deleted later by running the following query:
DELETE reasons
WHERE is_visible = 0
The soft delete comes in handy for verification and getting data back in case of a future bug in the software.
Simply use nested loops to insert each reason into a row of the table.
insert_query = """INSERT INTO table_name (userID, reason) VALUES (?, ?)"""
for user, reasons in json_data.items():
for reason in reasons:
cursor.execute(insert_query, (user, reason))
I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.
I need to store few tables of strings in python (each table contains few million records). Let the header be ("A", "B", "C") and ("A, "B") is the data primary key. Then I need following operations to proceed fast:
Add new record (need O(1) complexity).
Find / update, delete record with (A="spam", B="eggs") (need O(1) complexity).
Find all records with (A="spam", C="foo") (need O(k) complexity, where k is the number of result rows).
I see a solution based on nested dicts structure for each index. It fits my needs, but I think, there is a better existing solution.
As suggested in the comments use a database. sqlite3 is small and fairly easy. It creates a database that exists in a single file and you interact with it.
Here is an adapted example from the API
import sqlite3
# Connect to your database (or create it if it was not there)
db = sqlite3.connect('data.db')
# Create the table
conn = db.cursor()
conn.execute("""
CREATE TABLE my_table
A text,
B text,
C text
""")
# Add an entry to the db
conn.execute("INSERT INTO my_table VALUES ('spam','eggs','foo')")
# Read all the entries under a condition
for row in conn.execute("SELECT * FROM my_table WHERE A='spam' AND C='foo'"):
print(row)
#safely close the db connection
conn.close()
Note: example is in python3
I using SQLite (sqlite3) interfaced with Python, to hold parameters in a table which I use for processing a large amount of data. Suppose I have already populated the table initially, but then change the parameters, and I want to update my table. If I create a Python list holding the updated parameters, for every row and column in the table, how do I update the table?
I have looked here and here (though the latter refers to C++ as opposed to Python) but these don't really answer my question.
To make this concrete, I show some of my code below:
import sqlite3 as sql
import numpy as np
db = sql.connect('./db.sq3')
cur = db.cursor()
#... Irrelevant Processing Code ...#
cur.execute("""CREATE TABLE IF NOT EXISTS process_parameters (
parameter_id INTEGER PRIMARY KEY,
exciton_bind_energy REAL,
exciton_bohr_radius REAL,
exciton_mass REAL,
exciton_density_per_QW REAL,
box_trap_side_length REAL,
electron_hole_overlap REAL,
dipole_matrix_element REAL,
k_cutoff REAL)""")
#Parameter list
process_params = [(E_X/1.6e-19, a_B/1e-9, m_exc/9.11e-31, 1./(np.sqrt(rho_0)*a_B), D/1e-6, phi0/1e8, d/1e-28, k_cut/(1./a_B)) for i in range(0,14641)]
#Check to see if table is populated or not
count = cur.execute("""SELECT COUNT (*) FROM process_parameters""").fetchone()[0]
#If it's not, fill it up
if count == 0:
cur.executemany("""INSERT INTO process_parameters VALUES(NULL, ?, ?, ?, ?, ?, ?, ?, ?);""", process_params)
db.commit()
Now, suppose that on a subsequent processing run, I change one or more of the parameters in process_params. What I'd like is for on any subsequent runs that Python will update the database with the most recent version of the parameters. So I do
else:
cur.executemany("""UPDATE process_parameters SET exciton_bind_energy=?, exciton_bohr_radius=?, exciton_mass=?, exciton_density_per_QW=?, box_trap_side_length=?, electron_hole_overlap=?, dipole_matrix_element=?, k_cutoff=?;""", process_params)
db.commit()
db.close()
But when I do this, the script seems to hang (or be going very slowly) such that Ctrl+C doesn't even quit the script (being run via ipython).
I know in this case, updating using a huge Python list may be irrelevant, but it's the principle here which I want to clarify, since at another time, I may not be updating every row with the same values. If someone could help me understand what's happening and/or how to fix this, I'd really appreciate it. Thank-you.
cur.executemany("""
UPDATE process_parameters SET
exciton_bind_energy=?,
exciton_bohr_radius=?,
exciton_mass=?,
exciton_density_per_QW=?,
box_trap_side_length=?,
electron_hole_overlap=?,
dipole_matrix_element=?,
k_cutoff=?
;
""", process_params)
You forgot the WHERE clause while updating. Without the WHERE clause, the UPDATE statement will update every row in the table. Since you provide 14641 sets of parameters, the SQLite driver will update rows for 14641 (input) × 14641 (rows in table) = 214 million times, which shows why it is slow.
The proper way is to update only the relevant row every time:
cur.executemany("""
UPDATE process_parameters SET
exciton_bind_energy=?,
exciton_bohr_radius=?,
exciton_mass=?,
exciton_density_per_QW=?,
box_trap_side_length=?,
electron_hole_overlap=?,
dipole_matrix_element=?,
k_cutoff=?
WHERE parameter_id=?
-- ^~~~~~~~~~~~~~~~~~~~ don't forget this
;
""", process_params)
For sure, this means process_params must include parameter IDs, and you need to modify the INSERT statement to insert the parameter ID as well.
I have a simple table in mysql with the following fields:
id -- Primary key, int, autoincrement
name -- varchar(50)
description -- varchar(256)
Using MySQLdb, a python module, I want to insert a name and description into the table, and get back the id.
In pseudocode:
db = MySQLdb.connection(...)
queryString = "INSERT into tablename (name, description) VALUES" % (a_name, a_desc);"
db.execute(queryString);
newID = ???
I think it might be
newID = db.insert_id()
Edit by Original Poster
Turns out, in the version of MySQLdb that I am using (1.2.2)
You would do the following:
conn = MySQLdb(host...)
c = conn.cursor()
c.execute("INSERT INTO...")
newID = c.lastrowid
I am leaving this as the correct answer, since it got me pointed in the right direction.
I don't know if there's a MySQLdb specific API for this, but in general you can obtain the last inserted id by SELECTing LAST_INSERT_ID()
It is on a per-connection basis, so you don't risk race conditions if some other client performs an insert as well.
You could also do a
conn.insert_id
The easiest way of all is to wrap your insert with a select count query into a single stored procedure and call that in your code. You would pass in the parameters needed to the stored procedure and it would then select your row count.