I have a large database with hundreds of tables. Each table has a functionally equivalent structure: one column is a unique identifier and the other a group membership indicator; that is, each row in the table has a unique identifier, but any number of rows can have the same group membership indicator. The tables are also created in pairs in the same schema, so the naming scheme for this database is project_abbreviation.<name>_<suffix>; for example, the pair proj_abc.original_a and proj_abc.original_b.
I inherited this database, and when the original developers constructed it, they did not add UNIQUE constraints to the unique identifier columns when the tables were created. As a result, whenever someone wants to change the group membership indicator for a row or set of rows in a given table, I have to add a UNIQUE constraint on the column if the table hasn't been modified since its creation. I do that programmatically:
#connect
def make_column_unique(self, cursor, connection, column, suffix):
sql = f"ALTER TABLE {self._schema}.{self._table}_{suffix} "
sql += f"ADD CONSTRAINT unique_{column} UNIQUE ({column});"
cursor.execute(sql)
connection.commit()
where #connect is a decorator function which connects to the db instance, and the cursor and connection parameters are psycopg2 Cursor and Connection objects, respectively. I then call this in a try/except block:
...
for suffix in ["a", "b"]:
try:
self.modify_table(...)
except (Exception, psycopg2.DatabaseError) as e:
self.make_column_unique("uid", suffix)
self.modify_table(...)
...
Here is the function signature for self.modify_table:
#connect
def modify_table(self, cursor, connection, data, suffix):
sql = f"INSERT INTO {self._schema}.{self._table}_{suffix} (uid, group) "
sql += "VALUES "
zipped = list(zip(list(data["uid"]), list(data["group"])))
row = 0
for uid, group in zipped:
row += 1
sql += f"({uid},'{group}')" + ("," if row < len(zipped) else " ")
sql += f"ON CONFLICT (uid) DO UPDATE "
sql += "SET group = EXCLUDED.group;"
cursor.execute(sql)
connection.commit()
This approach worked exceedingly well and modified table entries properly, and set the UNIQUE constraint when one needed to be set.
Now, when I attempt to modify a table which has yet to be modified, I get a There is no unique or exclusion constraint matching the ON CONFLICT specification error, which kicks off the call to make_column_unique. However, when the program attempts to make the provided column unique, I get back a relation "unique_<column>" already exists error. Furthermore, this only happens for tables of suffix a, not suffix b. I went into pgAdmin4 to verify, and the desired modification occurred on the table with suffix b, but before and after the database transaction, the table with suffix a had no constraints applied to it:
pgAdmin database viewer with no constraints
Why am I getting these contradictory errors for only one type of table? It makes no sense to me to be told that a UNIQUE constraint doesn't exist, and then when I alter the table to include the constraint, to be told that it already exists.
This is the dumbest possible answer I could give to this question.
It turns out that, in my predecessor's infinite wisdom, they decided to randomly sprinkle duplicate values into columns that were supposed to be unique. Easily fixed.
Related
I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.
import sqlite3
def delete_data(db_name, table, col, search_condition):
with sqlite3.connect(db_name) as conn:
with conn.cursor() as cur:
code_piece = (f"FROM {table} WHERE {col}={search_condition}",)
self.cur.execute("DELETE ?", code_piece)
Taking the above code, is the data the from the function arguments sanitized or is there still a possibility of an sql injection attack?
Understanding QStyle Parameters
Here's a fix for a bunch of syntactical errors in your code example that prevent it from running:
def delete_data(db_name, table, col, search_condition):
with sqlite3.connect(db_name) as conn:
cur = conn.cursor()
code_piece = (f"FROM {table} WHERE {col}={search_condition}",)
cur.execute("DELETE ?", code_piece)
If you would actually run this function, it would throw an exception on the last line that should read sth like the following:
sqlite3.OperationalError: near "?": syntax error
Why? As far as I know, you cannot use qstyle parameters to cover anything but what could slot in as a value in a valid SQL statement; you cannot use it to replace large parts of a statement; you also can't replace table names. The piece of code that is closest to your intent that could run without raising an exception, is the following code:
def delete_data(db_name, col, search_condition):
with sqlite3.connect(db_name) as conn:
cur = conn.cursor()
cur.execute("DELETE FROM TABLE_NAME WHERE ?=?;", (col, search_condition,))
However, imagine if your table had an actual column called PRICE, with integer values, and several entries had values 5 for that column. The following statement would not delete any of them, because the value of col is not interpreted as the name of a column, but slotted in as a string, so you end up comparing the string 'PRICE' with the integer 5 in the WHERE-clause, which would never be true:
delete_data("sqlite3.db", 'PRICE', 5) # DELETE FROM TABLE_NAME WHERE 'PRICE'=5;
So really, the only thing that your function can end up being, is the following... which is far away from the generic stuff that you were trying to do; however, it uses the qstyle parameters properly, and should be secure from SQL injection:
def delete_data(db_name, col, search_condition):
with sqlite3.connect(db_name) as conn:
cur = conn.cursor()
cur.execute("DELETE FROM TABLE_NAME WHERE PRICE=?;", (search_condition,))
delete_data("sqlite3.db", 5); # DELETE FROM TABLE_NAME WHERE PRICE=5;
But honestly, this is great, because you really don't want functions that can end up resulting in a bunch of unpredictable queries to your database. My general advise is to just wrap each query in a simple function, and keep it all as simple as possible.
Your Original Question and SQL Injection
But let's imagine that your original code would actually run as you intended it to. There is nothing that prevents an attacker from abusing any of the parameters to alter the intended purpose of the statement: if user input affects the table parameter, it can be used to delete the content of any table; and the col and search_condition parameters could be altered to delete all entries of a table.
However, it all depends on whether or not an attacker has the ability to alter the values of the parameter through user input. It is unlikely that user input is used directly to select the table or the column to be compared against. However, it would be likely that you would use user input to use as the value of the search_condition parameter. If so, then the following function call would be possible.
delete_data(db_name, "USERS", "NAME", "Marc OR 1=1"):
This would result in the following query to the database, resulting in the deletion of all entries of the USERS table.
DELETE FROM USERS WHERE NAME=Marc or 1=1;
So yeah, your code was still susceptible to SQL injection.
I am fairly new to PostgreSQL and would like to know about possible best practices and whether it is possible at all to automatically generate and populate tables in one schema based on tables present in another schema, possibly using triggers and functions. My reason for doing this is that I have been told that it is preferable to do calculations within the database, compared to pulling the data, running calculations and inserting them again. I should mention that I am able to do the latter in python using psycopg2.
I understand that triggers and functions may be used for automatically populating columns based on other columns within the same table, but I have not yet been able to produce code that does what I would like, therefore I am seeking help & hints here. To clarify my question I would like to describe how my database looks right now:
A schema named raw_data, populated by an arbitrary and increasing number of tables related to measurements performed at different locations:
area1 (timestamp, value)
area2 (timestamp, value)
area3 (timestamp, value)
...
Each table consists of two columns timestamp and value. New data is added continuously to each table. A table is created using the following code in python, using psycopg2 with an active connection con to the database:
table_name = schema_name + '.' + table_name.lower()
sql = ('CREATE TABLE ' + table_name + ' ('
'timestamp varchar (19) PRIMARY KEY, '
'value numeric (5,2) NOT NULL, '
');')
try:
cur = con.cursor()
cur.execute(sql)
con.commit()
except psycopog2.Error as e:
con.rollback()
print(e)
finally:
cur.close()
My aim is to do a "live" (performed as soon as new values are inserted in a table in the raw_data schema) analysis (calculations) on the data that is available in each table in the raw_data schema, but it is also my interest to not alter the tables in raw_data, as I later on plan to run multiple "live" analyses with different methods, all based on the data in the tables in raw_data. Therefore, I would like to make a schema (named method1) that automatically generates tables inside itself, based on tables present in the raw_data schema.
If possible I would also like for the new tables to be populated by a specified number of rows from the column (timestamp) as well as values that have been calculated from the (value) column in the raw_data table.
Is this even feasible or should stick with pulling the data, doing calculations and reinserting using python and psycopg2?
I would like to apologize in advance if I am unclear in my use of technical terms, as I have not received any formal training in SQL or python.
Thank you for taking the time to read my question!
You can create a new table using:
https://www.postgresql.org/docs/current/sql-createtableas.html
A generic example below:
CREATE TABLE AS
another_schema.new_table
SELECT ... FROM
some_schema.existing_table
WHERE
specify conditions
LIMIT
14400
Not sure if applies here but there is a SAMPLING method for pulling out data:
https://www.postgresql.org/docs/current/sql-select.html
TABLESAMPLE sampling_method ( argument [, ...] ) [ REPEATABLE ( seed ) ]
I can do very efficient bulk inserts in Sqlite3 on Python (2.7) with this code:
cur.executemany("INSERT INTO " + tableName + " VALUES (?, ?, ?, ?);", data)
But I can't get updates to work efficiently. I thought it might be a problem of the database structure/indexing, but even on a test database with only one table of 100 rows, the update still takes about 2-3 seconds.
I've tried different code variations. The latest code I have is from this answer to a previous question about update and executemany, but it's just as slow for me as any other attempt I've made:
data = []
for s in sources:
source_id = s['source_id']
val = get_value(s['source_attr'])
x=[val, source_id]
data.append(x)
cur.executemany("UPDATE sources SET source_attr = ? WHERE source_id = ?", data)
con.commit()
How could I improve this code to do a big bulk update efficiently?
When inserting a record, the database just needs to write a row at the end of the table (unless you have something like UNIQUE constraints).
When updating a record, the database needs to find the row. This requires scanning through the entire table (for each command), unless you have an index on the search column:
CREATE INDEX whatever ON sources(source_id);
But if source_id is the primary key, you should just declare it as such (which creates an implicit index):
CREATE TABLE sources(
source_id INTEGER PRIMARY KEY,
source_attr TEXT,
[...]
);
I try to use INSERT INTO...NO DUPLICATE KEY UPDATE clause in python to update mysql records where name is the primary key. If the name exist, update record's age column, otherwise insert it:
sql = """INSERT INTO mytable(name, age) \
VALUES ('Tim',30),('Sam',21),('John','35') \
ON DUPLICATE KEY UPDATE age=VALUES(age)"""
with db.connection() as conn:
with conn.cursor as cursor:
cursor.execute(sql)
if cursor.rowcount == 0:
result = 'UPDATE'
else:
result = 'INSERT'
I want to find out whether this execution has add one or more new rows or not. But the cursor.rowcount is not correct for each insert and update. Any comments about that?
I ran into this problem before, where I wanted to know if my insert was successful or not. My short-term solution was to call a count(*) on the table before and after the insert and and compare the numbers.
I never found a way to determine which action you have used for both INSERT IGNORE and INSERT ... ON DUPLICATE KEY.
Just to add more clarification to the previous answer.
With a cursor.rowcount is particularly hard to achieve your goal if inserting multiple rows.
The reason is that rowcount returns the number of affected rows.
Here is how it is defined:
The affected-rows value per row is 1 if the row is inserted as a new row, 2 if an existing row is updated, and 0 if an existing row is set to its current values. (https://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html)
So, to solve your problem you will need to do count(*) before insert and after the insert.