MySQL Join Unknown Number of Tables - python

I have a database containing a primary table and some unknown number of secondary tables. The primary table has two columns: ID (which is the primary key) and name. Secondary tables can have any number of columns, but are guaranteed to have the ID column as their primary key. Furthermore, the ID column of secondary tables is guaranteed to reference the ID column of the primary table as a foreign key. Additionally, it can be assumed that every table in the database has every ID.
I am currently using Python to construct and send queries to the MySQL database. What I would like to be able to do is to construct a query that will join all the tables in the database together by their ID columns. I understand that I can use SHOW TABLES; to return all the tables in the database, but I can't figure out how to generate a MySQL query for an unknown number of joins from there.
Example: I have three tables: A, B, and C. A is the primary table and thus has columns ID and name. B has columns ID, foo, and bar. C has columns ID and baz. I need to construct a join such that I get a table that has columns ID, name, foo, bar, and baz.
Edit: To give more detail, my python program is running some unknown number of modules, each of which stores information the object in its own table in the database that fits the constraints for secondary tables I mentioned previously. I could store everything in one much wider table (thus avoiding the joins entirely), but that would require me to add columns as more modules get added, which feels weird to me. I'm hardly an SQL expert though, so that might be the way to go. Even if that is what I end up doing, I'm still curious about how one might go about doing this.

Related

Batch insert only unique records into PostgreSQL with Python (millions of records per day)

I have 10M+ records per day to insert into a Postgres database.
90% are duplicates and only the unique records should be inserted (this can be checked on a specific column value).
Because of the large volume, batch inserts seems like the only sensible option.
I'm trying to figure out how to make this work.
I've tried:
SQLAlchemy, but it throws an error. So I assume it's not possible.
s = Session(bind=engine)
s.bulk_insert_mappings(Model, rows)
s.commit()
Throws:
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "..._key"
Panda's to_sql doesn't have this unique record capability.
So I'm thinking of putting new records in an "intermediate table", then running background jobs in parallel to add those records to the main table if they don't already exist. I don't know if this is the most efficient procedure.
Is there a better approach?
Is there some way to make SQLAlchemy or Pandas do this?
There are two common ways to go about solving this problem. To pick between these, you need to examine where you're willing to spend the compute power, and whether or not the extra network transfer is going to be an issue. We don't have enough information to make that judgement call for you.
Option 1: Load to a temporary table
This option is basically what you described. Have a temporary table or a table that's dedicated to the load, which matches the schema of your destination table. Obviously this should exclude the unique constraints.
Load the entirety of your batch into this table, and once it's all there insert from this table into your destination table. You can very easily use standard SQL statements to do any kind of manipulation you need, such as distinct or whether it's the first record, or whatever else.
Option 2: Only load unique values, filtering with pandas
Pandas has a drop_duplicates() function which limit your dataframe to unique entries, and you can specify things such as which columns to check and which row to keep.
df = df.drop_duplicates(subset = ["Age"])

What happens if cursor.executemany fails in Python

I am using cursor.executemany to insert thousands of rows into snowflake database from some other source. So if in case the insert fails due to some reason, does it rollback all the inserts?
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
So if in case the insert fails due to some reason, does it rollback all the inserts?
The cursor.executemany(…) implementation in Snowflake's Python Connector fills a multi-row INSERT INTO command, whose values are pre-evaluated by the query compiler before the inserts run, so they all run together or fail early if a value is unacceptable to its defined column type.
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
If there are no ID-like columns, you'll need to define a condition that qualifies two rows to be the same (such as a multi-column match).
Assuming your new batch of inserts are in a temporary table TEMP, the following SQL can insert into the DESTINATION table by performing a check of all rows against a set from the DESTINATION table.
Using HASH(…) as a basis for comparison, comparing all columns in each row together (in order):
INSERT INTO DESTINATION
SELECT *
FROM TEMP
WHERE
HASH(*) IS NOT IN ( SELECT HASH(*) FROM DESTINATION )
As suggested by Christian in the comments, the MERGE SQL command can also be used, once you have an identifier strategy (join keys). This too requires the new rows to be placed in a temporary table first, and offers an ability to perform an UPDATE if a row is already found.
Note: HASH(…) may have collisions and isn't the best fit. Better is to form an identifier using one or more of your table's columns, and compare them together. Your question lacks information about table and data characteristics, so I've picked a very simple approach involving HASH(…) here.

Are there disadvantages to making all columns except the primary key column in a table a unique index?

I want to avoid making duplicate records, but there are some occasions when updating the record, the values I receive are exactly the same as the record's version. This results in 0 affected rows which is a value I retain to help me determine if I need to insert a new transaction.
I've tried using a select statement to look for the exact transaction, but some fields (out of many) can be null which doesn't bode well when I have string variables that all have 'field1 = %s' in their where clauses when I'd need something like 'field1 is NULL' instead to get an accurate result back.
My last thought is using a unique index on all of the columns except the one for the table's primary key, but I'm not too familiar with using unique indexes. Should I be able to update these records after the fact? Are there risks to consider when implementing this solution?
Or is there another way I can tell whether I have an unchanged transaction or a new one when provided with values to update with?
The language I'm using is Python with mysql.connector

Proper way to insert records with foreign keys in sqlalchemy

I created a number of classes in SQLAlchemy to represent my various tables. I now want to insert records into these tables from various csv files that contain the data in an unnormalized format. What is the best way to deal with foreign keys?
In a simplified model, I have two tables: Child and Parent, with a one to many relationship. The parent table is already filled up, with a unique parent_name for each primary key. I am currently doing this:
for index, row in df.iterrows():
u = session.query(Parent).filter_by(parent_name=row['parent_name']).first()
session.add(Child(child_name=row['child_name'], parent_id=u.id))
Is there a way with sqlalchemy to avoid the first query? This question implies that using relationships is the correct/easy way to do it, but only explains the hard way.

Pandas find columns with unique values

I have two databases (each with 1000's of tables) which are supposed to reflect the same data but they come from two different sources. I compared two tables to see what the differences were, but to do that I joined the two on a common ID key. I checked the table manually to see what the ID key was, but when I have to check 1000's of tables its not practical to do so.
Is there a way in pandas to find what column (or columns) in a table have only unique values?
Use the Python library that allows you to query your database (pymysql, psycopg2, etc). Programmatically use the metadata available from the DB to iterate over the tables and columns. Dynamically create the SQL queries to compare "select count(field) - count(distinct field) from table".
Or you could also potentially use the metadata to see which columns in each table are indexed.
The SQL query to pull the relevant metadata will vary based on the kind of DBMS.

Categories

Resources