I have two databases (each with 1000's of tables) which are supposed to reflect the same data but they come from two different sources. I compared two tables to see what the differences were, but to do that I joined the two on a common ID key. I checked the table manually to see what the ID key was, but when I have to check 1000's of tables its not practical to do so.
Is there a way in pandas to find what column (or columns) in a table have only unique values?
Use the Python library that allows you to query your database (pymysql, psycopg2, etc). Programmatically use the metadata available from the DB to iterate over the tables and columns. Dynamically create the SQL queries to compare "select count(field) - count(distinct field) from table".
Or you could also potentially use the metadata to see which columns in each table are indexed.
The SQL query to pull the relevant metadata will vary based on the kind of DBMS.
Related
I have 10M+ records per day to insert into a Postgres database.
90% are duplicates and only the unique records should be inserted (this can be checked on a specific column value).
Because of the large volume, batch inserts seems like the only sensible option.
I'm trying to figure out how to make this work.
I've tried:
SQLAlchemy, but it throws an error. So I assume it's not possible.
s = Session(bind=engine)
s.bulk_insert_mappings(Model, rows)
s.commit()
Throws:
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "..._key"
Panda's to_sql doesn't have this unique record capability.
So I'm thinking of putting new records in an "intermediate table", then running background jobs in parallel to add those records to the main table if they don't already exist. I don't know if this is the most efficient procedure.
Is there a better approach?
Is there some way to make SQLAlchemy or Pandas do this?
There are two common ways to go about solving this problem. To pick between these, you need to examine where you're willing to spend the compute power, and whether or not the extra network transfer is going to be an issue. We don't have enough information to make that judgement call for you.
Option 1: Load to a temporary table
This option is basically what you described. Have a temporary table or a table that's dedicated to the load, which matches the schema of your destination table. Obviously this should exclude the unique constraints.
Load the entirety of your batch into this table, and once it's all there insert from this table into your destination table. You can very easily use standard SQL statements to do any kind of manipulation you need, such as distinct or whether it's the first record, or whatever else.
Option 2: Only load unique values, filtering with pandas
Pandas has a drop_duplicates() function which limit your dataframe to unique entries, and you can specify things such as which columns to check and which row to keep.
df = df.drop_duplicates(subset = ["Age"])
I am using cursor.executemany to insert thousands of rows into snowflake database from some other source. So if in case the insert fails due to some reason, does it rollback all the inserts?
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
So if in case the insert fails due to some reason, does it rollback all the inserts?
The cursor.executemany(…) implementation in Snowflake's Python Connector fills a multi-row INSERT INTO command, whose values are pre-evaluated by the query compiler before the inserts run, so they all run together or fail early if a value is unacceptable to its defined column type.
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
If there are no ID-like columns, you'll need to define a condition that qualifies two rows to be the same (such as a multi-column match).
Assuming your new batch of inserts are in a temporary table TEMP, the following SQL can insert into the DESTINATION table by performing a check of all rows against a set from the DESTINATION table.
Using HASH(…) as a basis for comparison, comparing all columns in each row together (in order):
INSERT INTO DESTINATION
SELECT *
FROM TEMP
WHERE
HASH(*) IS NOT IN ( SELECT HASH(*) FROM DESTINATION )
As suggested by Christian in the comments, the MERGE SQL command can also be used, once you have an identifier strategy (join keys). This too requires the new rows to be placed in a temporary table first, and offers an ability to perform an UPDATE if a row is already found.
Note: HASH(…) may have collisions and isn't the best fit. Better is to form an identifier using one or more of your table's columns, and compare them together. Your question lacks information about table and data characteristics, so I've picked a very simple approach involving HASH(…) here.
I created a number of classes in SQLAlchemy to represent my various tables. I now want to insert records into these tables from various csv files that contain the data in an unnormalized format. What is the best way to deal with foreign keys?
In a simplified model, I have two tables: Child and Parent, with a one to many relationship. The parent table is already filled up, with a unique parent_name for each primary key. I am currently doing this:
for index, row in df.iterrows():
u = session.query(Parent).filter_by(parent_name=row['parent_name']).first()
session.add(Child(child_name=row['child_name'], parent_id=u.id))
Is there a way with sqlalchemy to avoid the first query? This question implies that using relationships is the correct/easy way to do it, but only explains the hard way.
I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.
SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.
I have a database containing a primary table and some unknown number of secondary tables. The primary table has two columns: ID (which is the primary key) and name. Secondary tables can have any number of columns, but are guaranteed to have the ID column as their primary key. Furthermore, the ID column of secondary tables is guaranteed to reference the ID column of the primary table as a foreign key. Additionally, it can be assumed that every table in the database has every ID.
I am currently using Python to construct and send queries to the MySQL database. What I would like to be able to do is to construct a query that will join all the tables in the database together by their ID columns. I understand that I can use SHOW TABLES; to return all the tables in the database, but I can't figure out how to generate a MySQL query for an unknown number of joins from there.
Example: I have three tables: A, B, and C. A is the primary table and thus has columns ID and name. B has columns ID, foo, and bar. C has columns ID and baz. I need to construct a join such that I get a table that has columns ID, name, foo, bar, and baz.
Edit: To give more detail, my python program is running some unknown number of modules, each of which stores information the object in its own table in the database that fits the constraints for secondary tables I mentioned previously. I could store everything in one much wider table (thus avoiding the joins entirely), but that would require me to add columns as more modules get added, which feels weird to me. I'm hardly an SQL expert though, so that might be the way to go. Even if that is what I end up doing, I'm still curious about how one might go about doing this.