I am using cursor.executemany to insert thousands of rows into snowflake database from some other source. So if in case the insert fails due to some reason, does it rollback all the inserts?
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
So if in case the insert fails due to some reason, does it rollback all the inserts?
The cursor.executemany(…) implementation in Snowflake's Python Connector fills a multi-row INSERT INTO command, whose values are pre-evaluated by the query compiler before the inserts run, so they all run together or fail early if a value is unacceptable to its defined column type.
Is there some way to insert only if the same row does not exist yet? There is no primary key nor unique key in the table
If there are no ID-like columns, you'll need to define a condition that qualifies two rows to be the same (such as a multi-column match).
Assuming your new batch of inserts are in a temporary table TEMP, the following SQL can insert into the DESTINATION table by performing a check of all rows against a set from the DESTINATION table.
Using HASH(…) as a basis for comparison, comparing all columns in each row together (in order):
INSERT INTO DESTINATION
SELECT *
FROM TEMP
WHERE
HASH(*) IS NOT IN ( SELECT HASH(*) FROM DESTINATION )
As suggested by Christian in the comments, the MERGE SQL command can also be used, once you have an identifier strategy (join keys). This too requires the new rows to be placed in a temporary table first, and offers an ability to perform an UPDATE if a row is already found.
Note: HASH(…) may have collisions and isn't the best fit. Better is to form an identifier using one or more of your table's columns, and compare them together. Your question lacks information about table and data characteristics, so I've picked a very simple approach involving HASH(…) here.
Related
I have 10M+ records per day to insert into a Postgres database.
90% are duplicates and only the unique records should be inserted (this can be checked on a specific column value).
Because of the large volume, batch inserts seems like the only sensible option.
I'm trying to figure out how to make this work.
I've tried:
SQLAlchemy, but it throws an error. So I assume it's not possible.
s = Session(bind=engine)
s.bulk_insert_mappings(Model, rows)
s.commit()
Throws:
IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "..._key"
Panda's to_sql doesn't have this unique record capability.
So I'm thinking of putting new records in an "intermediate table", then running background jobs in parallel to add those records to the main table if they don't already exist. I don't know if this is the most efficient procedure.
Is there a better approach?
Is there some way to make SQLAlchemy or Pandas do this?
There are two common ways to go about solving this problem. To pick between these, you need to examine where you're willing to spend the compute power, and whether or not the extra network transfer is going to be an issue. We don't have enough information to make that judgement call for you.
Option 1: Load to a temporary table
This option is basically what you described. Have a temporary table or a table that's dedicated to the load, which matches the schema of your destination table. Obviously this should exclude the unique constraints.
Load the entirety of your batch into this table, and once it's all there insert from this table into your destination table. You can very easily use standard SQL statements to do any kind of manipulation you need, such as distinct or whether it's the first record, or whatever else.
Option 2: Only load unique values, filtering with pandas
Pandas has a drop_duplicates() function which limit your dataframe to unique entries, and you can specify things such as which columns to check and which row to keep.
df = df.drop_duplicates(subset = ["Age"])
I want to avoid making duplicate records, but there are some occasions when updating the record, the values I receive are exactly the same as the record's version. This results in 0 affected rows which is a value I retain to help me determine if I need to insert a new transaction.
I've tried using a select statement to look for the exact transaction, but some fields (out of many) can be null which doesn't bode well when I have string variables that all have 'field1 = %s' in their where clauses when I'd need something like 'field1 is NULL' instead to get an accurate result back.
My last thought is using a unique index on all of the columns except the one for the table's primary key, but I'm not too familiar with using unique indexes. Should I be able to update these records after the fact? Are there risks to consider when implementing this solution?
Or is there another way I can tell whether I have an unchanged transaction or a new one when provided with values to update with?
The language I'm using is Python with mysql.connector
I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.
SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.
I have a database containing a primary table and some unknown number of secondary tables. The primary table has two columns: ID (which is the primary key) and name. Secondary tables can have any number of columns, but are guaranteed to have the ID column as their primary key. Furthermore, the ID column of secondary tables is guaranteed to reference the ID column of the primary table as a foreign key. Additionally, it can be assumed that every table in the database has every ID.
I am currently using Python to construct and send queries to the MySQL database. What I would like to be able to do is to construct a query that will join all the tables in the database together by their ID columns. I understand that I can use SHOW TABLES; to return all the tables in the database, but I can't figure out how to generate a MySQL query for an unknown number of joins from there.
Example: I have three tables: A, B, and C. A is the primary table and thus has columns ID and name. B has columns ID, foo, and bar. C has columns ID and baz. I need to construct a join such that I get a table that has columns ID, name, foo, bar, and baz.
Edit: To give more detail, my python program is running some unknown number of modules, each of which stores information the object in its own table in the database that fits the constraints for secondary tables I mentioned previously. I could store everything in one much wider table (thus avoiding the joins entirely), but that would require me to add columns as more modules get added, which feels weird to me. I'm hardly an SQL expert though, so that might be the way to go. Even if that is what I end up doing, I'm still curious about how one might go about doing this.
I have a single table in an Sqlite DB, with many rows. I need to get the number of rows (total count of items in the table).
I tried select count(*) from table, but that seems to access each row and is super slow.
I also tried select max(rowid) from table. That's fast, but not really safe -- ids can be re-used, table can be empty etc. It's more of a hack.
Any ideas on how to find the table size quickly and cleanly?
Using Python 2.5's sqlite3 version 2.3.2, which uses Sqlite engine 3.4.0.
Do you have any kind of index on a not-null column (for example a primary key)? If yes, the index can be scanned (which hopefully does not take that long). If not, a full table scan is the only way to count all rows.
Other way to get the rows number of a table is by using a trigger that stores the actual number of rows in other table (each insert operation will increment a counter).
In this way inserting a new record will be a little slower, but you can immediately get the number of rows.
To follow up on Thilo's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.