Python mysql check for duplicate before insert

Python mysql check for duplicate before insert - python

here is the table
CREATE TABLE IF NOT EXISTS kompas_url
(
id BIGINT(20) NOT NULL AUTO_INCREMENT,
url VARCHAR(1000),
created_date datetime,
modified_date datetime,
PRIMARY KEY(id)
)
I am trying to do INSERT to kompas_url table only if url is not exist yet
any idea?
thanks

You can either find out whether it's in there first, by SELECTing by url, or you can make the url field unique:
CREATE TABLE IF NOT EXISTS kompas_url
...
url VARCHAR(1000) UNIQUE,
...
)
This will stop MySQL from inserting a duplicate row, but it will also report an error when you try and insert. This isn't good—although we can handle the error, it might disguise others. To get around this, we use the ON DUPLICATE KEY UPDATE syntax:
INSERT INTO kompas_url (url, created_date, modified_date)
VALUES ('http://example.com', NOW(), NOW())
ON DUPLICATE KEY UPDATE modified_date = NOW()
This allows us to provide an UPDATE statement in the case of a duplicate value in a unique field (this can include your primary key). In this case, we probably want to update the modified_date field with the current date.
EDIT: As suggested by ~unutbu, if you don't want to change anything on a duplicate, you can use the INSERT IGNORE syntax. This simply works as follows:
INSERT IGNORE INTO kompas_url (url, created_date, modified_date)
VALUES ('http://example.com', NOW(), NOW())
This simply turns certain kinds of errors into warnings—most usefully, the error that states there will be a duplicate unique entry. If you place the keyword IGNORE into your statement, you won't get an error—the query will simply be dropped. In complex queries, this may also hide other errors that might be useful though, so it's best to make doubly sure your code is correct if you want to use it.

Related

Python SQLITE3 Insert into and On conflict do update with variables - Syntax problem?

I am running this code in a method that is called in a for loop and passing to it 'text_body':
*connecting to db*
cursor.execute('CREATE TABLE IF NOT EXISTS temp_data (temp_text_body text, id integer)')
cursor.execute('INSERT INTO temp_data(temp_text_body, id) VALUES (?, 1) ON CONFLICT (id) DO UPDATE '
'SET temp_text_body = temp_text_body || (?)', (text_body, text_body))
*commiting connection*
My goal is to, on first time this is called to create a table and fill it with 'text_body', then on a second call add to the first 'text_body' second 'text_body' and so on.
I tried many combinations of this code, I was also trying to do it with UPDATE (and it was fine but UPDATE needs some data to actually start working).
When my code comes to this places it's just stop running, no error no anything. Could some please let me know where I made a mistake?

You cannot set 'ON CONFLICT' on a field which is not a primary key or which does not have a 'unique constraint'. Here's the error message I get when running your code:
sqlite3.OperationalError:
ON CONFLICT clause does not match any PRIMARY KEY or UNIQUE constraint
This in essence is because you cannot have a conflict if there is no Primary Key/ Unique constraint on that field.
Under the assumption that you want id to be a unique Primary Key, I've created a working snippet:
import sqlite3
cursor = sqlite3.connect('stack_test.db')
cursor.execute('''CREATE TABLE IF NOT EXISTS temp_data
(temp_text_body TEXT,
id INTEGER PRIMARY KEY)''')
cursor.execute('''
INSERT INTO
temp_data(temp_text_body, id)
VALUES
(?, 1)
ON CONFLICT (id) DO UPDATE
SET
temp_text_body = temp_text_body || (?)''',
("foo", "bar"))
cursor.commit()
I've also made some stylistic changes to how you write your query statements which I hope will made them a little easier for you to debug!
Best of luck with the SQLite'ing, feel free to ask if any part of my answer is unclear :)

Insert record into PostgreSQL table and expire old record

I have a set of data that gets updated periodically by a client. Once a month or so we will download a new set of this data. The dataset is about 50k records with a couple hundred columns of data.
I am trying to create a database that houses all of this data so we can run our own analysis on it. I'm using PostgreSQL and Python (psycopg2).
Occasionally, the client will add columns to the dataset, so there are a number of steps I want to take:
Add new records to the database table
Compare the old set of data with the new set of data and update the table where necessary
Keep the old records, and either add an "expired" flag, or an "db_expire_date" to keep track of whether a record is active or expired
Add any new columns of data to the database for all records
I know how to add new records to the database (1) using INSERT INTO, and how to add new columns of data to the database (4) using ALTER TABLE. But having issues with (2) and (3). I figured out how to update a record, using the following code:
rows = zip(*[update_records[col] for col in update_records])
cursor = conn.cursor()
cursor.execute("""CREATE TEMP TABLE temptable (""" + schema_list + """) ON COMMIT DROP""")
cursor.executemany("""INSERT INTO temptable (""" + var +""") VALUES ("""+ perc_s + """)""", rows)
cursor.execute("""
UPDATE tracking.test_table
SET mfg = temptable.mfg, db_updt_dt = CURRENT_TIMESTAMP
FROM temptable
WHERE temptable.app_id = tracking.test_table.app_id;
""");
cursor.rowcount
conn.commit()
cursor.close()
conn.close()
However, this just updated the record based on the app_id as the primary key.
What I'd like to figure out is how to keep the original record and set it as "expired" and then create a new, updated record. It seems that "app_id" shouldn't be my primary key, so i've created a new primary key as '"primary_key" INT GENERATED ALWAYS AS IDENTITY not null,'.
I'm just not sure where to go from here. I think that I could probably just use INSERT INTO to send the new records to the database. But i'm not sure how to "expire" the old records that way. Possibly I could use UPDATE table to set the older values to "expired". But I am wondering if there is a more straightforward way to do this.
I hope my question is clear. I'm hoping someone can point me in the right direction. Thanks

A pretty standard data warehousing technique is to define two additional date fields, a from-effective-date and a to-effective-date. You only append rows, never update. You add the candidate record if the source primary key does not exist in your table OR if any column value is different from the most recently added prior record in your table with the same primary key. (Each record supersedes the last).
As you add your record to the table you do 3 things:
The New record's from-effective-date gets the transaction file's date
The New record's to-effective-date gets a date WAY in the future, like 9999-12-31. The important thing here is that it will not expire until you say so.
The most recent prior record (the one you compared values for changes) has its to-effective-date Updated to the transaction file's date minus one day. This has the effect of expiring the old record.
This creates a chain of records with the same source primary key with each one covering a non-overlapping time period. This format is surprisingly easy to select from:
If you want to reproduce the most current transaction file you select Where to-effective-date > Current Date
If you want to reproduce the transaction file at any date for a report, you select Where myreportdate Between from-effective-date And to-effective-date.
If you want the entire update history for a key you select * Where the key = mykeyvalue Order By from-effective-date.
The only thing that is ugly about this scheme is when columns are added, the comparison test also must be altered to include those new columns in case something changes. If you want that to be dynamic, you're going to have to loop through the reflection meta data for each column in the table, but Python will need to know how comparing a text field might be different from comparing a BLOB, for example.
If you actually care about having a primary key (many data warehouses do not have primary keys) you can define a compound key on the source primary key + one of those effective dates, it doesn't really matter which one.

You're looking for the concept of a "natural key", which is how you would identify a unique row, regardless of what the explicit logical constraints on the table are.
This means that you're spot on that you need to change your primary key to be more inclusive. Your new primary key doesn't actually help you decipher which row you are looking for once you have both in there unless you already know which row you are looking for (that "identity" field).
I can think of two likely candidates to add to your natural key: date, or batch.
Either way, you would look for "App = X, [Date|batch] = Y" in the data to find that one. Batch would be upload 1, upload 2, etc. You just make it up, or derive it from the date, or something along those lines.
If you aren't sure which to add, and you aren't ever going to upload multiple times in one day, I would go with Date. That will give you more visibility over time, as you can see when and how often things change.
Once you have a natural key, you want to make it explicit in your data. You can either keep your identity column (see: Surrogate Key) or you can have a compound primary key. With no other input or constraints, I would go with a compound primary key for your situation.
I'm a MySQL DBA, so I'm cribbing a bit from the docs here: https://www.postgresqltutorial.com/postgresql-primary-key/
You do NOT want this:
CREATE TABLE test_table (
app_id INTEGER PRIMARY KEY,
date DATE,
active BOOLEAN
);
Instead, you want this:
CREATE TABLE test_table (
app_id INTEGER,
date DATE,
active BOOLEAN,
PRIMARY KEY (app_id, date)
);
I've added an active column here as well, since you wanted to deactivate rows. This isn't explicitly necessary from what you've described though - you can always assume the most recent upload is active. Or you can expand the columns to have a "active_start" date and an "active_end" date, which will enable another set of queries. But for what you've stated here so far, just the date column should suffice. :)

For step 2)
First, you have to identify the records that have the same data for this you can run a select query with where clause before inserting any recode and count the number of records you receive as output. If the count is more than 0 don't insert the recode otherwise you can insert the recode.
For step 3)
For this, you can insert a column as you mention above with the name 'db_expire_date' and insert the expiration value at the time of record insertion only.
You can also use a column like 'is_expire' but for that, you need to add a cron job that can update the DB periodically for the value of this column.

Getting "Duplicate entry 'foo' for key 'PRIMARY'" While using ON DUPLICATE KEY UPDATE

Right off the bat I want to say, due to my position I cannot paste the full code. So I will do what I can to symbolize the code and get straight to the point.
Programmed in: Python.
Simply put I am getting a Duplicate Key Error. I have looked into other questions that have been raised about this and to my knowledge I am following the suggestion those answers have provided.
Table Structure Snippet:
CREATE TABLE `BAR_TABLE` (
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`foo` tinytext,
`bar` tinytext,
`uuid` varchar(25) NOT NULL,
PRIMARY KEY (`uuid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
In this case 'uuid' is the PRIMARY as there already a dataset this is taking in, that uuid is always unique(Unless it already exists then it's updating potential info)
SQL Snippet
INSERT INTO BAR_TABLE (
foo,
bar,
uuid
) VALUES (
%(foo)s,
%(bar)s,
%(uuid)s
) ON DUPLICATE KEY UPDATE
foo = VALUES(foo),
bar = VALUES(bar)
So the reason the SQL looks like this is because I am using executemany(). As well as the data that comes in is a Python Dictionary. So this allows it to assign all the values in the dictionary data to the SQL statement. Then all that gets shifted into the DB using executemany().
The item it is throwing the duplicate entry is actually in the table. I managed to get this to run a couple times, and then at some point during its testing it hit this error, and hasn't moved past it.
Obviously I am miss understanding something. Am I miss-understanding how the PRIMARY KEY works? Or what a KEY is using the ON DUPLICATE KEY UPDATE?

How to get the last row id on update?

I am a beginner in mysql and may be its my fault somewhere, and not able to understand how this can be resolved.
This is structure of my table:-
CREATE TABLE `nearest_product_type` (
`id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
`created` datetime NOT NULL,
`modified` datetime NOT NULL,
`name` varchar(15) NOT NULL UNIQUE
)
;
And this is the code i am trying:-
base = MySQLdb.connect (host="localhost", user = "root", passwd = "sheeshmohsin", db="points")
basecursor = base.cursor()
queryone = """INSERT INTO nearest_product_type (name,created,modified) VALUES (%s,%s,%s) ON DUPLICATE KEY UPDATE name=name """
category = "Indica"
valueone = (category,datetime.datetime.now(),datetime.datetime.now())
basecursor.execute(queryone, valueone)
product_id = basecursor.lastrowid
basecursor.close()
base.commit()
base.close()
print product_id
On running this python script, first time when category is not unique, it works fine, but on running again with the same category as first time, last row id returns 0. but i need the id of the last row which is updated.
And when i checked the rows in table, the auto-increment is also working, suppose if i run the script four times, in first time when category is unique the id is 1 and suppose another unique category comes in fourth time, then the id assigned to this row is 4, but it should be 2, because its second row. how can i solve this?

The ON DUPLICATE KEY UPDATE part here will not work as the key is the auto-increment column, which will never get duplicates.
It is almost certainly this clause that is causing the unexpected counts, particularly given the UNIQUE setting on name.
You can try using something like SELECT MAX(id) FROM nearest_product_type to get the last id added.

Something is wrong in the way you access the database. When you try to insert an new row in your database with a name that already exists, as the column name is declared to be unique, the insert will fail.
If you want to modify an existing row , you must use an UPDATE statement not an INSERT one. And there's nothing in SQL to do an insert or update.
And nothing in autoincrement guarantees that id are consecutive. All you know is that the database will allow a different id for each inserted row, but insertion failure can (and do in you case) result is holes is the id sequence.
Furthermore, some drivers may allow for pre-reservation of ids, particurarly with network connections to allow a client connection to get a bunch of ids in case it would insert more than one row. It that case, if another client asks for ids, and both clients insert rows alternatively, the id will not follow the insertion time.

Sqlite insert not working with python

I'm working with sqlite3 on python 2.7 and I am facing a problem with a many-to-many relationship. I have a table from which I am fetching its primary key like this
current.execute("SELECT ExtensionID FROM tblExtensionLookup where ExtensionName = ?",[ext])
and then i am fetching another primary key from another table
current.execute("SELECT HostID FROM tblHostLookup where HostName = ?",[host])
now what i am doing is i have a third table with these two keys as foreign keys and i inserted them like this
current.execute("INSERT INTO tblExtensionHistory VALUES(?,?)",[Hid,Eid])
The problem is i don't know why but the last insertion is not working it keeps giving errors. Now what i have tried is:
First I thought it was because I have an autoincrement primary id for the last mapping table which I didn't provide, but isn't it supposed to consider itself as it's auto incremented? However I went ahead and tried adding Null,None,0 but nothing works.
Secondly I thought maybe because i'm not getting the values from tables above so I tried printing it out and it shows so it works.
Any suggestions what I am doing wrong here?
EDIT :
When i don't provide primary key i get error as
The table has three columns but you provided only two values
and when i do provide them as None,Null or 0 it says
Parameter 0 is not supported probably because of unsupported type
I tried implementing the #abarnet way but still keeps saying parameter 0 not supported
connection = sqlite3.connect('WebInfrastructureScan.db')
with connection:
current = connection.cursor()
current.execute("SELECT ExtensionID FROM tblExtensionLookup where ExtensionName = ?",[ext])
Eid = current.fetchone()
print Eid
current.execute("SELECT HostID FROM tblHostLookup where HostName = ?",[host])
Hid = current.fetchone()
print Hid
current.execute("INSERT INTO tblExtensionHistory(HostID,ExtensionID) VALUES(?,?)",[Hid,Eid])
EDIT 2 :
The database schema is :
table 1:
CREATE TABLE tblHostLookup (
HostID INTEGER PRIMARY KEY AUTOINCREMENT,
HostName TEXT);
table2:
CREATE TABLE tblExtensionLookup (
ExtensionID INTEGER PRIMARY KEY AUTOINCREMENT,
ExtensionName TEXT);
table3:
CREATE TABLE tblExtensionHistory (
ExtensionHistoryID INTEGER PRIMARY KEY AUTOINCREMENT,
HostID INTEGER,
FOREIGN KEY(HostID) REFERENCES tblHostLookup(HostID),
ExtensionID INTEGER,
FOREIGN KEY(ExtensionID) REFERENCES tblExtensionLookup(ExtensionID));

It's hard to be sure without full details, but I think I can guess the problem.
If you use the INSERT statement without column names, the values must exactly match the columns as given in the schema. You can't skip over any of them.*
The right way to fix this is to just use the column names in your INSERT statement. Something like:
current.execute("INSERT INTO tblExtensionHistory (HostID, ExtensionID) VALUES (?,?)",
[Hid, Eid])
Now you can skip any columns you want (as long as they're autoincrement, nullable, or otherwise skippable, of course), or provide them in any order you want.
For your second problem, you're trying to pass in rows as if they were single values. You can't do that. From your code:
Eid = current.fetchone()
This will return something like:
[3]
And then you try to bind that to the ExtensionID column, which gives you an error.
In the future, you may want to try to write and debug the SQL statements in the sqlite3 command-line tool and/or your favorite GUI database manager (there's a simple extension that runs in for Firefox if you don't want anything fancy) and get them right, before you try getting the Python right.
* This is not true with all databases. For example, in MSJET/Access, you must skip over autoincrement columns. See the SQLite documentation for how SQLite interprets INSERT with no column names, or similar documentation for other databases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.