I'm using psycopg2 with Python.
I'd like to periodically flush data from my db. I've set up a task with Timer for this. I had asked this question before, but using the answer listed there freezes up my machine (keyboard stops responding and entire system grinds to halt). Instead, I would like to delete all entries in my table albeit the last N (Not sure that this is the right approach either).
Basically, there is another python process that is running (separate executable), which is populating the db that I wish to interrogate. It seems that if I delete all entries, and that other process is running, that it can lead to the freeze. I don't know of a safe way in which I can remove entries; it's almost as if the other process is relying on an incrementing ID as it writes to the db.
If anyone could help me work this out it'd be greatly appreciated. Thoughts?
A possible solution is to run a DELETE on all ids except those returned by select ... order by pk desc limit N given an autoincremental pk. If no such pk exists, having a created_date and ordering by it should do the same.
Non tested example:
import psycopg2
connection = psycopg2.connect('dbname=test user=postgres')
cursor = conn.cursor()
query = 'delete from my_table where id not in (
select id from my_table order by id desc limit 30)'
cursor.execute(query)
cursor.commit() #Don't know if necessary
cursor.close()
connection.close()
This is probably much faster:
CRETE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl ORDER BY <undisclosed> LIMIT <N>;
TRUNCATE TABLE tbl;
INSERT INTO tbl SELECT * FROM tbl_tmp;
Do it all in one session. Specifics depend on additional circumstances you did not disclose.
Compare to this related, comprehensive answer (your case is simpler):
Remove duplicates from table based on multiple criteria and persist to other table
Related
I'm using server-side cursor in PostgreSQL with psycopg2, based on this well-explained answer.
with conn.cursor(name='name_of_cursor') as cursor:
query = "SELECT * FROM tbl FOR UPDATE"
cursor.execute(query)
for row in cursor:
# process row
In processing each row, I'd like to update a few fields in the row using PostgreSQL's UPDATE tbl SET ... WHERE CURRENT OF name_of_cursor (docs), but it seems that, when the for loop enters and row is set, the position of the server-side cursor is in a different record, so while I can run the command, the wrong record is updated.
How can I make sure the result iterator is in the same position as the cursor? (also preferably in a way that won't make the loop slower than updating using an ID)
The reason why a different record was being updated was because internally psycopg2 does a FETCH FORWARD 1000 (or whatever the default chunk size is), positioning the cursor at the end of the block. You can override this by fetching one record at a time:
updcursor = conn.cursor()
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 1 # to make server-side cursor be in the same position as the iterator
cursor.execute('SELECT * FROM tbl FOR UPDATE')
for row in cursor:
# process row...
updcursor.execute('UPDATE tbl SET fld1 = %s WHERE CURRENT OF name_of_cursor', [val])
The snippet above will update the correct record. Note that you cannot use the same cursor for selecting and updating, they must be different cursors.
Performance
Reducing the FETCH size to 1 reduces the retrieval performance by a lot. I definitely wouldn't recommend using this technique if you're iterating a large dataset (which is probably the case you're searching for server-side cursors) from a different host than the PostgreSQL server.
I ended up using a combination of exporting records to CSV, then importing them later using COPY FROM (with the function copy_expert).
I'm completely stumped.
The query looks something like this:
WITH e AS(
INSERT INTO TEAMS(TEAM_NAME, SPORT_ID, TEAM_GENDER)
VALUES ('Cameroon U23','1','M')
ON CONFLICT (TEAM_NAME, SPORT_ID, TEAM_GENDER)
DO NOTHING
RETURNING TEAM_ID
)
SELECT * FROM e
UNION
SELECT TEAM_ID FROM TEAMS WHERE LOWER(TEAM_NAME)=LOWER('Cameroon U23') AND SPORT_ID='1' AND LOWER(TEAM_GENDER)=LOWER('M');
And the python code like this:
sqlString = """WITH e AS(
INSERT INTO TEAMS(TEAM_NAME, SPORT_ID, TEAM_GENDER)
VALUES (%s,%s,%s)
ON CONFLICT (TEAM_NAME, SPORT_ID, TEAM_GENDER)
DO NOTHING
RETURNING TEAM_ID
)
SELECT * FROM e
UNION
SELECT TEAM_ID FROM TEAMS WHERE LOWER(TEAM_NAME)=LOWER(%s) AND SPORT_ID=%s AND LOWER(TEAM_GENDER)=LOWER(%s);"""
cur.execute(sqlString, (TEAM_NAME, SPORT_ID, TEAM_GENDER, TEAM_NAME, SPORT_ID, TEAM_GENDER,))
fetch = cur.fetchone()[0]
The error that I get is on "cur.fetchone()[0]" because "cur.fetchone()" doesn't return any values for some reason. I have also tried "cur.fetchall()" but it's the same issue.
This query works every time without fail in the normal postgres shell. However, in my python code using psycopg2, it will sometimes error out and not return anything. When I check the DB from the shell, the data I am looking for is there so it is the select query that should be returning something but isn't.
I am not sure if this is relevant, but I am creating concurrent connections (not connection pools) and doing multiple of these queries at once. Each query has a different team, however, to prevent deadlock.
I have found the issue. It was to do with me using concurrency. I was wrong in saying that each query has a different team. The teams might sometimes be the same.
But the main issue was occurring because my INSERT would try and put some data in and find a duplicate because a concurrent query was also trying to put the same data in. But then for some reason, the SELECT wouldn't find that data. I don't exactly what the issue is but that's my understanding.
I had to change to doing a SELECT, checking if there was a result, then doing an INSERT if there wasn't and then doing a final SELECT if the INSERT didn't return anything. The INSERT does not return anything sometimes because it encounters a conflict with an entry that appeared after the first SELECT was executed.
EDIT:
Nevermind. The problem was, in fact, that my deadlock_timeout was too low. My program wasn't actually reaching deadlock (where two processes are waiting on each other and cannot because they are also dependent on the other finishing). So increasing the deadlock_timeout to be larger than the average time for one of my processes to complete was the solution.
THIS WILL NOT WORK if your program is actually reaching deadlock. In that case, fix it, because it should not be reaching deadlock ever.
Hope this helps someone.
The statement is set-up so that when a record already exists, it doesn't add a record, else, it does.
I've tried changing the query, even though I don't see anything wrong with it.
I've let the script run on python, and print the query it executed. Then I pasted that query in phpmyadmin, where it executed succesfully.
I have also double checked all parameters.
Query (blank params):
INSERT INTO users (uname,pass) SELECT * FROM (SELECT '{}','{}') AS tmp WHERE NOT EXISTS(SELECT uname FROM users WHERE uname = '{}') LIMIT 1;
Query (filled in parameters):
INSERT INTO users (uname,pass) SELECT * FROM (SELECT 'john_doe','password') AS tmp WHERE NOT EXISTS(SELECT uname FROM users WHERE uname = 'john_doe') LIMIT 1;
Python script (the important part)
if action == "add_user":
username = form.getvalue('username')
password = form.getvalue('password')
query = """
INSERT INTO users (uname,pass) SELECT * FROM
(SELECT '{}','{}') AS tmp WHERE NOT EXISTS(SELECT uname FROM users WHERE uname = '{}') LIMIT 1;
""".format(username, password, username)
mycursor.execute(query)
I know a couple of things.
There is nothing wrong with the database connection.
The parameters are not empty (ex. username="john_doe" & password="secret")
The query actually executes in that specific table.
The query seems to add a record and delete it directly afterwards (as AUTO_INCREMENT increases each time, even when the python script executes and doesn't add anything)
A try except doesn't do anything, as mysql.connector.Error doesn't report any error (which is obvious, since the query actually executes succesfully)
phpMyAdmin practical example:
(Removed INSERT INTO part in order to be able to show the resulting tables)
The first time you enter the query (above query as example), it will result in a table with both values as both column names as column values.
Screenshot of table output: http://prntscr.com/nkgaka
Once that result is entered once, next time you will try to insert it, it will simply result in only column names, no values. This means it will insert nothing, as there is nothing to insert as there are no actual values.
Screenshot of table output: http://prntscr.com/nkgbp3
Help is greatly appreciated.
If you want to ensure any field is unique in a table, make that field a PRIMARY KEY or UNIQUE KEY. In this case you want name to be unique.
CREATE UNIQUE INDEX name ON users(name)
With this in place you only need to INSERT. If a duplicate key error occurs the name already exists.
To avoid SQL injection, don't use """SELECT {}""".format. It will be SQL injection vulnerable.
Also don't store plain text passwords, ever. Salted hashes at least. There's plenty of frameworks that do this well already so you don't need to invent your own.
I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...
for row in range(sheet.nrows):
"""name is in the 0th col. email is the 4th col."""
name = sheet.cell(row, 0).value
email = sheet.cell(row, 4).value
if name and email:
mailing_list[name.lstrip()] = email.strip()
for n, e in sorted(mailing_list.iteritems()):
rows.append((n, e))
db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))
CLARIFICATION...
I read here that...
To be sure, executemany() is effectively the same as simple iteration.
However, it is typically faster. It provides an optimized means of
affecting INSERT and REPLACE across multiple rows.
Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.
TL;DR
1. What's the best way prevent duplicate inserts?
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.
Is executemany() as expensive as INSERT in a loop?
#Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger
tables and how would I test to be sure?
1 - What's the best way prevent duplicate inserts?
Depending on what "preventing" means in your case, you have two strategies and one requirement.
The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.
Then as strategies you can do:
An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.
2 - Is executemany() as expensive as INSERT in a loop?
No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.
I am trying to input 1000's of rows on SQLite3 with insert however the time it takes to insert is way too long. I've heard speed is greatly increased if the inserts are combined into one transactions. However, i cannot seem to get SQlite3 to skip checking that the file is written on the hard disk.
this is a sample:
if repeat != 'y':
c.execute('INSERT INTO Hand (number, word) VALUES (null, ?)', [wordin[wordnum]])
print wordin[wordnum]
data.commit()
This is what i have at the begining.
data = connect('databasenew')
data.isolation_level = None
c = data.cursor()
c.execute('begin')
However, it does not seem to make a difference. A way to increase the insert speed would be much appreciated.
According to Sqlite documentation, BEGIN transaction should be ended with COMMIT
Transactions can be started manually using the BEGIN command. Such
transactions usually persist until the next COMMIT or ROLLBACK
command. But a transaction will also ROLLBACK if the database is
closed or if an error occurs and the ROLLBACK conflict resolution
algorithm is specified. See the documentation on the ON CONFLICT
clause for additional information about the ROLLBACK conflict
resolution algorithm.
So, your code should be like this:
data = connect('databasenew')
data.isolation_level = None
c = data.cursor()
c.execute('begin')
if repeat != 'y':
c.execute('INSERT INTO Hand (number, word) VALUES (null,?)', [wordin[wordnum]])
print wordin[wordnum]
data.commit()
c.execute('commit')
https://stackoverflow.com/a/3689929/1147726 answers the question. execute('begin') does not have any effect. Apparently, a connection.commit() is sufficient.
(In case someone is still looking for an answer to this)
You should use executemany if you are just doing 1000's of inserts successively.
Look at What is the optimized way to insert large number of records (more than 40,000) in sqlite3
I just struggled with a LOT (order millions) of execute's that were taking about 30 minutes to complete - Switched to executemany and I now have it down to about 10 minutes.
You can use executemany, see this SO question: python sqlite question - Insert method