Wrapping INSERT by BEGIN TRANSACTION and COMMIT - python

I have some code from my class. It's about making one database from another. There is a INSERT method. It takes really long time. I read the FAQ, and i know that I need to put BEGIN TRANSACTION and COMMIT around the multiple INSERT, but i swear, I tried every place of c.execute(''BEGIN TRANSACTION") and c.execute("COMMIT") - always same ca 5 kb/s. Please show me where is the proper place for those instruction, or tell me what else could be a problem.
For the record - I'm working with 5400 RPM hard drive.
Here is original code:
import sqlite3
conn = sqlite3.connect('/path/to/database.db')
c = conn.cursor()
with open('sqlite-sakila-schema.sql', 'r', encoding='utf-8') as create_file:
create_query = create_file.read()
with open('sqlite-sakila-insert-data.sql', 'r', encoding='utf-8') as insert_file:
insert_query = insert_file.read()
c.executescript(create_query)
c.executescript(insert_query)
conn.commit()
conn.close()
edited:
first file:
https://raw.githubusercontent.com/jOOQ/jOOQ/master/jOOQ-examples/Sakila/sqlite-sakila-db/sqlite-sakila-schema.sql
second one:
https://raw.githubusercontent.com/jOOQ/jOOQ/master/jOOQ-examples/Sakila/sqlite-sakila-db/sqlite-sakila-insert-data.sql
It is all about INSERTs. There is couple tables, whole sql file started with deleting from those tables, and then 231K lines of INSERTS code like below.
Insert into language
(language_id,name,last_update)
Values
('1','English','2006-02-15 05:02:19.000')
;
Insert into language
(language_id,name,last_update)
Values
('2','Italian','2006-02-15 05:02:19.000')
;
Insert into language
(language_id,name,last_update)
Values
('3','Japanese','2006-02-15 05:02:19.000')
;
Insert into language
(language_id,name,last_update)
Values
('4','Mandarin','2006-02-15 05:02:19.000')
;

Try combining your INSERT queries into a single query:
INSERT into language (language_id,name,last_update) VALUES
('1','English','2006-02-15 05:02:19.000'),
('2','Italian','2006-02-15 05:02:19.000'),
('3','Japanese','2006-02-15 05:02:19.000'),
('4','Mandarin','2006-02-15 05:02:19.000'),
...
;
SQLite has a limit on the size of a single query, which is the value of SQLITE_MAX_SQL_LENGTH and defaults to 1,000,000 bytes. So you'll need to either increase this limit or split this query up into groups that fit into the limit. Doing them in groups of something like 1,000 rows will probably make a noticeable difference.

Related

Using Python - How can I parse CSV list of both integers and strings and add to SQL table through Insert Statement?

I am automating a task through Python that will run an SQL statement to insert into an existing table in a DB.
My CSV headers look as such:
ID,ACCOUNTID,CATEGORY,SUBCATEGORY,CREATION_DATE,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE
My values:
seq_addnoteid.nextval,123456,TEST,ADMN_TEST,sysdate,ME,This is a test,Y,1,A
NOTE: Currently, seq_addnote works from DB but in my code i added a small snippet to get the max ID and the rows will increase this by one for each iteration.
Sysdate could also be passed as format '19-MAY-22'
If i was to run from DB, this would work:
insert into notes values(seq_addnoteid.nextval,'123456','TEST','ADMN_TEST',sysdate,'ME','This is a test','Y',1,'A');
# Snippet to get function
cursor.execute("SELECT MAX(ID) from NOTES")
max = cursor.fetchone()
max = int(max[0])
with open ('sample.csv', 'r') as f:
reader = csv.reader(f)
columns = next(reader)
query = 'INSERT INTO NOTES({0}) values ({1})'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
cursor = conn.cursor()
for data in reader:
cursor.execute(query, data)
conn.commit()
print("Records inserted successfully")
cursor.close()
conn.close()
Currently, i'm getting Oracle-Error-Message: ORA-01036: illegal variable name/number and i think its because of my query.format line. However, I'm looking for help to get this code to handle the data types properly.
Thanks!
Try printing your query before you execute it. I think you'll find that it's printing this:
INSERT INTO NOTES(ID,ACCOUNTID,CATEGORY,SUBCATEGORY,CREATION_DATE,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE)
values(seq_addnoteid.nextval,123456,TEST,ADMN_TEST,sysdate,ME,This is a test,Y,1,A);
Which will also give you a ORA-01036 if you try to run it manually.
The problem is that you want some of your column values to be literal values, and some of them to be strings escaped in single-quotes, and your code doesn't do that. I don't think there's an easy to way to do it with ','.join(), so you'll either need to modify your CSVs to quote the strings, like:
seq_addnoteid.nextval,"'123456'","'TEST'","'ADMN_TEST'",sysdate,"'ME'","'This is a test'","'Y'",1,"'A'"
Or modify your query.format to add the quotes around the parameters that you want to treat as strings:
query.format(','.join(columns), "?,'?','?','?',?,'?','?','?',?,'?'")
As the commenters mentioned, pandas does handle this all very nicely.
EDIT: I see what you're saying. I'm not sure pandas will help with the literal functions you want to pass to the insert. But yes, you should be able to change your CSV and then do:
query.format(','.join(columns) + ',ID,CREATION_DATE', "'?','?','?','?','?','?',?,'?',seq_addnoteid.nextval,sysdate")
As a side note, a lot of people do this sort of thing on the database side in a BEFORE INSERT trigger, e.g.:
create or replace trigger NOTES_INS_TRG
before insert on NOTES
for each row
begin
:NEW.ID := seq_addnoteid.nextval;
:NEW.CREATION_DATE := sysdate;
end;
/
Then you could leave those columns out of your insert entirely.
Edit again:
I'm not sure you can use ? for bind/substitution variables in cx_oracle (see documentation ). So where your raw query is currently:
INSERT INTO NOTES(ACCOUNTID,CATEGORY,SUBCATEGORY,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE,ID,CREATION_DATE)
values (seq_addnoteid.nextval,sysdate,'?','?','?','?','?','?',?,'?')
You'd need something like:
INSERT INTO NOTES(ACCOUNTID,CATEGORY,SUBCATEGORY,CREATED_BY,REMARK,ISIMPORTANT,TYPE,ENTITY_TYPE,ID,CREATION_DATE)
values (seq_addnoteid.nextval,sysdate,:1,:2,:3,:4,:5,:6,:7,:8)
We can probably do that by modifying the format string again to generate some bind variables:
query.format('ID,CREATION_DATE,' + ','.join(columns),
"seq_addnoteid.nextval,sysdate," + ','.join([':'+c for c in columns])
Again, try printing the query before executing it to make sure the column names and values are lining up correctly.

query of sqlite3 with python using '?'

I have a table of three columnsid,word,essay.I want to do a query using (?). The sql sentence is sql1 = "select id,? from training_data". My code is below:
def dbConnect(db_name,sql,flag):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
if (flag == "danci"):
itm = 'word'
elif flag == "wenzhang":
itm = 'essay'
n = cursor.execute(sql,(itm,))
res1 = cursor.fetchall()
return res1
However, when I print dbConnect("data.db",sql1,"danci")
The result I obtained is [(1,'word'),(2,'word'),(3,'word')...].What I really want to get is [(1,'the content of word column'),(2,'the content of word column')...]. What should I do ? Please give me some ideas.
You can't use placeholders for identifiers -- only for literal values.
I don't know what to suggest in this case, as your function takes a database nasme, an SQL string, and a flag to say how to modify that string. I think it would be better to pass just the first two, and write something like
sql = {
"danci": "SELECT id, word FROM training_data",
"wenzhang": "SELECT id, essay FROM training_data",
}
and then call it with one of
dbConnect("data.db", sql['danci'])
or
dbConnect("data.db", sql['wenzhang'])
But a lot depends on why you are asking dbConnect to decide on the columns to fetch based on a string passed in from outside; it's an unusual design.
Update - SQL Injection
The problems with SQL injection and tainted data is well documented, but here is a summary.
The principle is that, in theory, a programmer can write safe and secure programs as long as all the sources of data are under his control. As soon as they use any information from outside the program without checking its integrity, security is under threat.
Such information ranges from the obvious -- the parameters passed on the command line -- to the obscure -- if the PATH environment variable is modifiable then someone could induce a program to execute a completely different file from the intended one.
Perl provides direct help to avoid such situations with Taint Checking, but SQL Injection is the open door that is relevant here.
Suppose you take the value for a database column from an unverfied external source, and that value appears in your program as $val. Then, if you write
my $sql = "INSERT INTO logs (date) VALUES ('$val')";
$dbh->do($sql);
then it looks like it's going to be okay. For instance, if $val is set to 2014-10-27 then $sql becomes
INSERT INTO logs (date) VALUES ('2014-10-27')
and everything's fine. But now suppose that our data is being provided by someone less than scrupulous or downright malicious, and your $val, having originated elsewhere, contains this
2014-10-27'); DROP TABLE logs; SELECT COUNT(*) FROM security WHERE name != '
Now it doesn't look so good. $sql is set to this (with added newlines)
INSERT INTO logs (date) VALUES ('2014-10-27');
DROP TABLE logs;
SELECT COUNT(*) FROM security WHERE name != '')
which adds an entry to the logs table as before, end then goes ahead and drops the entire logs table and counts the number of records in the security table. That isn't what we had in mind at all, and something we must guard against.
The immediate solution is to use placeholders ? in a prepared statement, and later passing the actual values in a call to execute. This not only speeds things up, because the SQL statement can be prepared (compiled) just once, but protects the database from malicious data by quoting every supplied value appropriately for the data type, and escaping any embedded quotes so that it is impossible to close one statement and another open another.
This whole concept was humourised in Randall Munroe's excellent XKCD comic

Remove all data from table but last N entries

I'm using psycopg2 with Python.
I'd like to periodically flush data from my db. I've set up a task with Timer for this. I had asked this question before, but using the answer listed there freezes up my machine (keyboard stops responding and entire system grinds to halt). Instead, I would like to delete all entries in my table albeit the last N (Not sure that this is the right approach either).
Basically, there is another python process that is running (separate executable), which is populating the db that I wish to interrogate. It seems that if I delete all entries, and that other process is running, that it can lead to the freeze. I don't know of a safe way in which I can remove entries; it's almost as if the other process is relying on an incrementing ID as it writes to the db.
If anyone could help me work this out it'd be greatly appreciated. Thoughts?
A possible solution is to run a DELETE on all ids except those returned by select ... order by pk desc limit N given an autoincremental pk. If no such pk exists, having a created_date and ordering by it should do the same.
Non tested example:
import psycopg2
connection = psycopg2.connect('dbname=test user=postgres')
cursor = conn.cursor()
query = 'delete from my_table where id not in (
select id from my_table order by id desc limit 30)'
cursor.execute(query)
cursor.commit() #Don't know if necessary
cursor.close()
connection.close()
This is probably much faster:
CRETE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl ORDER BY <undisclosed> LIMIT <N>;
TRUNCATE TABLE tbl;
INSERT INTO tbl SELECT * FROM tbl_tmp;
Do it all in one session. Specifics depend on additional circumstances you did not disclose.
Compare to this related, comprehensive answer (your case is simpler):
Remove duplicates from table based on multiple criteria and persist to other table

How to determine if field exists in same table in another field in any other row

I am having troubles finding out if I can even do this. Basically, I have a csv file that looks like the following:
1111,804442232,1
1112,312908721,1
1113,A*2434,1
1114,A*512343128760987,1
1115,3512748,1
1116,1111,1
1117,1234,1
This is imported into a sqlite database in memory for manipulation. I will be importing multiple files into this database after some manipulation. Sqlite is allowing me to keep constraints on the tables and receive errors where needed without creating additional functions just to check each constraint while using arrays in python. I want to do a few things but the first of which is to prepend field2 where all field2 strings match an entry in field1.
For example, in the above data field2 in entry 6 matches entry 1. In this case I would like to prepend field2 in entry 6 with '555'
If this is not possible I do believe I could make do using a regex and just do this on every row with 4 digits in field2... though... I have yet to successfully get REGEX working using python/sqlite as it always throws me an error.
I am working within Python using Sqlite3 to connect/manipulate my sqlite database.
EDIT: I am looking for a method to manipulate the resultant tables which reside in a sqlite database rather than manipulating just the csv data. The data above is just a simple representation of what is contained in the files I am working with. Would it be better to work with arrays containing the data from the csv files? These files have 10,000+ entries and about 20-30 columns.
If you must do it in SQLite, how about this:
First, get the column names of the table by running the following and parsing the result
def get_columns(table_name, cursor):
cursor.execute('pragma table_info(%s)' % table_name)
return [row[1] for row in cursor]
conn = sqlite3.connect('test.db')
columns = get_columns('test_table',conn.cursor())
For each of those columns, run the following update, that does your prepending
def prepend(column, reference, prefix, cursor):
query = '''
UPDATE %s
SET %s = 'prefix' || %s
WHERE %s IN (SELECT %s FROM %s)
''' % (table, column, column, column, reference, table)
cursor.execute(query)
reference = 'field1'
[prepend('test_table', column, reference, '555', conn.cursor())
for column in columns
if column != reference]
Note that this is expensive: O(n^2) for each column you want to do it for.
As per your edit and Nathan's answer, it might be better to simply work with python's builtin datastructures. You can always insert it into SQLite after.
10,000 entries is not really much so it might not matter in the end. It all depends on your reason for requiring it to be done in SQLite (which we don't have much visibility of).
There is no need to use regex expressions to do this, just throw the contents from the first column into a set and then iterate through the rows and update the second field.
first_col_values = set(row[0] for row in rows)
for row in rows:
if row[1] in first_col_values:
row[1] = '555' + row[1]
So... I found the answer to my own question after a ridiculous amount of my own searching and trial and error. My unfamiliarity with SQL had me stumped as I was trying all kinds of crazy things. In the end... this was the simple type of solution I was looking for:
prefix="555"
cur.execute("UPDATE table SET field2 = %s || field2 WHERE field2 IN (SELECT field1 FROM table)"% (prefix))
I kept the small amount of python in there but what I was looking for was the SQL statement. Not sure why nobody else came up with something that simple =/. Unsatisfied with the answers so far, I had been searching far and wide for this simple line >_<.

Combine inserts into one transaction Python SQLite3

I am trying to input 1000's of rows on SQLite3 with insert however the time it takes to insert is way too long. I've heard speed is greatly increased if the inserts are combined into one transactions. However, i cannot seem to get SQlite3 to skip checking that the file is written on the hard disk.
this is a sample:
if repeat != 'y':
c.execute('INSERT INTO Hand (number, word) VALUES (null, ?)', [wordin[wordnum]])
print wordin[wordnum]
data.commit()
This is what i have at the begining.
data = connect('databasenew')
data.isolation_level = None
c = data.cursor()
c.execute('begin')
However, it does not seem to make a difference. A way to increase the insert speed would be much appreciated.
According to Sqlite documentation, BEGIN transaction should be ended with COMMIT
Transactions can be started manually using the BEGIN command. Such
transactions usually persist until the next COMMIT or ROLLBACK
command. But a transaction will also ROLLBACK if the database is
closed or if an error occurs and the ROLLBACK conflict resolution
algorithm is specified. See the documentation on the ON CONFLICT
clause for additional information about the ROLLBACK conflict
resolution algorithm.
So, your code should be like this:
data = connect('databasenew')
data.isolation_level = None
c = data.cursor()
c.execute('begin')
if repeat != 'y':
c.execute('INSERT INTO Hand (number, word) VALUES (null,?)', [wordin[wordnum]])
print wordin[wordnum]
data.commit()
c.execute('commit')
https://stackoverflow.com/a/3689929/1147726 answers the question. execute('begin') does not have any effect. Apparently, a connection.commit() is sufficient.
(In case someone is still looking for an answer to this)
You should use executemany if you are just doing 1000's of inserts successively.
Look at What is the optimized way to insert large number of records (more than 40,000) in sqlite3
I just struggled with a LOT (order millions) of execute's that were taking about 30 minutes to complete - Switched to executemany and I now have it down to about 10 minutes.
You can use executemany, see this SO question: python sqlite question - Insert method

Categories

Resources