Till now our application has been using one SQLite database with SQLObject as the ORM. Obviously at some point we knew we had to face the SQLite concurrency problem and so we did.
We ended up splitting the current database into multiple databases. Meaning each table schema remained the same but we distributed different tables into multiple databases keeping tightly coupled tables together.
Now this works very well in a clean install of the new version of our application but upgrade to the previous versions of our application to this new version needs a special data migration before our application can start working. In this case the database migration is simple moving the tables from this single database into appropriate different databases.
To exemplify, consider this is the older structure:
single_db.db --- A single db
* A -- Table A
* B -- Table B
* C -- Table C
* D -- Table D
* E -- Table E
* F -- Table F
The new structure:
db1.db --- Database 1
- A -- Table A
- B -- Table B
- C -- Table C
- D -- Table D
db2.db --- Database 2
- E -- Table E
db3.db --- Database 3
- F -- Table F
When the upgrade will happen, our application will create the new structure with the above 3 databases and with empty tables in them. Also the older database single_db.db with all the tables and actual data will be there. Now before our application can begin working, it should move the tables or I should say copy the data from a table from the older database to the corresponding table in the corresponding new database.
I will need to write the code for this database migration. I know I can query a table using the older database connection and insert the returned rows to the corresponding table using the newer database connection. One caveat I should mention here is some of these tables can contain large number of rows. That is rows can be till 2 - 2.5 million in 2/3 tables.
So want to ask if I can use any other SLQObject tricks since I am using SQLObject on top of SQLite and also has anyone done this before?
Thanks for your help.
I realise you probably solved this by now but for anyone googling I had to do almost exactly the same as the OP, this was the core part of the code that I used (it's modified from something I found, but I can't find it again to credit the original author, apologies!)
def _iterdump(connection, table_name):
"""
Returns an iterator to dump a database table in SQL text format.
"""
cu = connection.cursor()
yield('BEGIN TRANSACTION;')
# sqlite_master table contains the SQL CREATE statements for the database.
q = """
SELECT name, type, sql
FROM sqlite_master
WHERE sql NOT NULL AND
type == 'table' AND
name == :table_name
"""
schema_res = cu.execute(q, {'table_name': table_name})
for table_name, type, sql in schema_res.fetchall():
if table_name == 'sqlite_sequence':
yield('DELETE FROM sqlite_sequence;')
elif table_name == 'sqlite_stat1':
yield('ANALYZE sqlite_master;')
elif table_name.startswith('sqlite_'):
continue
else:
yield('%s;' % sql)
# Build the insert statement for each row of the current table
res = cu.execute("PRAGMA table_info('%s')" % table_name)
column_names = [str(table_info[1]) for table_info in res.fetchall()]
q = "SELECT 'INSERT INTO \"%(tbl_name)s\" VALUES("
q += ",".join(["'||quote(" + col + ")||'" for col in column_names])
q += ")' FROM '%(tbl_name)s'"
query_res = cu.execute(q % {'tbl_name': table_name})
for row in query_res:
yield("%s;" % row[0])
If you pass the sqlite connection for the original db and the name of the table in the original db this generator will give back commands that you can pass to execute on the sqlite object for the new db.
When I did this I also did a count of rows first on all the tables and incremented a counter as I executed INSERT lines so I could show progress on the migration.
Related
Below is my code that I'd like some help with.
I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.
I figure bulk insert is the route to go to speed it up?
Or is it because I'm iterating over the rows via for data in reader: portion?
#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
#hooks csv reader to file
reader = csv.reader(f)
#pulls out the columns (which match the SQL table)
columns = next(reader)
#trims any extra spaces
columns = [x.strip(' ') for x in columns]
#starts SQL statement
query = 'bulk insert into SpikeData123({0}) values ({1})'
#puts column names in SQL query 'query'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
print 'Query is: %s' % query
#starts curser from cnxn (which works)
cursor = cnxn.cursor()
#uploads everything by row
for data in reader:
cursor.execute(query, data)
cursor.commit()
I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).
SpikeData123 is the table name.
As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.
pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
crsr.fast_executemany = True # new in pyodbc 4.0.19
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... reduced the execution time to just over 1 second.
Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.
Update - April 2019: As noted in the comment from #SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).
BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...
1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07
... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...
1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07
... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...
1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07
... or a tab-delimited file (where → represents the tab character) ...
1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07
... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:
import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
FIELDTERMINATOR='\\t',
ROWTERMINATOR='\\n'
);
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()
Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.
yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data.
I would remove loop and try again.
Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert
Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.
This problem was frustrating me and I didn't see much improvement using fast_executemany until I found this post on SO. Specifically, Bryan Bailliache's comment regarding max varchar. I had been using SQLAlchemy and even ensuring better datatype parameters did not fix the issue for me; however, switching to pyodbc did. I also took Michael Moura's advice of using a temp table and found it shaved of even more time. I wrote a function in case anyone might find it useful. I wrote it to take either a list or list of lists for the insert. It took my insert of the same data using SQLAlchemy and Pandas to_sql from taking upwards of sometimes 40 minutes down to just under 4 seconds. I may have been misusing my former method though.
connection
def mssql_conn():
conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
server=os.environ.get('MS_SQL_SERVER'),
database='EHT',
uid=os.environ.get('MS_SQL_UN'),
pwd=os.environ.get('MS_SQL_PW'),
autocommit=True)
return conn
Insert function
def mssql_insert(table,val_lst,truncate=False,temp_table=False):
'''Use as direct connection to database to insert data, especially for
large inserts. Takes either a single list (for one row),
or list of list (for multiple rows). Can either append to table
(default) or if truncate=True, replace existing.'''
conn = mssql_conn()
cursor = conn.cursor()
cursor.fast_executemany = True
tt = False
qm = '?,'
if isinstance(val_lst[0],list):
rows = len(val_lst)
params = qm * len(val_lst[0])
else:
rows = 1
params = qm * len(val_lst)
val_lst = [val_lst]
params = params[:-1]
if truncate:
cursor.execute(f"TRUNCATE TABLE {table}")
if temp_table:
#create a temp table with same schema
start_time = time.time()
cursor.execute(f"SELECT * INTO ##{table} FROM {table} WHERE 1=0")
table = f"##{table}"
#set flag to indicate temp table was used
tt = True
else:
start_time = time.time()
#insert into either existing table or newly created temp table
stmt = f"INSERT INTO {table} VALUES ({params})"
cursor.executemany(stmt,val_lst)
if tt:
#remove temp moniker and insert from temp table
dest_table = table[2:]
cursor.execute(f"INSERT INTO {dest_table} SELECT * FROM {table}")
print('Temp table used!')
print(f'{rows} rows inserted into the {dest_table} table in {time.time() -
start_time} seconds')
else:
print('No temp table used!')
print(f'{rows} rows inserted into the {table} table in {time.time() -
start_time} seconds')
cursor.close()
conn.close()
And my console results first using a temp table and then not using one (in both cases, the table contained data at the time of execution and Truncate=True):
No temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 10.595500707626343
seconds
Temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 3.810380458831787
seconds
FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test the save to csv and BULK INSERT, I wonder how it compares.
Here's my blog on the testing I did:
http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html
adding to Gord Thompson's answer:
# add the below line for controlling batch size of insert
cursor.fast_executemany_rows = batch_size # by default it is 1000
I actually use Cx_Oracle library in Python to work with my database Oracle.
import cx_Oracle as Cx
# Parameters for server connexion
dsn_tns = Cx.makedsn(_ip, _port, service_name=_service_name)
# Connexion with Oracle Database
db = Cx.connect(_user, _password, dsn_tns)
# Obtain a cursor for make SQL query
cursor = db.cursor()
One of my query write in an INSERT of a Python dataframe into my Oracle target table among some conditions.
query = INSERT INTO ORA_TABLE(ID1, ID2)
SELECT :1, :2
FROM DUAL
WHERE (:1 != 'NF' AND :1 NOT IN (SELECT ID1 FROM ORA_TABLE))
OR (:1 = 'NF' AND :2 NOT IN (SELECT ID2 FROM ORA_TABLE))
The goal of this query is to write only rows who respect conditions into the WHERE.
Actually ,this query works well when my Oracle target table have few rows. But, if my target Oracle table have more than 100 000 rows, it's very slow because I read through all the table in WHERE condition.
Is there a way to improve performance of this query with join or something else ?
End of code :
# SQL query incoming
cursor.prepare(query)
# Launch query with Python dataset
cursor.executemany(None, _py_table.values.tolist())
# Commit changes into Oracle database
db.commit()
# Close the cursor
cursor.close()
# Close the server connexion
db.close()
Here is a possible solution that could help: The sql that you have has an OR condition and only one part of this condition will be true for a given value. So I would divide it in two parts by checking the following in the code and constructing two inserts instead of one and at any point of time, only one would execute:
IF :1 != 'NF' then use the following insert:
INSERT INTO ORA_TABLE (ID1, ID2)
SELECT :1, :2
FROM DUAL
WHERE (:1 NOT IN (SELECT ID1
FROM ORA_TABLE));
and IF :1 = 'NF' then use the following insert:
INSERT INTO ORA_TABLE (ID1, ID2)
SELECT :1, :2
FROM DUAL
WHERE (:2 NOT IN (SELECT ID2
FROM ORA_TABLE));
So you check in code what is the value of :1 and depending on that use the two simplified insert. Please check if this is functionally the same as original query and verify if it improves the response time.
Assuming Pandas, consider exporting your data as a table to be used as staging for final migration where you run your subquery only once and not for every row of data set. In Pandas, you would need to interface with sqlalchemy to run the to_sql export operation. Note: this assumes your connected user has such DROP TABLE and CREATE TABLE privileges.
Also, consider using EXISTS subquery to combine both IN subqueries. Below subquery attempts to run opposite of your logic for exclusion.
import sqlalchemy
...
engine = sqlalchemy.create_engine("oracle+cx_oracle://user:password#dsn")
# EXPORT DATA -ALWAYS REPLACING
pandas_df.to_sql('myTempTable', con=engine, if_exists='replace')
# RUN TRANSACTION
with engine.begin() as cn:
sql = """INSERT INTO ORA_TABLE (ID1, ID2)
SELECT t.ID1, t.ID2
FROM myTempTable t
WHERE EXISTS
(
SELECT 1 FROM ORA_TABLE sub
WHERE (t.ID1 != 'NF' AND t.ID1 = sub.ID1)
OR (t.ID1 = 'NF' AND t.ID2 = sub.ID2)
)
"""
cn.execute(sql)
I'm querying a json on a website for data, then saving that data into a variable so I can put it into a sqlite table. I'm 2 out of 3 for what I'm trying to do, but the sqlite side is just mystifying. I'm able to request the data, from there I can verify that the variable has data when I test it with a print, but all of my sqlite stuff is failing. It's not even creating a table, much less updating the table (but it is printing all the results to the buffer for some reason) Any idea what I'm doing wrong here? Disclaimer: Bit of a python noob. I've successfully created test tables just copying the stuff off of the python sqlite doc
# this is requesting the data and seems to work
for ticket in zenpy.search("bananas"):
id = ticket.id
subj = ticket.subject
created = ticket.created_at
for comment in zenpy.tickets.comments(ticket.id):
body = comment.body
# connecting to sqlite db that exists. things seem to go awry here
conn = sqlite3.connect('example.db')
c = conn.cursor()
# Creating the table table (for some reason table is not being created at all)
c.execute('''CREATE TABLE tickets_test
(ticket id, ticket subject, creation date, body text)''')
# Inserting the variables into the sqlite table
c.execute("INSERT INTO ticketstest VALUES (id, subj, created, body)")
# committing changes the changes and closing
c.commit()
c.close()
I'm on Windows 64bit and using pycharm to do this.
Your table likely isn't created because you haven't committed yet, and your sql fails before it commits. It should work when you fix your 2nd sql statement.
You're not inserting the variables you've created into the table. You need to use parameters. There are two ways of parameterizing your sql statement. I'll show the named placeholders one:
c.execute("INSERT INTO ticketstest VALUES (:id, :subj, :created, :body)",
{'id':id, 'subj':subj, 'created':created, 'body':body}
)
Say I need a table that has to have two columns (A TEXT, B TEXT).
Every time before I run a program, I want to check if the table exists, and create it if it doesn't. Now say that the table with that name exists already, but has only one column (A TEXT), or maybe (A INT, B INT)
So in general, different columns.
How do I check that on CREATE query? And if there's a conflict back it up somewhere and drop, then create a new correct table. If there's no conflict - don't do anything.
I am working in Python, using sqlite3 by the way. Database is stored locally for now and program is distributed to multiple people, that's why I need to check the database.
Currently I have
con = sqlite3.connect(path)
with con:
cur = con.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS table (A TEXT, B TEXT);')
You can use the pragma table_info in order to get information about the table, and use the result to check your columns:
def validate(connection):
cursor = connection.cursor()
cursor.execute('PRAGMA table_info(table)')
columns = cursor.fetchall()
cursor.close()
return (len(columns) == 2
and columns[0][1:3] == ('A', 'TEXT')
and columns[1][1:3] == ('B', 'TEXT'))
So if validate returns False you can rename the table and create the new one.
If I run this query directly in sqlite3.exe on the same database, I get 20 records.
When I run it in Python using sqlite3, it returns every single record from table a (200000+).
import sqlite3
db = sqlite3.connect("path/to/my.db")
c = db.cursor()
c.execute("""SELECT a.*, b.*, c.* FROM t_data a NATURAL LEFT JOIN t_finished b
NATURAL LEFT JOIN user_info c WHERE user_id=1;""")
for row in c:
print row
How can this be possible?
Here is how the tables are related.
CREATE TABLE t_data ( t_id INTEGER REFERENCES t_finished (t_id),
ui_id INTEGER NOT NULL REFERENCES user_info (ui_id), ...);
CREATE TABLE t_finished ( t_id INTEGER PRIMARY KEY, ...);
CREATE TABLE user_info ( ui_id INTEGER PRIMARY KEY, user_id INTEGER REFERENCES accounts, ...);
No other columns are shared between them.
Trying to use explicit JOINS I have the same problem:
SELECT * FROM t_data a LEFT JOIN t_finished b USING(t_id) LEFT JOIN user_info c USING(ui_id) WHERE user_id=1;
This query works in sqlite3.exe, but throws an error in Python:
OperationalError: cannot join using column ui_id - column not present in both tables
If you are 100% certain that you are dealing with the same database, you probably are using a newer SQLite library version with the command-line tool.
You can verify what versions are being used with:
print sqlite.sqlite_version
in Python and
sqlite3 -version
with the command-line tool. You can check against the sqlite3 changelog to see if anything relevant changed, or you could just update your SQLite3 DLLs to the lastest version to make sure that you are not running into a bug or new feature here.