I've got a MySQL table with about ~10m rows. I created a parallel schema in SQLite3, and I'd like to copy the table somehow. Using Python seems like an acceptable solution, but this way --
# ...
mysqlcursor.execute('SELECT * FROM tbl')
rows = mysqlcursor.fetchall() # or mysqlcursor.fetchone()
for row in rows:
# ... insert row via sqlite3 cursor
...is incredibly slow (hangs at .execute(), I wouldn't know for how long).
I'd only have to do this once, so I don't mind if it takes a couple of hours, but is there a different way to do this? Using a different tool rather than Python is also acceptable.
The simplest way might be to use mysqldump to get a SQL file of the whole db, then use the SQLite command-line tool to execute the file.
You don't show exactly how you insert rows, but you mention execute().
You might try executemany()* instead.
For example:
import sqlite3
conn = sqlite3.connect('mydb')
c = conn.cursor()
# one '?' placeholder for each column you're inserting
# "rows" needs to be a sequence of values, e.g. ((1,'a'), (2,'b'), (3,'c'))
c.executemany("INSERT INTO tbl VALUES (?,?);", rows)
conn.commit()
*executemany() as described in the Python DB-API:
.executemany(operation,seq_of_parameters)
Prepare a database operation (query or
command) and then execute it against
all parameter sequences or mappings
found in the sequence
seq_of_parameters.
You can export a flat file from mysql using select into outfile and import those with sqlite's .import:
mysql> SELECT * INTO OUTFILE '/tmp/export.txt' FROM sometable;
sqlite> .separator "\t"
sqlite> .import /tmp/export.txt sometable
This handles the data export/import but not copying the schema, of course.
If you really want to do this with python (maybe to transform the data), I would use a MySQLdb.cursors.SSCursor to iterate over the data - otherwise the mysql resultset gets cached in memory which is why your query is hanging on execute. So that would look something like:
import MySQLdb
import MySQLdb.cursors
connection = MySQLdb.connect(...)
cursor = connection.cursor(MySQLdb.cursors.SSCursor)
cursor.execute('SELECT * FROM tbl')
for row in cursor:
# do something with row and add to sqlite database
That will be much slower than the export/import approach.
Related
Below is my code that I'd like some help with.
I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.
I figure bulk insert is the route to go to speed it up?
Or is it because I'm iterating over the rows via for data in reader: portion?
#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
#hooks csv reader to file
reader = csv.reader(f)
#pulls out the columns (which match the SQL table)
columns = next(reader)
#trims any extra spaces
columns = [x.strip(' ') for x in columns]
#starts SQL statement
query = 'bulk insert into SpikeData123({0}) values ({1})'
#puts column names in SQL query 'query'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
print 'Query is: %s' % query
#starts curser from cnxn (which works)
cursor = cnxn.cursor()
#uploads everything by row
for data in reader:
cursor.execute(query, data)
cursor.commit()
I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).
SpikeData123 is the table name.
As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.
pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
crsr.fast_executemany = True # new in pyodbc 4.0.19
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... reduced the execution time to just over 1 second.
Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.
Update - April 2019: As noted in the comment from #SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).
BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...
1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07
... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...
1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07
... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...
1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07
... or a tab-delimited file (where → represents the tab character) ...
1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07
... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:
import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
FIELDTERMINATOR='\\t',
ROWTERMINATOR='\\n'
);
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()
Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.
yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data.
I would remove loop and try again.
Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert
Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.
This problem was frustrating me and I didn't see much improvement using fast_executemany until I found this post on SO. Specifically, Bryan Bailliache's comment regarding max varchar. I had been using SQLAlchemy and even ensuring better datatype parameters did not fix the issue for me; however, switching to pyodbc did. I also took Michael Moura's advice of using a temp table and found it shaved of even more time. I wrote a function in case anyone might find it useful. I wrote it to take either a list or list of lists for the insert. It took my insert of the same data using SQLAlchemy and Pandas to_sql from taking upwards of sometimes 40 minutes down to just under 4 seconds. I may have been misusing my former method though.
connection
def mssql_conn():
conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
server=os.environ.get('MS_SQL_SERVER'),
database='EHT',
uid=os.environ.get('MS_SQL_UN'),
pwd=os.environ.get('MS_SQL_PW'),
autocommit=True)
return conn
Insert function
def mssql_insert(table,val_lst,truncate=False,temp_table=False):
'''Use as direct connection to database to insert data, especially for
large inserts. Takes either a single list (for one row),
or list of list (for multiple rows). Can either append to table
(default) or if truncate=True, replace existing.'''
conn = mssql_conn()
cursor = conn.cursor()
cursor.fast_executemany = True
tt = False
qm = '?,'
if isinstance(val_lst[0],list):
rows = len(val_lst)
params = qm * len(val_lst[0])
else:
rows = 1
params = qm * len(val_lst)
val_lst = [val_lst]
params = params[:-1]
if truncate:
cursor.execute(f"TRUNCATE TABLE {table}")
if temp_table:
#create a temp table with same schema
start_time = time.time()
cursor.execute(f"SELECT * INTO ##{table} FROM {table} WHERE 1=0")
table = f"##{table}"
#set flag to indicate temp table was used
tt = True
else:
start_time = time.time()
#insert into either existing table or newly created temp table
stmt = f"INSERT INTO {table} VALUES ({params})"
cursor.executemany(stmt,val_lst)
if tt:
#remove temp moniker and insert from temp table
dest_table = table[2:]
cursor.execute(f"INSERT INTO {dest_table} SELECT * FROM {table}")
print('Temp table used!')
print(f'{rows} rows inserted into the {dest_table} table in {time.time() -
start_time} seconds')
else:
print('No temp table used!')
print(f'{rows} rows inserted into the {table} table in {time.time() -
start_time} seconds')
cursor.close()
conn.close()
And my console results first using a temp table and then not using one (in both cases, the table contained data at the time of execution and Truncate=True):
No temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 10.595500707626343
seconds
Temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 3.810380458831787
seconds
FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test the save to csv and BULK INSERT, I wonder how it compares.
Here's my blog on the testing I did:
http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html
adding to Gord Thompson's answer:
# add the below line for controlling batch size of insert
cursor.fast_executemany_rows = batch_size # by default it is 1000
I'm using server-side cursor in PostgreSQL with psycopg2, based on this well-explained answer.
with conn.cursor(name='name_of_cursor') as cursor:
query = "SELECT * FROM tbl FOR UPDATE"
cursor.execute(query)
for row in cursor:
# process row
In processing each row, I'd like to update a few fields in the row using PostgreSQL's UPDATE tbl SET ... WHERE CURRENT OF name_of_cursor (docs), but it seems that, when the for loop enters and row is set, the position of the server-side cursor is in a different record, so while I can run the command, the wrong record is updated.
How can I make sure the result iterator is in the same position as the cursor? (also preferably in a way that won't make the loop slower than updating using an ID)
The reason why a different record was being updated was because internally psycopg2 does a FETCH FORWARD 1000 (or whatever the default chunk size is), positioning the cursor at the end of the block. You can override this by fetching one record at a time:
updcursor = conn.cursor()
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 1 # to make server-side cursor be in the same position as the iterator
cursor.execute('SELECT * FROM tbl FOR UPDATE')
for row in cursor:
# process row...
updcursor.execute('UPDATE tbl SET fld1 = %s WHERE CURRENT OF name_of_cursor', [val])
The snippet above will update the correct record. Note that you cannot use the same cursor for selecting and updating, they must be different cursors.
Performance
Reducing the FETCH size to 1 reduces the retrieval performance by a lot. I definitely wouldn't recommend using this technique if you're iterating a large dataset (which is probably the case you're searching for server-side cursors) from a different host than the PostgreSQL server.
I ended up using a combination of exporting records to CSV, then importing them later using COPY FROM (with the function copy_expert).
Does anybody know of a good Python code that can copy large number of tables (around 100 tables) from one database to another in SQL Server?
I ask if there is a way to do it in Python, because due to restrictions at my place of employment, I cannot copy tables across databases inside SQL Server alone.
Here is a simple Python code that copies one table from one database to another. I am wondering if there is a better way to write it if I want to copy 100 tables.
print('Initializing...')
import pandas as pd
import sqlalchemy
import pyodbc
db1 = sqlalchemy.create_engine("mssql+pyodbc://user:password#db_one")
db2 = sqlalchemy.create_engine("mssql+pyodbc://user:password#db_two")
print('Writing...')
query = '''SELECT * FROM [dbo].[test_table]'''
df = pd.read_sql(query, db1)
df.to_sql('test_table', db2, schema='dbo', index=False, if_exists='replace')
print('(1) [test_table] copied.')
SQLAlchemy is actually a good tool to use to create identical tables in the second db:
table = Table('test_table', metadata, autoload=True, autoload_with=db1)
table.create(engine=db2)
This method will also produce correct keys, indexes, foreign keys. Once the needed tables are created, you can move the data by either select/insert if the tables are relatively small or use bcp utility to dump table to disk and then load it into the second database (much faster but more work to get it to work correctly)
If using select/insert then it is better to insert in batches of 500 records or so.
You can do something like this:
tabs = pd.read_sql("SELECT table_name FROM INFORMATION_SCHEMA.TABLES", db1)
for tab in tabs['table_name']:
pd.read_sql("select * from {}".format(tab), db1).to_sql(tab, db2, index=False)
But it might be be awfully slow. Use SQL Server tools to do this job.
Consider using sp_addlinkedserver procedure to link one SQL Server from another. After that you can execute:
SELECT * INTO server_name...table_name FROM table_name
for all tables from the db1 database.
PS this might be done in Python + SQLAlchemy as well...
I have problem with the python PostgreSQL. I am using psycopg2.
here's my postgres DB looks like:
Database: qcdata
---information_schema
---pg_catalog
---prod
---activation
--- *** Many other table ***
---public
I want to pull out the information in schema: prod - table: activation
here's my code
import psycopg2
conn = psycopg2.connect(host='10.0.80.180', port = '5432',
dbname = 'qcdata',
user = 'username', password = 'pwd')
cur = conn.cursor()
print cur.execute("SELECT * FROM prod.activation")
But it returns None... I am sure there's data in it. How could that be?
Thanks guys.
According to the docs (http://initd.org/psycopg/docs/cursor.html), the cur.execute() cursor always returns None. You have to follow up with one of the fetch methods, like:
print cur.fetchall()
-g
While I will use cursor.fetchone() in cases where I'm getting a single value, i.e. when using COUNT:
SELECT count(*) FROM prod.activation
If I want to process a set of rows, I like to leverage the cursor's ability to iterate over the result set. That allows more control over the results, how they're processed, and how they're returned.
The cursor can be iterated over just like a Python list:
for row in cur:
dostuff()
If you use named cursors (simply set the name attribute when constructing the cursor), psycopg2 will chunk the SELECT for you, by 2000 by default. This will automatically happen in the background while iterating.
You can also use the with syntax to automatically close the cursor when you're done with it. (Useful for smaller-scoped uses.)
I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:
import csv, sqlite3
conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
reader = csv.reader(open(filecsv.txt, "rb"))
for field1, field2, field3, field4, field5 in reader:
cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)?
**EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. "My title, comma"). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file...
any other thoughts??**
Chris is right - use transactions; divide the data into chunks and then store it.
"... Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database." - Source: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
"... there is another trick you can use to speed up SQLite: transactions. Whenever you have to do multiple database writes, put them inside a transaction. Instead of writing to (and locking) the file each and every time a write query is issued, the write will only happen once when the transaction completes." - Source: How Scalable is SQLite?
import csv, sqlite3, time
def chunks(data, rows=10000):
""" Divides the data into 10000 rows each """
for i in xrange(0, len(data), rows):
yield data[i:i+rows]
if __name__ == "__main__":
t = time.time()
conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
csvData = csv.reader(open(filecsv.txt, "rb"))
divData = chunks(csvData) # divide into 10000 rows each
for chunk in divData:
cur.execute('BEGIN TRANSACTION')
for field1, field2, field3, field4, field5 in chunk:
cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
cur.execute('COMMIT')
print "\n Time Taken: %.3f sec" % (time.time()-t)
It's possible to import the CSV directly:
sqlite> .separator ","
sqlite> .import filecsv.txt mytable
http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles
As it's been said (Chris and Sam), transactions do improve a lot insert performance.
Please, let me recommend another option, to use a suite of Python utilities to work with CSV, csvkit.
To install:
pip install csvkit
To solve your problem
csvsql --db sqlite:///path/to/file.db --insert --table mytable filecsv.txt
Try using transactions.
begin
insert 50,000 rows
commit
That will commit data periodically rather than once per row.
Pandas makes it easy to load big files into databases in chunks. Read the CSV file into a Pandas DataFrame and then use the Pandas SQL writer (so Pandas does all the hard work). Here's how to load the data in 100,000 row chunks.
import pandas as pd
orders = pd.read_csv('path/to/your/file.csv')
orders.to_sql('orders', conn, if_exists='append', index = False, chunksize=100000)
Modern Pandas versions are very performant. Don't reinvent the wheel. See here for more info.