basic pyodbc bulk insert - python

In a python script, I need to run a query on one datasource and insert each row from that query into a table on a different datasource. I'd normally do this with a single insert/select statement with a tsql linked server join but I don't have a linked server connection to this particular datasource.
I'm having trouble finding a simple pyodbc example of this. Here's how I'd do it but I'm guessing executing an insert statement inside a loop is pretty slow.
result = ds1Cursor.execute(selectSql)
for row in result:
insertSql = "insert into TableName (Col1, Col2, Col3) values (?, ?, ?)"
ds2Cursor.execute(insertSql, row[0], row[1], row[2])
ds2Cursor.commit()
Is there a better bulk way to insert records with pyodbc? Or is this a relatively efficient way to do this anyways. I'm using SqlServer 2012, and the latest pyodbc and python versions.

The best way to handle this is to use the pyodbc function executemany.
ds1Cursor.execute(selectSql)
result = ds1Cursor.fetchall()
ds2Cursor.executemany('INSERT INTO [TableName] (Col1, Col2, Col3) VALUES (?, ?, ?)', result)
ds2Cursor.commit()

Here's a function that can do the bulk insert into SQL Server database.
import pyodbc
import contextlib
def bulk_insert(table_name, file_path):
string = "BULK INSERT {} FROM '{}' (WITH FORMAT = 'CSV');"
with contextlib.closing(pyodbc.connect("MYCONN")) as conn:
with contextlib.closing(conn.cursor()) as cursor:
cursor.execute(string.format(table_name, file_path))
conn.commit()
This definitely works.
UPDATE: I've noticed at the comments, as well as coding regularly, that pyodbc is better supported than pypyodbc.
NEW UPDATE: remove conn.close() since the with statement handles that automatically.

Since the discontinuation of the pymssql library (which seems to be under development again) we started using the cTDS library developed by the smart people at Zillow and for our surprise it supports the FreeTDS Bulk Insert.
As the name suggests cTDS is written in C on top of FreeTDS library, which makes it fast, really fast. IMHO this is the best way to bulk insert into SQL Server since the ODBC driver does not support bulk insert and executemany or fast_executemany as suggested aren't really bulk insert operations. The BCP tool and T-SQL Bulk Insert has it limitations since it needs the file to be accessible by the SQL Server which can be a deal breaker in many scenarios.
Bellow a naive implementation of Bulk Inserting a CSV file. Please, forgive me for any bug, I wrote this from mind without testing.
I don't know why but for my server which uses Latin1_General_CI_AS I needed to wrap the data which goes into NVarChar columns with ctds.SqlVarChar. I opened an issue about this but developers said the naming is correct, so I changed my code to keep me mentally health.
import csv
import ctds
def _to_varchar(txt: str) -> ctds.VARCHAR:
"""
Wraps strings into ctds.NVARCHAR.
"""
if txt == "null":
return None
return ctds.SqlNVarChar(txt)
def _to_nvarchar(txt: str) -> ctds.VARCHAR:
"""
Wraps strings into ctds.VARCHAR.
"""
if txt == "null":
return None
return ctds.SqlVarChar(txt.encode("utf-16le"))
def read(file):
"""
Open CSV File.
Each line is a column:value dict.
https://docs.python.org/3/library/csv.html?highlight=csv#csv.DictReader
"""
with open(file, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
yield row
def transform(row):
"""
Do transformations to data before loading.
Data specified for bulk insertion into text columns (e.g. VARCHAR,
NVARCHAR, TEXT) is not encoded on the client in any way by FreeTDS.
Because of this behavior it is possible to insert textual data with
an invalid encoding and cause the column data to become corrupted.
To prevent this, it is recommended the caller explicitly wrap the
the object with either ctds.SqlVarChar (for CHAR, VARCHAR or TEXT
columns) or ctds.SqlNVarChar (for NCHAR, NVARCHAR or NTEXT columns).
For non-Unicode columns, the value should be first encoded to
column’s encoding (e.g. latin-1). By default ctds.SqlVarChar will
encode str objects to utf-8, which is likely incorrect for most SQL
Server configurations.
https://zillow.github.io/ctds/bulk_insert.html#text-columns
"""
row["col1"] = _to_datetime(row["col1"])
row["col2"] = _to_int(row["col2"])
row["col3"] = _to_nvarchar(row["col3"])
row["col4"] = _to_varchar(row["col4"])
return row
def load(rows):
stime = time.time()
with ctds.connect(**DBCONFIG) as conn:
with conn.cursor() as curs:
curs.execute("TRUNCATE TABLE MYSCHEMA.MYTABLE")
loaded_lines = conn.bulk_insert("MYSCHEMA.MYTABLE", map(transform, rows))
etime = time.time()
print(loaded_lines, " rows loaded in ", etime - stime)
if __name__ == "__main__":
load(read('data.csv'))

You should use executemany with the cursor.fast_executemany = True, to improve the performance.
pyodbc's default behaviour is to run many inserts, but this is inefficient. By applying fast_executemany, you can drastically improve performance.
Here is an example:
connection = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server}',host='host', database='db', user='usr', password='foo')
cursor = connection.cursor()
# I'm the important line
cursor.fast_executemany = True
sql = "insert into TableName (Col1, Col2, Col3) values (?, ?, ?)"
tuples=[('foo','bar', 'ham'), ('hoo','far', 'bam')]
cursor.executemany(sql, tuples)
cursor.commit()
cursor.close()
connection.close()
Docs.
Note that this has been available since 4.0.19 Oct 23, 2017

Helpful function for generating the SQL required for using execute_many():
def generate_bulk_insert_sql(self, data:pd.DataFrame, table_name) -> str:
table_sql = str([c for c in data.columns]).replace("'","").replace("[", "").replace("]", "")
return f'INSERT INTO {table_name} ({table_sql}) VALUES ({("?,"*len(data.columns))[:-1]})

Related

Upload an entire CSV into SQL Server [duplicate]

Below is my code that I'd like some help with.
I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.
I figure bulk insert is the route to go to speed it up?
Or is it because I'm iterating over the rows via for data in reader: portion?
#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
#hooks csv reader to file
reader = csv.reader(f)
#pulls out the columns (which match the SQL table)
columns = next(reader)
#trims any extra spaces
columns = [x.strip(' ') for x in columns]
#starts SQL statement
query = 'bulk insert into SpikeData123({0}) values ({1})'
#puts column names in SQL query 'query'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
print 'Query is: %s' % query
#starts curser from cnxn (which works)
cursor = cnxn.cursor()
#uploads everything by row
for data in reader:
cursor.execute(query, data)
cursor.commit()
I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).
SpikeData123 is the table name.
As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.
pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
crsr.fast_executemany = True # new in pyodbc 4.0.19
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... reduced the execution time to just over 1 second.
Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.
Update - April 2019: As noted in the comment from #SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).
BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...
1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07
... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...
1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07
... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...
1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07
... or a tab-delimited file (where → represents the tab character) ...
1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07
... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:
import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
FIELDTERMINATOR='\\t',
ROWTERMINATOR='\\n'
);
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()
Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.
yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data.
I would remove loop and try again.
Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert
Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.
This problem was frustrating me and I didn't see much improvement using fast_executemany until I found this post on SO. Specifically, Bryan Bailliache's comment regarding max varchar. I had been using SQLAlchemy and even ensuring better datatype parameters did not fix the issue for me; however, switching to pyodbc did. I also took Michael Moura's advice of using a temp table and found it shaved of even more time. I wrote a function in case anyone might find it useful. I wrote it to take either a list or list of lists for the insert. It took my insert of the same data using SQLAlchemy and Pandas to_sql from taking upwards of sometimes 40 minutes down to just under 4 seconds. I may have been misusing my former method though.
connection
def mssql_conn():
conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
server=os.environ.get('MS_SQL_SERVER'),
database='EHT',
uid=os.environ.get('MS_SQL_UN'),
pwd=os.environ.get('MS_SQL_PW'),
autocommit=True)
return conn
Insert function
def mssql_insert(table,val_lst,truncate=False,temp_table=False):
'''Use as direct connection to database to insert data, especially for
large inserts. Takes either a single list (for one row),
or list of list (for multiple rows). Can either append to table
(default) or if truncate=True, replace existing.'''
conn = mssql_conn()
cursor = conn.cursor()
cursor.fast_executemany = True
tt = False
qm = '?,'
if isinstance(val_lst[0],list):
rows = len(val_lst)
params = qm * len(val_lst[0])
else:
rows = 1
params = qm * len(val_lst)
val_lst = [val_lst]
params = params[:-1]
if truncate:
cursor.execute(f"TRUNCATE TABLE {table}")
if temp_table:
#create a temp table with same schema
start_time = time.time()
cursor.execute(f"SELECT * INTO ##{table} FROM {table} WHERE 1=0")
table = f"##{table}"
#set flag to indicate temp table was used
tt = True
else:
start_time = time.time()
#insert into either existing table or newly created temp table
stmt = f"INSERT INTO {table} VALUES ({params})"
cursor.executemany(stmt,val_lst)
if tt:
#remove temp moniker and insert from temp table
dest_table = table[2:]
cursor.execute(f"INSERT INTO {dest_table} SELECT * FROM {table}")
print('Temp table used!')
print(f'{rows} rows inserted into the {dest_table} table in {time.time() -
start_time} seconds')
else:
print('No temp table used!')
print(f'{rows} rows inserted into the {table} table in {time.time() -
start_time} seconds')
cursor.close()
conn.close()
And my console results first using a temp table and then not using one (in both cases, the table contained data at the time of execution and Truncate=True):
No temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 10.595500707626343
seconds
Temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 3.810380458831787
seconds
FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test the save to csv and BULK INSERT, I wonder how it compares.
Here's my blog on the testing I did:
http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html
adding to Gord Thompson's answer:
# add the below line for controlling batch size of insert
cursor.fast_executemany_rows = batch_size # by default it is 1000

Using prepared statements with mysql in python

I am trying to use SQL with prepared statements in Python. Python doesn't have its own mechanism for this so I try to use SQL directly:
sql = "PREPARE stmt FROM ' INSERT INTO {} (date, time, tag, power) VALUES (?, ?, ?, ?)'".format(self.db_scan_table)
self.cursor.execute(sql)
Then later, in the loop:
sql = "EXECUTE stmt USING \'{}\', \'{}\', {}, {};".format(d, t, tag, power)
self.cursor.execute(sql)
And in the loop I get:
MySQL Error [1064]: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ''2014-12-25', '12:31:46', 88000000, -6.64' at line 1
What's going on?
Using prepared statements with MySQL in Python is explained e.g at http://zetcode.com/db/mysqlpython/ -- look within that page for Prepared statements.
In your case, that would be, e.g:
sql = ('INSERT INTO {} (date, time, tag, power) VALUES '
'(%s, %s, %s, %s)'.format(self.db_scan_table))
and later, "in the loop" as you put it:
self.cursor.execute(sql, (d, t, tag, power))
with no further string formatting -- the MySQLdb module does the prepare and execute parts on your behalf (and may cache things to avoid repeating work needlessly, etc, etc).
Do consider, depending on the nature of "the loop" you mention, that it's possible that a single call to .execute_many (with a sequence of tuples as the second argument) could take the place of the whole loop (unless you need more processing within that loop beyond just the insertion of data into the DB).
Added: a better alternative nowadays may be to use mysql's own Connector/Python and the explicit prepare=True option in the .cursor() factory -- see http://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursorprepared.html . This lets you have a specific cursor on which statements are prepared (with the "more efficient than using PREPARE and EXECUTE" binary protocol, according to that mysql.com page) and another one for statements that are better not prepared; "explicit is better than implicit" is after all one of the principles in "The Zen of Python" (import this from an interactive prompt to read all those principles). mysqldb doing things implicitly (and it seems the current open-source version doesn't use prepared statements) can't be as good an architecture as Connector/Python's more explicit one.
import mysql.connector
db_con=mysql.connector.connect(host='',
database='',
user='',
password='')
cursor = db_con.cursor(prepared=True,)
#cursor = db_con.cursor(prepared=True)#IT MAY HAVE PROBLEM
sql = """INSERT INTO table (xy,zy) VALUES (%s, %s)"""
input=(1,2)
cursor.execute(sql , input)
db_con.commit()
SELECT STMT
sql = """SELECT * FROM TABLE WHERE XY=%s ORDER BY id DESC LIMIT 1 """
ID=1
input=(ID,)
#input=(ID)# IT MAY HAS PROBLEM
cursor.execute(sql, input)
data = cursor.fetchall()
rowsNumber=cursor.rowcount
Python does support prepared statements:
sql = "INSERT INTO {} (date, time, tag, power) VALUES (%s, %s, %s, %s);"
sql = sql.format(self.db_scan_table)
self.cursor.execute(sql, (d, t, tag, power))
(You should ensure self.db_scan_table is not vulnerable to SQL injection)
This assumes your paramstyle is 'format', which it should be for MySQL.

Print if MySQL returns no results

This is my code so far. I'm attempting to print No results found if no results are returned by MySQL however I can't figure it out. Perhaps I'm using incorrect arguments. Could anyone provide me with an example? Much appreciated!
def movie_function(film):
connection = mysql connection info
cursor = connection.cursor()
sql = "SELECT * FROM film_database WHERE film_name = '"+film+"' ORDER BY actor"
cursor.execute(sql)
rows = cursor.fetchall()
for row in rows:
print row[1]
When you execute a select statement, cursor.rowcount is set to the number of results retrieved. Also, there is no real need to call cursor.fetchall(); looping over the cursor directly is easier:
def movie_function(film):
connection = mysql connection info
cursor = connection.cursor()
sql = "SELECT * FROM film_database WHERE film_name = %s ORDER BY actor"
cursor.execute(sql, (film,))
if not cursor.rowcount:
print "No results found"
else:
for row in cursor:
print row[1]
Note that I also switched your code to use SQL parameters; there is no need to use string interpolation here, leave that to the database adapter. The %s placeholder is replaced for you by a correctly quoted value taken from the second argument to cursor.execute(), a sequence of values (here a tuple of one element).
Using SQL parameters also lets a good database reuse the query plan for the select statement, and leaving the quoting up to the database adapter prevents SQL injection attacks.
You could use cursor.rowcount after your code to see how many rows were actually returned. See here for more.
I guess, this should work.
def movie_function(film):
connection = mysql connection info
cursor = connection.cursor()
sql = "SELECT * FROM film_database WHERE film_name = %s ORDER BY actor"
cursor.execute(sql, [film])
rows = cursor.fetchall()
if not rows:
print 'No resulrs found'
return
for row in rows:
print row[1]
Note, that I changed the way the film parameter is passed to query. I don't know, how exactly it should be (this depends on what MySQL driver for python you use), but important thing to know, is that you should not pass your parameters directly to the query string, because of security reasons.
You can also use :
rows_affected=cursor.execute("SELECT ... ") -> you have directly the number of returned rows

oursql extremely slow in inserting data

I am trying to store some data generated by a python script in a MySQL database. Essentially I am using the commands:
con = oursql.connect(user="user", host="host", passwd="passwd",
db="testdb")
c = con.cursor()
c.executemany(insertsimoutput, zippedsimoutput)
con.commit()
c.close()
where,
insertsimoutput = '''insert into simoutput
(repnum,
timepd,
...) values (?, ?, ...?)'''
About 30,000 rows are inserted and there are about 15 columns. The above takes about 7 minutes. If I use MySQLdb instead of oursql, it takes about 2 seconds. Why this huge difference? Is this supposed to be done some other way in oursql, our oursql is just plain slow? If there is a better way to insert this data with oursql, I would appreciate if you can let me know.
Thank you.
The difference is that MySQLdb does some hackery to your query while oursql does not...
Taking this:
cursor.executemany("INSERT INTO sometable VALUES (%s, %s, %s)",
[[1,2,3],[4,5,6],[7,8,9]])
MySQLdb translates it before running into this:
cursor.execute("INSERT INTO sometable VALUES (1,2,3),(4,5,6),(7,8,9)")
But if you do:
cursor.executemany("INSERT INTO sometable VALUES (?, ?, ?)",
[[1,2,3],[4,5,6],[7,8,9]])
In oursql, it gets translated into something like this pseudocode:
stmt = prepare("INSERT INTO sometable VALUES (?, ?, ?)")
for params in [[1,2,3],[4,5,6],[7,8,9]]:
stmt.execute(*params)
So if you want to emulate what mysqldb is doing but benefit from prepared statements and other goodness with oursql, you need to do this:
from itertools import chain
data = [[1,2,3],[4,5,6],[7,8,9]]
one_val = "({})".format(','.join("?" for i in data[0]))
vals_clause = ','.join(one_val for i in data)
cursor.execute("INSERT INTO sometable VALUES {}".format(vals_clause),
chain.from_iterable(data))
I bet oursql will be faster when you do this :-)
Also, if you think its ugly, you are right. But just remember MySQL db is doing something uglier internally - its using regular expressions to parse your INSERT statement and break off the parameterized part and THEN doing what I suggested you do for oursql.
I would say to check if oursql supports a bulk insert sql command to boost performance.
Oursql does support bulk insert statements. I've written code to do so, using the sqlalchemy wrapper.
For pure oursql, something like this should be fine:
with open('tmp.csv', 'wb') as tmp:
for item in zippedsimoutput:
tmp.write("{0}\n".format(item))
c.execute("""LOAD DATA LOCAL INFILE 'tmp.csv' INTO TABLE flags FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\r\n' ;""")
Note that the rows must be in the same order as the columns on the database.

Python CSV to SQLite

I am "converting" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:
import csv, sqlite3
conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
reader = csv.reader(open(filecsv.txt, "rb"))
for field1, field2, field3, field4, field5 in reader:
cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
Everything works as I expect it to with the exception... IT TAKES AN INCREDIBLE AMOUNT OF TIME TO PROCESS. Am I coding it incorrectly? Is there a better way to achieve a higher performance and accomplish what I'm needing (simply convert a few fields of a CSV into SQLite table)?
**EDIT -- I tried directly importing the csv into sqlite as suggested but it turns out my file has commas in fields (e.g. "My title, comma"). That's creating errors with the import. It appears there are too many of those occurrences to manually edit the file...
any other thoughts??**
Chris is right - use transactions; divide the data into chunks and then store it.
"... Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database." - Source: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html
"... there is another trick you can use to speed up SQLite: transactions. Whenever you have to do multiple database writes, put them inside a transaction. Instead of writing to (and locking) the file each and every time a write query is issued, the write will only happen once when the transaction completes." - Source: How Scalable is SQLite?
import csv, sqlite3, time
def chunks(data, rows=10000):
""" Divides the data into 10000 rows each """
for i in xrange(0, len(data), rows):
yield data[i:i+rows]
if __name__ == "__main__":
t = time.time()
conn = sqlite3.connect( "path/to/file.db" )
conn.text_factory = str #bugger 8-bit bytestrings
cur = conn.cur()
cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
csvData = csv.reader(open(filecsv.txt, "rb"))
divData = chunks(csvData) # divide into 10000 rows each
for chunk in divData:
cur.execute('BEGIN TRANSACTION')
for field1, field2, field3, field4, field5 in chunk:
cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
cur.execute('COMMIT')
print "\n Time Taken: %.3f sec" % (time.time()-t)
It's possible to import the CSV directly:
sqlite> .separator ","
sqlite> .import filecsv.txt mytable
http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles
As it's been said (Chris and Sam), transactions do improve a lot insert performance.
Please, let me recommend another option, to use a suite of Python utilities to work with CSV, csvkit.
To install:
pip install csvkit
To solve your problem
csvsql --db sqlite:///path/to/file.db --insert --table mytable filecsv.txt
Try using transactions.
begin
insert 50,000 rows
commit
That will commit data periodically rather than once per row.
Pandas makes it easy to load big files into databases in chunks. Read the CSV file into a Pandas DataFrame and then use the Pandas SQL writer (so Pandas does all the hard work). Here's how to load the data in 100,000 row chunks.
import pandas as pd
orders = pd.read_csv('path/to/your/file.csv')
orders.to_sql('orders', conn, if_exists='append', index = False, chunksize=100000)
Modern Pandas versions are very performant. Don't reinvent the wheel. See here for more info.

Categories

Resources