How do I speed up my sqlite3 queries in Python? - python

I have an sqlite table with a few hundred million rows:
sqlite> create table t1(id INTEGER PRIMARY KEY,stuff TEXT );
I need to query this table by its integer primary key hundreds of millions of times. My code:
conn = sqlite3.connect('stuff.db')
with conn:
cur = conn.cursor()
for id in ids:
try:
cur.execute("select stuff from t1 where rowid=?",[id])
stuff_tuple = cur.fetchone()
#do something with the fetched row
except:
pass #for when id is not in t1's key set
Here, ids is a list that may have tens of thousands of elements. Forming t1 did not take very long (ie ~75K inserts per second). Querying t1 the way I've done it is unacceptably slow (ie ~1K queries in 10 seconds).
I am completely new to SQL. What am I doing wrong?

Since you're retrieving values by their keys, it seems like a key/value store would be more appropriate in this case. Relational databases (Sqlite included) are definitely feature-rich, but you can't beat the performance of a simple key/value store.
There are several to choose from:
Redis: "advanced key-value store", very fast, optimized for in-memory operation
Cassandra: extremely high performance, scalable, used by multiple high-profile sites
MongoDB: feature-rich, tries to be "middle ground" between relational and NoSQL (and they've started offering free online classes)
And there's many, many more.

You should make one sql call instead, should be must faster
conn = sqlite3.connect('stuff.db')
with conn:
cur = conn.cursor()
for row in cur.execute("SELECT stuff FROM t1 WHERE rowid IN (%s)" % ','.join('?'*len(ids)), ids):
#do something with the fetched row
pass
you do not need a try except since ids not in the db will not show up. If you want to know which ids are not in the results, you can do:
ids_res = set()
for row in c.execute(...):
ids_res.add(row['id'])
ids_not_found = ids_res.symmetric_difference(ids)

Related

Upload an entire CSV into SQL Server [duplicate]

Below is my code that I'd like some help with.
I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.
I figure bulk insert is the route to go to speed it up?
Or is it because I'm iterating over the rows via for data in reader: portion?
#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
#hooks csv reader to file
reader = csv.reader(f)
#pulls out the columns (which match the SQL table)
columns = next(reader)
#trims any extra spaces
columns = [x.strip(' ') for x in columns]
#starts SQL statement
query = 'bulk insert into SpikeData123({0}) values ({1})'
#puts column names in SQL query 'query'
query = query.format(','.join(columns), ','.join('?' * len(columns)))
print 'Query is: %s' % query
#starts curser from cnxn (which works)
cursor = cnxn.cursor()
#uploads everything by row
for data in reader:
cursor.execute(query, data)
cursor.commit()
I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).
SpikeData123 is the table name.
As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.
pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...
cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")
crsr.fast_executemany = True # new in pyodbc 4.0.19
sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')
... reduced the execution time to just over 1 second.
Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.
Update - April 2019: As noted in the comment from #SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).
BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...
1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07
... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...
1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07
... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...
1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07
... or a tab-delimited file (where → represents the tab character) ...
1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07
... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:
import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
FIELDTERMINATOR='\\t',
ROWTERMINATOR='\\n'
);
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()
Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.
yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data.
I would remove loop and try again.
Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert
Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.
This problem was frustrating me and I didn't see much improvement using fast_executemany until I found this post on SO. Specifically, Bryan Bailliache's comment regarding max varchar. I had been using SQLAlchemy and even ensuring better datatype parameters did not fix the issue for me; however, switching to pyodbc did. I also took Michael Moura's advice of using a temp table and found it shaved of even more time. I wrote a function in case anyone might find it useful. I wrote it to take either a list or list of lists for the insert. It took my insert of the same data using SQLAlchemy and Pandas to_sql from taking upwards of sometimes 40 minutes down to just under 4 seconds. I may have been misusing my former method though.
connection
def mssql_conn():
conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
server=os.environ.get('MS_SQL_SERVER'),
database='EHT',
uid=os.environ.get('MS_SQL_UN'),
pwd=os.environ.get('MS_SQL_PW'),
autocommit=True)
return conn
Insert function
def mssql_insert(table,val_lst,truncate=False,temp_table=False):
'''Use as direct connection to database to insert data, especially for
large inserts. Takes either a single list (for one row),
or list of list (for multiple rows). Can either append to table
(default) or if truncate=True, replace existing.'''
conn = mssql_conn()
cursor = conn.cursor()
cursor.fast_executemany = True
tt = False
qm = '?,'
if isinstance(val_lst[0],list):
rows = len(val_lst)
params = qm * len(val_lst[0])
else:
rows = 1
params = qm * len(val_lst)
val_lst = [val_lst]
params = params[:-1]
if truncate:
cursor.execute(f"TRUNCATE TABLE {table}")
if temp_table:
#create a temp table with same schema
start_time = time.time()
cursor.execute(f"SELECT * INTO ##{table} FROM {table} WHERE 1=0")
table = f"##{table}"
#set flag to indicate temp table was used
tt = True
else:
start_time = time.time()
#insert into either existing table or newly created temp table
stmt = f"INSERT INTO {table} VALUES ({params})"
cursor.executemany(stmt,val_lst)
if tt:
#remove temp moniker and insert from temp table
dest_table = table[2:]
cursor.execute(f"INSERT INTO {dest_table} SELECT * FROM {table}")
print('Temp table used!')
print(f'{rows} rows inserted into the {dest_table} table in {time.time() -
start_time} seconds')
else:
print('No temp table used!')
print(f'{rows} rows inserted into the {table} table in {time.time() -
start_time} seconds')
cursor.close()
conn.close()
And my console results first using a temp table and then not using one (in both cases, the table contained data at the time of execution and Truncate=True):
No temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 10.595500707626343
seconds
Temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 3.810380458831787
seconds
FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test the save to csv and BULK INSERT, I wonder how it compares.
Here's my blog on the testing I did:
http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html
adding to Gord Thompson's answer:
# add the below line for controlling batch size of insert
cursor.fast_executemany_rows = batch_size # by default it is 1000

How to use UPDATE/DELETE WHERE CURRENT OF in psycopg2?

I'm using server-side cursor in PostgreSQL with psycopg2, based on this well-explained answer.
with conn.cursor(name='name_of_cursor') as cursor:
query = "SELECT * FROM tbl FOR UPDATE"
cursor.execute(query)
for row in cursor:
# process row
In processing each row, I'd like to update a few fields in the row using PostgreSQL's UPDATE tbl SET ... WHERE CURRENT OF name_of_cursor (docs), but it seems that, when the for loop enters and row is set, the position of the server-side cursor is in a different record, so while I can run the command, the wrong record is updated.
How can I make sure the result iterator is in the same position as the cursor? (also preferably in a way that won't make the loop slower than updating using an ID)
The reason why a different record was being updated was because internally psycopg2 does a FETCH FORWARD 1000 (or whatever the default chunk size is), positioning the cursor at the end of the block. You can override this by fetching one record at a time:
updcursor = conn.cursor()
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 1 # to make server-side cursor be in the same position as the iterator
cursor.execute('SELECT * FROM tbl FOR UPDATE')
for row in cursor:
# process row...
updcursor.execute('UPDATE tbl SET fld1 = %s WHERE CURRENT OF name_of_cursor', [val])
The snippet above will update the correct record. Note that you cannot use the same cursor for selecting and updating, they must be different cursors.
Performance
Reducing the FETCH size to 1 reduces the retrieval performance by a lot. I definitely wouldn't recommend using this technique if you're iterating a large dataset (which is probably the case you're searching for server-side cursors) from a different host than the PostgreSQL server.
I ended up using a combination of exporting records to CSV, then importing them later using COPY FROM (with the function copy_expert).

Inserting into MySQL from Python using mysql.connector module - How to use executemany() to inert rows and child rows?

I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
Is there a way to get a list of auto generated keys that were created using an executemany() call?
If not, can I use last_insert_id() and assume that the auto generated keys will be in sequence?
Looking at the sample code below, is there a simpler or better way do accomplish this task?
What if my cars dictionary were empty? No rows would be inserted so what would the last_insert_id() return?
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization
(Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey could be a 2-byte SMALLINT UNSIGNED (limit 64K) instead of a bulky 8-byte BIGINT.

Psycopg2 rowcount for server side cursor

I am querying a Postgres database for a large number of results and want to use server side cursors to stream the results to my client. It looks like when I do this, the rowcount attribute of the cursor is now set to -1 after I execute the query. I'm creating the cursor like so:
with db.cursor('cursor_name') as cursor:
Is there a way to find the number of results of my query while streaming results from the database? (I could do a SELECT COUNT(*), but I'd like to avoid that because I'm trying to abstract away the code around the query and that would complicate the API).
In the case of a server-side cursor, although cursor.execute() returns, the query has not necessarily been executed by the server at that point, and so the row count is not available to psycopg2. This is consistent with the DBAPI 2.0 spec which states that rowcount should be -1 if the row count of the last operation is indeterminate.
Attempts to coerce it with cursor.fetchone(), for example, updates cursor.rowcount, but only by the number of items retrieved, so that is not useful. cursor.fetchall() will result in rowcount being correctly set, however, that performs the full query and transfer of data that you seek to avoid.
A possible workaround that avoids a completely separate query to get the count, and which should give accurate results is:
select *, (select count(*) from test) from test;
This will result in each row having the table row count appended as the final column. You can then get the table row count using cursor.fetchone() and then taking the final column:
with db.cursor('cursor_name') as cursor:
cursor.execute('select *, (select count(*) from test) from test')
row = cursor.fetchone()
data, count = row[:-1], row[-1]
Now count will contain the number of rows in the table. You can use row[:-1] to refer to the row data.
This might slow down the query because a possibly expensive SELECT COUNT(*) will be performed, but once done retrieving the data should be fast.

Proper way to store table in python efficiently: low memory usage and fast search by few complex indexes

I need to store few tables of strings in python (each table contains few million records). Let the header be ("A", "B", "C") and ("A, "B") is the data primary key. Then I need following operations to proceed fast:
Add new record (need O(1) complexity).
Find / update, delete record with (A="spam", B="eggs") (need O(1) complexity).
Find all records with (A="spam", C="foo") (need O(k) complexity, where k is the number of result rows).
I see a solution based on nested dicts structure for each index. It fits my needs, but I think, there is a better existing solution.
As suggested in the comments use a database. sqlite3 is small and fairly easy. It creates a database that exists in a single file and you interact with it.
Here is an adapted example from the API
import sqlite3
# Connect to your database (or create it if it was not there)
db = sqlite3.connect('data.db')
# Create the table
conn = db.cursor()
conn.execute("""
CREATE TABLE my_table
A text,
B text,
C text
""")
# Add an entry to the db
conn.execute("INSERT INTO my_table VALUES ('spam','eggs','foo')")
# Read all the entries under a condition
for row in conn.execute("SELECT * FROM my_table WHERE A='spam' AND C='foo'"):
print(row)
#safely close the db connection
conn.close()
Note: example is in python3

Categories

Resources