I have an SQLite3 database with a table that has twenty million rows.
I would like to update the values of some of the columns in the table (for all rows).
I am running into performance issues (about only 1'000 rows processed per second).
I would like to continue using the peewee module in python to interact with the
database.
So I'm not sure if I am taking the right approach with my code. After trying some ideas that all failed, I attempted to perform the update in batches. My first solution here was to iterate in over the cursor with islice as so:
import math, itertools
from tqdm import tqdm
from cool_project.database import db, MyTable
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
total_rows = rows.count()
page_size = 1000
total_pages = math.ceil(total_rows / page_size)
# Start #
with db.atomic():
for page_num in tqdm(range(total_pages)):
page = list(itertools.islice(rows, page_size))
for row in page: update_row(row)
MyTable.bulk_update(page, fields=fields)
This failed, because it would attempt to put the result of the whole query into memory. So I adapted the code to use the paginate function.
import math
from tqdm import tqdm
from cool_project.database import db, MyTable
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
total_rows = rows.count()
page_size = 1000
total_pages = math.ceil(total_rows / page_size)
# Start #
with db.atomic():
for page_num in tqdm(range(1, total_pages+1)):
# Get a batch #
page = MyTable.select().paginate(page_num, page_size)
# Update #
for row in page: update_row(row)
# Commit #
MyTable.bulk_update(page, fields=fields)
But it's still quite slow, and would take >24 hours to complete.
What is strange is that the speed (in number of rows per second) notably decreases as time goes by. The scripts starts with ~1000 rows per second. But after half an hour it's down to 250 rows per second.
Am I missing something? Thanks!
The issues are twofold -- you are pulling all results into memory and you are using the bulk update API, which is quite complex/special and which is also completely unnecessary for SQLite.
Try the following:
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
with db.atomic():
for row in rows.iterator(): # Add ".iterator()" to avoid caching rows
update_row(row)
row.save(only=fields)
Related
I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.
I have two tables with ~9 to 12million records that I am joining on their primary key for all columns (30 columns each). I then load 100,000 (have also tried 1,000, 10,000 etc) rows into memory at a time to process it in chunks.
It takes ~9 seconds in SQL for this query:
9477056 rows returned in 9646ms from: SELECT A., B. FROM 'tableA_2019-12-08' A JOIN 'tableB_2019-12-08' B USING(Symbol);
But I am seeing 250-300 seconds in python.
Here's the python code:
myQuery = "SELECT {} FROM '{}' A JOIN '{}' B USING(Symbol)".format(ColumnsToSelect,table1_name,table2_name)
c.execute(myQuery)
nr = 0
while c.fetchmany(100000) != []: # this will break when no more rows will be returned i.e end of table
nr+=1
print(nr)
results = c.fetchmany(100000)
#code to process results goes here but I commented it out to test the fetchmany speed```
Does anyone know why this is happening?
Edit: Tried changing it to this but I have the same issue, taking 250 seconds:
```DataProcessed = False
start_time_diffgen = time.time()
while DataProcessed is not True: # this will break when no more rows will be returned i.e end of table
result = c.fetchmany(1000000)
nr+=1
if result == []:
DataProcessed = True```
I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit
While reading large relations from a SQL database to a pandas dataframe, it would be nice to have a progress bar, because the number of tuples is known statically and the I/O rate could be estimated. It looks like the tqdm module has a function tqdm_pandas which will report progress on mapping functions over columns, but by default calling it does not have the effect of reporting progress on I/O like this. Is it possible to use tqdm to make a progress bar on a call to pd.read_sql?
Edit: Answer is misleading - chunksize has no effect on database side of the operation. See comments below.
You could use the chunksize parameter to do something like this:
chunks = pd.read_sql('SELECT * FROM table', con=conn, chunksize=100)
df = pd.DataFrame()
for chunk in tqdm(chunks):
df = pd.concat([df, chunk])
I think this would use less memory as well.
yes! you can!
expanding the answer here, and Alex answer, to include tqdm, we get:
# get total number or rows
q = f"SELECT COUNT(*) FROM table"
total_rows = pd.read_sql_query(q, conn).values[0, 0]
# note that COUNT implementation should not download the whole table.
# some engine will prefer you to use SELECT MAX(ROWID) or whatever...
# read table with tqdm status bar
q = f"SELECT * FROM table"
rows_in_chunk = 1_000
chunks = pd.read_sql_query(q, conn, chunksize=rows_in_chunk)
df = tqdm(chunks, total=total_rows/rows_in_chunk)
df = pd.concat(df)
output example:
39%|███▉ | 99/254.787 [01:40<02:09, 1.20it/s]
I have around 20000 rows in my azure table . I wanted to query all the rows in the azure table . But due to certain azure limitation i am getting only 1000 rows.
My code
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='YYY')
i=0
tasks=table_service.query_entities('ValidOutputTable',"PartitionKey eq 'tasksSeattle'")
for task in tasks:
i+=1
print task.RowKey,task.DomainUrl,task.Status
print i
I want to get all the rows from the validoutputtable .Is there a way to do so
But due to certain azure limitation i am getting only 1000 rows.
This is a documented limitation. Each query request to Azure Table will return no more than 1000 rows. If there are more than 1000 entities, table service will return a continuation token that must be used to fetch next set of entities (See Remarks section here: http://msdn.microsoft.com/en-us/library/azure/dd179421.aspx)
Please see the sample code to fetch all entities from a table:
from azure import *
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='yyy')
i=0
next_pk = None
next_rk = None
while True:
entities=table_service.query_entities('Address',"PartitionKey eq 'Address'", next_partition_key = next_pk, next_row_key = next_rk, top=1000)
i+=1
for entity in entities:
print(entity.AddressLine1)
if hasattr(entities, 'x_ms_continuation'):
x_ms_continuation = getattr(entities, 'x_ms_continuation')
next_pk = x_ms_continuation['nextpartitionkey']
next_rk = x_ms_continuation['nextrowkey']
else:
break;
Update 2019
Just running for loop on the query result (as author of the topic does) - will get all the data from the query.
from azure.cosmosdb.table.tableservice import TableService
table_service = TableService(account_name='accont_name', account_key='key')
#counter to keep track of records
counter=0
# get the rows. Debugger shows the object has only 100 records
rows = table_service.query_entities(table,"PartitionKey eq 'mykey'")
for row in rows:
if (counter%100 == 0):
# just to keep output smaller, print every 100 records
print("Processing {} record".format(counter))
counter+=1
The output proves that loop goes over a 1000 records
...
Processing 363500 record
Processing 363600 record
...
Azure Table Storage has a new python library in preview release that is available for installation via pip. To install use the following pip command
pip install azure-data-tables
To query all rows for a given table with the newest library, you can use the following code snippet:
from azure.data.tables import TableClient
key = os.environ['TABLES_PRIMARY_STORAGE_ACCOUNT_KEY']
account_name = os.environ['tables_storage_account_name']
endpoint = os.environ['TABLES_STORAGE_ENDPOINT_SUFFIX']
account_url = "{}.table.{}".format(account_name, endpoint)
table_name = "myBigTable"
with TableClient(account_url=account_url, credential=key, table_name=table_name) as table_client:
try:
table_client.create_table()
except:
pass
i = 0
for entity in table_client.list_entities():
print(entity['value'])
i += 1
if i % 100 == 0:
print(i)
Your outlook would look like this: (modified for brevity, assuming there are 2000 entities)
...
1100
1200
1300
...