I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function
Related
I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612
I am using python version 3.6.8.
Recently using code I have been using for several months over jupyter notebook, I began getting the above error. The code uses cx_Oracle to connect run several queries. I have never gotten this error before when running these queries, and the queries generally completed within 1 minute.
The code creates a sql query string in the following form (here I use *** to mean the query is actually much longer but I have stripped it to the basics):
def function(d1,d2):
conn=cx_Oracle.connect(dsn="connection")
sqlstr=f"""select *** from table
where date between {d1} and {d2}
"""
query1=pd.read_sql(f"""
select * from
({sqlstr})
""",conn)
*several similar queries**
conn.close()
return [(query1,query2...)]
I think the error likely comes from the string formatting I am doing based on the testing I have done, but I am not sure why it has become an issue now/randomly, when again the code has been working for quite a while now. The sql if I remove the date formatting, runs perfectly fine, and very quickly elsewhere, so that is why the formatting seems to be the issue*.
EDIT*
Even after editing out the string formatting I was still having issues, but I was able to isolate it down to one single query actually--which is actually not** able to run on a direct sql query either so it is likely not a python/string issue but a DB issue (my apologies for the misplaced assurance),
the query is:
select column1,column2,sum(column3)
from
(select several variables,Case When ... as column2
from table1 inner join table2 inner join table3
where several conditions
UNION ALL
select several variables, Case When ... as column2
from table1 inner join table2 inner join table3
where several conditions)
group by column1, column2
having (ABS(sum(column3))>=10000 and column1 in ('a','b','c'))
or (column1 not in ('a','b','c'))
order by column1, sum(column3) desc
I assume there must of been some changes on the DB end that would make running this somewhat hefty query currently un-runnable to give the above error? Isolating it further-- it looks like its potentially related to grouping by the CASE When variable column 2
As others have commented this error has to do with TEMP Tablespace Sizing and the workload on the DB at the time. Also, when you get TEMP “blowouts”, it is possible that a particular SQL execution becomes the “victim” of rather than the cause of high temp usage. Sadly, there is no way to predict in advance how much TEMP tablespace a particular workload will require, so getting to the right setting is a process involving incrementally increasing TEMP as per your best estimations; TEMP needs to be big enough to handle peak workload conditions. Having said that you can use the Active Session History [requires diagnostics pack) to find the high TEMP consuming SQL and possibly tune it to use less TEMP. Alternatively, you can use/instrument v$sort_usage to find which SQL are using the most TEMP. The following query can be used to examine the current TEMP tablespace usage:
with sort as
(
SELECT username, sqladdr, sqlhash, sql_id, tablespace, session_addr
, sum(blocks) sum_blocks
FROM v$sort_usage
GROUP BY username, sqladdr, sqlhash, sql_id, tablespace, session_addr
)
, temp as
(
SELECT A.tablespace_name tablespace, D.mb_total,
SUM (A.used_blocks * D.block_size) / 1024 / 1024 mb_used,
D.mb_total - SUM (A.used_blocks * D.block_size) / 1024 / 1024 mb_free
FROM v$sort_segment A,
(
SELECT B.name, C.block_size, SUM (C.bytes) / 1024 / 1024 mb_total
FROM v$tablespace B, v$tempfile C
WHERE B.ts#= C.ts#
GROUP BY B.name, C.block_size
) D
WHERE A.tablespace_name = D.name
GROUP by A.tablespace_name, D.mb_total
)
SELECT to_char(sysdate, 'DD-MON-YYYY HH24:MI:SS') sample_time
, sess.sql_id
, CASE WHEN elapsed_time > 2*86399*1000000
THEN '2 ' || to_char(to_date(round((elapsed_time-(2*86399*1000000))/decode(executions, 0, 1, executions)/1000000) ,'SSSSS'), 'HH24:MI:SS')
WHEN elapsed_time > 86399*1000000
THEN '1 ' || to_char(to_date(round((elapsed_time-(86399*1000000))/decode(executions, 0, 1, executions)/1000000) ,'SSSSS'), 'HH24:MI:SS')
WHEN elapsed_time <= 86399*1000000
THEN to_char(to_date(round(elapsed_time/decode(executions, 0, 1, executions)/1000000) ,'SSSSS'), 'HH24:MI:SS')
END as time_per_execution
, sum_blocks*dt.block_size/1024/1024 usage_mb, sort.tablespace
, temp.mb_used, temp.mb_free, temp.mb_total
, sort.username, sess.sid, sess.serial#
, p.spid, sess.osuser, sess.module, sess.machine, p.program
, vst.sql_text
FROM sort,
v$sqlarea vst,
v$session sess,
v$process p,
dba_tablespaces dt
, temp
WHERE sess.sql_id = vst.sql_id (+)
AND sort.session_addr = sess.saddr (+)
AND sess.paddr = p.addr (+)
AND sort.tablespace = dt.tablespace_name (+)
AND sort.tablespace = temp.tablespace
order by 4 desc
;
When I use the term "instrument" I mean, you can periodically persist the results of running this query so that you can look at a later time to see what was running when you got a TEMP blowout.
In my case there're some select sqls execute very long time, about 10 minutes, but general sql usually can completed in 5 seconds.
and I find in the sql it uses SELECT COLUMN_A||COLUMN_B FROM TABLE_C ... , that is it uses || to concatenate columns, and compare with another table, and another table also use ||, so it may cause Oracle Database execute Table Full Scan and use lots of memory, and ORA-01652 occur.
after change || to general column compare: SELECT COLUMN_A, COLUMN_B FROM TABLE_C ... , it can normally execute, without error.
I'm trying sum up the values in two columns and truncate my date fields by the day. I've constructed the SQL query to do this(which works):
SELECT date_trunc('day', date) AS Day, SUM(fremont_bridge_nb) AS
Sum_NB, SUM(fremont_bridge_sb) AS Sum_SB FROM bike_count GROUP BY Day
ORDER BY Day;
But I then run into issues when I try to format this into peewee:
Bike_Count.select(fn.date_trunc('day', Bike_Count.date).alias('Day'),
fn.SUM(Bike_Count.fremont_bridge_nb).alias('Sum_NB'),
fn.SUM(Bike_Count.fremont_bridge_sb).alias('Sum_SB'))
.group_by('Day').order_by('Day')
I don't get any errors, but when I print out the variable I stored this in, it shows:
<class 'models.Bike_Count'> SELECT date_trunc(%s, "t1"."date") AS
Day, SUM("t1"."fremont_bridge_nb") AS Sum_NB,
SUM("t1"."fremont_bridge_sb") AS Sum_SB FROM "bike_count" AS t1 ORDER
BY %s ['day', 'Day']
The only thing that I've written in Python to get data successfully is:
Bike_Count.get(Bike_Count.id == 1).date
If you just stick a string into your group by / order by, Peewee will try to parameterize it as a value. This is to avoid SQL injection haxx.
To solve the problem, you can use SQL('Day') in place of 'Day' inside the group_by() and order_by() calls.
Another way is to just stick the function call into the GROUP BY and ORDER BY. Here's how you would do that:
day = fn.date_trunc('day', Bike_Count.date)
nb_sum = fn.SUM(Bike_Count.fremont_bridge_nb)
sb_sum = fn.SUM(Bike_Count.fremont_bridge_sb)
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(day)
.order_by(day))
Or, if you prefer:
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(SQL('Day'))
.order_by(SQL('Day')))
I have a csv file of customer ids (CRM_id). I need to get their primary keys (an autoincrement int) from the customers table of the database. (I can't be assured of the integrity of the CRM_ids so I chose not to make that the primary key).
So:
customers = []
with open("CRM_ids.csv", 'r', newline='') as csvfile:
customerfile = csv.DictReader(csvfile, delimiter = ',', quotechar='"', skipinitialspace=True)
#only one "CRM_id" field per row
customers = [c for c in customerfile]
So far so good? I think this is the most pythonesque way of doing that (but happy to hear otherwise).
Now comes the ugly code. It works, but I hate appending to the list because that has to copy and reallocate memory for each loop, right? Is there a better way (pre-allocate + enumerate to keep track of the index comes to mind, but maybe there's an even quickler/better way by being clever with the SQL so as not to do several thousand separate queries...)?
cnx = mysql.connector.connect(user='me', password=sys.argv[1], host="localhost", database="mydb")
cursor = cnx.cursor()
select_customer = ("SELECT id FROM customers WHERE CRM_id = %(CRM_id)s LIMIT 1;")
c_ids = []
for row in customers:
cursor.execute(select_customer, row)
#note fetchone() returns a tuple, but the SELECTed set
#only has a single column so we need to get this column with the [0]
c_ids.extend(cursor.fetchall())
c_ids = [c[0] for c in c_ids]
Edit:
Purpose is to get the primary keys in a list so I can use these to allocate some other data from other CSV files in linked tables (the customer id primary key is a foreign key to these other tables, and the allocation algorithm changes, so it's best to have the flexibility to do the allocation in python rather than hard coding SQL queries). I know this sounds a little backwards, but the "client" only works with spreadsheets rather than an ERP/PLM, so I have to build the "relations" for this small app myself.
What about changing your query to get what you want?
crm_ids = ",".join(customers)
select_customer = "SELECT UNIQUE id FROM customers WHERE CRM_id IN (%s);" % crm_ids
MySQL should be fine with even a multi-megabyte query, according to the manual; if it gets to be a really long list, you can always break it up - two or three queries is guaranteed much faster than a few thousand.
how about storing your csv in a dict instead of a list:
customers = [c for c in customerfile]
becomes:
customers = {c['CRM_id']:c for c in customerfile}
then select the entire xref:
result = cursor.execute('select id, CRM_id from customers')
and add the new rowid as a new entry in the dict:
for row in result:
customers[row[1]]['newid']=row[0]
So I found a great script over at QuantState that had a great walk-through on setting up my own securities database and loading in historical pricing information. However, I'm not trying to modify the script so that I can run it daily and have the latest stock quotes added.
I adjusted the initial data load to just download 1 week worth of historicals, but I've been having issues with writing the SQL statement to see if the row exists already before adding. Can anyone help me out with this. Here's what I have so far:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
# Create the time now
now = datetime.datetime.utcnow()
# Amend the data to include the vendor ID and symbol ID
daily_data = [(data_vendor_id, symbol_id, d[0], now, now,
d[1], d[2], d[3], d[4], d[5], d[6]) for d in daily_data]
# Create the insert strings
column_str = """data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price,
close_price, volume, adj_close_price"""
insert_str = ("%s, " * 11)[:-2]
final_str = "INSERT INTO daily_price (%s) VALUES (%s) WHERE NOT EXISTS (SELECT 1 FROM daily_price WHERE symbol_id = symbol_id AND price_date = insert_str[2])" % (column_str, insert_str)
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
Some notes regarding your code above:
It's generally easier to defer to now() in pure SQL than to try in Python whenever possible. It avoids lots of potential pitfalls with timezones, library differences, etc.
If you construct a list of columns, you can dynamically generate a string of %s's based on its size, and don't need to hardcode the length into a repeated string with is then sliced.
Since it appears that insert_daily_data_into_db is meant to be called from within a loop on a per-stock basis, I don't believe you want to use executemany here, which would require a list of tuples and is very different semantically.
You were comparing symbol_id to itself in the sub select, instead of a particular value (which would mean it's always true).
To prevent possible SQL Injection, you should always interpolate values in the WHERE clause, including sub selects.
Note: I'm assuming that you're using psycopg2 to access Postgres, and that the primary key for the table is a tuple of (symbol_id, price_date). If not, the code below would need to be tweaked at least a bit.
With those points in mind, try something like this (untested, since I don't have your data, db, etc. but it is syntactically valid Python):
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
column_list = ["data_vendor_id", "symbol_id", "price_date", "created_date",
"last_updated_date", "open_price", "high_price", "low_price",
"close_price", "volume", "adj_close_price"]
insert_list = ['%s'] * len(column_str)
values_tuple = (data_vendor_id, symbol_id, daily_data[0], 'now()', 'now()', daily_data[1],
daily_data[2], daily_data[3], daily_data[4], daily_data[5], daily_data[6])
final_str = """INSERT INTO daily_price ({0})
VALUES ({1})
WHERE NOT EXISTS (SELECT 1
FROM daily_price
WHERE symbol_id = %s
AND price_date = %s)""".format(', '.join(column_list), ', '.join(insert_list))
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.execute(final_str, values_tuple, values_tuple[1], values_tuple[2])