I have a series of computations on some data which I’m modelling as a graph with dask delayed, and works well, however the graph itself takes longer (or a comparable time) to create than the calculations take to run.
I add data throughout the day, so would like to be able to change the inputs without recreating the graph, is there a way to do this?
This is an advanced topic, so I am going to provide only a somewhat-hacky solution:
import dask
from dask.multiprocessing import get
def myfunc(x):
return x+1
nested = 0
for x in range(1, 3):
nested = myfunc(x*nested, dask_key_name=f'{x}')
# 1*0 + 1 = 1 -> 2*1 + 1 = 3
dag_modified = nested.dask.to_dict()
dag_modified['1'] = modified_dag['1'][0], 2
# 1*2 + 1 = 3 -> 2*3 + 1 = 7
print(get(dag_modified, '2'))
I have a big dataframe with many columns. For simplicity, lets say:
df_sample = pd.DataFrame({'a':np.arange(10)})
I need to define a new column in df_sample (say column 'b') which needs to use some interpolation function, the argument of which is to be taken from column 'a'.
Now, the problem is that the interpolation function is different for each row. For each row, I interpolate from a different 1D grid; so I have different interpolation function for each row. So, what I did was to generate these interpolation functions before-hand and store them into an array. Just to given an example, showing below code to generate a sample-array 'list_interpfns'
list_interpfns = np.array([None]*10)
for j in range(10):
list_interpfns[j] = scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10),np.linspace(0,50,10))
To generate df_sample.b[j], I need to use the list_interpfns[j], with the argument df_sample.a[j]. Since I am not able to directly apply a column formula for this purpose, I put this inside a loop.
df_sample['b'] = 0
for j in range(10):
df_sample.loc[j,'b'] = list_interpfns[j](df_sample.a[j])
The problem is that this operation takes a lot of time. In this small example, the computation might seem fast. But my actual program is much larger, and when I was comparing the time taken for all operations, this particular sequence of operation took 84% of the total time; and I need to speed this up.
If there is some way to avoid the for loop (like using df.apply or something), then I believe it could reduce the operation time. Could you give possible alternatives?
Consider avoiding the multiple for loops and bookkeeping of initializing and updating arrays and series, and pass column values into function build and function argument using Series.apply():
def interp_(j):
return scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10), np.linspace(0,50,10))
df_sample['b_'] = df_sample['a'].apply(lambda x: interp_(x)(x))
Results replicate your original
# a b b_
# 0 0 0.000000 0.000000
# 1 1 2.500000 2.500000
# 2 2 3.333333 3.333333
# 3 3 3.750000 3.750000
# 4 4 4.000000 4.000000
# 5 5 4.166667 4.166667
# 6 6 4.285714 4.285714
# 7 7 4.375000 4.375000
# 8 8 4.444444 4.444444
# 9 9 4.500000 4.500000
And timings suggest slightly faster processing even though Series.apply() is still a loop:
def run1():
list_interpfns = np.array([None]*10)
for j in range(10):
list_interpfns[j] = scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10),
df_sample['b'] = 0
for j in range(10):
df_sample.loc[j,'b'] = list_interpfns[j](df_sample.a[j])
def run2():
def interp_(j):
return scipy.interpolate.interp1d(np.linspace(0,10*(j+1),10), np.linspace(0,50,10))
df_sample['b_'] = df_sample['a'].apply(lambda x: interp_(x)(x))
if __name__=='__main__':
from timeit import Timer
f1 = Timer("run1()", "from __main__ import run1")
res1 = f1.repeat(repeat=100, number=1)
print('LOOP: {}'.format(np.mean(res1)))
f2 = Timer("run2()", "from __main__ import run2")
res2 = f2.repeat(repeat=100, number=1)
print('APPLY: {}'.format(np.mean(res2)))
# LOOP: 0.006322918700000002
# APPLY: 0.0015046094699999867
I have a pd.DataFrame of return series corresponding to years with a fixed spending rate of 5%. I am looking to find the ending portfolio value after spending for each year. val_after_spending in year t is equal to the average of year t val_before_spending with year t-1 val_after_spending times the spending rate. For the first year, the val_after_spending in t-1 is assumed to be 1.
I right now have a working implementation (below), but it is incredibly slow. Is there a faster way to implement this?
import pandas as pd
import numpy as np
port_rets = pd.DataFrame({'port_ret': [.10,-.25,.15]})
spending_rate = .05
for index, row in port_rets.iterrows():
if index != 0:
port_rets.at[index, 'val_before_spending'] = port_rets['val_after_spending'][index - 1] * (1 + port_rets['port_ret'][index])
port_rets.at[index, 'spending'] = np.mean([port_rets['val_after_spending'][index - 1], port_rets['val_before_spending'][index]]) * spending_rate
port_rets.at[index, 'val_before_spending'] = 1 * (1 + port_rets['port_ret'][index])
port_rets.at[index, 'spending'] = np.mean([1, port_rets['val_before_spending'][index]]) * spending_rate
port_rets.at[index, 'val_after_spending'] = port_rets['val_before_spending'][index] - port_rets['spending'][index]
# port_ret val_before_spending spending val_after_spending
#0 0.100000 1.100000 0.052500 1.047500
#1 -0.250000 0.785625 0.045828 0.739797
#2 0.150000 0.850766 0.039764 0.811002
You very heavily interface with pandas in your code, which seems to be a bad idea as far as performance is concerned. To make it as easy to use as it is, pandas needs to do a lot of book keeping, which leads to reduced performance.
We do all the calculation in numpy and then having got all the building blocks, build the dataframe in the end. Thus, the code translates to :
def get_vals(rates, spending_rate):
n = len(rates)
vals_after_spending = np.zeros((n+1, ))
vals_before_spending = np.zeros((n+1, ))
vals_after_spending[0] = 1.0
for i in range(n):
vals_before_spending[i+1] = vals_after_spending[i] * (1 + rates[i])
spending = np.mean(np.array([vals_after_spending[i], vals_before_spending[i+1]])) * spending_rate
vals_after_spending[i+1] = vals_before_spending[i+1] - spending
return vals_before_spending[1:], vals_after_spending[1:]
rates = np.array(port_rets["port_ret"].tolist())
vals_before_spending, vals_after_spending = get_vals(rates, spending_rate)
port_rets = pd.DataFrame({'port_ret': rates, "val_before_spending": vals_before_spending, "val_after_spending": vals_after_spending})
We can further improve by JIT compiling the code, as python loops are slow.
Below I use numba :
import numba as nb
#nb.njit(cache=True) # as easy as putting this decorator
def get_vals(rates, spending_rate):
n = len(rates)
vals_after_spending = np.zeros((n+1, ))
vals_before_spending = np.zeros((n+1, ))
# ... code remains same, we are just compiling the function
If we consider a random list of rates like this :
port_rets = pd.DataFrame({'port_ret': np.random.uniform(low=-1.0, high=1.0, size=(100000,))})
We get the performance comparisons:
Your code : 15.758s
get_vals : 1.407s
JITed get_vals : 0.093s (on second run to discount for compile time)
Here is my code:
import xlwings as xw
import datetime as dt
import numpy as np
import pandas as pd
import threading
import time
#connect to workbook
wb = xw.Book(r'C:\Users\Ryan\AppData\Local\Programs\Python\Python37-32\constituents.xlsx')
sht = wb.sheets['constituents']
#store data in np array, pass to Pandas
a = sht.range('A2:C1760').options(np.array).value
df = pd.DataFrame(a)
df = df.rename(index=str, columns={0: "tickers", 1: "start_dates", 2: "end_dates"})
#initialize variables
start_quarter = 0
start_year = 0
fiscal_dates = []
s1 = pd.date_range(start='1/1/1964', end='12/31/2018', freq='B')
df2 = pd.DataFrame(data=np.ndarray(shape=(len(s1),500), dtype=str), index=s1)
#create list of fiscal quarters
def fiscal_quarters(start_year):
year_count = start_year - 1
quarter_count = 1
for n in range(2019 - start_year):
year_count += 1
for i in range(1,5):
fiscal_dates.append(str(quarter_count) + 'Q'+ str(year_count)[-2:])
quarter_count += 1
quarter_count = 1
#iterate over list of tickers to create self-named spreadsheets
def populate_worksheets():
for n in range(len(fiscal_dates)):
#populate df2 with appropriate tickers
def populate_tickers():
count = 0
for n in range(len(s1)):
for i in range(len(df['tickers'])):
if df.loc[str(i), 'start_dates'] <= s1[n] and df.loc[str(i), 'end_dates'] > s1[n]:
count += 1
df2.loc[str(s1[n]), str(count)] = df.loc[str(i), 'tickers']
count = 0
#run populate_tickers function with status updates
def pt_thread():
t = threading.Thread(target=populate_tickers)
c = 0
while (t.is_alive()):
count += 5
print("Working... " + str(c) + 's')
First, I run fiscal_quarters(1964) in the Python Shell, then pt_thread(), which appears to be particularly resource intensive. It's been running for over a half hour for this point on my (admittedly slow) laptop. However, without waiting for it to finish running, is there any way to be able to see whether it's working as intended? Or at all? It's still printing "Working..." to the shell which I suppose is a good sign, but I'd like to start troubleshooting if something's wrong rather than waiting an indefinite amount of time before giving up on it.
For reference, the s1 Series contains ~17,500 items, while the df['tickers'] column contains ~2,000 items, so there should be somewhere in the neighborhood of 35,000,000 iterations with 4 operations each. Is this a lot, or should a modern PC be able to work through this rather quickly and my program is probably just not working?
If you are running loops that are taking a long time and want to see what is going on you could use tqdm. It gives iterations per second and estimated time remaining. Here is a quick example:
from tqdm import tqdm
def sim(sims):
x = 0
pb = tqdm(total=sims, initial=x)
while x < sims:
UPDATE My question has been fully answered, I have applied it to my program using jarmod's answer, and although the code looks neater, it has not effected the speed of (when my graph appears( i plot this data using matplotlib) I am a a little confused on why my program runs slowly and how I can increase the speed ( takes about 30 seconds and I know this portion of the code is slowing it down) I have shown my real code in the second block of code. Also, the speed is strongly determined by the Range I set, with a short range it is quiet fast
I have this sample code here that shows my calculation needed to conduct forecasting and extracting values. I use the for loops to run through a specific range of CSV files that I labeled 1-100. I return numbers for each month (1-12) to get the forecasting average for a forecast for a given amount of month.
My full code includes 12 functions for a full year forecast but I feel the code is inefficient because the functions are very similar except for one number and reading the csv file so many times slows the program.
Is there a way I can combine these functions and perhaps add another parameter to make it run so. The biggest concern I had was that it would be hard to return separate numbers and categorize them. In other words, I would like to ideally only have one function for all 12 month accuracy predictions and the way I can possibly see how to do that would to add another parameter and another loop series, but have no idea how to go about that or if it is possible. Essentially, I would like to store all the values of onemonthaccuracy ( which goes into the file before the current file and compares the predicted value for the date associated with the currentfile) and then store all the values of twomonthaccurary and so on... so I can later use these variables for graphing and other purposes
import csv
import pandas as pd
def onemonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
onemonthread = pd.read_csv(str(basefilenumber-1)+'.csv', encoding='latin-1')
onemonthvalue = onemonthread.loc[onemonthread['Customer'].str.contains('Customer A', na=False),'Jun-16\nQty']
onetotal = int(onemonthvalue)/int(basefilevalue)
return onetotal
def twomonthaccuracy(basefilenumber):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twomonthread = pd.read_csv(str(basefilenumber-2)+'.csv', encoding = 'Latin-1')
twomonthvalue = twomonthread.loc[twomonthread['Customer'].str.contains('Customer A', na=False), 'Jun-16\nQty']
twototal = int(twomonthvalue)/int(basefilevalue)
return twototal
onetotal = 0
twototal = 0
onetotallist = []
twototallist = []
for basefilenumber in range(24,36):
onetotal += onemonthaccuracy(basefilenumber)
twototal +=twomonthaccuracy(basefilenumber)
onetotalpermonth = onetotal/12
twototalpermonth = twototal/12
x = [1,2]
y = [onetotalpermonth, twototalpermonth]
z = [1,2]
w = [(onetotallist),(twototallist)]
for ze, we in zip(z, w):
plt.scatter([ze] * len(we), we, marker='D', s=5)
This is the real block of code I am using in my program, perhaps something is slowing it down that I am unaware of?
#other parts of code
#StartRange = yearvalue+Value
#EndRange = endValue + endyearvalue
#Range = EndRange - StartRange
# Department
#more code....
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
baseheader = getfileheader(basefilenumber)
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains(Department, na=False), baseheader]
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains(Department, na=False), baseheader]
return (1-(int(basefilevalue)/int(nmonthvalue))+1) if int(nmonthvalue) > int(basefilevalue) else int(nmonthvalue)/int(basefilevalue)
N = 13
total = [0] * N
total_by_month_list = [[] for _ in range(N)]
for basefilenumber in range(int(StartRange),int(EndRange)):
for n in range(N):
total[n] += nmonthaccuracy(basefilenumber, n)
onetotal=total[1]/ Range
twototal=total[2]/ Range
threetotal=total[3]/ Range
fourtotal=total[4]/ Range
fivetotal=total[5]/ Range #... all the way to 12
fivetotallist=total_by_month_list[5] #... all the way to 12
# alot more code after this
Something like this:
def nmonthaccuracy(basefilenumber, n):
basefileread = pd.read_csv(str(basefilenumber)+'.csv', encoding='Latin-1')
basefilevalue = basefileread.loc[basefileread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
nmonthread = pd.read_csv(str(basefilenumber-n)+'.csv', encoding = 'Latin-1')
nmonthvalue = nmonthread.loc[nmonthread['Customer'].str.contains('Lam DepT', na=False), 'Jun-16\nQty']
return int(nmonthvalue)/int(basefilevalue)
N = 2
total_by_month = [0] * N
total_aggregate = 0
for basefilenumber in range(20,30):
for n in range(N):
a = nmonthaccuracy(basefilenumber, n)
total_by_month[n] += a
total_aggregate += a
In case you are wondering what the following code does:
N = 2
total_by_month = [0] * N
It sets N to the number of months desired (2, but you could make it 12 or another value) and it then creates a total_by_month array that can store N results, one per month. It then initializes total_by_month to all zeroes (N zeroes) so that each of the N monthly totals starts at zero.
This is a follow up question related to this question.
Thanks to previous help I have successfully imported a netCDF file (or files with MFDataset) and am able to compare the different times to one another to create another cumulative dataset. Here is a piece of the current code.
from numpy import *
import netCDF4
import os
f = netCDF4.MFDataset('air.2m.1979.nc')
atemp = f.variables['air']
ntimes, ny, nx = atemp.shape
cold_days = zeros((ntimes, ny, nx), dtype=int)
for i in range(ntimes):
for b in range(ny):
for c in range(nx):
if i == 1:
if atemp[i,b,c] < 0:
cold_days[i,b,c] = 1
cold_days[i,b,c] = 0
if atemp[i,b,c] < 0:
cold_days[i,b,c] = cold_days[i-1,b,c] + 1
cold_days[i,b,c] = 0
This seems like a brute force way to get the job done, and though it works it takes a very long time. I'm not sure if it takes such a long time because I'm dealing with 365 349x277 matrices (35,285,645 pixels) or if my old school brute force way is simply slow in comparison to some built in python methods.
Below is an example of what I believe the code is doing. It looks at Time and increments cold days if temp < 0. If temp >= 0 than cold days resets to 0. In the below image you will see that the cell at row 2, column 1 increments each Time that passes but the cell at row 2, column 2 increments at Time 1 but resets to zero on Time 2.
Is there a more efficient way to rip through this netCDF dataset to perform this type of operation?
Seems like this is a minor modification -- just write the data out at each time step. Something close to this should work:
from pylab import *
import netCDF4
# open NetCDF input files
f = netCDF4.MFDataset('/usgs/data2/rsignell/models/ncep/narr/air.2m.19??.nc')
# print variables
atemp = f.variables['air']
print atemp
ntimes, ny, nx = shape(atemp)
cold_days = zeros((ny,nx),dtype=int)
# create output NetCDF file
nco = netCDF4.Dataset('/usgs/data2/notebook/cold_days.nc','w',clobber=True)
cold_days_v = nco.createVariable('cold_days', 'i4', ( 'time', 'y', 'x'))
cold_days_v.long_name='total number of days below 0 degC'
cold_days_v.grid_mapping = 'Lambert_Conformal'
timeo = nco.createVariable('time','f8',('time'))
lono = nco.createVariable('lon','f4',('y','x'))
lato = nco.createVariable('lat','f4',('y','x'))
xo = nco.createVariable('x','f4',('x'))
yo = nco.createVariable('y','f4',('y'))
lco = nco.createVariable('Lambert_Conformal','i4')
# copy all the variable attributes from original file
for var in ['time','lon','lat','x','y','Lambert_Conformal']:
for att in f.variables[var].ncattrs():
# copy variable data for time, lon,lat,x and y
timeo[:] = f.variables['time'][:]
lato[:] = f.variables['lat'][:]
xo[:] = f.variables['x'][:]
yo[:] = f.variables['y'][:]
for i in xrange(ntimes):
cold_days += atemp[i,:,:].data-273.15 < 0
# write the cold_days data
# copy Global attributes from original file
for att in f.ncattrs():