Hi have created a OLS regression using Statsmodels
I've written some code that loops through every variable in a dataframe and enters it into the model and then records the T Stat in a new dataframe and builds a list of potential variables.
However I have 20,000 variables so it takes ages to run each time.
Can anyone think of a better approach?
This is my current approach
TStatsOut=pd.DataFrame()
for i in VarsOut:
try:
xstrout='+'.join([baseterms,i])
fout='ymod~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1, join_axes=[j.index])
TStatsOut=TStatsOut.append(s)
Here is what I have found in regards to your question. My answer uses the approach of using dask for distributed computing, and also just general clean up of you current approach.
I made a smaller fake dataset with 1000 variables, one will be the outcome, and two will be the baseterms, so there is really 997 variables to loop through.
import dask
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
#make some toy data for the case you showed
df_train = pd.DataFrame(np.random.randint(low=0,high=10,size=(10000, 1000)))
df_train.columns = ['var'+str(x) for x in df_train.columns]
baseterms = 'var1+var2'
VarsOut = df_train.columns[3:]
Baseline for your current Code (20s +- 858ms):
%%timeit
TStatsOut=pd.DataFrame()
for i in VarsOut:
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1)
s=s.reindex(j.index)
TStatsOut=TStatsOut.append(s)
Created a function for readability, but returns just the pval, and regression coefficient for each variable tested instead of the one line dataframes.
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
Now runs at (14.1s +- 982ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = testVar(i)
pvals.append(pval)
coefs.append(coef)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
Using Dask delayed for parallel processing. Keep in mind each delayed task that is created cause a slight overhead as well, so sometimes it it may not be beneficial, but will depend on your exact dataset and how long the regressions are taking. My data example may be too simple to show any benefit.
#define the same function as before, but tell dask how many outputs it has
#dask.delayed(nout=2)
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
Now run through the 997 candidate variables and create the same dataframe with dask delayed. (18.6s +- 588ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = dask.delayed(testVar)(i)
pvals.append(pval)
coefs.append(coef)
pvals, coefs = dask.compute(pvals,coefs)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
Again, dask delayed creates more overhead as it creates the tasks to be sent across many processors, so any performance gain will depend on the time your data actually takes in the regressions as well as how many CPUs you have availible. Dask can be scaled from a single workstation to a cluster of workstations.
Related
I need to calculate the plane of array (POA) irradiance using python's pvlib package (https://pvlib-python.readthedocs.io/en/stable/). For this I would like to use the output data from the WRF model (GHI, DNI, DHI). The output data is in netCDF format, which I open using the netCDF4 package and then I extract the necessary variables using the wrf-python package.
With that I get a xarray.Dataset with the variables I will use. I then use the xarray.Dataset.to_dataframe() method to transform it into a pandas dataframe, and then I transform the dataframe into a numpy array using the dataframe.values. And then I do a loop where in each iteration I calculate the POA using the function irradiance.get_total_irradiance (https://pvlib-python.readthedocs.io/en/stable/auto_examples/plot_ghi_transposition.html) for a grid point.
That's the way I've been doing it so far, however I have over 160000 grid points in the WRF domain, the data is hourly and spans 365 days. This gives a very large amount of data. I believe if pvlib could work directly with xarray.dataset it could be faster. However, I could only do it this way, transforming the data into a numpy.array and looping through the rows. Could anyone tell me how I can optimize this calculation? Because the code I developed is very time-consuming.
If anyone can help me with this I would appreciate it. Maybe an improvement to the code, or another way to calculate the POA from the WRF data...
I'm providing the code I've built so far:
from pvlib import location
from pvlib import irradiance
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr
import netCDF4
import wrf
Getting WRF data
variaveis = ['T2',
'U10',
'V10',
'SWDDNI',
'SWDDIF',
'SWDOWN']
netcdf_data = netCDF4.Dataset('wrfout_d02_2003-11-01_00_00_00')
first = True
for v in variaveis:
var = wrf.getvar(netcdf_data, v, timeidx=wrf.ALL_TIMES)
if first:
met_data = var
first = False
else:
met_data = xr.merge([met_data, var])
met_data = xr.Dataset.reset_coords(met_data, ['XTIME'], drop=True)
met_data['T2'] = met_data['T2'] - 273.15
WS10 = (met_data['U10']**2 + met_data['V10']**2)**0.5
met_data['WS10'] = WS10
df = met_data[['SWDDIF',
'SWDDNI',
'SWDOWN',
'T2',
'WS10']].to_dataframe().reset_index().drop(columns=['south_north',
'west_east'])
df.rename(columns={'SWDOWN': 'ghi',
'SWDDNI':'dni',
'SWDDIF':'dhi',
'T2':'temp_air',
'WS10':'wind_speed',
'XLAT': 'lat',
'XLONG': 'lon',
'Time': 'time'}, inplace=True)
df.set_index(['time'], inplace=True)
df = df[df.ghi>0]
df.index = df.index.tz_localize('America/Recife')
Function to get POA irradiance
def get_POA_irradiance(lon, lat, date, dni, dhi, ghi, tilt=10, surface_azimuth=0):
site_location = location.Location(lat, lon, tz='America/Recife')
# Get solar azimuth and zenith to pass to the transposition function
solar_position = site_location.get_solarposition(times=date)
# Use the get_total_irradiance function to transpose the GHI to POA
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt = tilt,
surface_azimuth = surface_azimuth,
dni = dni,
ghi = ghi,
dhi = dhi,
solar_zenith = solar_position['apparent_zenith'],
solar_azimuth = solar_position['azimuth'])
# Return DataFrame with only GHI and POA
return pd.DataFrame({'lon': lon,
'lat': lat,
'GHI': ghi,
'POA': POA_irradiance['poa_global']}, index=[date])
Loop in each row (time) of the array
array = df.reset_index().values
list_poa = []
def loop_POA():
for i in tqdm(range(len(array) - 1)):
POA = get_POA_irradiance(lon=array[i,6],
lat=array[i,7],
dni=array[i,2],
dhi=array[i,1],
ghi=array[i,3],
date=str(array[i,0]))
list_poa.append(POA)
return list_poa
poa_final = pd.concat(lista)
Thanks both for a good question and for using pvlib! You're right that pvlib is intended for modeling single locations and is not designed for use with xarray datasets, although some functions might coincidentally work with them.
I strongly suspect that the majority of the runtime you're seeing is for the solar position calculations. You could switch to a faster method (see the method options here), as the default solar position method is very accurate but also quite slow when calculating bulk positions. Installing numba will help, but it still might be too slow for you, so you might check the other models (ephemeris, pyephem). There are also some fast but low-precision methods, but you will need to change your code a bit to use them. See the list under "Correlations and analytical expressions for low precision solar position calculations" here.
Like Michael Delgado suggests in the comments, parallel processing is an option. But that can be a headache in python. You will probably want multiprocessing, not multithreading.
Another idea is to use atlite, a python package designed for this kind of spatial modeling. But its solar modeling capabilities are not nearly as detailed as pvlib, so it might not be useful for your case.
One other note: I don't know if the WRF data are interval averages or instantaneous values, but if you care about accuracy you should handle them differently for transposition. See this example.
Edit to add: after looking at your code again, there might be another significant speedup to be had. Are you calling get_POA_irradiance for single combinations of position and timestamp? If so, that is unnecessary and very slow. It would be much faster to pass in the full time series for each location, i.e. scalar lat/lon but vector irradiance.
I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
BACKGROUND:
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
They time out sending data to the main worker performing the sum
The main worker runs out of memory gathering all of the matrices
The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
]
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
QUESTION:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?
Dask bag has a reduction/aggregation method that will generate tree-like DAG: fold.
The workflow would be to 'bag' the delayed objects and then fold them.
I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result.
I would like to apply this to a big xarray dataset, which is backed by a chunked dask array. For computation, I'd like to use dask.distributed.
However, when I apply this to the full dataset, the number of tasks skyrockets, overwhelming the client and most likely also the scheduler and workers if submitted.
The xarray docs explain:
Do your spatial and temporal indexing (e.g. .sel() or .isel()) early
in the pipeline, especially before calling resample() or groupby().
Grouping and rasampling triggers some computation on all the blocks,
which in theory should commute with indexing, but this optimization
hasn’t been implemented in dask yet.
But I really need to apply this to the full temporal axis.
So how to best implement this?
My approach was to use map_blocks, to apply this function for each chunk individually as to keep the individual xarray sub-datasets small enough.
This seems to work on a small scale, but when I use the full dataset, the workers run out of memory and quickly die.
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
So my questions are:
Is this approach valid?
How could I implement this workflow otherwise, besides manually implementing the resample and groupby part and putting it in a ufunc?
Any ideas regarding the performance issues at scale (specifically the number of executions vs chunks)?
Here's a small example that mimics the workflow and shows the number of executions vs chunks:
from time import sleep
import dask
from dask.distributed import Client, LocalCluster
import numpy as np
import pandas as pd
import xarray as xr
def ufunc(x):
# computation
sleep(2)
return x
def fun(x):
# upsample to higher res
x = x.resample(time="1h").asfreq().fillna(0)
# apply function
x = xr.apply_ufunc(ufunc, x, input_core_dims=[["time"]], output_core_dims=[['time']], dask="parallelized")
# average over dates
x['time'] = x.time.dt.strftime("%Y-%m-%d")
x = x.groupby("time").mean()
return x
def create_xrds(shape):
''' helper function to create dataset'''
x,y,t = shape
tv = pd.date_range(start="1970-01-01", periods=t)
ds = xr.Dataset({
"band": xr.DataArray(
dask.array.zeros(shape, dtype="int16"),
dims=['x', 'y', 'time'],
coords={"x": np.arange(0, x), "y": np.arange(0, y), "time": tv})
})
return ds
# set up distributed
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
ds = create_xrds((500,500,500)).chunk({"x": 100, "y": 100, "time": -1})
# create template
template = ds.copy()
template['time'] = template.time.dt.strftime("%Y-%m-%d")
# map fun to blocks
ds_out = xr.map_blocks(fun, ds, template=template)
# persist
ds_out.persist()
Using the example above, this is how the dask array (25 chunks) looks like:
But the function fun gets executed 125 times:
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
This is misleading because of an unfortunate choice made when making the graph. The number includes tasks that make a block of the input Dataset (one per variable per chunk) & for the output Dataset as well as tasks that apply the function. This will get fixed soon (https://github.com/pydata/xarray/pull/5007)
I am trying to solve a multiobjective optimization problem with 3 objectives and 2 decision variables using NSGA 2. The pymoo code for NSGA2 algorithm and termination criteria is given below. My pop_size is 100 and n_offspring is 100. The algorithm is iterated over 100 generations. I want to store all 100 values of decision variables considered in each generation for all 100 generations in a dataframe.
NSGA2 implementation in pymoo code:
from pymoo.algorithms.nsga2 import NSGA2
from pymoo.factory import get_sampling, get_crossover, get_mutation
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True
)
from pymoo.factory import get_termination
termination = get_termination("n_gen", 100)
from pymoo.optimize import minimize
res = minimize(MyProblem(),
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
What I have tried (My reference: stackoverflow question):
import pandas as pd
df2 = pd.DataFrame (algorithm.pop)
df2.head(10)
The result from above code is blank and on passing
print(df2)
I get
Empty DataFrame
Columns: []
Index: []
Glad you intend to use pymoo for your research. You have correctly enabled the save_history option, which means you can access the algorithm objects.
To have all solutions from the run, you can combine the offsprings (algorithm.off) from each generation. Don't forget the Population objects contain Individual objectives. With the get method you can get the X and F or other values. See the code below.
import pandas as pd
from pymoo.algorithms.nsga2 import NSGA2 from pymoo.factory import get_sampling, get_crossover, get_mutation, ZDT1 from pymoo.factory import get_termination from pymoo.model.population import Population from pymoo.optimize import minimize
problem = ZDT1()
algorithm = NSGA2(
pop_size=20,
n_offsprings=10,
sampling=get_sampling("real_random"),
crossover=get_crossover("real_sbx", prob=0.9, eta=15),
mutation=get_mutation("real_pm", prob=0.01,eta=20),
eliminate_duplicates=True )
termination = get_termination("n_gen", 10)
res = minimize(problem,
algorithm,
termination,
seed=1,
save_history=True,
verbose=True)
all_pop = Population()
for algorithm in res.history:
all_pop = Population.merge(all_pop, algorithm.off)
df = pd.DataFrame(all_pop.get("X"), columns=[f"X{i+1}" for i in range(problem.n_var)])
print(df)
Another way would be to use a callback and fill the data frame each generation. Similar as shown here: https://pymoo.org/interface/callback.html
I have a task to fit about 10000 1D profiles obtained from an electron beam to Gaussian.
The raw data is basically very noisy. I had to denoise before fitting. I was advised to use KernelReg to do this.
For each profile, I firstly call KernelReg and then lmfit to extract the center, sigma, amp and offset of raw data in a for loop.
When I tested with 100 profiles:
If I use only lmfit the runtime is 2.4 seconds (cprofiler).
If I combine KernelReg and lmfit the runtime is 272 seconds.
The cprofiler displays a bottleneck in the KernelReg call.
An example profile:
data_array=np.array([-0.000159229
-0.000213496
-1.37e-05
-0.00021545
1.24e-05
0.000181446
-0.000133793
-7.84e-05
-0.000266477
-0.000206505
-0.000376277
-9.38e-05
0.000174166
-0.000365068
-0.00291559
-6.37e-05
-0.000314041
-0.000426127
-0.000322608
-0.000293555
-0.000306628
-0.000379695
-6.12e-05
-0.000336458
-0.000296795
-6.57e-06
-0.000121408
-0.000327136
-0.000215139
-0.000221265
-0.000112685
-0.000244148
-0.000318746
-0.00039916
-0.00454921
-0.00026823
-0.000153014
-0.000423619
-0.000348621
-0.000311244
-0.000318724
-0.000145046
-0.000164001
-0.000224927
-0.000568133
-0.000106227
-0.00022688
-0.000417715
-0.000382891
2.87e-05
-0.00267422
-0.000207038
-0.000239531
-0.000174655
-0.000145335
-0.000202266
-0.000455647
-0.000348444
-0.000346801
-5.86e-05
-8.12e-05
-0.00016733
-0.000241884
-0.000227368
-0.000229987
-0.000121697
-0.00030503
-0.000244148
7.8e-05
-0.000253847
0.000289293
-0.000123672
0.0175145
-0.000436537
-0.000320966
-0.000177473
-0.000148553
-9.91e-05
-0.000197605
-0.000155855
-0.000259152
-0.000221265
-0.00014023
-8.7e-05
-0.000532443
-7.29e-05
-0.0001464
-0.000401024
-0.000176963
-0.000318946
5.83e-05
-0.000281947
-0.000120476
-0.000313708
-0.000114594
-0.000242483
-0.000162958
-0.000144203
-0.000445903
-1.44e-05
-0.000186307
-0.000197738
8.11e-05
-0.000203264
-0.000407749
0.00026843
-0.000268629
-0.000228789
3e-06
-0.000199136
-0.000201023
1.69e-05
-0.000168995
7.79e-05
6.05e-05
-0.000280038
-0.000305252
-0.000308625
-5.47e-05
-0.00034032
-0.000169572
-0.0001193
-0.000234626
5.73e-05
-0.000235869
8.41e-06
-0.000331353
0.000407483
0.000226658
-4.63e-05
-3.39e-05
-0.000163224
-4.31e-05
-0.000191434
9.93e-05
0.000193032
-9.16e-05
-0.000144513
-0.00010616
-4.39e-05
-5.87e-05
9.19e-06
0.000276642
4.48e-05
-7.63e-05
0.000100678
1.18e-05
0.000209568
0.000472049
0.0291889
0.000433762
0.000433607
0.000574392
0.000997101
0.00142112
0.00391844
0.00768485
0.0138782
0.0216787
0.0281597
0.028893
0.0259644
0.0190644
0.01219
0.00574567
-0.00154854
0.00143113
0.000845396
0.00076789
0.000276753
0.000351285
0.000284233
0.00057053
0.000433873
-0.000197183
1.29e-05
0.000118878
0.000203819
0.000132328
1.84e-05
-4.34e-05
-7.95e-05
-0.000400492
6e-05
-9.98e-05
0.00441493
5.23e-05
-7.08e-05
5.7e-05
-0.000148531
-0.000139475
-3.74e-05
-0.000149086
-0.000234826
-3.42e-05
5.27e-05
-0.000171436
-0.00021778
-0.000175076
-0.000198071
])
position=position = np.linspace(1, 200, 200)
My code:
This code takes about 1~2 seconds to run a call. It is the bottleneck at run time.
import numpy as np
from statsmodels.nonparametric.kernel_regression import KernelReg
data_array_kr = KernelReg(data_array, position,'c')
data_array_pred, data_array_std = data_array_kr.fit(position)
data_array_pred[data_array_pred<0] = 0
Questions:
How to improve the performance of my KernelReg call?
Is this a good choice for denoising and fit using KernelReg+lmfit?
Answering your second question:
You might try comparing a smoothing algorithm such as Savitzky-Golay (if it can be applied to your data -- it requires samples on a uniform interval). It definitely takes some time, but might be faster than KernelReg. You might also try to assess how much smoothing you actually need to get stable fit results.