KernelReg performance in a for loop

KernelReg performance in a for loop - python

I have a task to fit about 10000 1D profiles obtained from an electron beam to Gaussian.
The raw data is basically very noisy. I had to denoise before fitting. I was advised to use KernelReg to do this.
For each profile, I firstly call KernelReg and then lmfit to extract the center, sigma, amp and offset of raw data in a for loop.
When I tested with 100 profiles:
If I use only lmfit the runtime is 2.4 seconds (cprofiler).
If I combine KernelReg and lmfit the runtime is 272 seconds.
The cprofiler displays a bottleneck in the KernelReg call.
An example profile:
data_array=np.array([-0.000159229
-0.000213496
-1.37e-05
-0.00021545
1.24e-05
0.000181446
-0.000133793
-7.84e-05
-0.000266477
-0.000206505
-0.000376277
-9.38e-05
0.000174166
-0.000365068
-0.00291559
-6.37e-05
-0.000314041
-0.000426127
-0.000322608
-0.000293555
-0.000306628
-0.000379695
-6.12e-05
-0.000336458
-0.000296795
-6.57e-06
-0.000121408
-0.000327136
-0.000215139
-0.000221265
-0.000112685
-0.000244148
-0.000318746
-0.00039916
-0.00454921
-0.00026823
-0.000153014
-0.000423619
-0.000348621
-0.000311244
-0.000318724
-0.000145046
-0.000164001
-0.000224927
-0.000568133
-0.000106227
-0.00022688
-0.000417715
-0.000382891
2.87e-05
-0.00267422
-0.000207038
-0.000239531
-0.000174655
-0.000145335
-0.000202266
-0.000455647
-0.000348444
-0.000346801
-5.86e-05
-8.12e-05
-0.00016733
-0.000241884
-0.000227368
-0.000229987
-0.000121697
-0.00030503
-0.000244148
7.8e-05
-0.000253847
0.000289293
-0.000123672
0.0175145
-0.000436537
-0.000320966
-0.000177473
-0.000148553
-9.91e-05
-0.000197605
-0.000155855
-0.000259152
-0.000221265
-0.00014023
-8.7e-05
-0.000532443
-7.29e-05
-0.0001464
-0.000401024
-0.000176963
-0.000318946
5.83e-05
-0.000281947
-0.000120476
-0.000313708
-0.000114594
-0.000242483
-0.000162958
-0.000144203
-0.000445903
-1.44e-05
-0.000186307
-0.000197738
8.11e-05
-0.000203264
-0.000407749
0.00026843
-0.000268629
-0.000228789
3e-06
-0.000199136
-0.000201023
1.69e-05
-0.000168995
7.79e-05
6.05e-05
-0.000280038
-0.000305252
-0.000308625
-5.47e-05
-0.00034032
-0.000169572
-0.0001193
-0.000234626
5.73e-05
-0.000235869
8.41e-06
-0.000331353
0.000407483
0.000226658
-4.63e-05
-3.39e-05
-0.000163224
-4.31e-05
-0.000191434
9.93e-05
0.000193032
-9.16e-05
-0.000144513
-0.00010616
-4.39e-05
-5.87e-05
9.19e-06
0.000276642
4.48e-05
-7.63e-05
0.000100678
1.18e-05
0.000209568
0.000472049
0.0291889
0.000433762
0.000433607
0.000574392
0.000997101
0.00142112
0.00391844
0.00768485
0.0138782
0.0216787
0.0281597
0.028893
0.0259644
0.0190644
0.01219
0.00574567
-0.00154854
0.00143113
0.000845396
0.00076789
0.000276753
0.000351285
0.000284233
0.00057053
0.000433873
-0.000197183
1.29e-05
0.000118878
0.000203819
0.000132328
1.84e-05
-4.34e-05
-7.95e-05
-0.000400492
6e-05
-9.98e-05
0.00441493
5.23e-05
-7.08e-05
5.7e-05
-0.000148531
-0.000139475
-3.74e-05
-0.000149086
-0.000234826
-3.42e-05
5.27e-05
-0.000171436
-0.00021778
-0.000175076
-0.000198071
])
position=position = np.linspace(1, 200, 200)
My code:
This code takes about 1~2 seconds to run a call. It is the bottleneck at run time.
import numpy as np
from statsmodels.nonparametric.kernel_regression import KernelReg
data_array_kr = KernelReg(data_array, position,'c')
data_array_pred, data_array_std = data_array_kr.fit(position)
data_array_pred[data_array_pred<0] = 0
Questions:
How to improve the performance of my KernelReg call?
Is this a good choice for denoising and fit using KernelReg+lmfit?

Answering your second question:
You might try comparing a smoothing algorithm such as Savitzky-Golay (if it can be applied to your data -- it requires samples on a uniform interval). It definitely takes some time, but might be faster than KernelReg. You might also try to assess how much smoothing you actually need to get stable fit results.

Related

Fitting two voigt curves, one after the other using lmfit

I have the following emission spectra of Neon collected on a Raman (background subtracted data):
x=np.array([[1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981]])
y=np.array([[-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01]])
I have fitted a single Voigt function using lmfit, specifically:
model = VoigtModel()+ ConstantModel()
params=model.make_params(center=1123.096389, amplitude=1000, sigma=0.27)
result = model.fit(y.flatten(), params, x=x.flatten())
There is a second peak on the LH shoulder (sorry can't post image)- people using commercial peak fitting software fit the first voigt, then add the second, and then it adjusts the fits of both. How would I do this in python?
A related question - is there a way to optimize how many points to include in the peak fit. Right now, I am only feeding x and y data covering a set spectral range to do the peak fitting. But commercial software optimizes how much range to include in a given peak fit (I presume using residuals). How would I recreate this?
Thanks!

You can do it manually as so:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import VoigtModel, ConstantModel
x=np.array([1114.120887, 1114.682293, 1115.243641, 1115.80493 , 1116.366161, 1116.927334, 1117.488449, 1118.049505, 1118.610503, 1119.171443, 1119.732324, 1120.293147, 1120.853912, 1121.414619, 1121.975267, 1122.535857, 1123.096389, 1123.656863, 1124.217278, 1124.777635, 1125.337934, 1125.898175, 1126.458357, 1127.018482, 1127.578548, 1128.138556, 1128.698505, 1129.258397, 1129.81823 , 1130.378005, 1130.937722, 1131.497381, 1132.056981])
y=np.array([-4.89046878e+00, -4.90985832e+00, -5.92924587e+00, -3.28194437e+00, -1.96801488e+00, -3.32070938e+00, -5.34008887e+00, -3.59466330e-01, -2.04552879e+00, -1.06490224e+00, 8.24910035e+00, 5.32297309e+01, 1.11543677e+02, 8.98576241e+01, 2.18504948e+02, 7.15152212e+02, 7.62799601e+02, 2.89446870e+02, 7.24275144e+01, 1.94081610e+01, 1.72212272e+00, 7.02773412e-01, -3.16573861e-01, 4.99745483e+00, 7.97811157e+00, 6.25396305e-01, 6.27274408e+00, -4.41328018e+00, -7.76592840e+00, 3.88142539e+00, 6.52872017e+00, 1.50939096e+00, -8.43249208e-01])
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1123.0, amplitude=1000, sigma=0.27)
result1 = model.fit(y.flatten(), params, x=x.flatten())
rest = y-result1.best_fit
model = VoigtModel() + ConstantModel()
params=model.make_params(center=1120.5, amplitude=200, sigma=0.27)
result2 = model.fit(rest, params, x=x.flatten())
rest -= result2.best_fit
plt.plot(x, y, label='Original')
plt.plot(x, result1.best_fit, label='1123.0')
plt.plot(x, result2.best_fit, label='1120.5')
plt.plot(x, rest, label='residual')
plt.legend()
plt.show()
You have to make sure that the residual makes sense. In this case, is quite close to 0, so I'd argue that it is fine.
lmfit does optimize the fit, so it is not necessary to pinpoint the exact value of the peak position. Also, it is important to point out that because of the resolution of this data (and spectroscopy in general), the highest points are not necessarily the centre of the peak. Additionally, because of the same, some shoulders might not be shoulders, though in this case looks like it is.
For your related question - judging by the documentation of lmfit it uses all the range you input. Residuals seem like not a solution since you fall in the same problem (what range to consider). I believe that the commercial SW you mention uses Multivariate Curve Resolution (MCR). These deconvolution problems have been a hot topic for decades. If you are interested in this kind of solution, I suggest reading about Multivariate Curve Resolution (MCR).

resample and groupby on big dask array with xarray - using map_blocks?

I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result.
I would like to apply this to a big xarray dataset, which is backed by a chunked dask array. For computation, I'd like to use dask.distributed.
However, when I apply this to the full dataset, the number of tasks skyrockets, overwhelming the client and most likely also the scheduler and workers if submitted.
The xarray docs explain:
Do your spatial and temporal indexing (e.g. .sel() or .isel()) early
in the pipeline, especially before calling resample() or groupby().
Grouping and rasampling triggers some computation on all the blocks,
which in theory should commute with indexing, but this optimization
hasn’t been implemented in dask yet.
But I really need to apply this to the full temporal axis.
So how to best implement this?
My approach was to use map_blocks, to apply this function for each chunk individually as to keep the individual xarray sub-datasets small enough.
This seems to work on a small scale, but when I use the full dataset, the workers run out of memory and quickly die.
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
So my questions are:
Is this approach valid?
How could I implement this workflow otherwise, besides manually implementing the resample and groupby part and putting it in a ufunc?
Any ideas regarding the performance issues at scale (specifically the number of executions vs chunks)?
Here's a small example that mimics the workflow and shows the number of executions vs chunks:
from time import sleep
import dask
from dask.distributed import Client, LocalCluster
import numpy as np
import pandas as pd
import xarray as xr
def ufunc(x):
# computation
sleep(2)
return x
def fun(x):
# upsample to higher res
x = x.resample(time="1h").asfreq().fillna(0)
# apply function
x = xr.apply_ufunc(ufunc, x, input_core_dims=[["time"]], output_core_dims=[['time']], dask="parallelized")
# average over dates
x['time'] = x.time.dt.strftime("%Y-%m-%d")
x = x.groupby("time").mean()
return x
def create_xrds(shape):
''' helper function to create dataset'''
x,y,t = shape
tv = pd.date_range(start="1970-01-01", periods=t)
ds = xr.Dataset({
"band": xr.DataArray(
dask.array.zeros(shape, dtype="int16"),
dims=['x', 'y', 'time'],
coords={"x": np.arange(0, x), "y": np.arange(0, y), "time": tv})
})
return ds
# set up distributed
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
ds = create_xrds((500,500,500)).chunk({"x": 100, "y": 100, "time": -1})
# create template
template = ds.copy()
template['time'] = template.time.dt.strftime("%Y-%m-%d")
# map fun to blocks
ds_out = xr.map_blocks(fun, ds, template=template)
# persist
ds_out.persist()
Using the example above, this is how the dask array (25 chunks) looks like:
But the function fun gets executed 125 times:

Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
This is misleading because of an unfortunate choice made when making the graph. The number includes tasks that make a block of the input Dataset (one per variable per chunk) & for the output Dataset as well as tasks that apply the function. This will get fixed soon (https://github.com/pydata/xarray/pull/5007)

How to compare scipy noise filters?

I need to reduce my noise like behavior in my data. I tried one of the method called Savitzky-Golay Filter . On the other hand, I need to find fastest method, because the filtering algorithm will be in the most running script in my code.
I am not familiar with the signal processing methods. Can you suggest faster methods and usage of them briefly?
I do not need complex structure like low-pass, high-pass etc (I know there are thousands of them). As fast as possible smoothening method is what I want to use.
Here my test script:
import numpy as np
import matplotlib.pyplot as plt
noisyData=np.array([
2.77741650e+43, 1.30016392e+42, 8.05792443e+42, 1.74277713e+43,
2.33814198e+43, 6.75553976e+42, 2.56642073e+43, 4.71467220e+43,
4.25047666e+43, 3.07095152e+43, 7.30694187e+43, 7.54411548e+43,
1.29555422e+43, 8.09272000e+42, 9.18193162e+43, 2.25447063e+44,
3.43044832e+41, 7.02901256e+43, 2.54438379e+43, 8.72303015e+43,
7.80333557e+42, 7.55039871e+43, 7.70164773e+43, 4.38740319e+43,
8.43139041e+43, 6.12168640e+43, 5.64352020e+43, 3.63824769e+42,
2.35296604e+43, 4.66272666e+43, 5.03660902e+44, 1.65071897e+44,
2.81055925e+44, 1.46401444e+44, 5.44407940e+43, 4.50672710e+43,
1.60833084e+44, 1.68038069e+44, 1.08588606e+44, 7.00867980e+43])
xAxis=np.arange(len(noisyData))
# ------------- Savitzky-Golay Filter ---------------------
windowLength = len(xAxis) - 5
polyOrder = 6
from scipy.signal import savgol_filter
# Function
def set_SavgolFilter(noisyData,windowLength,polyOrder):
return savgol_filter(noisyData, windowLength, polyOrder)
plt.plot(xAxis,noisyData,alpha=0.5)
plt.plot(xAxis,set_SavgolFilter(noisyData,windowLength,polyOrder))
# ------------- Time Comparison ----------------------
import time
start_time = time.time()
for i in range(50):
savgolfilter1 = set_SavgolFilter(noisyData,windowLength,polyOrder)
print(" %s seconds " % (time.time() - start_time))
# === OTHER METHODS WILL BE HERE

Unless you really need polynomial-based smoothing, Savitzky-Golay does not have any particular advantages. It's basically a bad lowpass filter. For more details, see https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5888646
Using a basic Butterworth lowpass filter instead:
from scipy.signal import butter, filtfilt
b, a = butter(5, .2)
datafilt = filtfilt(b, a, noisyData)
The filtfilt call seems to be several times faster than savgol_filter. How much faster do you need? Using lfilter from scipy is at least 10 times faster, but the result will be delayed with respect to the input signal.

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

python OLS statsmodels T Stats of variables not entered into the model

Hi have created a OLS regression using Statsmodels
I've written some code that loops through every variable in a dataframe and enters it into the model and then records the T Stat in a new dataframe and builds a list of potential variables.
However I have 20,000 variables so it takes ages to run each time.
Can anyone think of a better approach?
This is my current approach
TStatsOut=pd.DataFrame()
for i in VarsOut:
try:
xstrout='+'.join([baseterms,i])
fout='ymod~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1, join_axes=[j.index])
TStatsOut=TStatsOut.append(s)

Here is what I have found in regards to your question. My answer uses the approach of using dask for distributed computing, and also just general clean up of you current approach.
I made a smaller fake dataset with 1000 variables, one will be the outcome, and two will be the baseterms, so there is really 997 variables to loop through.
import dask
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
#make some toy data for the case you showed
df_train = pd.DataFrame(np.random.randint(low=0,high=10,size=(10000, 1000)))
df_train.columns = ['var'+str(x) for x in df_train.columns]
baseterms = 'var1+var2'
VarsOut = df_train.columns[3:]
Baseline for your current Code (20s +- 858ms):
%%timeit
TStatsOut=pd.DataFrame()
for i in VarsOut:
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1)
s=s.reindex(j.index)
TStatsOut=TStatsOut.append(s)
Created a function for readability, but returns just the pval, and regression coefficient for each variable tested instead of the one line dataframes.
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
Now runs at (14.1s +- 982ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = testVar(i)
pvals.append(pval)
coefs.append(coef)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
Using Dask delayed for parallel processing. Keep in mind each delayed task that is created cause a slight overhead as well, so sometimes it it may not be beneficial, but will depend on your exact dataset and how long the regressions are taking. My data example may be too simple to show any benefit.
#define the same function as before, but tell dask how many outputs it has
#dask.delayed(nout=2)
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
Now run through the 997 candidate variables and create the same dataframe with dask delayed. (18.6s +- 588ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = dask.delayed(testVar)(i)
pvals.append(pval)
coefs.append(coef)
pvals, coefs = dask.compute(pvals,coefs)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
Again, dask delayed creates more overhead as it creates the tasks to be sent across many processors, so any performance gain will depend on the time your data actually takes in the regressions as well as how many CPUs you have availible. Dask can be scaled from a single workstation to a cluster of workstations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

KernelReg performance in a for loop - python

Related

Fitting two voigt curves, one after the other using lmfit

resample and groupby on big dask array with xarray - using map_blocks?

How to compare scipy noise filters?

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

python OLS statsmodels T Stats of variables not entered into the model

Categories

Resources