How to compare scipy noise filters? - python

I need to reduce my noise like behavior in my data. I tried one of the method called Savitzky-Golay Filter . On the other hand, I need to find fastest method, because the filtering algorithm will be in the most running script in my code.
I am not familiar with the signal processing methods. Can you suggest faster methods and usage of them briefly?
I do not need complex structure like low-pass, high-pass etc (I know there are thousands of them). As fast as possible smoothening method is what I want to use.
Here my test script:
import numpy as np
import matplotlib.pyplot as plt
noisyData=np.array([
2.77741650e+43, 1.30016392e+42, 8.05792443e+42, 1.74277713e+43,
2.33814198e+43, 6.75553976e+42, 2.56642073e+43, 4.71467220e+43,
4.25047666e+43, 3.07095152e+43, 7.30694187e+43, 7.54411548e+43,
1.29555422e+43, 8.09272000e+42, 9.18193162e+43, 2.25447063e+44,
3.43044832e+41, 7.02901256e+43, 2.54438379e+43, 8.72303015e+43,
7.80333557e+42, 7.55039871e+43, 7.70164773e+43, 4.38740319e+43,
8.43139041e+43, 6.12168640e+43, 5.64352020e+43, 3.63824769e+42,
2.35296604e+43, 4.66272666e+43, 5.03660902e+44, 1.65071897e+44,
2.81055925e+44, 1.46401444e+44, 5.44407940e+43, 4.50672710e+43,
1.60833084e+44, 1.68038069e+44, 1.08588606e+44, 7.00867980e+43])
xAxis=np.arange(len(noisyData))
# ------------- Savitzky-Golay Filter ---------------------
windowLength = len(xAxis) - 5
polyOrder = 6
from scipy.signal import savgol_filter
# Function
def set_SavgolFilter(noisyData,windowLength,polyOrder):
return savgol_filter(noisyData, windowLength, polyOrder)
plt.plot(xAxis,noisyData,alpha=0.5)
plt.plot(xAxis,set_SavgolFilter(noisyData,windowLength,polyOrder))
# ------------- Time Comparison ----------------------
import time
start_time = time.time()
for i in range(50):
savgolfilter1 = set_SavgolFilter(noisyData,windowLength,polyOrder)
print(" %s seconds " % (time.time() - start_time))
# === OTHER METHODS WILL BE HERE

Unless you really need polynomial-based smoothing, Savitzky-Golay does not have any particular advantages. It's basically a bad lowpass filter. For more details, see https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5888646
Using a basic Butterworth lowpass filter instead:
from scipy.signal import butter, filtfilt
b, a = butter(5, .2)
datafilt = filtfilt(b, a, noisyData)
The filtfilt call seems to be several times faster than savgol_filter. How much faster do you need? Using lfilter from scipy is at least 10 times faster, but the result will be delayed with respect to the input signal.

Related

Parallelizing a simple matrix assignment in Python with MPI

I have a rather simple parallelization question that I can't seem to work out. I had parallelized a simple matrix assignment using joblib in Python which worked nicely on my workstation, but now I need to run the code on a HPC and the as-is code is not playing nicely with MPI. A skeleton of the code is below (I have stripped out a lot of not-relevant computation). Basically I have a large matrix that I want to fill in and at each point the value is a sum over many energies and eigenvalues, so this is the 'slow step' of the calculation. When I run this on my workstation I just parallelize that fill in using Parallel and delayed from joblib but of course when I run this on the cluster using mpirun --bind-to none -n 16 python KZ_spectral_function.py | tee spectral.out for example the code runs basically in serial (although with some odd behavior).
So, what I think I need to do is to convert that joblib line over to mpi4py, include an if rank == 0: statement encompassing everything in the main function, and just modify the the contents of gen_spec_func() and divvy up the calls to spec_func() to the different cores. This is the part where I am stuck, as all the examples I have read that were simple enough for me to understand use some variation of COMMS.scatter() and then append the results to a list, as far as I can tell in a random order, and I don't know quite enough to adapt them to something where I want the results to go in a specific place in the matrix. Any help or advice would be greatly appreciated, as neither parallelization nor python are strengths of mine ...
Code Snippet (simplified):
import numpy as np
import numpy.linalg as lina
import time
from functools import partial
from joblib import Parallel, delayed
# Helper Functions
def get_eigenvals(k_cart,cellMap,Hwannier,G):
## [....] some linear algebra, not important
return Ek
def gen_spec_func(eigenvals,Nkpts,Energies,Sigma):
## This is really the only part that I care to parallelize
## This is the joblib version
num_cores=16
tempfunc = partial(spec_func,Energies,Sigma,eigenvals)
spectral = np.reshape(Parallel(n_jobs=num_cores)(delayed(tempfunc)(i,j) for j in range(0,Nkpts) for i in range(0,len(Energies))),(Nkpts,len(Energies))).T
return np.matrix(spectral)
def spec_func(Energies,Sigma,eigenvals,i,j) :
return sum([(1.0/( (Energies[i]-val)**2 + (Sigma)**2 )) for val in eigenvals[j,:]])
#--- Start of main script
Tstart = time.time()
# [...] Declare Constants & Parameters
# [...] Read data from disk
# [...] Some calculations on that data we want done in serial
Energies = [Emin + (Emax-Emin)*i/(Nenergies-1) for i in range(0,Nenergies)]
kzs = [kzi+(kzf-kzi)*l/(nkzs-1) for l in range(0,nkzs)]
DomainAvg = np.matrix([[0.0 for j in range(0,Nkpts)] for i in range(0,Nenergies)])
for kz in kzs:
## An outer loop over Kz
print("Starting loop for kz = ",kz)
# Generate the base k-grid (symmetric) in 1/A for convenience
# [...] Generate the appropriate kpoint grid
for angle in range(0,Nangles):
## Inner loop over rotation angles
#--- For each angle generate the kpoint grid for that domain
# [...] Calculate some eigenvalues, small matrices not a big deal, serial fine
#--- Now we Generate the spectral function for that grid
### Ok, this is the slow part that we want to parallelize
DomainAvg += gen_spec_func(eigenvals,Nkpts,Energies,Sigma)
if(angle%20 == 0):
Tend = time.time()
print("Completed iteration ", angle, "of ", Nangles, " at T = ", Tend-Tstart)
# Output the results (one file for each Kz)
DomainAvg = DomainAvg/Nangles
outfile = "Spectral"+str(kz)+".txt"
np.savetxt(outfile, DomainAvg)
# And we are done
Tend = time.time()
print("Total execution time was :", Tend-Tstart)
EDIT: A very very hack solution I came up with was to encode the matrix indices in the matrix itself as floats, then use scatter() and gather() to distribute the matrix, replace the value with the calculation output, and reassemble the matrix. This is of course not a good idea since it requires int<->float conversion but it was the only way I could come up with that didn't require rebuilding the entire matrix from the gathered data index by index (instead just using hstack() and reshape() to put it together). I feel like there must be some tool I am missing that assists in distributed calculation for arrays/matrices where the index matters so I would still be interested if someone has a tip/pointer in this regard.
Minimum Working Example:
import numpy.linalg as lina
import time
import math
from functools import partial
from mpi4py import MPI
#-- Standard Comms
COMM = MPI.COMM_WORLD
size = COMM.Get_size()
rank = COMM.Get_rank()
Nkpts = 3
Energies = [1.00031415926*i for i in range(0,11)]
#--- Now we Generate the spectral function for that grid
# This will be done in parallel using scatter/gather in MPI
if rank == 0:
# List that we will scatter to the different nodes
# Encode the matrix index from whcn each element came as a float
datalist = [float(j+i*Nkpts) for j in range(0,Nkpts) for i in range(0,len(Energies))]
data = np.array_split(datalist, COMM.Get_size())
else:
data = None
# Distribute to the different nodes
data = COMM.scatter(data, root=0)
print("I am processor ",rank," and my data is",data)
for index in range(0,len(data)) :
# Decode the indicies
j = data[index]%Nkpts
i = math.floor(data[index]/Nkpts)
data[index] = 100.100*j+Energies[i]
COMM.Barrier()
dataMPI = COMM.gather(data,root=0)
if(rank==0) :
spectral = np.reshape(np.hstack(dataMPI),(Nkpts,len(Energies))).T
spectral_func = np.matrix(spectral)
print(spectral_func)```

Can Dask automatically create a tree to parallelize a computation and reduce the copies between workers?

I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
BACKGROUND:
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
They time out sending data to the main worker performing the sum
The main worker runs out of memory gathering all of the matrices
The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
]
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
QUESTION:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?
Dask bag has a reduction/aggregation method that will generate tree-like DAG: fold.
The workflow would be to 'bag' the delayed objects and then fold them.

resample and groupby on big dask array with xarray - using map_blocks?

I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result.
I would like to apply this to a big xarray dataset, which is backed by a chunked dask array. For computation, I'd like to use dask.distributed.
However, when I apply this to the full dataset, the number of tasks skyrockets, overwhelming the client and most likely also the scheduler and workers if submitted.
The xarray docs explain:
Do your spatial and temporal indexing (e.g. .sel() or .isel()) early
in the pipeline, especially before calling resample() or groupby().
Grouping and rasampling triggers some computation on all the blocks,
which in theory should commute with indexing, but this optimization
hasn’t been implemented in dask yet.
But I really need to apply this to the full temporal axis.
So how to best implement this?
My approach was to use map_blocks, to apply this function for each chunk individually as to keep the individual xarray sub-datasets small enough.
This seems to work on a small scale, but when I use the full dataset, the workers run out of memory and quickly die.
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
So my questions are:
Is this approach valid?
How could I implement this workflow otherwise, besides manually implementing the resample and groupby part and putting it in a ufunc?
Any ideas regarding the performance issues at scale (specifically the number of executions vs chunks)?
Here's a small example that mimics the workflow and shows the number of executions vs chunks:
from time import sleep
import dask
from dask.distributed import Client, LocalCluster
import numpy as np
import pandas as pd
import xarray as xr
def ufunc(x):
# computation
sleep(2)
return x
def fun(x):
# upsample to higher res
x = x.resample(time="1h").asfreq().fillna(0)
# apply function
x = xr.apply_ufunc(ufunc, x, input_core_dims=[["time"]], output_core_dims=[['time']], dask="parallelized")
# average over dates
x['time'] = x.time.dt.strftime("%Y-%m-%d")
x = x.groupby("time").mean()
return x
def create_xrds(shape):
''' helper function to create dataset'''
x,y,t = shape
tv = pd.date_range(start="1970-01-01", periods=t)
ds = xr.Dataset({
"band": xr.DataArray(
dask.array.zeros(shape, dtype="int16"),
dims=['x', 'y', 'time'],
coords={"x": np.arange(0, x), "y": np.arange(0, y), "time": tv})
})
return ds
# set up distributed
cluster = LocalCluster(n_workers=2)
client = Client(cluster)
ds = create_xrds((500,500,500)).chunk({"x": 100, "y": 100, "time": -1})
# create template
template = ds.copy()
template['time'] = template.time.dt.strftime("%Y-%m-%d")
# map fun to blocks
ds_out = xr.map_blocks(fun, ds, template=template)
# persist
ds_out.persist()
Using the example above, this is how the dask array (25 chunks) looks like:
But the function fun gets executed 125 times:
Looking at the dashboard, the function I'm applying to the array gets executed multiple times of the number of chunks I have. Shouldn't these two numbers line up?
This is misleading because of an unfortunate choice made when making the graph. The number includes tasks that make a block of the input Dataset (one per variable per chunk) & for the output Dataset as well as tasks that apply the function. This will get fixed soon (https://github.com/pydata/xarray/pull/5007)

Modify 3D numpy array in slices in parallel

I have a complex numpy array signal with dimensions [10,1000,50000]
I need to modify this array in slices. This is done in a for loop:
for k in range(signal.shape[2]):
signal[:,:,k] = myfunction(signal[:,:,k], constant1, constant2, constant5=constant5, constant6=constant6)
I optimized myfunction as much as possible. When I run the script it takes quite some time but only uses 1 of 24 CPU's.
The code can not be rewritten to perform myfunction on the entire array with numpy.
Therefore I want to speed up my code with parallel computing.
There seem to many different approach for parallel computing in python.
Which one seems to be the best for my problem? And how can I implement it?
Joblib provides easy execution for such 'embarrassingly-parallel' tasks:
import numpy as np
# Initialize array and define function
np_array = np.random.rand(100,100,100)
my_function = lambda x: x / np.sum(x)
# Option 1: Loop over array and apply function
serial_result = np_array.copy()
for i in range(np_array.shape[2]):
serial_result[:,:,i] = my_function(np_array[:,:,i])
Now using parallel execution with joblib:
# Option 2: Parallel execution
# ... Apply function in Parallel
from joblib import delayed, parallel
sub_arrays = Parallel(n_jobs=6)( # Use 6 cores
delayed(my_function)(np_array[:,:,i]) # Apply my_function
for i in range(np_array.shape[2])) # For each 3rd dimension
# ... Concatenate the list of returned arrays
parallel_results = np.stack(sub_arrays, axis=2)
# Compare results
np.equal(serial_result, parallel_results).all() # True

KernelReg performance in a for loop

I have a task to fit about 10000 1D profiles obtained from an electron beam to Gaussian.
The raw data is basically very noisy. I had to denoise before fitting. I was advised to use KernelReg to do this.
For each profile, I firstly call KernelReg and then lmfit to extract the center, sigma, amp and offset of raw data in a for loop.
When I tested with 100 profiles:
If I use only lmfit the runtime is 2.4 seconds (cprofiler).
If I combine KernelReg and lmfit the runtime is 272 seconds.
The cprofiler displays a bottleneck in the KernelReg call.
An example profile:
data_array=np.array([-0.000159229
-0.000213496
-1.37e-05
-0.00021545
1.24e-05
0.000181446
-0.000133793
-7.84e-05
-0.000266477
-0.000206505
-0.000376277
-9.38e-05
0.000174166
-0.000365068
-0.00291559
-6.37e-05
-0.000314041
-0.000426127
-0.000322608
-0.000293555
-0.000306628
-0.000379695
-6.12e-05
-0.000336458
-0.000296795
-6.57e-06
-0.000121408
-0.000327136
-0.000215139
-0.000221265
-0.000112685
-0.000244148
-0.000318746
-0.00039916
-0.00454921
-0.00026823
-0.000153014
-0.000423619
-0.000348621
-0.000311244
-0.000318724
-0.000145046
-0.000164001
-0.000224927
-0.000568133
-0.000106227
-0.00022688
-0.000417715
-0.000382891
2.87e-05
-0.00267422
-0.000207038
-0.000239531
-0.000174655
-0.000145335
-0.000202266
-0.000455647
-0.000348444
-0.000346801
-5.86e-05
-8.12e-05
-0.00016733
-0.000241884
-0.000227368
-0.000229987
-0.000121697
-0.00030503
-0.000244148
7.8e-05
-0.000253847
0.000289293
-0.000123672
0.0175145
-0.000436537
-0.000320966
-0.000177473
-0.000148553
-9.91e-05
-0.000197605
-0.000155855
-0.000259152
-0.000221265
-0.00014023
-8.7e-05
-0.000532443
-7.29e-05
-0.0001464
-0.000401024
-0.000176963
-0.000318946
5.83e-05
-0.000281947
-0.000120476
-0.000313708
-0.000114594
-0.000242483
-0.000162958
-0.000144203
-0.000445903
-1.44e-05
-0.000186307
-0.000197738
8.11e-05
-0.000203264
-0.000407749
0.00026843
-0.000268629
-0.000228789
3e-06
-0.000199136
-0.000201023
1.69e-05
-0.000168995
7.79e-05
6.05e-05
-0.000280038
-0.000305252
-0.000308625
-5.47e-05
-0.00034032
-0.000169572
-0.0001193
-0.000234626
5.73e-05
-0.000235869
8.41e-06
-0.000331353
0.000407483
0.000226658
-4.63e-05
-3.39e-05
-0.000163224
-4.31e-05
-0.000191434
9.93e-05
0.000193032
-9.16e-05
-0.000144513
-0.00010616
-4.39e-05
-5.87e-05
9.19e-06
0.000276642
4.48e-05
-7.63e-05
0.000100678
1.18e-05
0.000209568
0.000472049
0.0291889
0.000433762
0.000433607
0.000574392
0.000997101
0.00142112
0.00391844
0.00768485
0.0138782
0.0216787
0.0281597
0.028893
0.0259644
0.0190644
0.01219
0.00574567
-0.00154854
0.00143113
0.000845396
0.00076789
0.000276753
0.000351285
0.000284233
0.00057053
0.000433873
-0.000197183
1.29e-05
0.000118878
0.000203819
0.000132328
1.84e-05
-4.34e-05
-7.95e-05
-0.000400492
6e-05
-9.98e-05
0.00441493
5.23e-05
-7.08e-05
5.7e-05
-0.000148531
-0.000139475
-3.74e-05
-0.000149086
-0.000234826
-3.42e-05
5.27e-05
-0.000171436
-0.00021778
-0.000175076
-0.000198071
])
position=position = np.linspace(1, 200, 200)
My code:
This code takes about 1~2 seconds to run a call. It is the bottleneck at run time.
import numpy as np
from statsmodels.nonparametric.kernel_regression import KernelReg
data_array_kr = KernelReg(data_array, position,'c')
data_array_pred, data_array_std = data_array_kr.fit(position)
data_array_pred[data_array_pred<0] = 0
Questions:
How to improve the performance of my KernelReg call?
Is this a good choice for denoising and fit using KernelReg+lmfit?
Answering your second question:
You might try comparing a smoothing algorithm such as Savitzky-Golay (if it can be applied to your data -- it requires samples on a uniform interval). It definitely takes some time, but might be faster than KernelReg. You might also try to assess how much smoothing you actually need to get stable fit results.

Categories

Resources