I have a dataset : temperature and pressure values in different ranges.
I want to filter out all data that deviates more than x% from the "normal" value. This data occurs on process failures .
Extra: the normal value can change over a longer time , so what is a exception at timestamp1 can be normal at timestamp2.
I looked into some noise-filters but i'm not sure this is noise.
You asked two questions.
1.
Tack on a derived column, so it's easy to filter.
For "x%", like five percent, you might use
avg = np.mean(df.pressure)
df['pres_deviation'] = abs(df.pressure - avg) / avg
print(df[df.pres_deviation < .05])
But rather than working with a percentage,
you might find it more natural to work with standard deviations,
filtering out e.g. values more than three standard deviations from the mean.
See
https://en.wikipedia.org/wiki/Standard_score
sklearn StandardScaler
2.
(Extra: the normal value can change over time.)
You could use a window of "most recent 100 samples" to define a smoothed average, store that as an extra column, and it replaces the avg scalar in the calculations above.
More generally you could manually set high / low thresholds as a time series in your data.
The area you're describing is called "change point detection", and we find an extensive literature on it, see e.g. https://paperswithcode.com/task/change-point-detection .
I have used ruptures to good effect, and I recommend it to you.
I have 2 pink noise signals created with a random generator
and I put that into a for-loop like:
for i in range(1000):
input[i] = numpy.random.uniform(-1,1)
for i in range(1000):
z[i] = z[i-1] + (1-b)*(z[i] - input[i-1])
Now I try to convert this via the snntorch library. I already used the rate coding part of this library and want to compare it with the latency coding part. So I want to use snntorch.spikegen.latency() but I don't know how to use it right. I changed all the parameters and got no good result.
Do you have any tips for the Encoding/Decoding part to convert this noise into a spike train and convert it back?
Thanks to everyone!
Can you share how you're currently trying to use the latency() function?
It should be similar to rate() in that you just pass z to the latency function. Though there are many more options involved (e.g., normalize=True finds the time constant to ensure all spike times occur within the range of time num_steps).
Each element in z will correspond to one spike. So if it is of dimension N, then the output should be T x N.
The value/intensity of the element corresponds to what time that spike occurs. Negative intensities don't make sense here, so either take the absolute value of z before passing it in, or level shift it.
I want to generate many randomized realizations of a low discrepancy sequence thanks to scipy.stat.qmc. I only know this way, which directly provide a randomized sequence:
from scipy.stats import qmc
ld = qmc.Sobol(d=2, scramble=True)
r = ld.random_base2(m=10)
But if I run
r = ld_deterministic.random_base2(m=10)
twice I get
The balance properties of Sobol' points require n to be a power of 2. 2048 points have been previously generated, then: n=2048+2**10=3072. If you still want to do this, the function 'Sobol.random()' can be used.
It seems like using Sobol.random() is discouraged from the doc.
What I would like (and it should be faster) is to first get
ld = qmc.Sobol(d=2, scramble=False)
then to generate like a 1000 scrambling (or other randomization method) from this initial series.
It avoids having to regenerate the Sobol sequence for each sample and just do scrambling.
How to that?
It seems to me like it is the proper way to do many Randomized QMC, but I might be wrong and there might be other ways.
As the warning suggests, Sobol' is a sequence meaning that there is a link between with the previous samples. You have to respect the properties of 2^m. It's perfectly fine to use Sobol.random() if you understand how to use it, this is why we created Sobol.random_base2() which prints a warning if you try to do something that would break the properties of the sequence. Remember that with Sobol' you cannot skip 10 points and then sample 5 or do arbitrary things like that. If you do that, you will not get the convergence rate guaranteed by Sobol'.
In your case, what you want to do is to reset the sequence between the draws (Sobol.reset). A new draw will be different from the previous one if scramble=True. Another way (using a non scrambled sequence for instance) is to sample 2^k and skip the first 2^(k-1) points then you can sample 2^n with n<k-1.
I've designed a model using Pymc3, and I have some trouble optimizing it with multiple data.
The model is a bit similar to the coal-mining disaster (as in the Pymc3 tutorial for those who know it), except there are multiple switchpoints.
The output of the network is a serie of real numbers for instance:
[151,152,150,20,19,18,0,0,0]
with Model() as accrochage_model:
time=np.linspace(0,n_cycles*data_length,n_cycles*data_length)
poisson = [Normal('poisson_0',5,1), Normal('poisson_1',10,1)]
variance=3
t = [Normal('t_0',0.5,0.01), Normal('t_1',0.7,0.01)]
taux = [Bernoulli('taux_{}'.format(i),t[i]) for i in range(n_peaks)]
switchpoint = [Poisson('switchpoint_{}'.format(i),poisson[i])*taux[i] for i in range(n_peaks)]
peak=[Normal('peak_0',150,2),Normal('peak_1',50,2),Normal('peak_2',0,2)]
z_init=switch(switchpoint[0]>=time%n_cycles,0,peak[0])
z_list=[switch(sum(switchpoint[j] for j in range(i))>=time%n_cycles,0,peak[i]-peak[i-1]) for i in range(1,n_peaks)]
z=(sum(z_list[i] for i in range(len(z_list))))
z+=z_init
m =Normal('m', z, variance,observed=data)
I have multiple realisations of the true distribution and I'd like taking all of them into account while performing optimization of the parameters of the system.
Right now my "data" that appears in observed=data is just one list of results , such as:
[151,152,150,20,19,18,0,0,0]
What I would like to do is give not just one but several lists of results,
for instance:
data=([151,152,150,20,19,18,0,0,0],[145,152,150,21,17,19,1,0,0],[151,149,153,17,19,18,0,0,1])
I tried using the shape parameter and making data an array of results but none of it seemed to work.
Does anyone have an idea of how it's possible to do the inference so that the network is optimized for an entire dataset and not a single output?
I am working with IFFT and have a set of real and imaginary values with their respective frequencies (x-axis). The frequencies are not equidistant, I can't use a discrete IFFT, and I am unable to fit my data correctly, because the values are so jumpy at the beginning. So my plan is to "stretch out" my frequency data points on a lg-scale, fit them (with polyfit) and then return - somehow - to normal scale.
f = data[0:27,0] #x-values
re = daten[0:27,5] #y-values
lgf = p.log10(f)
polylog_re = p.poly1d(p.polyfit(lgf, re, 6))
The fit works definitely better (http://imgur.com/btmC3P0), but is it possible to then transform my polynom back into the normal x-scaling? Right now I'm using those logarithmic fits for my IFFT and take the log10 of my transformed values for plotting etc., but that probably defies all mathematical logic and results in errors.
Your fit is perfectly valid but not a regular polynomial fit. By using log_10(x), you use another model function. Something like y(x)=sum(a_i * 10^(x_i^i). If this is okay for you, you are done. When you wan't to do some more maths, I would suggest using the natural logarithm instead the one to base 10.