I am creating a system for logging data from sensors. (Just a series of numbers)
I would like to be able to put the system into a "learn" mode for a couple of days so it can see what its "normal" operational values are and that once it is out of this any deviation from this behaviour past a certain point can be flagged. The data is all stored in a MySQL database.
Any suggestions on how to carry this out would be welcome, as would locations for further reading on the topic.
I would preferably like to use python for this task.
The data temperature and humidity values ever 5 minutes in a temperature controlled area that is accessed and used during the day. This means the it will have fluctuations for when it is in use and some temperature changes. But anything different to this such as cooling or heating systems failing needs to be detected
Essentially what you should be looking at is density estimation: the task of determining a model of how some variables behave, so that you can look for deviations from it.
Here's some very simple example code. I've assumed that temperature and humidity have independent normal distributions on their untransformed scales:
import numpy as np
from matplotlib.mlab import normpdf
from itertools import izip
class TempAndHumidityModel(object):
def __init__(self):
self.tempMu=0
self.tempSigma=1
self.humidityMu=0
self.humiditySigma=1
def setParams(self, tempMeasurements, humidityMeasurements, quantile):
self.tempMu=np.mean(tempMeasurements)
self.tempSigma=np.std(tempMeasurements)
self.humidityMu=np.mean(humidityMeasurements)
self.humiditySigma=np.std(humidityMeasurements)
if not 0 < quantile <= 1:
raise ValueError("Quantile for threshold must be between 0 and 1")
self._thresholdDensity(quantile, tempMeasurements, humidityMeasurements)
def _thresholdDensity(self, quantile, tempMeasurements, humidityMeasurements):
tempDensities = np.apply_along_axis(
lambda x: normpdf(x, self.tempMu, self.tempSigma),0,tempMeasurements)
humidityDensities = np.apply_along_axis(
lambda x: normpdf(x, self.humidityMu, self.humiditySigma),0,humidityMeasurements)
densities = sorted(tempDensities * humidityDensities, reverse=True)
#Here comes the massive oversimplification: just choose the
#density value at the quantile*length position, and use this as the threshold
self.threshold = densities[int(np.round(quantile*len(densities)))]
def probOfObservation(self, temp, humidity):
return normpdf(temp, self.tempMu, self.tempSigma) * \
normpdf(humidity, self.humidityMu, self.humiditySigma)
def isNormalMeasurement(self, temp, humidity):
return self.probOfObservation(temp, humidity) > self.threshold
if __name__ == '__main__':
#Create some simulated data
temps = np.random.randn(100)*10 + 50
humidities = np.random.randn(100)*2 + 10
thm = TempAndHumidityModel()
#going to hard code in the 95% threshold
thm.setParams(temps, humidities, 0.95)
#Create some new data from same dist and see how many false positives
newTemps = np.random.randn(100)*10 + 50
newHumidities = np.random.randn(100)*2 + 10
numFalseAlarms = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(newTemps,newHumidities))
print '{} false alarms!'.format(numFalseAlarms)
#Now create some abnormal data: mean temp drops to 20
lowTemps = np.random.randn(100)*10 + 20
normalHumidities = np.random.randn(100)*2 + 10
numDetections = sum(~thm.isNormalMeasurement(t,h) for t,h in izip(lowTemps,normalHumidities))
print '{} abnormal measurements flagged'.format(numDetections)
Example output:
>> 3 false alarms!
>> 77 abnormal measurements flagged
Now, I have no idea whether the assumption of normality is appropriate for your data (you may want to transform the data onto a different scale so that it is); it's probably wildly inaccurate to assume independence between temperature and humidity; and the trick that I have used to find the density value corresponding to the requested quantile of the distribution should be replaced by something that uses the inverse CDF of the distribution. However, this should give you a flavour of what to do.
Note additionally that there are many good non-parametric density estimators: kernel density estimators immediately spring to mind. These may be more appropriate if your data doesn't look like any standard distribution.
It looks like you are attempting to perform anomaly detection but your description of your data is vague. You should start by trying to define/constrain what it means, in general, for your data to be "normal".
Is there a different "normal" for each sensor?
Is a sensor measurement somehow dependent on it's previous measurement(s)?
Does "normal" change over the course of a day?
Can the "normal" measurements from a sensor be characterized by a statistical model (e.g., are the data Gaussian or log-normal)?
Once you have answered those types of questions, then you can train a classifier or anomaly detector with a batch of data from your database and use the result to assess future log output. If machine learning algorithms are applicable to your data, you might consider using scikit-learn. For statistical models, you could use the stats subpackage of SciPy. And, of course, for any kind of numerical data manipulation in python, NumPy is your friend.
Related
Good day
EDIT:
What I want: From any current/voltage waveform on a Power System(PS) I want the filtered 50Hz (fundamental) RMS values magnitudes (and effectively their angles). The current as measured contains all harmonics from 100Hz to 1250Hz depending on the equipment. One cannot analyse and calculate using a wave with these harmonics your error gets so big (depending on equipment) that PS protection equipment calculates incorrect quantities. The signal attached also has MANY many other frequency components involved.
My aim: PS protection Relays are special and calculate a 20ms window in a very short time. I.m not trying to get this. I'm using external recording tech and testing what the relays see are true and they operate correctly. Thus I need to do what they do and only keep 50Hz values without any harmonic and DC.
Important expected result: Given any frequency component that MAY be in the signal I want to see the magnitude of any given harmonic (150,250 - 3rd harmonic magnitudes and 5th harmonic of the fundamental) as well as the magnitude of the DC. This will tell me what type of PS equipment possibly injects these frequencies. It is important that I provide a frequency and the answer is a vector of that frequency only with all other values filtered OUT.
RMS-of-the-fundamental vs RMS differs with a factor of 4000A (50Hz only) and 4500A (with other freqs included)
This code calculates a One Cycle Fourier value (RMS) for given frequency. Sometimes called a Fourier filter I think? I use it for Power System 50Hz/0Hz/150Hz analogues analysis. (The answers have been tested and are correct fundamental RMS values. (https://users.wpi.edu/~goulet/Matlab/overlap/trigfs.html)
For a large sample the code is very slow. For 55000 data points it takes 12seconds. For 3 voltages and 3 currents this gets to be VERY slow. I look at 100s of records a day.
How do I enhance it? What Python tips and tricks/ libraries are there to append my lists/array.
(Also feel free to rewrite or use the code). I use the code to pick out certain values out of a signal at given times. (which is like reading the values from a specialized program for power system analysis)
Edited: with how I load the files and use them, code works with pasting it:
import matplotlib.pyplot as plt
import csv
import math
import numpy as np
import cmath
# FILES ATTACHED TO POST
filenamecfg = r"E:/Python_Practise/2019-10-21 13-54-38-482.CFG"
filename = r"E:/Python_Practise/2019-10-21 13-54-38-482.DAT"
t = []
IR = []
newIR=[]
with open(filenamecfg,'r') as csvfile1:
cfgfile = [row for row in csv.reader(csvfile1, delimiter=',')]
numberofchannels=int(np.array(cfgfile)[1][0])
scaleval = float(np.array(cfgfile)[3][5])
scalevalI = float(np.array(cfgfile)[8][5])
samplingfreq = float(np.array(cfgfile)[numberofchannels+4][0])
numsamples = int(np.array(cfgfile)[numberofchannels+4][1])
freq = float(np.array(cfgfile)[numberofchannels+2][0])
intsample = int(samplingfreq/freq)
#TODO neeeed to get number of samples and frequency and detect
#automatically
#scaleval = np.array(cfgfile)[3]
print('multiplier:',scaleval)
print('SampFrq:',samplingfreq)
print('NumSamples:',numsamples)
print('Freq:',freq)
with open(filename,'r') as csvfile:
plots = csv.reader(csvfile, delimiter=',')
for row in plots:
t.append(float(row[1])/1000000) #get time from us to s
IR.append(float(row[6]))
newIR = np.array(IR) * scalevalI
t = np.array(t)
def mag_and_theta_for_given_freq(f,IVsignal,Tsignal,samples): #samples are the sample window size you want to caclulate for (256 in my case)
# f in hertz, IVsignal, Tsignal in numpy.array
timegap = Tsignal[2]-Tsignal[3]
pi = math.pi
w = 2*pi*f
Xr = []
Xc = []
Cplx = []
mag = []
theta = []
#print("Calculating for frequency:",f)
for i in range(len(IVsignal)-samples):
newspan = range(i,i+samples)
timewindow = Tsignal[newspan]
#print("this is my time: ",timewindow)
Sig20ms = IVsignal[newspan]
N = len(Sig20ms) #get number of samples of my current Freq
RealI = np.multiply(Sig20ms, np.cos(w*timewindow)) #Get Real and Imaginary part of any signal for given frequency
ImagI = -1*np.multiply(Sig20ms, np.sin(w*timewindow)) #Filters and calculates 1 WINDOW RMS (root mean square value).
#calculate for whole signal and create a new vector. This is the RMS vector (used everywhere in power system analysis)
Xr.append((math.sqrt(2)/N)*sum(RealI)) ### TAKES SO MUCH TIME
Xc.append((math.sqrt(2)/N)*sum(ImagI)) ## these steps make RMS
Cplx.append(complex(Xr[i],Xc[i]))
mag.append(abs(Cplx[i]))
theta.append(np.angle(Cplx[i]))#th*180/pi # this can be used to get Degrees if necessary
#also for freq 0 (DC) id the offset is negative how do I return a negative to indicate this when i'm using MAGnitude or Absolute value
return Cplx,mag,theta #mag[:,1]#,theta # BUT THE MAGNITUDE WILL NEVER BE zero
myZ,magn,th = mag_and_theta_for_given_freq(freq,newIR,t,intsample)
plt.plot(newIR[0:30000],'b',linewidth=0.4)#, label='CFG has been loaded!')
plt.plot(magn[0:30000],'r',linewidth=1)
plt.show()
The code as pasted runs smoothly given the files attached
Regards
EDIT: Please find a test csvfile and COMTRADE TEST files here:
CSV:
https://drive.google.com/open?id=18zc4Ms_MtYAeTBm7tNQTcQkTnFWQ4LUu
COMTRADE
https://drive.google.com/file/d/1j3mcBrljgerqIeJo7eiwWo9eDu_ocv9x/view?usp=sharing
https://drive.google.com/file/d/1pwYm2yj2x8sKYQUcw3dPy_a9GrqAgFtD/view?usp=sharing
Forewords
As I said in my previous comment:
Your code mainly relies on a for loop with a lot of indexation and
scalar operations. You already have imported numpy so you should take
advantage of vectorization.
This answer is a start towards your solution.
Light weight MCVE
First we create a trial signal for the MCVE:
import numpy as np
# Synthetic signal sampler: 5s sampled as 400 Hz
fs = 400 # Hz
t = 5 # s
t = np.linspace(0, t, fs*t+1)
# Synthetic Signal: Amplitude is about 325V #50Hz
A = 325 # V
f = 50 # Hz
y = A*np.sin(2*f*np.pi*t) # V
Then we can compute the RMS of this signal using the usual formulae:
# Actual definition of RMS:
yrms = np.sqrt(np.mean(y**2)) # 229.75 V
Or alternatively we can compute it using DFT (implemented as rfft in numpy.fft):
# RMS using FFT:
Y = np.fft.rfft(y)/y.size
Yrms = np.sqrt(np.real(Y[0]**2 + np.sum(Y[1:]*np.conj(Y[1:]))/2)) # 229.64 V
A demonstration of why this last formulae works can be found here. This is valid because of the Parseval Theorem implies Fourier transform does conserve Energy.
Both versions make use of vectorized functions, no need of splitting real and imaginary part to perform computation and then reassemble into a complex number.
MCVE: Windowing
I suspect you want to apply this function as a window on a long term time serie where RMS value is about to change. Then we can tackle this problem using pandas library which provides time series commodities.
import pandas as pd
We encapsulate the RMS function:
def rms(y):
Y = 2*np.fft.rfft(y)/y.size
return np.sqrt(np.real(Y[0]**2 + np.sum(Y[1:]*np.conj(Y[1:]))/2))
We generate a damped signal (variable RMS)
y = np.exp(-0.1*t)*A*np.sin(2*f*np.pi*t)
We wrap our trial signal into a dataframe to take advantage of the rolling or resample methods:
df = pd.DataFrame(y, index=t*pd.Timedelta('1s'), columns=['signal'])
A rolling RMS of your signal is:
df['rms'] = df.rolling(int(fs/f)).agg(rms)
A periodically sampled RMS is:
df['signal'].resample('1s').agg(rms)
The later returns:
00:00:00 2.187840e+02
00:00:01 1.979639e+02
00:00:02 1.791252e+02
00:00:03 1.620792e+02
00:00:04 1.466553e+02
Signal Conditioning
Addressing your need of keeping only fundamental harmonic (50 Hz), a straightforward solution could be a linear detrend (to remove constant and linear trend) followed by a Butterworth filter (bandpass filter).
We generate a synthetic signal with other frequencies and linear trend:
y = np.exp(-0.1*t)*A*(np.sin(2*f*np.pi*t) \
+ 0.2*np.sin(8*f*np.pi*t) + 0.1*np.sin(16*f*np.pi*t)) \
+ A/20*t + A/100
Then we condition the signal:
from scipy import signal
yd = signal.detrend(y, type='linear')
filt = signal.butter(5, [40,60], btype='band', fs=fs, output='sos', analog=False)
yfilt = signal.sosfilt(filt, yd)
Graphically it leads to:
It resumes to apply the signal conditioning before the RMS computation.
First, I must admit that my statistics knowledge is rusty at best: even when it was shining new, it's not a discipline I particularly liked, which means I had a hard time making sense of it.
Nevertheless, I took a look at how the barplot graphs were calculating error bars, and was surprised to find a "confidence interval" (CI) used instead of (the more common) standard deviation. Researching more CI led me to this wikipedia article which seems to say that, basically, a CI is computed as:
Or, in pseudocode:
def ci_wp(a):
"""calculate confidence interval using Wikipedia's formula"""
m = np.mean(a)
s = 1.96*np.std(a)/np.sqrt(len(a))
return m - s, m + s
But what we find in seaborn/utils.py is:
def ci(a, which=95, axis=None):
"""Return a percentile range from an array of values."""
p = 50 - which / 2, 50 + which / 2
return percentiles(a, p, axis)
Now maybe I'm missing this completely, but this seems just like a completely different calculation than the one proposed by Wikipedia. Can anyone explain this discrepancy?
To give another example, from comments, why do we get so different results between:
>>> sb.utils.ci(np.arange(100))
array([ 2.475, 96.525])
>>> ci_wp(np.arange(100))
[43.842250270646467,55.157749729353533]
And to compare with other statistical tools:
def ci_std(a):
"""calculate margin of error using standard deviation"""
m = np.mean(a)
s = np.std(a)
return m-s, m+s
def ci_sem(a):
"""calculate margin of error using standard error of the mean"""
m = np.mean(a)
s = sp.stats.sem(a)
return m-s, m+s
Which gives us:
>>> ci_sem(np.arange(100))
(46.598850802411796, 52.401149197588204)
>>> ci_std(np.arange(100))
(20.633929952277882, 78.366070047722118)
Or with a random sample:
rng = np.random.RandomState(10)
a = rng.normal(size=100)
print sb.utils.ci(a)
print ci_wp(a)
print ci_sem(a)
print ci_std(a)
... which yields:
[-1.9667006 2.19502303]
(-0.1101230745774124, 0.26895640045116026)
(-0.017774461397903049, 0.17660778727165088)
(-0.88762281417683186, 1.0464561400505796)
Why are Seaborn's numbers so radically different from the other results?
Your calculation using this Wikipedia formula is completely right. Seaborn just uses another method: https://en.wikipedia.org/wiki/Bootstrapping_(statistics). It's well described by Dragicevic [1]:
[It] consists of generating many alternative datasets from the experimental data by randomly drawing observations with replacement. The variability across these datasets is assumed to approximate sampling error and is used to compute so-called bootstrap confidence intervals. [...] It is very versatile and works for many kinds of distributions.
In the Seaborn's source code, a barplot uses estimate_statistic which bootstraps the data then computes the confidence interval on it:
>>> sb.utils.ci(sb.algorithms.bootstrap(np.arange(100)))
array([43.91, 55.21025])
The result is consistent with your calculation.
[1] Dragicevic, P. (2016). Fair statistical communication in HCI. In Modern Statistical Methods for HCI (pp. 291-330). Springer, Cham.
You need to check the code of percentiles. The seaborn ci code you posted simply computes the percentile limits. This interval has a defined mean of 50 (median) and a default range of 95% confidence interval. The actual mean, the standard deviation, etc. will appear in the percentiles routine.
I am trying to write code to produce confidence intervals for the number of different books in a library (as well as produce an informative plot).
My cousin is at elementary school and every week is given a book by his teacher. He then reads it and returns it in time to get another one the next week. After a while we started noticing that he was getting books he had read before and this became gradually more common over time.
Say the true number of books in the library is N and the teacher picks one uniformly at random (with replacement) to give to you each week. If at week t the number of occasions on which you have received a book you have read is x, then I can produce a maximum likelihood estimate for the number of books in the library following https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library .
Example: Consider a library with five books A, B, C, D, and E. If you receive books [A, B, A, C, B, B, D] in seven successive weeks, then the value for x (the number of duplicates) will be [0, 0, 1, 1, 2, 3, 3] after each of those weeks, meaning after seven weeks, you have received a book you have already read on three occasions.
To visualise the likelihood function (assuming I have understood what one is correctly) I have written the following code which I believe plots the likelihood function. The maximum is around 135 which is indeed the maximum likelihood estimate according to the MSE link above.
from __future__ import division
import random
import matplotlib.pyplot as plt
import numpy as np
#N is the true number of books. t is the number of weeks.unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t):
return t - len(set([random.randint(0,N) for i in xrange(t)]))
iters = 1000
ydata = []
for N in xrange(10,500):
sampledunk = [numberrepeats(N,t) for i in xrange(iters)].count(unk)
ydata.append(sampledunk/iters)
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
The output looks like
My questions are these:
Is there an easy way to get a 95% confidence interval and plot it on the diagram?
How can you superimpose a smoothed curve over the plot?
Is there a better way my code should have been written? It isn't very elegant and is also quite slow.
Finding the 95% confidence interval means finding the range of the x axis so that 95% of the time the empirical maximum likelihood estimate we get by sampling (which should theoretically be 135 in this example) will fall within it. The answer #mbatchkarov has given does not currently do this correctly.
There is now a mathematical answer at https://math.stackexchange.com/questions/656101/how-to-find-a-confidence-interval-for-a-maximum-likelihood-estimate .
Looks like you're ok on the first part, so I'll tackle your second and third points.
There are plenty of ways to fit smooth curves, with scipy.interpolate and splines, or with scipy.optimize.curve_fit. Personally, I prefer curve_fit, because you can supply your own function and let it fit the parameters for you.
Alternatively, if you don't want to learn a parametric function, you could do simple rolling-window smoothing with numpy.convolve.
As for code quality: you're not taking advantage of numpy's speed, because you're doing things in pure python. I would write your (existing) code like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
# N is the true number of books.
# t is the number of weeks.
# unk is the true number of repeats found
t = 30
unk = 3
def numberrepeats(N, t, iters):
rand = np.random.randint(0, N, size=(t, iters))
return t - np.array([len(set(r)) for r in rand])
iters = 1000
ydata = np.empty(500-10)
for N in xrange(10,500):
sampledunk = np.count_nonzero(numberrepeats(N,t,iters) == unk)
ydata[N-10] = sampledunk/iters
print "MLE is", np.argmax(ydata)
xdata = range(10, 500)
print len(xdata), len(ydata)
plt.plot(xdata,ydata)
plt.show()
It's probably possible to optimize this even more, but this change brings your code's runtime from ~30 seconds to ~2 seconds on my machine.
The a simple (numerical) way to get a confidence interval is simply to run your script many times, and see how much your estimate varies. You can use that standard deviation to calculate the confidence interval.
In the interest of time, another option is to run a bunch of trials at each value of N (I used 2000), and then use random subsampling of those trials to get an estimate of the estimator standard deviation. Basically, this involves selecting a subset of the trials, generating your likelihood curve using that subset, then finding the maximum of that curve to get your estimator. You do this over many subsets and this gives you a bunch of estimators, which you can use to find a confidence interval on your estimator. My full script is as follows:
import numpy as np
t = 30
k = 3
def trial(N):
return t - len(np.unique(np.random.randint(0, N, size=t)))
def trials(N, n_trials):
return np.asarray([trial(N) for i in xrange(n_trials)])
n_trials = 2000
Ns = np.arange(1, 501)
results = np.asarray([trials(N, n_trials=n_trials) for N in Ns])
def likelihood(results):
L = (results == 3).mean(-1)
# boxcar filtering
n = 10
L = np.convolve(L, np.ones(n) / float(n), mode='same')
return L
def max_likelihood_estimate(Ns, results):
i = np.argmax(likelihood(results))
return Ns[i]
def max_likelihood(Ns, results):
# calculate mean from all trials
mean = max_likelihood_estimate(Ns, results)
# randomly subsample results to estimate std
n_samples = 100
sample_frac = 0.25
estimates = np.zeros(n_samples)
for i in xrange(n_samples):
mask = np.random.uniform(size=results.shape[1]) < sample_frac
estimates[i] = max_likelihood_estimate(Ns, results[:,mask])
std = estimates.std()
sterr = std * np.sqrt(sample_frac) # is this mathematically sound?
ci = (mean - 1.96*sterr, mean + 1.96*sterr)
return mean, std, sterr, ci
mean, std, sterr, ci = max_likelihood(Ns, results)
print "Max likelihood estimate: ", mean
print "Max likelihood 95% ci: ", ci
There are two drawbacks to this method. One is that, since you're taking many subsamples from the same set of trials, your estimates are not independent. To limit the effect of this, I only used 25% of the results for each subset. Another drawback is that each subsample is only a fraction of your data, so estimates derived from these subsets will have more variance than estimates derived from running the full script many times. To account for this, I computed the standard error as the standard deviation divided by the square root of 4, since I had four times as much data in my full data set than in one of the subsamples. However, I'm not familiar enough with Monte Carlo theory to know if this is mathematically sound. Running my script a number of times did seem to indicate that my results were reasonable.
Lastly, I did use a boxcar filter on the likelihood curves to smooth them out a bit. Ideally, this should improve results, but even with the filtering there was still a considerable amount of variability in the results. When calculating the value for the overall estimator, I wasn't sure if it would be better compute one likelihood curve from all the results and use the max of that (this is what I ended up doing), or to use the mean of all the subset estimators. Using the mean of the subset estimators might be able to help cancel out some of the roughness in the curves that remains after filtering, but I'm not sure on this.
Here is an answer to your first question and a pointer to a solution for the second:
plot(xdata,ydata)
# calculate the cumulative distribution function
cdf = np.cumsum(ydata)/sum(ydata)
# get the left and right boundary of the interval that contains 95% of the probability mass
right=argmax(cdf>0.975)
left=argmax(cdf>0.025)
# indicate confidence interval with vertical lines
vlines(xdata[left], 0, ydata[left])
vlines(xdata[right], 0, ydata[right])
# hatch confidence interval
fill_between(xdata[left:right], ydata[left:right], facecolor='blue', alpha=0.5)
This produces the following figure:
I'll try to answer question 3 when I have more time :)
I have a dataset of time-series examples. I want to calculate the similarity between various time-series examples, however I do not want to take into account differences due to scaling (i.e. I want to look at similarities in the shape of the time-series, not their absolute value). So, to this end, I need a way of normalizing the data. That is, making all of the time-series examples fall between a certain region e.g [0,100]. Can anyone tell me how this can be done in python
The solutions given are good for a series that aren’t incremental nor decremental(stationary). In financial time series( or any other series with a a bias) the formula given is not right. It should, first be detrended or perform a scaling based in the latest 100-200 samples.
And if the time series doesn't come from a normal distribution ( as is the case in finance) there is advisable to apply a non linear function ( a standard CDF funtion for example) to compress the outliers.
Aronson and Masters book (Statistically sound Machine Learning for algorithmic trading) uses the following formula ( on 200 day chunks ):
V = 100 * N ( 0.5( X -F50)/(F75-F25)) -50
Where:
X : data point
F50 : mean of the latest 200 points
F75 : percentile 75
F25 : Percentile 25
N : normal CDF
Assuming that your timeseries is an array, try something like this:
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
This will confine your values between 0 and 1
Following my previous comment, here it is a (not optimized) python function that does scaling and/or normalization:
( it needs a pandas DataFrame as input, and it’s doesn’t check that, so it raises errors if supplied with another object type. If you need to use a list or numpy.array you need to modify it. But you could convert those objects to pandas.DataFrame() first.
This function is slow, so it’s advisable run it just once and store the results.
from scipy.stats import norm
import pandas as pd
def get_NormArray(df, n, mode = 'total', linear = False):
'''
It computes the normalized value on the stats of n values ( Modes: total or scale )
using the formulas from the book "Statistically sound machine learning..."
(Aronson and Masters) but the decission to apply a non linear scaling is left to the user.
It is modified to fit the data from -1 to 1 instead of -100 to 100
df is an imput DataFrame. it returns also a DataFrame, but it could return a list.
n define the number of data points to get the mean and the quartiles for the normalization
modes: scale: scale, without centering. total: center and scale.
'''
temp =[]
for i in range(len(df))[::-1]:
if i >= n: # there will be a traveling norm until we reach the initian n values.
# those values will be normalized using the last computed values of F50,F75 and F25
F50 = df[i-n:i].quantile(0.5)
F75 = df[i-n:i].quantile(0.75)
F25 = df[i-n:i].quantile(0.25)
if linear == True and mode == 'total':
v = 0.5 * ((df.iloc[i]-F50)/(F75-F25))-0.5
elif linear == True and mode == 'scale':
v = 0.25 * df.iloc[i]/(F75-F25) -0.5
elif linear == False and mode == 'scale':
v = 0.5* norm.cdf(0.25*df.iloc[i]/(F75-F25))-0.5
else: # even if strange values are given, it will perform full normalization with compression as default
v = norm.cdf(0.5*(df.iloc[i]-F50)/(F75-F25))-0.5
temp.append(v[0])
return pd.DataFrame(temp[::-1])
I'm not going to give the Python code, but the definition of normalizing, is that for every value (datapoint) you calculate "(value-mean)/stdev". Your values will not fall between 0 and 1 (or 0 and 100) but I don't think that's what you want. You want to compare the variation. Which is what you are left with if you do this.
from sklearn import preprocessing
normalized_data = preprocessing.minmax_scale(data)
You can take a look here normalize-standardize-time-series-data-python
and
sklearn.preprocessing.minmax_scale
Is it possible to perform "pseudoexperiments" using PyMC?
By pseudoexperiments, I mean generating random "observations" by sampling from the prior, and then, given each pseudoexperiment, drawing samples from the posterior. Afterwards, one would compare the trace for each parameter to the sample (obtained from the prior) used in sampling from the posterior.
A more concrete example: Suppose that I want to know the rate of process X. I count how many occurrences there are in a certain period of time. However, I know that process Y also sometimes occurs and will contaminate my count. The rate of process Y is known with some uncertainty. So, I build a model, include my observations, and sample from the posterior:
import pymc
class mymodel:
rate_x = pymc.Uniform('rate_x', lower=0, upper=100)
rate_y = pymc.Normal('rate_y', mu=150, tau=1./(15**2))
total_rate = pymc.LinearCombination('total_rate', [1,1], [rate_x, rate_y])
data = pymc.Poisson('data', mu=total_rate, value=193, observed=True)
Mod = pymc.Model(mymodel)
MCMC = pymc.MCMC(Mod)
MCMC.sample(100000, burn=5000, thin=5)
print MCMC.stats()['rate_x']['quantiles']
However, before I do my experiment (or before I "unblind" my analysis and look at my data), I would like to know how sensitive I expect to be -- what will be the uncertainty on my measurement of rate_x?
To answer this, I could sample from the prior
Mod.draw_from_prior()
but this only samples rate_x, rate_y, and calculates total_rate. But once the values of those are set by draw_from_prior(), I can draw a pseudoexperiment:
Mod.data.random()
This just returns a number, so I have to set the value of Mod.data to a random sample. Because Mod.data has the observed flag set, I have to also "force" it:
Mod.data.set_value(Mod.data.random(), force=True)
Now I can sample from the posterior again
MCMC.sample(100000, burn=500, thin=5)
print MCMC.stats()['rate_x']['quantiles']
All this works, so I suppose the simple answer to my question is "yes". But it feels very hacky. Is there a better or more natural way to accomplish this?