I've got a set of 780 monthly temperature anomalies over 65 years and I want to analyse it for frequencies that are driving the anomalies. I've used the spectrum package to do this, I've included pictures of the series before and after the analysis.
from spectrum import *
p = Periodogram(anomalies, sampling=1/12)
p.run()
plt.title('Power Spectrum of Monthly Temperature Anomalies')
p.plot(marker='.')
plt.show()
The resulting spectrum has several clear negative spikes. Now I understand that a negative value in Db isn't actually a negative absolute value, but why is this happening? Does it imply there's some specific frequency missing from my data? Because a positive spike implies one is present.
Also, why are most of the frequencies displayed negative? What is the reference value for Db to be an amplification of?
A part of me thinks I should be taking the absolute value of this spectrum but I'd like to understand why if that's the case. Also I put in the value for sampling as 1/12 because the data points are monthly so hopefully the frequency scale is in per year?
Many thanks, this is my first post here so let me know if I need to be clearer about anything.
Negative Energies
The Series being Analysed
As you can see in the plots, on the y-axis, the units are in dB (decibel, https://en.wikipedia.org/wiki/Decibel). so what is see is not the raw data (in the frequency domain) but something like 10*log10(data). This explains the presence of negative values and is perfectly normal.
Here you have positive and negative values but in general, you would normalise the data (by the maximum) so that all values are negative and the highest value is set to 0. This is possible using :
p.plot(norm=True)
You can plot the raw data (without the log function) but you would need to use the raw data (in the frequency domain). For instance to reproduce the behaviour of p.plot function, you can use:
from pylab import plot
plot(p.frequencies(), 10*log10(p.psd/p.psd.max())
So, if you do not want to use the decibel unit, use:
from pylab import plot
plot(p.frequencies(), p.psd)
disclaimer: I'm the main author of Spectrum (http://pyspectrum.readthedocs.io/).
Related
With the informations bellow can I say that my data follow a normal distribution ?
Else is theire other way to check that?
PS: the data are values of pressure inside oil pipelines every minute from sensors.
Try increasing the number of bins in your histogram plot.
You can also try visualising your data with a qq plot, looking at other statistics such as kurtosis or perform test for normality.
Looking at the sample size of your data, it seems that it should produce somewhat of a normal distribution. However looking at your histogram plot, there seems to be extreme values > 2 s.d. You might want to look into that, to determine if you want to add a dummy variable to account for it in any future analysis.
I have a some large datasets of sensor values consisting of a single sensor value sampled at a one-minute interval, like a waveform. The total dataset spans a few years.
I wish to (using python) enter/select a arbitrary set of sensor data (for instance consisting of 600 values, so for 10hrs worth of data) and find all similar time stamps where roughly the same shape occurred in these datasets.
The matches should be made by shape (relative differences), not by actual values, as there are different sensors used with different biases and environments. Also, I wish to retrieve multiple matches within a single dataset, to further analyse.
I’ve been looking into pandas, but I’m stuck at the moment... any guru here?
I don't know much about the functionalities available in Pandas.
I think you need to first decide the typical time span T over which
the correlation is supposed to occurred. What I would do is to
split all your times series into (possibly overlapping) segments
of duration T using Numpy (see here for instance).
This will lead to a long list of segments. I would then compute
the correlation between all pairs of segments using e.g. corrcoef.
You get a large correlation matrix where you can spot the
pairs of similar segments by applying a threshold on the absolute
value of the correlation. You can estimate the correct threshold
by applying this algorithm to a data set where you don't expect
any correlation, or by randomizing your data.
I am writing some python code to do Kaplan-Meier (KM) curves using the KM Fitter and usually plot 4 curves in the same graph to compare different groups. The basic way to get a KM curve is:
from lifelines import KaplanMeierFitter
#Create the KMF object
KM_curve = KaplanMeierFitter()
#Give data to object. Status is 0 if alive, 1 if deceased (in my case)
KM_curve.fit (durations=My_Data["Time"], event_observed=My_Data["Status"])
#I do a figure in which I use this line 4 times (one per group)
KM_curve.plot(ci_show=False)
With those 4 lines of code and a pandas dataframe (here called My_Data) the KM Fitter automatically does all the calculations and plotting, but I was wondering if anyone knows how to stop the curve prematurely. I have done around 50 different graphs, they look nice and give me the info I need, but sometimes the last part of some curves dramatically drops to 0% (vertically) or very close to it. That is weird since none of my groups has 0 survivors at the end of my x-axis [See in this example, the red line https://i.stack.imgur.com/bn6Vy.png ]
I did read that the KM curves are good to see trends in the middle section, but the last part of the curves may be misleading and has to be examined carefully. That is especially true if there are not enough patients left in that group and thus, the %survival estimate drops dramatically. Someone who does bioinformatics told me she usually stops plotting the curve whenever 10% of patients are left, to prevent this issue. Is it possible to do that in python KMF?
There are few ways to achieve this:
1.
ax = KM_curve.plot(ci_show=False)
ax.set_xlim(0, <your upper limit here>)
KM_curve.plot(ci_show=False, loc=slice(0, <your upper limit here>))
See more documentation on loc and iloc here: https://lifelines.readthedocs.io/en/latest/fitters/univariate/KaplanMeierFitter.html#lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter.plot
I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:
Number of unique OS used
Number of unique Browsers user
Number of unique cookies used
All these numbers are taken over a period of six months.
Now I did try to do a normal test using:
from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)
Which returns 0.0 meaning the data is not following normal distribution.
Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.
I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?
In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.
I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.
Something like
df_visits = df_visits.apply(lambda x: np.log(x))
of course you will also need to get rid of any zeros before you can log transform.
Image showing pre Vs post log transform
I am currently trying to get the frequency of the audio data that I have obtained from pyAudioStream.read(). I have already gotten the number of zero crosses there were in the chunk but now I want to determine the frequency based on the zero crosses, I heard it was possible but idk how to do it and I cant seem to find it on a google search, can anyone help me out here?
Let assume that the variable num_crossings holds the number of zero crossings in your chunk.
Therefore, you have:
frequency = (num_crossings * sampling_rate / (2 * (len(chunk))))
For frequency detection, you can also use Fourier transform (with numpy.fft for instance).