Check for normal distribution - python

With the informations bellow can I say that my data follow a normal distribution ?
Else is theire other way to check that?
PS: the data are values of pressure inside oil pipelines every minute from sensors.

Try increasing the number of bins in your histogram plot.
You can also try visualising your data with a qq plot, looking at other statistics such as kurtosis or perform test for normality.
Looking at the sample size of your data, it seems that it should produce somewhat of a normal distribution. However looking at your histogram plot, there seems to be extreme values > 2 s.d. You might want to look into that, to determine if you want to add a dummy variable to account for it in any future analysis.

Related

How to indicate Kaplan-Meier Fitter (python) to plot 90% of datapoints to avoid sudden drops

I am writing some python code to do Kaplan-Meier (KM) curves using the KM Fitter and usually plot 4 curves in the same graph to compare different groups. The basic way to get a KM curve is:
from lifelines import KaplanMeierFitter
#Create the KMF object
KM_curve = KaplanMeierFitter()
#Give data to object. Status is 0 if alive, 1 if deceased (in my case)
KM_curve.fit (durations=My_Data["Time"], event_observed=My_Data["Status"])
#I do a figure in which I use this line 4 times (one per group)
KM_curve.plot(ci_show=False)
With those 4 lines of code and a pandas dataframe (here called My_Data) the KM Fitter automatically does all the calculations and plotting, but I was wondering if anyone knows how to stop the curve prematurely. I have done around 50 different graphs, they look nice and give me the info I need, but sometimes the last part of some curves dramatically drops to 0% (vertically) or very close to it. That is weird since none of my groups has 0 survivors at the end of my x-axis [See in this example, the red line https://i.stack.imgur.com/bn6Vy.png ]
I did read that the KM curves are good to see trends in the middle section, but the last part of the curves may be misleading and has to be examined carefully. That is especially true if there are not enough patients left in that group and thus, the %survival estimate drops dramatically. Someone who does bioinformatics told me she usually stops plotting the curve whenever 10% of patients are left, to prevent this issue. Is it possible to do that in python KMF?
There are few ways to achieve this:
1.
ax = KM_curve.plot(ci_show=False)
ax.set_xlim(0, <your upper limit here>)
KM_curve.plot(ci_show=False, loc=slice(0, <your upper limit here>))
See more documentation on loc and iloc here: https://lifelines.readthedocs.io/en/latest/fitters/univariate/KaplanMeierFitter.html#lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter.plot

How to convert data into normal distribution

I have a data set consists of number of page views in 6 months for 30k customers. It also consists of following:
Number of unique OS used
Number of unique Browsers user
Number of unique cookies used
All these numbers are taken over a period of six months.
Now I did try to do a normal test using:
from scipy.stats import normaltest
k2, p = normaltest(df)
print(p)
Which returns 0.0 meaning the data is not following normal distribution.
Now I want to know why is that? I thought that generally as the size increases, we see normal distribution in data, since the data has a size of 30k I was not able to understand why it was not normally distributed.
I did try converting them into Z score, but still no luck. Can I transform my data such that I can have a normal distribution? Is there any method using which I can do that?
In the area I work in we typically Log transform data which is heteroscedastic like yours probably is. In my area (mass spectrometry), small values are far more likely than large, so we end up with an exponential distribution.
I'm guessing your data will look like mine, in which case you will need to do a log transform of your data to make it normally distributed. I would do this so that I can apply t-tests and other stats models.
Something like
df_visits = df_visits.apply(lambda x: np.log(x))
of course you will also need to get rid of any zeros before you can log transform.
Image showing pre Vs post log transform

Negative power spectrum

I've got a set of 780 monthly temperature anomalies over 65 years and I want to analyse it for frequencies that are driving the anomalies. I've used the spectrum package to do this, I've included pictures of the series before and after the analysis.
from spectrum import *
p = Periodogram(anomalies, sampling=1/12)
p.run()
plt.title('Power Spectrum of Monthly Temperature Anomalies')
p.plot(marker='.')
plt.show()
The resulting spectrum has several clear negative spikes. Now I understand that a negative value in Db isn't actually a negative absolute value, but why is this happening? Does it imply there's some specific frequency missing from my data? Because a positive spike implies one is present.
Also, why are most of the frequencies displayed negative? What is the reference value for Db to be an amplification of?
A part of me thinks I should be taking the absolute value of this spectrum but I'd like to understand why if that's the case. Also I put in the value for sampling as 1/12 because the data points are monthly so hopefully the frequency scale is in per year?
Many thanks, this is my first post here so let me know if I need to be clearer about anything.
Negative Energies
The Series being Analysed
As you can see in the plots, on the y-axis, the units are in dB (decibel, https://en.wikipedia.org/wiki/Decibel). so what is see is not the raw data (in the frequency domain) but something like 10*log10(data). This explains the presence of negative values and is perfectly normal.
Here you have positive and negative values but in general, you would normalise the data (by the maximum) so that all values are negative and the highest value is set to 0. This is possible using :
p.plot(norm=True)
You can plot the raw data (without the log function) but you would need to use the raw data (in the frequency domain). For instance to reproduce the behaviour of p.plot function, you can use:
from pylab import plot
plot(p.frequencies(), 10*log10(p.psd/p.psd.max())
So, if you do not want to use the decibel unit, use:
from pylab import plot
plot(p.frequencies(), p.psd)
disclaimer: I'm the main author of Spectrum (http://pyspectrum.readthedocs.io/).

Extracting the threshold value of some given raw numbers

I am trying to determine the conditions of a wireless channel by analysis of captured I/Q samples. Indeed, I have a 50000 data samples and as it is shown in the attached figure, there are some sparks in the graphs when there is an activity (e.g. data transmission) over the channel. I am trying to count the number of sparks which are data values higher than a threshold.
I need to have an accurate estimation of the threshold and then I can find the channel load. the threshold value in the attached figure is around 0.0025 and it should be noted that it varies over time. So, each time that I took 50000 samples, I have to find the threshold value first using some sort of unsupervised learning.
I tried k-means (in python scikit-learn) to cluster the data and find the centroids of the estimated clusters, but it can't give me good estimation on the threshold value (especially when there is no activity over the channel and the channel is idle).
I would like to know is there anyone who has prior experience on similar topics?
Captured data
Since the idle noise seems relatively consistent and very different from when data is transmitted, I can think of several simple algorithms which could give you a reasonable threshold in an unsupervised manner.
The most direct method would be to sort the values (perhaps first group into buckets), then find the lowest-valued region where a large enough proportion (at least ~5%) of values fall. Take a reasonable margin above the highest values (50%?) and you should be good to go.
You'll need to fiddle with the thresholds a bit. I'd collect sample data and tweak the values until I get it working 100% of the time and the values used make sense.

How to set x lim to 99.5 percentile of my data series for matplotlib histogram?

I'm currently pumping out some histograms with matplotlib. The issue is that because of one or two outliers my whole graph is incredibly small and almost impossible to read due to having two separate histograms being plotted. The solution I am having problems with is dropping the outliers at around a 99/99.5 percentile. I have tried using:
plt.xlim([np.percentile(df,0), np.percentile(df,99.5)])
plt.xlim([df.min(),np.percentile(df,99.5)])
Seems like it should be a simple fix, but I'm missing some key information to make it happen. Any input would be much appreciated, thanks in advance.
To restrict focus to just the middle 99% of the values, you could do something like this:
trimmed_data = df[(df.Column > df.Column.quantile(0.005)) & (df.Column < df.Column.quantile(0.995))]
Then you could do your histogram on trimmed_data. Exactly how to exclude outliers is more of a stats question than a Python question, but basically the idea I was suggesting in a comment is to clean up the data set using whatever methods you can defend, and then do everything (plots, stats, etc.) on only the cleaned dataset, rather than trying to tweak each individual plot to make it look right while still having the outlier data in there.

Categories

Resources