I am trying to plot a Q-Q plot using python. I was checking scipy.stats.probplot, and the input seems to be the measurement against a normal distributiom.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
and in my code, I had
stats.probplot(mean, dist="norm", plot=plt)
to compare distributions.
But I am wondering where can I input standard deviation? I thought that's a very important factor when comparing distributions but so far I can only input the mean.
Thanks
Let's suppose you have a list on float
X = [-1.31,
4.82,
2.18,
1.99,
4.37,
2.58,
7.22,
3.93,
6.95,
2.41,
2.02,
2.48,
-1.01,
2.3,
2.87,
-0.06,
2.13,
3.62,
5.24,
0.57]
If you want to make a QQ_plot test you need to compare X against a distribution.
For example : N(0, 1) a normal distribution whose mean = 0 and sigma = 1
In OpenTURNS, it goes like that:
import openturns as ot
sample = ot.Sample([[p] for p in X])
graph = ot.VisualTest.DrawQQplot(sample, ot.Normal(0,1))
View(graph);
Explanation: I tell OpenTURNS I have a sample of 20 points [p] coming from X and not 1 point in dimension 20. Then I call ot.VisualTest.DrawQQplot with 2 arguments: sample and the Normal distribution (0,1) ot.Normal(0,1).
We see on the graph that the test fails:
The question now is: what is the best Normal Distribution fitting the sample?
Thanks to NormalFactory() the answer is simple:
BestNormalDistribution = ot.NormalFactory().build(sample)
If you print(BestNormalDistribution) you get the parameters of this distribution:
Normal(mu = 2.76832, sigma = 2.27773)
If we repeat the QQ_plot test of sample against BestNormalDistribution it would be much better
Lets say I have the following data:
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()
What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories.
In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.
I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.
Update: see #unutbu answere
Updated code and im getting an error for qcut when trying to use tuples.
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, "other".
By choosing threshold high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":
import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
There is a limit to the number of category labels you can sensibly display on a
bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is
probably not reasonable to expect an audience to glean any meaning out of
reading 3000 labels.
The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut to categorize the cases into simple categories such as bottom 25%, mid 70%, and top 5%:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
Just log the axis (I have no pandas, but it should be similar):
import numpy as np
import matplotlib.pyplot as plt
s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
plt.plot(s2)
plt.show()
I'm new to python and scipy, and i am trying to filter acceleration data taken in 3 dimensions at 25Hz. I'm having a weird problem, after applying the filter the graph of my data is smoothed, however the values seem to be amplified quite a bit depending on the order and cutoff frequencies of the filter. Here is my code:
from scipy import loadtxt
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
my_data = loadtxt("DATA-001.CSV",delimiter=",",skiprows=8)
N, Wn = signal.buttord( [3,11], [.3,18], .1, 10, True)
print N
print Wn
b,a = signal.butter(N, Wn, 'bandpass', analog=True)
filtered_z = signal.filtfilt(a,b,[my_data[1:500,3]],)
filtered_z = np.reshape(filtered_z, (499,))
plt.figure(1)
plt.subplot(411)
plt.plot(my_data[1:500,0],my_data[1:500,3])
plt.subplot(412)
plt.plot(my_data[1:500,0], filtered_z, 'k')
plt.show()
Right now, this code returns this graph:
I'm unsure of how to get rid of this weird gain issue, if anyone has any suggestions?
Thank you!
You have your coefficients the wrong way around in signal.filtfilt. Should be:
filtered_z = signal.filtfilt(b,a,[my_data[1:500,3]],)
The size and ratio of the coefficients can result in amplification of the signal.