Python Applying a CDF after Fitter - python

I would like to apply the best fit CDF found by Fitter to each value in a number of panda data-frame columns by hopefully passing the Fitter results to Scipy Stats (or another library?).
I can get the distribution function easily enough from Fitter with the following code:
import numpy as np
import pandas as pd
import seaborn as sns
from fitter import Fitter
from fitter import get_common_distributions
from fitter import get_distributions
dataset = pd.read_csv("econ.csv")
dataset.head()
sns.set_style('white')
sns.set_context("paper", font_scale = 2)
sns.displot(data = dataset, x = "Value_1",kind = "hist", bins = 100, aspect = 1.5)
spac = dataset['Value_1'].values
f = Fitter(spac, distributions=get_distributions())
f.fit()
f.summary()
f.get_best(method='sumsquare_error')
This provides me with an output for Value_1:
{'norminvgauss': {'a': 1.87,
'b': -0.65,
'loc': 0.46,
'scale': 1.24}}
Now this is where I am stuck:
Is there a way to pass this information back to Scipy Stats (or another library) so I can calculate the cumulative distribution function (CDF) of the best fit for each value in each column?
The dataset columns range from Value_1 to Value_99 with about 400 rows - Once I know how to feed the fitter results back into scipy stats I should be able to write a simple for loop to apply this over each column.
An example of the result would be like:
ID
Value1
CDF_BestFit_Value1
n
0.9
0.33
n+1
0.7
0.07
Much appreciated in advanced for anyone who is able to help with this.

Related

QQ-plot python mean and standard deviation

I am trying to plot a Q-Q plot using python. I was checking scipy.stats.probplot, and the input seems to be the measurement against a normal distributiom.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
and in my code, I had
stats.probplot(mean, dist="norm", plot=plt)
to compare distributions.
But I am wondering where can I input standard deviation? I thought that's a very important factor when comparing distributions but so far I can only input the mean.
Thanks
Let's suppose you have a list on float
X = [-1.31,
4.82,
2.18,
1.99,
4.37,
2.58,
7.22,
3.93,
6.95,
2.41,
2.02,
2.48,
-1.01,
2.3,
2.87,
-0.06,
2.13,
3.62,
5.24,
0.57]
If you want to make a QQ_plot test you need to compare X against a distribution.
For example : N(0, 1) a normal distribution whose mean = 0 and sigma = 1
In OpenTURNS, it goes like that:
import openturns as ot
sample = ot.Sample([[p] for p in X])
graph = ot.VisualTest.DrawQQplot(sample, ot.Normal(0,1))
View(graph);
Explanation: I tell OpenTURNS I have a sample of 20 points [p] coming from X and not 1 point in dimension 20. Then I call ot.VisualTest.DrawQQplot with 2 arguments: sample and the Normal distribution (0,1) ot.Normal(0,1).
We see on the graph that the test fails:
The question now is: what is the best Normal Distribution fitting the sample?
Thanks to NormalFactory() the answer is simple:
BestNormalDistribution = ot.NormalFactory().build(sample)
If you print(BestNormalDistribution) you get the parameters of this distribution:
Normal(mu = 2.76832, sigma = 2.27773)
If we repeat the QQ_plot test of sample against BestNormalDistribution it would be much better

mono-energetic gamma ray mean free path

I am writing a code about a mono-energetic gamma beam which the dominated interaction is photoelectric absorption, mu=2 cm-1, and i need to generate 50000 random numbers and sample the interaction depth(which I do not know if i did it or not).
I know that the mean free path=mu-1, but I need to find the mean free path from the simulation and from mu and compare them, is what I did right in the code or not?
import random
import matplotlib.pyplot as plt
import numpy as np
mu=(2)
random.seed=()
data = np.random.randn(50000)*10
bins = np.arange(data.min(), data.max()+1e-8, 0.1)
meanfreepath = 1/mu
print(meanfreepath)
plt.hist(data, bins=bins)
plt.show()
Well, interaction depth distribution is Exponential one, not a gaussian.
So code would be
lmbda = 2 # cm^-1
beta = 1.0/lmbda
data = np.random.exponential(scale=beta, size=50000)
mfp = np.mean(data)
print(mfp)
# build histogram
More details at https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.exponential.html
Code above produced
0.4977168417102998
which looks like 2-1 to me

How to plot a value_counts in pandas that has a huge number of different counts not distributed evenly

Lets say I have the following data:
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
s2.value_counts(normalize=True).plot()
What I want to show in the plot is that there are a few numbers that make up the majority of cases.The problem is that this will be seen in the far left side of the graph and then there will be a straight line for all the other categories.
In the real data the x axis will be categorical with about 18000 categories and 4% of the counts will be around 10000 high then the rest will drop of and be around 50.
I want to show this for an audience of "ordinary" business people so cant be some fanzy hard to read solution.
Update: see #unutbu answere
Updated code and im getting an error for qcut when trying to use tuples.
TypeError: unsupported operand type(s) for -: 'tuple' and 'tuple'
df = pd.DataFrame({'s1':[1,0,1,0], 's2':[1,0,1,1], 's3':[1,0,1,1], 's4':[0,0,0,1]})
perms = df.apply(tuple, axis=1)
prob = perms.value_counts(normalize=True).reset_index(drop='True')
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
You could keep the normalized value counts above a certain threshold. Then sum together the values below the threshold and clump them together in one category which could be called, say, "other".
By choosing threshold high enough, you will able to display the most important contributors to the overall probability distribution, while still showing the size of the tail in the bar labeled "other":
import matplotlib.pyplot as plt
import pandas as pd
s2 = pd.Series([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
prob = s2.value_counts(normalize=True)
threshold = 0.02
mask = prob > threshold
tail_prob = prob.loc[~mask].sum()
prob = prob.loc[mask]
prob['other'] = tail_prob
prob.plot(kind='bar')
plt.xticks(rotation=25)
plt.show()
There is a limit to the number of category labels you can sensibly display on a
bar graph. For a normal-sized graph 3000 is way too many. Moreover, it is
probably not reasonable to expect an audience to glean any meaning out of
reading 3000 labels.
The graph should summarize the data. And the main point seems to be that 4 or 5% of the categories constitute the vast majority of the cases. So to drive home that point, perhaps use pd.qcut to categorize the cases into simple categories such as bottom 25%, mid 70%, and top 5%:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = 18000
categories = np.arange(N)
np.random.shuffle(categories)
M = int(N*0.04)
prob = pd.Series(np.concatenate([np.random.randint(9000, 11000, size=M),
np.random.randint(0, 100, size=N-M), ]), index=categories)
prob /= prob.sum()
category_classes = pd.qcut(prob, q=[0, .25, 0.95, 1.],
labels=['bottom 25%', 'mid 70%', 'top 5%'])
prob_groups = prob.groupby(category_classes).sum()
prob_groups.plot(kind='bar')
plt.xticks(rotation=0)
plt.show()
Just log the axis (I have no pandas, but it should be similar):
import numpy as np
import matplotlib.pyplot as plt
s2 = np.log([1,2,3,4,5,2,3,333,2,123,434,1,2,3,1,11,11,432,3,2,4,3,3,3,54,34,24,2,223,2535334,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30000, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2])
plt.plot(s2)
plt.show()

Filtering accelerometry data in scipy

I'm new to python and scipy, and i am trying to filter acceleration data taken in 3 dimensions at 25Hz. I'm having a weird problem, after applying the filter the graph of my data is smoothed, however the values seem to be amplified quite a bit depending on the order and cutoff frequencies of the filter. Here is my code:
from scipy import loadtxt
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
my_data = loadtxt("DATA-001.CSV",delimiter=",",skiprows=8)
N, Wn = signal.buttord( [3,11], [.3,18], .1, 10, True)
print N
print Wn
b,a = signal.butter(N, Wn, 'bandpass', analog=True)
filtered_z = signal.filtfilt(a,b,[my_data[1:500,3]],)
filtered_z = np.reshape(filtered_z, (499,))
plt.figure(1)
plt.subplot(411)
plt.plot(my_data[1:500,0],my_data[1:500,3])
plt.subplot(412)
plt.plot(my_data[1:500,0], filtered_z, 'k')
plt.show()
Right now, this code returns this graph:
I'm unsure of how to get rid of this weird gain issue, if anyone has any suggestions?
Thank you!
You have your coefficients the wrong way around in signal.filtfilt. Should be:
filtered_z = signal.filtfilt(b,a,[my_data[1:500,3]],)
The size and ratio of the coefficients can result in amplification of the signal.

Why is there a difference in magnitude response between scipy.filtfilt and scipy.lfilter?

I was trying to filter a signal using the scipy module of python and I wanted to see which of lfilter or filtfilt is better.
I tried to compare them and I got the following plot from my mwe
import numpy as np
import scipy.signal as sp
import matplotlib.pyplot as plt
frequency = 100. #cycles/second
samplingFrequency = 2500. #samples/second
amplitude = 16384
signalDuration = 2.3
cycles = frequency*signalDuration
time = np.linspace(0, 2*np.pi*cycles, signalDuration*samplingFrequency)
freq = np.fft.fftfreq(time.shape[-1])
inputSine = amplitude*np.sin(time)
#Create IIR Filter
b, a = sp.iirfilter(1, 0.3, btype = 'lowpass')
#Apply filter to input
filteredSignal = sp.filtfilt(b, a, inputSine)
filteredSignalInFrequency = np.fft.fft(filteredSignal)
filteredSignal2 = sp.lfilter(b, a, inputSine)
filteredSignal2InFrequency = np.fft.fft(filteredSignal2)
plt.close('all')
plt.figure(1)
plt.title('Sine filtered with filtfilt')
plt.plot(freq, abs(filteredSignalInFrequency))
plt.subplot(122)
plt.title('Sine filtered with lfilter')
plt.plot(freq, abs(filteredSignal2InFrequency))
print max(abs(filteredSignalInFrequency))
print max(abs(filteredSignal2InFrequency))
plt.show()
Can someone please explain why there is a difference in the magnitude response?
Thanks a lot for your help.
Looking at your graphs shows that the signal filtered with filtfilt has a peak magnitude of 4.43x107 in the frequency domain compared with 4.56x107 for the signal filtered with lfilter. In other words, the signal filtered with filtfilt has an peak magnitude that is 0.97 that when filtering with
Now we should note that scipy.signal.filtfilt applies the filter twice, whereas scipy.signal.lfilter only applies it once. As a result, input signals get attenuated twice as much. To confirm this we can have a look at the frequency response of the Butterworth filter you have used (obtained with iirfilter) around the input tone's normalized frequency of 100/2500 = 0.04:
which indeed shows an that the application of this filter does causes an attenuation of ~0.97 at a frequency of 0.04.

Categories

Resources