Reproduce statsmodel calculation

Reproduce statsmodel calculation - python

I am running an OLS regression with statsmodel, using clustered standard errors (https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html):
import statsmodels.api as sm
import statsmodels.formula.api as smf
model = smf.ols(model_string, data=df1).fit(cov_type = 'cluster', cov_kwds={'groups': df1['correctedSE']})
I run a loop with different specifications and would like to visualise the the coefficients of all iterations.
I extract the coefficient and standard error in every iteration with:
coe = model.params["variable_of_interest"]
se = model.bse["variable_of_interest"]
List_coefficients.append(coe)
List_standard_errors.append(se)
I create a simple dataframe out of the lists then want to visualise the coefficients in an errorbar from matplotlib:
import matplotlib
import matplotlib.pyplot as plt
ax = plt.errorbar(df["loop_rounds"], df["Coefficient"], yerr=df150["Standard_e"])
However, when doing so, the confidence intervals are somehow calculated different and some coefficients in the graph are currently significant.
What would be the best way to correct for the calculation difference (e.g. extract different parameters from statsmodel, manually adjust standard errors, change in errorbar)?
If manually editing is the solution, where can I find the formula for the clustered standard errors?
Thank you

Related

I can't get the linear regression with Python of a scatter plot to have the expected values obtained with Origin

I am writing a code that allows me to obtain the linear regression of some measures. I have used different codes but with all of them I get strange result. Instead of being a line with a constant slope, the line I get is first horizontal and between the penultimate point and the last point the slope decreases.
The code I am using is:
import matplotlib.pyplot as plt
import numpy as np
x0=[0.00000001,0.000001,0.0001,0.01]
y0=[0.9974209723854539,0.9945196648709005,0.9914759279447916,0.9852556749265332]
x=np.array(x0)
y=np.array(y0)
m,b=np.polyfit(x,y,1)
print(m,b)
plt.scatter(x,y)
plt.plot(x,m*x+b,color='green')
plt.xscale('log')
d=['linear fit '+str(round(m,4))+'*x+'+str(round(b,4)),'real measure']
plt.legend(d,loc="upper right",borderaxespad=0.1,title="")
And I get the following graph:
Phyton plot
Which is very different from what I should get, which I have been able to draw in Origin:
Origin plot
I have tried various linear fit methods but with all of them I get this form which is wrong.
Hopefully you can help me find the error.
Thank you very much.

How can I get the "shape" of some data so I can generate similar random numbers in numpy/scipy [duplicate]

This question already has answers here:
Fitting empirical distribution to theoretical ones with Scipy (Python)?
(13 answers)
Python: Generate random values from empirical distribution
(1 answer)
Closed 2 years ago.
Apologies. I know what I want to do, but am not sure what it is called and so haven't been able to search for it.
I am chasing down some anomalies in data (two reports which should add to the same total based on about 50K readings differ slightly). I therefore want to generate some random data which is the same "shape" as the data in question in order to determine whether this might be down to rounding error.
Is there a way of analysing the existing 50K or so numbers and then generating random numbers which would look pretty much the same shape on a histogram? My presumption is that numpy is probably the best tool for this, but I am open to advice.

You can use scipy's stats package to do this, if I'm interpreting your question correctly:
First, we generate a histogram, and measure its histogram distribution using the scipy.stats.rv_histogram() method
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
data = scipy.stats.norm.rvs(size=50000, loc=0)
hist = np.histogram(data, bins=100)
dist = scipy.stats.rv_histogram(hist)
To generate new data from this histogram, we simply call the rvs() method on the dist variable:
fake_data = dist.rvs(size=50000)
Then, we show the two distributions to prove we are getting what we expect:
plt.figure()
plt.hist(data,bins=100, alpha=0.5, label='real data')
plt.hist(fake_data,bins=100, alpha=0.5, label='fake data')
plt.legend(loc='upper right')
plt.show()
Hopefully this is what you're looking to do.

The magic words are "inverse transform sampling" (you can generate the CDF from your histogram distribution). See this nice tutorial: https://usmanwardag.github.io/python/astronomy/2016/07/10/inverse-transform-sampling-with-python.html

Outlier detection for time series

I wanted to generate a very simple example of anomaly detection for time series. So I created sample data with one very obvious outlier but I didn't get any method to detect the outlier reliably so far. I tried local outlier factor, isolation forests and k nearest neighbors. From what I read, at least one of those methods should be suitable. I also tried tweaking the parameters but that didn't really help.
What mistake do I make here? Are the methods not appropriate?
Below is a code example.
Thanks in advance!
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
t=np.linspace(0,10,101).reshape(-1,1)
y_test=0.5+t+t**2+2*np.random.randn(len(t),1)
y_test[10]=y_test[10]*7
plt.figure(1)
plt.plot(t,y_test)
plt.show;
from sklearn.neighbors import LocalOutlierFactor
clf=LocalOutlierFactor(contamination='auto')
pred=clf.fit_predict(y_test)
plt.figure(2)
plt.plot(t[pred==1],y_test[pred==1],'bx')
plt.plot(t[pred==-1],y_test[pred==-1],'ro')
plt.show
from sklearn.ensemble import IsolationForest
clf=IsolationForest(behaviour='new',contamination='auto')
pred=clf.fit_predict(y_test)
plt.figure(3)
plt.plot(t[pred==1],y_test[pred==1],'bx')
plt.plot(t[pred==-1],y_test[pred==-1],'ro')
plt.show
from pyod.models.knn import KNN
clf = KNN()
clf.fit(y_test)
pred=clf.predict(y_test)
plt.figure(4)
plt.plot(t[pred==0],y_test[pred==0],'bx')
plt.plot(t[pred==1],y_test[pred==1],'ro')
plt.show

I guess it is because of two reasons:
these algorithms are not designed to handle 1-d data specifically
they are not designed for ts problem...
you may need to use time series tool for it

Plotting a probability distribution using matplotlib

I would like to plot the softmax probabilities for a neural network classification task, similar to the plot below
However most of the code I've found on SO and the doc pages for matplotlib are using histograms.
Examples:
plotting histograms whose bar heights sum to 1 in matplotlib
Python: matplotlib - probability mass function as histogram
http://matplotlib.org/gallery.html
But none of them match what I'm trying to achieve in that plot. Code and sample figure are highly appreciated.

I guess you are just looking for a different plot type. Adapted from here:
# Import
import numpy as np
import matplotlib.pyplot as plt
# Generate random normally distributed data
data=np.random.randn(10000)
# Histogram
heights,bins = np.histogram(data,bins=50)
# Normalize
heights = heights/float(sum(heights))
binMids=bins[:-1]+np.diff(bins)/2.
plt.plot(binMids,heights)
Which produces something like this:
Hope that is what you are looking for.

Confusion with bandwidth on seaborn's kdeplot

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.
The following code:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()
yields
Which looks like a Gaussian with bandwidth much larger than 5 MHz.
I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 , -75.71982528, -12.1897835 , -73.94903264,
-178.14293936, -123.51339541, -118.11826988, -50.19812838,
-43.69282206, -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()
I get:
With lines that seem to have the expected 5 MHz bandwidth.
As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.
Thanks,
Samuel

One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)
The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.
So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.
To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).
By the way, these results were tested against Seaborn version 0.7.0 - I expect (hope?) that versions in the future might change this behavior.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.