Is it feasible to plot a line of predicted values in Python? In R, you can plot the line of predicted values using lines(x$age, predict(result, data.frame(x$height, x$weight))). What I want to do is to plot such predicted values determined from the regression result, but also want to plot anything other than a simple, one-variable regression, which can be done in NumPy and matplotlib.
So for example, in the following code:
import statsmodels.formula.api as smf
df = pd.read_csv("sample.csv")
result = smf.ols("res~height+weight+I((180-height^2))+I((60-weight)^3)", data=df).fit()
And I want to use the regression result from smf.ols(), which is a bit more complicated than the usual multiple regression (say, res~height+weight). So how can I determine the predicted values and plot those values?
Thanks.
Related
I have the following dataset in normal space, lets call it func:
I transformed it to fourierspace using the numpy fft algorithm from numpy.fft import fft as fourier, I received the fouriertransform usingfunc_fourier = np.fft.fftshift(fourier(func)) and plotted the absolute values plt.plot(np.abs(func_fourier)), what results in the following plot:.
I now want to fit a gaussian model to this function in fourierspace. The problem is, that I dont have x-values(frequencies) that I could plot my func_fourier over. How do I create the correct frequency array in fourierspace, which I also need for fitting the gaussian model to my transformed function ?
The default x-values are created as follows:
frequencies = list(range(len(y)))
Note: According to your explanation, your Fourier transformed values are stored in func_fourier, so the y is func_fourier.
I am:
(A) running the example from the Seaborn documentation, Discovering
structure in heatmap data, but using the Distance
Correlation from the dcor library, instead of
pandas.DataFrame.corr, which is limited to linear or rank
coefficients.
Then I want to:
(B) do the same using a couple of DataFrames with my own data.
I supply the distance correlation to sns.clustermap directly, as done in the documentation example, because I am interested in the structure in the heatmap, as opposed to using the Distance Correlation matrix to calculate the linkage, as done in this SO answer, for example. I create the distance correaltion matrix with a modification of code from this excellent SO answer.
(A) No issues here
As I execute:
distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
dcor_df= df.apply(lambda col1: df.apply(lambda col2: distcorr(col1, col2)))
sns.clustermap(dcor_df, cmap="mako",
row_colors=network_colors, col_colors=network_colors,
linewidths=.75, figsize=(13, 13))
I get the result I expected:
(B) I do encounter issues here, as I move to my own data
For some background: I have two DataFrames with variables labeled A, B, ..., P in both. The variables are identical (same measurement, same units), but the measurements were collected in two locations that are spatially separated, hence my goal was to run the analysis separately, to see if the variables correlate in a similar way (i.e. with similar structure in the heatmap) in different locations.
Data from the first location is stored in here.
I execute the following code:
df_1 = pd.read_csv('df_1.csv')
pd.options.display.float_format = '{:,.3f}'.format
distcorr = lambda column1, column2: dcor.distance_correlation(column1, column2)
rslt_1 = df_1.apply(lambda col1: df_1.apply(lambda col2: distcorr(col1, col2)))
rslt_1
and I get the expected (square, symmetric) Distance Correlation matrix:
which I can plot with sns.heatmap as:
h=sns.heatmap(rslt_1, cmap="mako", vmin=0, vmax=1,
xticklabels=True, yticklabels=True, square=True)
fig = plt.gcf()
fig.set_size_inches(14, 10)
However, when I try to pass the Distance Correlation matrix to `sns.clustermap' with:
s=sns.clustermap(rslt_1, cmap="mako", standard_scale=1, linewidths=0)
fig = plt.gcf()
fig.set_size_inches(10, 10);
I get this:
which is very weird to me because I'm expecting the same ordering on both rows and columns as in the above modified documentation example. Unless I'm totally out to lunch and am missing or misunderstand something important.
If I pass metric='correlation' like this:
s=sns.clustermap(rslt_1, cmap="mako", metric='correlation',
standard_scale=1, linewidths=0)
fig = plt.gcf()
fig.set_size_inches(10, 10);
I get a result that is symmetric about the diagonal as I expected, and if I 'eyeball' those clusters they make more sense to me when I compare to the matrix in tabular form:
With the data from the second location, which is stored here, I get reasonable results (and fairly similar, although not identical) whether I pass metric='correlation' or not:
I cannot explain the behavior in the first case. Am I missing something?
Thank you.
PS I am on a Windows 10 PC.
Some info:
Remove the standard scale parameter from your clustermap.
According to the seaborn documentation (seaborn.pydata.org/generated/seaborn.clustermap.html), the standard_scale=1 parameter standardize the column dimension, which means subtracting the minimum and divide each by its maximum. Your data matrix passed to the clustermap function looks like to be already between [0 1].
s=sns.clustermap(rslt_1, cmap="mako", metric='correlation', linewidths=0)
Clustering is basically grouping data based on relationships among the variables in the data. Clustering algorithms help in getting structured data in unsupervised learning. The most common types of clustering are shown in the picture below.
The clustermap() function of seaborn plots a hierarchically-clustered heat map of the given matrix dataset. It returns a clustered grid index.
In Agglomerative clustering, we start with considering each data point as a cluster and then repeatedly combine two nearest clusters into larger clusters until we are left with a single cluster. The graph plotted after performing agglomerative clustering on data is called "Dendrogram".
Clustering obviously re-order your data.
I want to detect the outliers in a "time series data" which contains the trend and seasonality components. I want to leave out the peaks which are seasonal and only consider only the other peaks and label them as outliers. As I am new to time series analysis, Please assist me to approach this time series problem.
The coding platform is using is Python.
Attempt 1 : Using ARIMA model
I have trained my model and forecasted for the test data. Then being able to compute the difference between forecasted results with my actual values of test data then able to find out the outliers based on the variance observed.
Implementation of Auto Arima
!pip install pyramid-arima
from pyramid.arima import auto_arima
stepwise_model = auto_arima(train_log, start_p=1, start_q=1,max_p=3, max_q=3,m=7,start_P=0, seasonal=True,d=1, D=1, trace=True,error_action='ignore', suppress_warnings=True,stepwise=True)
import math
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from sklearn.metrics import mean_squared_error
Split data into train and test-sets
train, test = actual_vals[0:-70], actual_vals[-70:]
Log Transformation
train_log, test_log = np.log10(train), np.log10(test)
Converting to list
history = [x for x in train_log]
predictions = list()
predict_log=list()
Fitting Stepwise ARIMA model
for t in range(len(test_log)):
stepwise_model.fit(history)
output = stepwise_model.predict(n_periods=1)
predict_log.append(output[0])
yhat = 10**output[0]
predictions.append(yhat)
obs = test_log[t]
history.append(obs)
Plotting
figsize=(12, 7)
plt.figure(figsize=figsize)
pyplot.plot(test,label='Actuals')
pyplot.plot(predictions, color='red',label='Predicted')
pyplot.legend(loc='upper right')
pyplot.show()
But I can detect the outliers only in test data. Actually, I have to detect the outliers for the whole time series data including the train data I am having.
Attempt 2 : Using Seasonal Decomposition
I have used the below code to split the original data into Seasonal, Trend, Residuals and can be seen in the below image.
from statsmodels.tsa.seasonal import seasonal_decompose
decomposed = seasonal_decompose()
Then am using the residual data to find out the outliers using boxplot since the seasonal and trend components were removed. Does this makes sense ?
Or is there any other simple or better approach to go with ?
You can:
in the 4th graph (residual plot) at "Attempt 2 : Using Seasonal Decomposition" try to check for extreme points and that may lead you to some anomalies in the seasonal series.
Supervised(if you have some labeled data): Do some classification.
Unsupervised: Try to predict the next value and create a confidence interval to check whether the prediction lays inside it or not.
You can try to calculate the relative extrema of data. using argrelextrema as shown here for example:
from scipy.signal import argrelextrema
x = np.array([2, 1, 2, 3, 2, 0, 1, 0])
argrelextrema(x, np.greater)
output:
(array([3, 6]),)
Some random data (My implementation of the above argrelextrema):
with the following code I just want to fit a regression curve to sample data which is not working as expected.
X = 10*np.random.rand(100)
y= 2*X**2+3*X-5+3*np.random.rand(100)
xfit=np.linspace(0,10,100)
poly_model=make_pipeline(PolynomialFeatures(2),LinearRegression())
poly_model.fit(X[:,np.newaxis],y)
y_pred=poly_model.predict(X[:,np.newaxis])
plt.scatter(X,y)
plt.plot(X[:,np.newaxis],y_pred,color="red")
plt.show()
Shouldnt't there be a curve which is perfectly fitting to the data points? Because the training data (X[:,np.newaxis]) and the data which get used to predict y_pred are the same (also (X[:,np.newaxis]).
If I instead use the xfit data to predict the model the result is as desired...
...
y_pred=poly_model.predict(xfit[:,np.newaxis])
plt.scatter(X,y)
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
plt.show()
So whats the issue and the explanation for such a behaviour?
The difference between the two plots is that in the line
plt.plot(X[:,np.newaxis],y_pred,color="red")
The values in X[:,np.newaxis] are not sorted, while in
plt.plot(xfit[:,np.newaxis],y_pred,color="red")
the values of xfit[:,np.newaxis] are sorted.
Now, plt.plot connects any two consecutive values in the array by line, and since they are not sorted you get this bunch of lines in your first figure.
Replace
plt.plot(X[:,np.newaxis],y_pred,color="red")
with
plt.scatter(X[:,np.newaxis],y_pred,color="red")
and you'll get this nice looking figure:
Based on the answer of Miriam Farber I have figured out an other way. Since the X values are not sorted I can fix the issue by simply sort the values with:
X=np.sort(X)
Now the remaining code can remain stationary and will deliver the desired result.
I have a plot with me which is logarithmic on both the axes. I have pyplot's loglog function to do this. It also gives me the logarithmic scale on both the axes.
Now, using numpy I fit a straight line to the set of points that I have. However, when I plot this line on the plot, I cannot get a straight line. I get a curved line.
The blue line is the supposedly "straight line". It is not getting plotted straight. I want to fit this straight line to the curve plotted by red dots
Here is the code I am using to plot the points:
import numpy
from matplotlib import pyplot as plt
import math
fp=open("word-rank.txt","r")
a=[]
b=[]
for line in fp:
string=line.strip().split()
a.append(float(string[0]))
b.append(float(string[1]))
coefficients=numpy.polyfit(b,a,1)
polynomial=numpy.poly1d(coefficients)
ys=polynomial(b)
print polynomial
plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()
To better understand this problem, let's first talk about plain ol' linear regression (the polyfit function, in this case, is your linear regression algorithm).
Suppose you have a set of data points (x,y), shown below:
You want to create a model that predicts y as a function of x, so you use linear regression. That uses the model:
y = mx + b
and computes the values of m and b that best predict your data, using some linear algebra.
Next, you use your model to predict values of y as a function of x. You do this by picking a set of values for x (think linspace) and computing the corresponding values of y. Plotting these (x,y) pairs gives you your regression line.
Now, let's talk about logarithmic regression. In this case, we still have two variables, y versus x, and we are still interested in their relationship, i.e., being able to predict y given x. The only difference is, now y and x happen to be logarithms of two other variables, which I'll call log(F) and log(R). Thus far, this is nothing more than a simple change of name.
The linear regression also works the same way. You're still regressing y versus x. The linear regression algorithm doesn't care that y and x are actually log(F) and log(R) - it makes no difference to the algorithm.
The last step is a little bit different - and this is where you're getting tripped up in your plot above. What you're doing is computing
F = m R + b
but this is incorrect, because the relationship between F and R is not linear. (That's why you're using a log-log plot.)
Instead, you should compute
log(F) = m log(R) + b
If you transform this (raise 10 to the power of both sides and rearrange), you get
F = c R^m
where c = 10^b. This is the relationship between F and R: it is a power law relationship. (Power law relationships are what log-log plots are best at.)
In your code, you're using A and B when calling polyfit, but you should be using log(A) and log(B).
Your linear fit is not performed on the same data as shown in the loglog-plot.
Make a and b numpy arrays like this
a = numpy.asarray(a, dtype=float)
b = numpy.asarray(b, dtype=float)
Now you can perform operations on them. What the loglog-plot does, is to take the logarithm to base 10 of both a and b. You can do the same by
logA = numpy.log10(a)
logB = numpy.log10(b)
This is what the loglog plot visualizes. Check this by ploting both logA and logB as a regular plot. Repeat the linear fit on the log data and plot your line in the same plot as the logA, logB data.
coefficients = numpy.polyfit(logB, logA, 1)
polynomial = numpy.poly1d(coefficients)
ys = polynomial(b)
plt.plot(logB, logA)
plt.plot(b, ys)
The other answers offer great explanations and a solution. However I would like to propose a solution that helped myself a lot and maybe will help you as well.
Another simple way of writing a line fit for log-log scale is the function powerfit in the code below. It takes in the original x and y data and by using a number of new x-points you can get a straight line on log-log scale. In the current case the values xnew are the same as x (which are both b).
The advantage of defining new x-coordinates is that you can get as few or as many points of the powerfitted line for whatever purpose you might need them.
import numpy as np
from matplotlib import pyplot as plt
import math
def powerfit(x, y, xnew):
"""line fitting on log-log scale"""
k, m = np.polyfit(np.log(x), np.log(y), 1)
return np.exp(m) * xnew**(k)
fp=open("word-rank.txt","r")
a=[]
b=[]
for line in fp:
string=line.strip().split()
a.append(float(string[0]))
b.append(float(string[1]))
ys = powerfit(b, a, b)
plt.loglog(b,a,'ro')
plt.plot(b,ys)
plt.xlabel("Log (Rank of frequency)")
plt.ylabel("Log (Frequency)")
plt.title("Frequency vs frequency rank for words")
plt.show()