I generated some data y with linear relationship to log(x). I put y and x in a dataframe, sorted by the value of x, fitted model, and then tried to plot data points along with fitted line. However, what I got was a very messy fitted line. I must have done something wrong. This can be easily done in R, but with statsmodels.... I still cannot figure out why. Help needed. Thanks in advance.
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
%matplotlib inline
B0 = 3
B1 = 0.5
X = np.random.rand(1000)
epsilon = np.random.normal(0,0.1, size=1000)
y=B0 + B1*np.log(X)+epsilon
df1 = pd.DataFrame({'Y':y, 'X':X})
df1.sort_values('X', inplace=True)
model1 = smf.ols ('Y~np.log(X)', data=df1).fit()
plt.scatter(df1.X, df1.Y)
plt.plot(df1.X, model1.predict(np.log(df1.X)), 'r-')
This is what I got:
Related
I'm having trouble using the scatter to create a scatter plot. Can someone help me? I've highlighted the line causing the error:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np
from sklearn.preprocessing import StandardScaler
data = pd.read_csv('vetl8.csv')
df = pd.DataFrame(data=data)
clusterNum = 3
X = df.iloc[:, 1:].values
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
k_means = KMeans(init="k-means++", n_clusters=clusterNum, n_init=12)
k_means.fit(X)
labels = k_means.labels_
df["Labels"] = labels
df.to_csv('dfkmeans.csv')
plt.scatter(df[2], df[1], c=labels) **#Here**
plt.xlabel('K', fontsize=18)
plt.ylabel('g', fontsize=16)
plt.show()
#data set correct
You are close, just a minor adjustment to access the x-y columns by number should fix it:
plt.scatter(df[df.columns[2]], df[df.columns[1]], c=df["Labels"])
I have some data and want to show them in a scatter matrix
thats fine and something I was able to do with:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
np.random.seed(100)
xlin = np.linspace(0, 10, 20)
a = xlin*0.2 + np.random.randn(20)*3
b = a*2 + 2 + np.random.randn(20)*3
c = b*0.1 - a*2 + np.random.randn(20)*3
d = a*a*0.1 + np.random.randn(20)*3
df = pd.DataFrame(np.array([a,b,c,d]).T, columns=list('ABCD'))
df.plot()
plt.show()
but now I want to draw a scatter matrix of my dataframe - ok so far, so good
I did with:
from pandas.plotting import scatter_matrix
from scipy.stats import linregress
scatter_matrix(df, figsize=(12,12))
plt.show()
Now I want to show the correlations of the datasets using 'linregress'.
I even succeeded doing this but only in an extra plot below the scatter matrix. But how do I manage to get as a part of the scatter matrix?
In the end it should look like this:
scatter matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import RidgeCV
tips = sns.load_dataset('tips')
X = tips.drop(columns=['tip','sex', 'smoker', 'day', 'time'])
y = tips['tip']
alphas = 10**np.linspace(10,-2,100)*0.5
ridge_clf = RidgeCV(alphas=alphas,scoring='r2').fit(X, y)
ridge_clf.score(X, y)
I wanted to plot the following graph for RidgeCV. I don't see any option to do that like GridSearhCV. I appreciate your suggestions!
There is no indication what the colors stand for. I assume they stand for features and we investigate the size of each feature weight as function of alpha. Here is my solution:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import RidgeCV
tips = sns.load_dataset('tips')
X = tips.drop(columns=['tip','sex', 'smoker', 'day', 'time'])
y = tips['tip']
alphas = 10**np.linspace(10,-2,100)*0.5
w = list()
for a in alphas:
ridge_clf = RidgeCV(alphas=[a],cv=10).fit(X, y)
w.append(ridge_clf.coef_)
w = np.array(w)
plt.semilogx(alphas,w)
plt.title('Ridge coefficients as function of the regularization')
plt.xlabel('alpha')
plt.ylabel('weights')
plt.legend(X.keys())
Output:
Since you only have two features in X there are only two lines.
Here is the code for generating the plot that you had posted.
Firstly, we need to understand that RidgeCV would not return the coef for each alpha value that we had fed in the alphas param.
The motivation behind having the RidgeCV is that it will try for different alpha values mentioned in alphas param, then based on cross validation scoring, it will return the best alpha along with the fitted model.
Hence, the only way to get the coef for each alpha value using cv is iterate through RidgeCV using each alpha value.
Example:
# Author: Fabian Pedregosa -- <fabian.pedregosa#inria.fr>
# License: BSD 3 clause
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
# X is the 10x10 Hilbert matrix
X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)
# #############################################################################
# Compute paths
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
coefs = []
for a in alphas:
ridge = linear_model.RidgeCV(alphas=[a], fit_intercept=False, cv=3)
ridge.fit(X, y)
coefs.append(ridge.coef_)
# #############################################################################
# Display results
ax = plt.gca()
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1]) # reverse axis
plt.xlabel('alpha')
plt.ylabel('weights')
plt.title('RidgeCV coefficients as a function of the regularization')
plt.axis('tight')
plt.show()
I was hoping someone may be able to clarify something for me. I am trying to do a timeseries forecasting with the GaussianRandomWalk function in PyMC3. I have been suggested that my code is wrong as I’ve modeled it so that the standard deviation of the latent walk is the same as the observation noise, which seems like it might be a mistake. Is it a mistake? How would i change it?
import pymc3 as pm
#import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# generate a random walk
sd = .1
N = 200
deltas = np.random.normal(scale=sd, size=N)
y = np.cumsum(deltas)
x = np.arange(N)
df = pd.DataFrame({‘y’: y})
df = df.reindex(np.arange(250))
with pm.Model() as model:
sd = pm.HalfNormal(‘sd’)
mu = pm.Uniform(“mu”, 0, 100)
prior = pm.GaussianRandomWalk(‘prior’, mu=mu, sd=sd, shape=len(df))
obs = pm.Normal("obs", mu=prior, sd=sd, observed=df["y"])
# graph = pm.model_to_graphviz(model)
# print(graph)
trace = pm.sample(2000, chains=1)
pm.traceplot(trace)
plt.show()
with model:
ppc = pm.sample_posterior_predictive(trace)
pm.traceplot(ppc)
plt.show()
print(ppc)
GuassianRandomWalk is pure random, without any trend/inertia. You might want to look into tfp.sts.LocalLinearTrend or pm.AR which has some "inertia" in it.
I don't know more about how to model timeseries.
I have a plot with some outliers (wrong measurements):
The base data is good though. I want to just delete everything that is too far off the "current average". I tried using pd.rolling().mean() but with no satisfactory result:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
plt.plot(df)
plt.plot(df2)
plt.show()
I tried to search the web for a good solution but couldn't find one. It shouldn't be that hard to delete data points, that jump through the roof, should it?
Edit:
data file can be downloaded here: https://ufile.io/pviuc
Edit2:
I takled this problem of too many outliers by improving my data set creation.
The core of it:
if abs(D - D_List[-2]) > 30:
D = D_List[-2]
D_List.pop()
D_List.append(D)
Basically what this does is checking if the change of a value is larger than 30, if so it deletes the last value and replaces is with the second last. Not very spectacular but just what I need. I used one of the answers though because it is so much prettier. Thank you guys very much.
Let's try using scipy.signal see docs:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
b, a = signal.butter(3, 0.05)
y = signal.filtfilt(b,a, df[1].values)
df3 = pd.DataFrame(y, index=df2.index)
plt.plot(df, alpha=.3)
plt.plot(df2, alpha=.3)
plt.plot(df3)
plt.show()
Output:
Use medfilt:
y = signal.medfilt(df[1].values)
Output:
There are many ways to smooth a curve (rolling mean, GAM, smoothing spline etc.), my favorite one is the Savitzky–Golay method.
It works as follows: after having regressed a small window around a data point y onto a polynomial (with least squares), it uses this polynomial to get the estimation of your data point ^y. Then the window is shifted forward by one data point.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,5,150)
y = np.cos(x) + np.random.random(150) * 0.15
yhat = savgol_filter(y, 49, 3)
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
Note that rolling mean can't work in your case with a perimeter as low as 20, since the outlier point will have a non-negligible weight (5%) and will always induce a big bias...