PyMC3: Different predictions for identical inputs - python

In PyMC3, single new observations passed via set_data() are currently not handled correctly by sample_posterior_predictive(), which in such cases predicts the training data instead (see #3640). Therefore, I decided to add a second artificial row, which is identical to the first one, to my input data in order to bypass this behavior.
Now, I stumbled across something that I currently fail to make sense of: the predictions for the first and second row are different. With a constant random_seed, I would have expected the two predictions to be identical. Can anyone please (i) affirm that this is intended behavior rather than a bug and, if so, (ii) explain why sample_posterior_predictive() creates different results for one and the same input data?
Here's a reproducible example based on the iris dataset, where petal width and length serve as predictor and response, respectively, and everything but the last row is used for training. The model is subsequently tested against the last row. pd.concat() is used to duplicate the first row of the test data frame to circumvent the above bug.
import seaborn as sns
import pymc3 as pm
import pandas as pd
import numpy as np
### . training ----
dat = sns.load_dataset('iris')
trn = dat.iloc[:-1]
with pm.Model() as model:
s_data = pm.Data('s_data', trn['petal_width'])
outcome = pm.glm.GLM(x = s_data, y = trn['petal_length'], labels = 'petal_width')
trace = pm.sample(500, cores = 1, random_seed = 1899)
### . testing ----
tst = dat.iloc[-1:]
tst = pd.concat([tst, tst], axis = 0, ignore_index = True)
with model:
pm.set_data({'s_data': tst['petal_width']})
ppc = pm.sample_posterior_predictive(trace, random_seed = 1900)
np.mean(ppc['y'], axis = 0)
# array([5.09585088, 5.08377112]) # mean predicted value for [first, second] row

I don't think it's a bug and I also don't find it troubling. Since PyMC3 doesn't check whether the points being predicted are identical, it treats them separately and each one results in a random draw from the model. While each PPC draw (row in ppc['y']) is using the same random parameter settings for the GLM taken from the trace, the model is still stochastic (i.e., there is always measurement error). I think this explains the difference.
If you increase the number of draws in the PPC, you will see that the difference in the means decreases, which is consistent with this just being a difference in sampling.

Related

How to give more importance to some features in sklearn Isolation Forest

I am using sklearn isolation forest for an anomaly detection task. Isolation forest consists of iTrees. As this paper describes, the nodes of the iTrees are split in the following way:
We select any feature (uniformly) randomly and perform a split on a random value of that feature.
But I want to give more weight to some features than the others. So instead of selecting the features with equal probability, I want to draw some features with a higher probability (giving more weight to those features) and other features with a lower probability.
How can I do that? From the source code it seems I have to change the function _generate_bagging_indices in _bagging.py, but not sure.
You can achieve this without changing the source code. Instead, you can tweak your input data by duplicating the features you wish to increase the weight for. If you have a feature appearing twice, the trees will use it twice to split your data, which in practice will mean the same as having doubled the weight of the feature.
In addition to this, you can also choose to reduce the amount of features used by your isolation forest in each tree. This is controlled by the argument max_features. The default value of 1.0 ensures that every feature will be used for each tree. By reducing it, more trees will be trained without the less frequent features in your input.
Illustration
Load Data
from sklearn.ensemble import IsolationForest
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris()
X = data.data
df = pd.DataFrame(X, columns=data.feature_names)
Default settings
IF = IsolationForest()
IF.fit(df)
preds = IF.predict(df)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds)
plt.title("Default settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Weighted Settings
df1 = df.copy()
weight_feature = 10
for i in range(weight_feature):
df1["duplicated_" + str(i)] = df1["sepal length (cm)"]
IF1 = IsolationForest(max_features=0.3)
IF1.fit(df1)
preds1 = IF1.predict(df1)
plt.scatter(df.iloc[:, 0], df.iloc[:, 1], c=preds1)
plt.title("Weighted settings")
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
As you can see visually, the second option has used the X-axis more intensively to determine which are the outliers.

Clustering data- Poor results, feature extraction

I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.
First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();

partially define initial centroid for scikit-learn K-Means clustering

Scikit documentation states that:
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
My data has 10 (predicted) clusters and 7 features. However, I would like to pass array of 10 by 6 shape, i.e. I want 6 dimensions of centroid of be predefined by me, but 7th dimension to be iterated freely using k-mean++.(In another word, I do not want to specify initial centroid, but rather control 6 dimension and only leave one dimension to vary for initial cluster)
I tried to pass 10x6 dimension, in hope it would work, but it just throw up the error.
Sklearn does not allow you to perform this kind of fine operations.
The only possibility is to provide a 7th feature value that is random, or similar to what Kmeans++ would have achieved.
So basically you can estimate a good value for this as follows:
import numpy as np
from sklearn.cluster import KMeans
nb_clust = 10
# your data
X = np.random.randn(7*1000).reshape( (1000,7) )
# your 6col centroids
cent_6cols = np.random.randn(6*nb_clust).reshape( (nb_clust,6) )
# artificially fix your centroids
km = KMeans( n_clusters=10 )
km.cluster_centers_ = cent_6cols
# find the points laying on each cluster given your initialization
initial_prediction = km.predict(X[:,0:6])
# For the 7th column you'll provide the average value
# of the points laying on the cluster given by your partial centroids
cent_7cols = np.zeros( (nb_clust,7) )
cent_7cols[:,0:6] = cent_6cols
for i in range(nb_clust):
init_7th = X[ np.where( initial_prediction == i ), 6].mean()
cent_7cols[i,6] = init_7th
# now you have initialized the 7th column with a Kmeans ++ alike
# So now you can use the cent_7cols as your centroids
truekm = KMeans( n_clusters=10, init=cent_7cols )
That is a very nonstandard variation of k-means. So you cannot expect sklearn to be prepared for every exotic variation. That would make sklearn slower for everybody else.
In fact, your approach is more like certain regression approaches (predicting the last value of the cluster centers) rather than clustering. I also doubt the results will be much better than simply setting the last value to the average of all points assigned to the cluster center using the other 6 dimensions only. Try partitioning your data based on the nearest center (ignoring the last column) and then setting the last column to be the arithmetic mean of the assigned data.
However, sklearn is open source.
So get the source code, and modify k-means. Initialize the last component randomly, and while running k-means only update the last column. It's easy to modify it this way - but it's very hard to design an efficient API to allow such customizations through trivial parameters - use the source code to customize at his level.

PYMC3 Bayesian Prediction Cones

I'm still learning PYMC3, but I cannot find anything on the following problem in the docs. Consider the Bayesian Structure Time Series (BSTS) model from this question with no seasonality. This can be modeled in PYMC3 as follows:
import pymc3, numpy, matplotlib.pyplot
# generate some test data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
y_train = y_full[:90]
y_test = y_full[90:]
# specify the model
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_train.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_train)
trace = pymc3.sample(1000)
y_mean_pred = pymc3.sample_ppc(trace,samples=1000,model=model)['y'].mean(axis=0)
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='b')
ax.plot(t[:90],y_mean_pred,c='r')
matplotlib.pyplot.show()
Now I would like to predict the behavior for the next 10 time steps, i.e., y_test. I would also like to include credible regions over this area produce a Bayesian cone, e.g., see here. Unfortunately the mechanism for producing the cones in the aforementioned link is a little vague. In a more conventional AR model one could learn the mean regression coefficients and manually extend the mean curve. However, in this BSTS model there is no obvious way to do this. Alternatively, if there were regressors, then I could use a theano.shared and update it with a finer/extended grid to impute and extrapolate with sample_ppc, but thats not really an option in this setting. Perhaps sample_ppc is a red herring here, but its unclear how else to proceed. Any help would be welcome.
I think the following work. However, its super clunky and requires that I know how far in advance I want to predict before I train (in particular it percludes streaming usage or simple EDA). I suspect there is a better way and I would much rather accept a better solution by someone with more Pymc3 experience
import numpy, pymc3, matplotlib.pyplot, seaborn
# generate some data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
# mask the data that I want to predict (requires knowledge
# that one might not always have at training time).
cutoff_idx = 80
y_obs = numpy.ma.MaskedArray(y_full,numpy.arange(t.size)>cutoff_idx)
# specify and train the model, used the masked array to supply only
# the observed data
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_obs.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_obs)
trace = pymc3.sample(5000)
y_pred = pymc3.sample_ppc(trace,samples=20000,model=model)['y']
y_pred_mean = y_pred.mean(axis=0)
# compute percentiles
dfp = numpy.percentile(y_pred,[2.5,25,50,70,97.5],axis=0)
# plot actual data and summary posterior information
pal = seaborn.color_palette('Purples')
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='g',label='true value',alpha=0.5)
ax.plot(t,y_pred_mean,c=pal[5],label='posterior mean',alpha=0.5)
ax.plot(t,dfp[2,:],alpha=0.75,color=pal[3],label='posterior median')
ax.fill_between(t,dfp[0,:],dfp[4,:],alpha=0.5,color=pal[1],label='CR 95%')
ax.fill_between(t,dfp[1,:],dfp[3,:],alpha=0.4,color=pal[2],label='CR 50%')
ax.axvline(x=t[cutoff_idx],linestyle='--',color='r',alpha=0.25)
ax.legend()
matplotlib.pyplot.show()
This outputs the following which seems like a really bad prediction, but at least the code is supplying out of sample values.

Does sklearn silhouette_samples depend on the order of the data?

I’ve been trying to determine the silhouette scores for each sample in a data set, which contains two different classes. However, the distribution and sample values change depending on how I’ve sorted my data ahead of time. For example, if I sort my dataframe by the class labels (0 & 1) in ascending vs descending order prior to calling silhouette_samples(), the silhouette scores change.
Can someone help me figure out what’s going on? I’d like to know
Is there a bug in my code that I’m not aware of?
Is this normal behavior of the sklearn silhouette_samples function that I’m
ignorant of?
Or is this a bug in the sklearn silhouette_samples?
The effect occurs with the following code:
import pandas as pd
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df_myfeatures #data frame containing features and class labels
'''data frame sorted by output labels in ascending order'''
df1 = df_myfeatures.copy().sort_values(['output_label'], ascending = True)
'''data frame sorted by output labels in descending order'''
df2 = df_myfeatures.copy().sort_values(['output_label'], ascending = False)
'''
standardize the features ahead of time since they’re on different scales
the X matrix has 26k rows (observations) and 9 columns (features)
'''
standard_scaler = StandardScaler()
X1 = standard_scaler.fit_transform(df1[cols]) #cols is just a list of columns for fitting
X2 = standard_scaler.fit_transform(df2[cols])
y1 = df1['output_label']
y2 = df2['output_label']
'''find the silhouette scores'''
ss1 = silhouette_samples(X1,y1)
ss2 = silhouette_samples(X2,y2)
'''plot the distribution'''
plt.hist(ss1, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted ascending')
plt.hist(ss2, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted descending')
plt.legend()
plt.title('distribution of silhouette scores')
Which generates the following distributions of scores:
histogram of silhouette scores
As you can see, the distribution of scores changes depending on the order of data. I’ve verified that there’s not an issue with standard scaler producing different results for the data depending on the order, and that there’s not an issue with the pandas sorting the data and somehow messing up the alignment of different rows across columns. I'm completely stumped, as far as I was aware, there shouldn't be any order effects in the calculation of silhouette scores.
Please, help me understand this behavior! Thanks!
Note: I’m running on a Windows 10 machine, using Anaconda 4.0 (64 bit), Python v3.5.1, sklearn v0.17.1 and Pandas v0.18.0

Categories

Resources