Clustering an array of values without using thresholds - python

I want to segment A 1D dataset where each value represents an error into 2 segments:
A cluster with the smallest values
All the others
Example:
X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5, 21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
In this small example, I would like to regroup the 4 first values in a cluster and forget about the others. I do not want a solution based on a threshold. The point is that the cluster of interest centroid will not always have the same value. It might be 1e-6, or it might be 1e-3, or it might be 1.
My idea was to use a k-means clustering algorithm, which would work fine if I did know how many clusters existed in my data. In the example above, the number is 3, one around 1 (the cluster of interest), one around 22, and one around 51. But sadly, I do not know the number of clusters... Simply searching for 2 clusters will not lead to a segmentation of the dataset as intended.
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
Returns a cluster 1 way too large, which also includes the data from the cluster centered around 22.
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
I did find some interesting answers on methods to select the k, but it complexifies the algorithm and I feel like there must be a far better way to solve this problem.
I'm open to any suggestions and example which could work on the X array provided.

You might find AffinityPropagation useful here, as it does not require to specify the amount of clusters to generate. You might have to tune however the damping factor and preference, so that it produces the expected results.
On the provided example, the default parameters seem to do the job:
from sklearn.cluster import AffinityPropagation
X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5,
21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
ap = AffinityPropagation(random_state=12).fit(X)
y = ap.predict(X)
print(y)
# array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], dtype=int64)
To obtain individual clusters from X, you can index using y:
first_cluster = X[y==0].ravel()
first_cluster
# array([1. , 1.5, 0.4, 1.1])
second_cluster = X[y==1].ravel()
second_cluster
# array([23. , 24. , 22.5, 21. , 20. , 25. ])

Related

Alternatives for numpy.random generation with choice values and specific frequency of values

I am working in generating an (1109, 8) array with random values generated from a fixed set of numbers [18, 24, 36, 0], I need to ensure each row contains 5 zeros at all times, but it wasn't happening even after adjusting the weightings for probabilities.
My workaround code is below but wanted to know if there is an easier way with another function? or perhaps by adjusting some of the parameters of the generator?
https://numpy.org/doc/stable/reference/random/generator.html
#Random output using new method
from numpy.random import default_rng
rng = default_rng(1)
#generate an array with random values of test duration,
test_duration = rng.choice([18, 24, 36, 0], size = arr.shape, p=[0.075, 0.1, 0.2, 0.625])
# ensure number of tests equals n_tests
n_tests = 3
non_tested = arr.shape[1] - n_tests
for row in range(len(test_duration)):
while np.count_nonzero(test_duration[row, :]) != n_tests:
new_test = rng.choice([18, 24, 36, 0], size = arr.shape[1], p=[0.075, 0.1, 0.2, 0.625])
test_duration[row, :] = np.array(new_test)
else:
pass
print('There are no days exceeding n_tests')
#print(test_durations)
print(test_duration[:10, :])
If you need 5 zeros in every row, you can just randomly select 3 values from [18, 24, 36], pad the rest with zeros and then do a per-row random shuffle. The numpy shuffle happens in-place, so you don't need to reassign.
import numpy as np
c = [18,24,26]
p = np.array([0.075, 0.1, 0.2])
p = p / p.sum() # normalize the probs
a = np.random.choice(c, size=(1109, 3), replace=True, p=(p/p.sum()))
a = np.hstack([a, np.zeros((1109, 5), dtype=np.int32)])
list(map(np.random.shuffle, a))
a
# returns:
array([[ 0, 0, 0, 0, 36, 0, 36, 36],
[ 0, 36, 0, 24, 24, 0, 0, 0],
[ 0, 0, 0, 0, 36, 36, 36, 0]])
...
[ 0, 0, 0, 24, 24, 36, 0, 0],
[ 0, 24, 0, 0, 0, 36, 0, 18],
[ 0, 0, 0, 36, 36, 24, 0, 0]])
You could simply create a random choice for the 5 positions of the zeros in the array, this way you would enforce that there are indeed 5 zeros, and after you sample the [18, 24, 36] with their normalized probabilities.
But by doing this you are not respecting the probability density that you specified in the first place, I don't know in which application you're using this for but this is a point to consider.

Calculating percentile of bins from numpy digitize?

I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins, I want to know the base percentile. For example, in bins, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data falls between the 0th and 20th percentile of data, it belongs in the first bin.
I've looked into pandas rank(pct=True) but can't seem to get this done correctly.
Suggestions?
You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])

Scikit learn wrong predictions with SVC

I am trying to predict the MNIST (http://pjreddie.com/projects/mnist-in-csv/) dataset with an SVM using the radial kernel. I want to train with few examples (e.g. 1000) and predict many more. The problem is that whenever I predict, the predictions are constant unless the indices of the test set coincide with those of the training set. That is, suppose I train with examples 1:1000 from my training examples. Then, the predictions are correct (i.e. the SVM does its best) for 1:1000 of my test set, but then I get the same output for the rest. If however I train with examples 2001:3000, then only the test examples corresponding to those rows in the test set are labeled correctly (i.e. not with the same constant). I am completely at a loss, and I think that there is some sort of bug, because the exact same code works just fine with LinearSVC, although evidently the accuracy of the method is lower.
First, I train with examples 501:1000 of training data:
# dat_train/test are pandas DFs corresponding to both MNIST datasets
dat_train = pd.read_csv('data/mnist_train.csv', header=None)
dat_test = pd.read_csv('data/mnist_train.csv', header=None)
svm = SVC(C=10.0)
idx = range(1000)
#idx = np.random.choice(range(len(dat_train)), size=1000, replace=False)
X_train = dat_train.iloc[idx,1:].reset_index(drop=True).as_matrix()
y_train = dat_train.iloc[idx,0].reset_index(drop=True).as_matrix()
X_test = dat_test.reset_index(drop=True).as_matrix()[:,1:]
y_test = dat_test.reset_index(drop=True).as_matrix()[:,0]
svm.fit(X=X_train[501:1000,:], y=y_train[501:1000])
Here you can see that about half the predictions are wrong
y_pred = svm.predict(X_test[:1000,:])
confusion_matrix(y_test[:1000], y_pred)
All wrong (i.e. constant)
y_pred = svm.predict(X_test[:500,:])
confusion_matrix(y_test[:500], y_pred)
This is what I would expect to see for all test data
y_pred = svm.predict(X_test[501:1000,:])
confusion_matrix(y_test[501:1000], y_pred)
You can check that all of the above are correct using LinearSVC!
The default kernel is RBF, in which case gamma matters. If gamma is not provided, it is auto by default, which is 1/n_features. You'd better run grid search to find the optimal parameters. Here I just illustrate the result is normal given suitable parameters.
In [120]: svm = SVC(C=1, gamma=0.0000001)
In [121]: svm.fit(X=X_train[501:1000,:], y=y_train[501:1000])
Out[121]:
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=1e-07, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In [122]: y_pred = svm.predict(X_test[:1000,:])
In [123]: confusion_matrix(y_test[:1000], y_pred)
Out[123]:
array([[ 71, 0, 2, 0, 2, 9, 1, 0, 0, 0],
[ 0, 123, 0, 0, 0, 1, 1, 0, 1, 0],
[ 2, 5, 91, 1, 1, 1, 3, 7, 5, 0],
[ 0, 1, 4, 48, 0, 40, 1, 5, 7, 1],
[ 0, 0, 0, 0, 88, 2, 3, 2, 0, 15],
[ 1, 1, 1, 0, 2, 77, 0, 3, 1, 1],
[ 3, 0, 3, 0, 5, 4, 72, 0, 0, 0],
[ 0, 2, 3, 0, 3, 0, 1, 88, 1, 1],
[ 2, 0, 1, 2, 3, 9, 1, 4, 63, 4],
[ 0, 1, 0, 0, 16, 3, 0, 11, 1, 62]])
Finding good parameters for an SVC is an art in itself. Grid Search might help, better works some population based training like in this article - i recently tried it. If you let it run the same time, it has better results than GridSearch. If you let it run until the accuracy is the same, it is faster.
It also helps to make a graphic: let the x and y axis be C and gamma, and plot the prediction scores as color. Usually you will find kind of a V-Shape with the best training results at the point where the two lines meet. At the same time this point has low C-Values, too, which is desirable because C determines the runtime of the SVC: High C makes a long runtime.

How to organize values in a numpy array into bins that contain a certain range of values?

I am trying to sort values in an numpy array so that I can store all of the values that are in a certain range (That could probably be phrased better). Anyway ill give an example of what I am trying to do. I have an array called bins that looks like this:
bins = array([11,11.5,12,12.5,13,13.5,14])
I also have another array called avgs:
avgs = array([11.02, 13.67, 11.78, 12.34, 13.24, 12.98, 11.3, 12.56, 13.95, 13.56,
11.64, 12.45, 13.23, 13.64, 12.46, 11.01, 11.87, 12.34, 13,87, 13.04,
12.49, 12.5])
What I am trying to do is to find the index values of the avgs array that are in the ranges between the values of the bins array. For example I was trying to make a while loop that would create new variables for each bin. The first bin would be everything that is between bins[0] and bins[1] and would look like:
bin1 = array([0, 6, 15])
Those index values would correspond to the values 11.02, 11.3, and 11.01 in the avgs and would be the values of avgs that were between index values 0 and 1 in bins. I also need the other bins so another example would be:
bin2 = array([2, 10, 16])
However the challenging part of this for me was that the size of bins and avgs changes based on other parameters so I was trying to build something that would be able to be expanded to larger or smaller bins and avgs arrays.
Numpy has some pretty powerful bin counting functions.
>>> binplace = np.digitize(avgs, bins) #Returns which bin an average belongs
>>> binplace
array([1, 6, 2, 3, 5, 4, 1, 4, 6, 6, 2, 3, 5, 6, 3, 1, 2, 3, 5, 7, 5, 3, 4])
>>> np.where(binplace == 1)
(array([ 0, 6, 15]),)
>>> np.where(binplace == 2)
(array([ 2, 10, 16]),)
>>> avgs[np.where(binplace == 1)]
array([ 11.02, 11.3 , 11.01])

Grouping 2D numpy array in average

I am trying to group a numpy array into smaller size by taking average of the elements. Such as take average foreach 5x5 sub-arrays in a 100x100 array to create a 20x20 size array. As I have a huge data need to manipulate, is that an efficient way to do that?
I have tried this for smaller array, so test it with yours:
import numpy as np
nbig = 100
nsmall = 20
big = np.arange(nbig * nbig).reshape([nbig, nbig]) # 100x100
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
An example with 6x6 -> 3x3:
nbig = 6
nsmall = 3
big = np.arange(36).reshape([6,6])
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
This is pretty straightforward, although I feel like it could be faster:
from __future__ import division
import numpy as np
Norig = 100
Ndown = 20
step = Norig//Ndown
assert step == Norig/Ndown # ensure Ndown is an integer factor of Norig
x = np.arange(Norig*Norig).reshape((Norig,Norig)) #for testing
y = np.empty((Ndown,Ndown)) # for testing
for yr,xr in enumerate(np.arange(0,Norig,step)):
for yc,xc in enumerate(np.arange(0,Norig,step)):
y[yr,yc] = np.mean(x[xr:xr+step,xc:xc+step])
You might also find scipy.signal.decimate interesting. It applies a more sophisticated low-pass filter than simple averaging before downsampling the data, although you'd have to decimate one axis, then the other.
Average a 2D array over subarrays of size NxN:
height, width = data.shape
data = average(split(average(split(data, width // N, axis=1), axis=-1), height // N, axis=1), axis=-1)
Note that eumiro's approach does not work for masked arrays as .mean(3).mean(1) assumes that each mean along axis 3 was computed from the same number of values. If there are masked elements in your array, this assumption does not hold any more. In that case, you have to keep track of the number of values used to compute .mean(3) and replace .mean(1) by a weighted mean. The weights are the normalized number of values used to compute .mean(3).
Here is an example:
import numpy as np
def gridbox_mean_masked(data, Nbig, Nsmall):
# Reshape data
rshp = data.reshape([Nsmall, Nbig//Nsmall, Nsmall, Nbig//Nsmall])
# Compute mean along axis 3 and remember the number of values each mean
# was computed from
mean3 = rshp.mean(3)
count3 = rshp.count(3)
# Compute weighted mean along axis 1
mean1 = (count3*mean3).sum(1)/count3.sum(1)
return mean1
# Define test data
big = np.ma.array([[1, 1, 2],
[1, 1, 1],
[1, 1, 1]])
big.mask = [[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]
Nbig = 3
Nsmall = 1
# Compute gridbox mean
print gridbox_mean_masked(big, Nbig, Nsmall)

Categories

Resources