Related
Given some SymPy matrix M
M = Matrix([
[0.000111334436666596, 0.00114870370895408, -0.000328330524152990, 5.61388353859808e-6, -0.000464532588930332, -0.000969955779635878, 1.70579589853818e-5, -5.77891177019884e-6, -0.000186812539472235, -2.37115911398055e-5],
[-0.00105346453420510, 0.000165063406707273, -0.00184449574409890, 0.000658080565333929, 0.00197652092300241, 0.000516180213512589, 9.53823860082390e-5, 0.000189858427211978, -3.80494288487685e-5, 0.000188984043643408],
[-0.00102465075104153, -0.000402915220398109, 0.00123785300884241, -0.00125808154543978, 0.000126618511490838, 0.00185985865307693, 0.000123626008509804, 0.000211557638637554, 0.000407232404255796, 1.89851719447102e-5],
[0.230813497584639, -0.209574389008468, 0.742275067362657, -0.202368828927654, -0.236683258718819, 0.183258819107153, 0.180335891933511, -0.530606389541138, -0.379368598768419, 0.334800403899511],
[-0.00102465075104153, -0.000402915220398109, 0.00123785300884241, -0.00125808154543978, 0.000126618511490838, 0.00185985865307693, 0.000123626008509804, 0.000211557638637554, 0.000407232404255796, 1.89851719447102e-5],
[0.00105346453420510, -0.000165063406707273, 0.00184449574409890, -0.000658080565333929, -0.00197652092300241, -0.000516180213512589, -9.53823860082390e-5, -0.000189858427211978, 3.80494288487685e-5, -0.000188984043643408],
[0.945967255845168, -0.0468645728473480, 0.165423896937049, -0.893045423193559, -0.519428986944650, -0.0463256408085840, -0.0257001217930424, 0.0757328764368606, 0.0541336731317414, -0.0477734271777646],
[-0.0273371493900004, -0.954100482348723, -0.0879282784854250, 0.100704543595514, -0.243312734473589, -0.0217088779350294, 0.900584332231093, 0.616061129532614, 0.0651163853434486, -0.0396603397583054],
[0.0967584768347089, -0.0877680087304911, -0.667679934757176, -0.0848411039101494, -0.0224646387789634, -0.194501966574153, 0.0755161040544943, 0.699388977592066, 0.394125039254254, -0.342798611994521],
[-0.000222668873333193, -0.00229740741790816, 0.000656661048305981, -1.12277670771962e-5, 0.000929065177860663, 0.00193991155927176, -3.41159179707635e-5, 1.15578235403977e-5, 0.000373625078944470, 4.74231822796110e-5]
])
I have calculated SymPy rank() and rref() of the matrix. Rank is 7 and rref() result is:
Matrix([
[1, 0, 0, 0, 0, 0, 0, -5.14556976678473, -3.72094268951566, 3.48581267477014],
[0, 1, 0, 0, 0, 0, 0, -5.52930150663022, -4.02230308325653, 3.79193678096199],
[0, 0, 1, 0, 0, 0, 0, 2.44893308665325, 1.83777402439421, -1.87489784909824],
[0, 0, 0, 1, 0, 0, 0, -7.33732284392352, -5.25036238623229, 4.97256759287563],
[0, 0, 0, 0, 1, 0, 0, 5.48049237370489, 3.90091366576548, -3.83642187384021],
[0, 0, 0, 0, 0, 1, 0, -10.6826798792866, -7.56560803870182, 7.45974067056387],
[0, 0, 0, 0, 0, 0, 1, -3.04726210012149, -2.66388837034592, 2.48327234504403],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Weird thing is that if I calculate rank with either NumPy or MATLAB I get value 6 and calculating rref with MATLAB I get the expected result - last 4 rows are all zero (instead of only last 3).
Does any one know where does this difference comes from and why am I unable to get correct results with SymPy? I know that rank 6 is correct because it is system of the equations where some linear dependency exist.
Looking at the eigenvalues of your matrix, the rank is indeed 6:
array([ 1.14550481e+00+0.00000000e+00j, -1.82137718e-01+6.83443168e-01j,
-1.82137718e-01-6.83443168e-01j, 2.76223053e-03+0.00000000e+00j,
-3.51138883e-04+8.61508469e-04j, -3.51138883e-04-8.61508469e-04j,
5.21160131e-17+0.00000000e+00j, -2.65160469e-16+0.00000000e+00j,
-2.67753616e-18+9.70937977e-18j, -2.67753616e-18-9.70937977e-18j])
With the sympy version I have, I get even a rank of 8, compared to the rank 6 that numpy returns.
But actually, Sympy cannot solve the eigenvalues of this matrix due to the size of the matrix (probably related to SymPy could not compute the eigenvalues of this matrix).
So one of them, Sympy, is trying to solve symbolically the equations and find the rank (based on imperfect floating point numbers), whereas the other one, numpy, uses approximations (lapack IIRC) to find the eigenvalues. By having an adequate threshold, numpy finds the proper rank, but it could have said differently with a different threshold. Sympy tried to find the rank based on an approximate system of a perfect 6 rank system and finds that it is of rank 7 or 8. It's not surprising due to the floating point difference (Sympy moves to integers to try to find the eigenvalues, for instance, instead of staying in floating point realm).
I am using Random Forest as a binary classifier for a dataset and the results just don't seem believable, but I can't find where the problem is.
The problem lies in the fact that the examples are clearly not separable by setting a threshold, as the values for the feature of interest for the positive/negative examples are highly homogeneous. When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right? If that's the case, how can the code below result in perfect performance on the test set?
P.S. In practice I have many more than the ~30 examples shown below, but only included these as an example. Same performance when evaluating >100.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
X_train = np.array([0.427948, 0.165065, 0.31179, 0.645415, 0.125764,
0.448908, 0.417467, 0.524891, 0.038428, 0.441921,
0.927511, 0.556332, 0.243668, 0.565939, 0.265502,
0.122271, 0.275983, 0.60786, 0.670742, 0.565939,
0.117031, 0.117031, 0.001747, 0.148472, 0.038428,
0.50393, 0.49607, 0.148472, 0.275983, 0.191266,
0.254148, 0.430568, 0.198253, 0.323144, 0.29869,
0.344978, 0.524891, 0.323144, 0.344978, 0.28821,
0.441921, 0.127511, 0.31179, 0.254148, 0, 0.001747,
0.243668, 0.281223, 0.281223, 0.427948, 0.548472,
0.927511, 0.417467, 0.282969, 0.367686, 0.198253,
0.572926, 0.29869, 0.570306, 0.183406, 0.310044,
1, 1, 0.60786, 0, 0.282969, 0.349345, 0.521106,
0.430568, 0.127511, 0.50393, 0.367686, 0.310044,
0.556332, 0.670742, 0.30393, 0.548472, 0.193886,
0.349345, 0.122271, 0.193886, 0.265502, 0.537991,
0.165065, 0.191266])
y_train = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1,
0, 0, 1, 0, 0, 0, 0])
X_test = np.array((0.572926, 0.521106, 0.49607, 0.570306, 0.645415,
0.125764, 0.448908, 0.30393, 0.183406, 0.537991))
y_test = np.array((1, 1, 1, 0, 0, 0, 1, 1, 0, 0))
# Instantiate model and set parameters
clf = RandomForestClassifier()
clf.set_params(n_estimators=500, criterion='gini', max_features='sqrt')
# Note: reshape is because RF requires column vector format, # but
default NumPy is row
clf.fit(X_train.reshape(-1, 1), y_train)
pred = clf.predict(X_test.reshape(-1, 1))
# sort by feature value for comparison
o = np.argsort(X_test)
print('Example#\tX\t\t\tY_test\tY_true')
for i in o:
print('%d\t\t\t%f\t%d\t%d' % (i, X_test[i], y_test[i], pred[i]))
Which then returns:
Example# X Y_test Y_true
5 0.125764 0 0
8 0.183406 0 0
7 0.303930 1 1
6 0.448908 1 1
2 0.496070 1 1
1 0.521106 1 1
9 0.537991 0 0
3 0.570306 0 0
0 0.572926 1 1
4 0.645415 0 0
How can an RF model with a single feature possibly discriminate these examples? Isn't there something wrong? I've looked into the configuration of the classifier and whatnot and can't find any problems. I was thinking that maybe it was a problem of overfitting (however I'm doing 10-fold cross validation, so that seems less likely), but then I came across this quote on the official webpage for Random Forest classification - ”Random forests does not overfit. You can run as many trees as you want.” (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#remarks)
When only a single feature is used for binary classification, RF should only be able to discriminate between examples by setting an absolute threshold for positive/negative identification, right?
Each branch can discriminate only by one threshold, but each tree is built up by several branches. If the X-space can be split into several intervals such that each interval has the same y-value, then as long as the classifier has enough data to get the boundaries of those intervals, it will be able to predict the test set. However, I noticed that your "test" set seems to be a subset of your train set, which defeats the purpose of having a test set. Of course if you test it on data than you trained on, the accuracy will be high. Try sorting your data by X-value, then taking X-values that aren't in your training set, but are between two adjacent X_train values that have different y-values. For instance, x=.001. You should see accuracy plummet.
The Scenario:
I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:
OLD FORMAT:
uid iid rat
941 1 5
941 7 4
941 15 4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4
NEW FORMAT:
uid 1 2 3 4
1 5 3 4 3
2 4 3.6185548023 3.646073985 3.9238342172
3 2.8978348799 2.6692556753 2.7693015618 2.8973463681
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
940 3.7996234581 3.4979386925 3.5707888503 2
941 5 NaN NaN NaN
942 4.5762594612 4.2752554573 4.2522440019 4.3761477591
943 3.8252406362 5 3.3748860659 3.8487417604
over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN.
With KMeans I'm able to set and get clusters.
The Problem
The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)
Techniques Tried:
Tried this with IRIS data-set, where I found Species wasn't included. Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)
Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.
Snippet 1:
print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris
iris = load_iris()
dbscan = DBSCAN()
d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)
Snippet 1 (Output):
FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]:
array([ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, -1, -1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, -1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, -1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Snippet 2:
import pandas as pd
from sklearn.cluster import DBSCAN
data_set = pd.DataFrame
ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
data_set = pd.read_csv("MainMatrix_IBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
elif ch is 2:
data_set = pd.read_csv("MainMatrix_UBCF.csv")
data_set = data_set.iloc[:, 1:]
data_set = data_set.dropna()
else:
print "Enter Proper choice!"
print "Starting with DBSCAN for Clustering on\n", data_set.info()
db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
Snippet 2 (Output):
Extended Cluster Methods for:
1. Main Matrix IBCF
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])
As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.
As pointed by #faraway and #Anony-Mousse, the solution is more Mathematical on Dataset than Programming.
Could finally figure out the clusters. Here's how:
db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)
uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d
The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results?.
You need to choose appropriate parameters. With a too small epsilon, everything becomes noise. sklearn shouldn't have a default value for this parameter, it needs to be chosen for each data set differently.
You also need to preprocess your data.
It's trivial to get "clusters" with kmeans that are meaningless...
Don't just call random functions. You need to understand what you are doing, or you are just wasting your time.
Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).
It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*). While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).
Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values. You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).
If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.
Have you tried seeing how the cluster looks in 2D space using PCA (e.g). If whole data is dense and actually forms single group probably then you might get single cluster.
Change other parameters like min_samples=5, algorithm, metric. Possible value of algorithm and metric you can check from sklearn.neighbors.VALID_METRICS.
I have 2d binary numpy arrays of varying size, which contain certain patterns.
Just like this:
import numpy
a = numpy.zeros((6,6), dtype=numpy.int)
a[1,2] = a[1,3] = 1
a[4,4] = a[5,4] = a[4,3] = 1
Here the "image" contains two patches one with 2 and one with 3 connected cells.
print a
array([[0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 1, 0]])
I want to know how often a non-zero cell borders another non-zero cell ( neighbours defined as rook's case, so the cells to the left, right, below and above each cell) including their pseudo-replication (so vice-versa).
A previous approach for inner boundaries returns wrong values (5) as it was intended to calculate outer boundaries.
numpy.abs(numpy.diff(a, axis=1)).sum()
So for the above test array, the correct total result would be 6 (The upper patch has two internal borders, the lower four ).
Grateful for any tips!
EDIT:
Mistake: The lower obviously has 4 internal edges (neighbouring cells with the same value)
Explained the desired neighbourhood a bit more
I think the result is 8 if it's 8-connected neighborhood. Here is the code:
import numpy
a = numpy.zeros((6,6), dtype=numpy.int)
a[1,2] = a[1,3] = 1
a[4,4] = a[5,4] = a[4,3] = 1
from scipy.ndimage import convolve
kernel = np.ones((3, 3))
kernel[1, 1] = 0
b = convolve(a, kernel, mode="constant")
b[a != 0].sum()
but you said rook's case.
edit
Here is the code for 4-connected neighborhood:
import numpy as np
a = np.zeros((6,6), dtype=np.int)
a[1,2] = a[1,3] = 1
a[4,4] = a[5,4] = a[4,3] = 1
from scipy import ndimage
kernel = ndimage.generate_binary_structure(2, 1)
kernel[1, 1] = 0
b = convolve(a, kernel, mode="constant")
b[a != 0].sum()
I tried a Machine Learning algorithm on a hypothetical problem :-
I made a fake feature vector and a fake result data set by the following python code :-
x=[]
y=[]
for i in range(0,100000):
mylist=[]
mylist.append(i)
mylist.append(i)
x.append(mylist)
if(i%2)==0:
y.append(0)
else:
y.append(1)
The above code gives me 2 python lists, namely,
x = [[0,0],[1,1],[2,2]....and so on] #this list contains the fake feature vector, with 2 same numbers
y = [0,1,0..... and so on] #this has the fake test labels, 0 for even, 1 for odd
I think the test data is good enough for a ML algorithm to learn. I use the following python code to train a couple of different machine learning models.
Approach 1 : Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(x,y)
x_pred = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10],[11,11],[12,12],[13,13],[14,14],[15,15],[16,16]]
y_pred=gnb.predict(x_pred)
print y_pred
I get the following incorrect output, the classifier fails to predict :-
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Approach 2 : Support Vector Machines
from sklearn import svm
clf = svm.SVC()
clf.fit(x, y)
x_pred = [[1,1],[2,2],[3,3],[4,4],[5,5],[6,6],[7,7],[8,8],[9,9],[10,10],[11,11],[12,12],[13,13],[14,14],[15,15],[16,16]]
y_pred=clf.predict(x_pred)
print y_pred
I get the following correct output, the classifier fails to predict :-
[1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0]
Can someone put light on this and explain why one approach had 50% accuracy and the other one had 100% accuracy.
Let me know if this question is tagged with a wrong category.
Naive Bayes is a parametric model: it tries to summarize your training set in nine parameters, the class prior (50% for either class) and the per-class, per-feature means and variances. However, your target value y is not a function of the means and variances of the inputs x in any way,(*) so the parameters are irrelevant and the model resorts to what is effectively random guessing.
By contrast, the support vector machine remembers its training set and compares new inputs to its training inputs using a kernel function. It's supposed to pick a subset of its training samples, but for this problem it's forced to just remember all of them:
>>> x = np.vstack([np.arange(100), np.arange(100)]).T
>>> y = x[:, 0] % 2
>>> from sklearn import svm
>>> clf = svm.SVC()
>>> clf.fit(x, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> clf.support_vectors_.shape
(100, 2)
Since you're using test samples that occurred in the training set, all it has to do is look up the label that the samples you presented had in the training set and return those, so you get 100% accuracy. If you feed the SVM samples outside of the training set, you'll see that it too starts guessing randomly:
>>> clf.predict(x * 2)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1])
Since multiplying by two makes all the features even, the true labeling would have been all zero and the accuracy is 50%: the accuracy of a random guess.
(*) Actually there is some dependence in the training set, but that drops off with more data.