varying number of clusters without recalculating tree every time - python

We are using sklearn in python and trying to run agglomerative clustering (Wards) on a range of cluster numbers (i.e. N=2-9) using full_tree without having to re-compute the tree for each individual value of N, by using the cache. This was answered in an old post from 2016 but that answer doesn't seem to work anymore. See (sklearn agglomerative clustering: dynamically updating the number of clusters).
In other words, run fit over different values of N, without re-clustering every time. However, we are getting syntax errors and not able to call up the labels for any of the clusters stored in cache afterwards. Code looks something like:
x = AgglomerativeClustering(memory="mycachedir", compute_full_tree=True
but x.fit_predict(inputDF{2}) does not fit the syntax of the memory access command
Anybody know the syntax for calling up labels from the cache in this scenario? Thanks
P.S. I'm a newbie so apologies in advance if I am not being clear.
Tried solution posted in 2016 (sklearn agglomerative clustering: dynamically updating the number of clusters).
Code looks something like:
x = AgglomerativeClustering(memory="mycachedir", compute_full_tree=True
but x.fit_predict(inputDF{2}) does not fit the syntax of the memory access command
We expect to run clustering on a given array input and retrieve labels of each cluster when we vary the number of clusters "N" over a range, using the cache memory rather than re-computing the tree every time

The sklearn API is badly suited for this.
It's much better to use agglomerative clustering from scipy. Because it consists of two steps: Building the linkage / dendrogram and then extracting a flat clustering from this. The first step is O(n³) with Ward, but the second step is only O(n) I think. A similar approach can be found in ELKI, too. But unfortunately, sklearn follows this narrow "fit-predict" view originating from classification, and that does not support such a two-step approach.
There is also other functionality available in scipy, but not in sklearn, if I am not mistaken. Just have a look.

Related

How to set good parameters clustering high density data with DBSCAN?

I want to cluster some stars based on given position (X,Y,Z) using DBSCAN, I do not know how to adjust the data to get the right numbers of clusters to plot it afterward?
this is how the data looks like
what is the right parameters for these data?
the number of rows are 1.202672e+06
import pandas as pd
data = pd.read_csv('datasets/full_dataset.csv')
from sklearn.cluster import DBSCAN
clusters=DBSCAN(eps=0.5,min_samples=40,metric="euclidean",algorithm="auto")
min_samples is arguably one of the tougher ones to choose, but you can decide that by just looking at the results and deciding how much noise you are okay with.
Choosing eps can be aided by running k-NN to understand the density distribution of your data. I believe that the DBACAN paper recommends in more detail. There might even be a way to plot this in python (in R it is kNNdistplot).
I would prefer to use OPTICS, which is essentially doing all eps values simultaneously. However, I haven't found a decent implementation of this in either in python or R. In fact, there is an incorrect implementation in python which doesn't follow the original OPTICS paper at all.
If you really want to use optics, I recommend using a java implementation available using ELKI.
If anyone else has heard of a proper python implementation, I'd love to hear it.
If you want to go the trial and error route, start eps much smaller and go from there.

Cluster analysis algorithm for identifying line clusters on a map

I have a reasonably large set of (r,g,b)-colored data points with (x,y)-coordinates that looks like this:
Before commiting them to my database, I'd like to automatically identify all point clusters ( most of which look like lines ) and attribute a category to each colored point according to which cluster they belong to.
According to the scikit-learn roadmap I should be using either Meanshift or Gaussian mixture models, but I'd like to know if there is any solution available that will also take into account that nearby points that share similar colors are more likely to belong to the same cluster.
I have access to a GPU so any kind of solution is welcome, even if it's based on deep learning.
I tried #mcdowella 's answer and it worked surprisingly well. I ran it over the higher-dimensional version of these points ( which were generated through T-SNE ) by using the HDBSCAN Robust Single Linkage implementation and it approximated many lines without any parameter tuning.
I would try https://en.wikipedia.org/wiki/Single-linkage_clustering - it has a tendency to follow lines that is sometimes even a disadvantage for people who want nice compact rounded clusters and get straggling spaghetti (nice picture on P7 of https://www.stat.cmu.edu/~cshalizi/350/lectures/08/lecture-08.pdf).

method for implementing regression tree on raster data - python

I'm trying to build and implement a regression tree algorithm on some raster data in python, and can't seem to find the best way to do so. I will attempt to explain what I'm trying to do:
My desired output is a raster image, whose values represent lake depth, call it depth.tif. I have a series of raster images, each representing the reflectance values in different Landsat bands, say [B1.tif, B2.tif, ..., B7.tif] that I want to use as my independent variables to predict lake depth.
For my training data, I have a shapefile of ~6000 points of known lake depth. To create a tree, I extracted the corresponding reflectance values for each of those points, then exported that to a table. I then used that table in weka, a machine-learning software, to create a 600-branch regression tree that would predict depth values based on the set of reflectance values. But because the tree is so large, I can't write it in python manually. I ran into the python-weka-wrapper module so I can use weka in python, but have gotten stuck with the whole raster part. Since my data has an extra dimension (if converted to array, each independent variable is actually a set of ncolumns x nrows values instead of just a row of values, like in all of the examples), I don't know if it can do what I want. In all the examples for the weka-python-wrapper, I can't find one that deals with spatial data, and I think this is what's throwing me off.
To clarify, I want to use the training data (which is a point shapefile/table right now but can- if necessary- be converted into a raster of the same size as the reflectance rasters, with no data in all cells except for the few points I have known depth data at), to build a regression tree that will use the reflectance rasters to predict depth. Then I want to apply that tree to the same set of reflectance rasters, in order to obtain a raster of predicted depth values everywhere.
I realize this is confusing and I may not be doing the best job at explaining. I am open to other options besides just trying to implement weka in python, such as sklearn, as long as they are open source. My question is, can what I described be done? I'm pretty sure it can, as it's very similar to image classification, with the exception that the target values (depth) are continuous and not discrete classes but so far I have failed. If so, what is the best/most straight-forward method and/or are there any examples that might help?
Thanks
I have had some experience using LandSat Data for the prediction of environmental properties of soil, which seems to be somewhat related to the problem that you have described above. Although I developed my own models at the time, I could describe the general process that I went through in order to map the predicted data.
For the training data, I was able to extract the LandSat values (in addition to other properties) for the spatial points where known soil samples were taken. This way, I could use the LandSat data as inputs for predicting the environmental data. A part of this data would also be reserved for testing to confirm that the trained models were not overfitting to training data and that it predicted the outputs well.
Once this process was completed, it would be possible to map the desired area by getting the spatial information at each point of the desired area (matching the resolution of the desired image). From there, you should be able to input these LandSat factors into the model for prediction and the output used to map the predicted depth. You could likely just use Weka in this case to predict all of the cases, then use another tool to build the map from your estimates.
I believe I whipped up some code long ago to extract each of my required factors in ArcGIS, but it's been a while since I did this. There should be some good tutorials out there that could help you in that direction.
I hope this helps in your particular situation.
It sounds like you are not using any spatial information to build your tree
(such as information on neighboring pixels), just reflectance. So, you can
apply your decision tree to the pixels as if the pixels were all in a
one-dimensional list or array.
A 600-branch tree for a 6000 point training data file seems like it may be
overfit. Consider putting in an option that requires the tree to stop splitting
when there are fewer than N points at a node or something similar. There may
be a pruning factor that can be set as well. You can test different settings
till you find the one that gives you the best statistics from cross-validation or
a held-out test set.

K-means algorithm suitable?

I am writing a python script to analyse some data captured from a device. I want to automate the task of finding out if my data matches a certain pattern. In the image given below I want to determine that in the given set of captured data if I can categorize my data into 3 different clusters [as shown] using a script. The range of these clusters are not predefined. All I want to know is if I see a three different clusters in my data that are reasonably apart from each other - if not then my test fails. I am just trying to figure out what is best data analysis algorithm to use here. I was reading about clustering algorithms and was going to start with K-means clustering but anyone has a better idea?
http://imgur.com/I4jMqpk
[Link to the an example set of captured data - Note the color coded clusters][1]
The better idea is to start with a good problem statement. If you are not able to strictly define what you are looking for, then no method is suitable. If you can exactly write down what you need, then you can search for a solution. Clustering methods are quite weird objects, they will always "succeed", they will always cluster your data, often in a way, which is completely unacceptable for a human being. If your data looks like you plotted (it is 2d case, with points being a parts of a "dense" point clouds) then the most appropriate seems something like DBScan/Optics, so very simple method, which will result in more "human like" clusters (as opposed to k-means which won't divide your data into those "clouds", but rather will often split them).

Utilising Genetic algorithm to overcome different size datasets in model

SO I realise the question I am asking here is large and complex.
A potential solution to variences in sizes of
In all of my searching through statistical forums and posts I haven't come across a scientifically sound method of taking into account the type of data that I am encountering,
but I have thought up a (novel?) potential solutions to account perfectly (in my mind) for large and small datasets within the same model.
The proposed method involves using a genetic algorithm to alter two numbers defining a relationship between the size of the dataset making up an implied strike rate and the
percentage of the implied strike to be used, with the target of the model to maximise the homology of the number 1 in two columns of the following csv. (ultra simplified
but hopefully demonstrates the principle)
Example data
Date,PupilName,Unique class,Achieved rank,x,y,x/y,Average xy
12/12/2012,PupilName1,UniqueClass1,1,3000,9610,0.312174818,0.08527
12/12/2012,PupilName2,UniqueClass1,2,300,961,0.312174818,0.08527
12/12/2012,PupilName3,UniqueClass1,3,1,3,0.333333333,0.08527
13/12/2012,PupilName1,UniqueClass2,1,2,3,0.666666667,0.08527
13/12/2012,PupilName2,UniqueClass2,2,0,1,0,0.08527
13/12/2012,PupilName3,UniqueClass2,3,0,5,0,0.08527
13/12/2012,PupilName4,UniqueClass2,4,0,2,0,0.08527
13/12/2012,PupilName5,UniqueClass2,5,0,17,0,0.08527
14/12/2012,PupilName1,UniqueClass3,1,1,2,0.5,0.08527
14/12/2012,PupilName2,UniqueClass3,2,0,1,0,0.08527
14/12/2012,PupilName3,UniqueClass3,3,0,5,0,0.08527
14/12/2012,PupilName4,UniqueClass3,4,0,6,0,0.08527
14/12/2012,PupilName5,UniqueClass3,5,0,12,0,0.08527
15/12/2012,PupilName1,UniqueClass4,1,0,0,0,0.08527
15/12/2012,PupilName2,UniqueClass4,2,1,25,0.04,0.08527
15/12/2012,PupilName3,UniqueClass4,3,1,29,0.034482759,0.08527
15/12/2012,PupilName4,UniqueClass4,4,1,38,0.026315789,0.08527
16/12/2012,PupilName1,UniqueClass5,1,12,24,0.5,0.08527
16/12/2012,PupilName2,UniqueClass5,2,1,2,0.5,0.08527
16/12/2012,PupilName3,UniqueClass5,3,13,59,0.220338983,0.08527
16/12/2012,PupilName4,UniqueClass5,4,28,359,0.077994429,0.08527
16/12/2012,PupilName5,UniqueClass5,5,0,0,0,0.08527
17/12/2012,PupilName1,UniqueClass6,1,0,0,0,0.08527
17/12/2012,PupilName2,UniqueClass6,2,2,200,0.01,0.08527
17/12/2012,PupilName3,UniqueClass6,3,2,254,0.007874016,0.08527
17/12/2012,PupilName4,UniqueClass6,4,2,278,0.007194245,0.08527
17/12/2012,PupilName5,UniqueClass6,5,1,279,0.003584229,0.08527
So I have created a tiny model dataset, which contains some good examples of where my current methods fall short and how I feel a genetic algorithm can be used to fix this. If we look in the dataset above it contains 6 unique classes the ultimate objective of the algorithm is to create as high as possible correspondence between a rank of an adjusted x/y and the achieved rank in column 3 (zero based referencing.) In uniqueclass1 we have two identical x/y values, now these are comparatively large x/y values if you compare with the average (note the average isn't calculated from this dataset) but it would be common sense to expect that the 3000/9610 is more significant and therefore more likely to have an achieved rank of 1 than the 300/961. So what I want to do is make an adjusted x/y to overcome these differences in dataset sizes using a logarithmic growth relationship defined by the equation:
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where α is the only dynamic number
If I can explain my logic a little and open myself up to (hopefully) constructive criticsm. This graph below shows is an exponential growth relationship between size of the data set and the % of x/y contributing to the adjusted x/y. Essentially what the above equation says is as the dataset gets larger the percentage of the original x/y used in the adjusted x/y gets larger. Whatever percentage is left is made up by the average xy. Could hypothetically be 75% x/y and 25% average xy for 300/961 and 95%/5% for 3000/9610 creating an adjusted x/y which clearly demonstrates
For help with understanding the lowering of α would produce the following relationship where by a larger dataset would be requred to achieve the same "% of xy contributed"
Conversly increasing α would produce the following relationship where by a smaller dataset would be requred to achieve the same "% of xy contributed"
So I have explained my logic. I am also open to code snippets to help me overcome the problem. I have plans to make a multitude of genetic/evolutionary algorithms in the future and could really use a working example to pick apart and play with in order to help my understanding of how to utilise such abilities of python. If additional detail is required or further clarification about the problem or methods please do ask, I really want to be able to solve this problem and future problems of this nature.
So after much discussion about the methods available to overcome the problem presented here I have come to the conclusion that he best method would be a genetic algorithm to iterate α in order to maximise the homology/correspondance between a rank of an adjusted x/y and the achieved rank in column 3. It would be greatly greatly appreciated if anyone be able to help in that department?
So to clarify, this post is no longer a discussion about methodology
I am hoping someone can help me produce a genetic algorithm to maximise the homology between the results of the equation
adjusted xy = ((1-exp(-y*α)) * x/y)) + ((1-(1-exp(-y*α)))*Average xy)
Where adjusted xy applies to each row of the csv. Maximising homology could be achieved by minimising the difference between the rank of the adjusted xy (where the rank is by each Unique class only) and Achieved rank.
Minimising this value would maximise the homology and essentially solve the problem presented to me of different size datasets. If any more information is required please ask, I check this post about 20 times a day at the moment so should reply rather promptly. Many thanks SMNALLY.
The problem you are facing sounds to me like "Bias Variance Dilemna" from a general point of view. In a nutshell, a more precise model favours variance (sensitivity to change in a single training set), a more general model favours bias (model works for many training sets)
May I suggest not to focus on GA but look at Instance Base Learning and advanced regression techniques. The Andrew moore page at CMU is a good entry point.
And particularly those slides.
[EDIT]
After a second reading, here is my second understanding:
You have a set of example data with two related attributes X and Y.
You do not want X/Y to dominate when Y is small, (considered as less representative).
As a consequence you want to "weigth" the examples with a adapted value adjusted_xy .
You want adjusted_xy to be related to a third attribute R (rank). Related such as,per class, adjusted_xy is sorted like R.
To do so you suggest to put it as an optimization problem, searching for PARAMS of a given function F(X,Y,PARAMS)= adjusted_xy .
With the constraint that D=Distance( achieved rank for this class, rank of adjusted_xy for this class ) is minimal.
Your question, at least for me, is in the field of attribute selection/attribute adaptation. (I guess the data set will later be used for supervised learning ).
One problem that I see in your approach (if well understood) is that, at the end, rank will be highly related to adjusted_xy which will bring therefore no interesting supplementary information.
Once this said, I think you surely know how GA works . You have to
define the content of the chromosome : this appears to be your alpha parameter.
define an appropriate fitness function
The fitness function for one individual can be a sum of distances over all examples of the dataset.
As you are dealing with real values , other metaheuristics such as Evolution Strategies (ES) or Simulated Anealing may be more adapted than GA.
As solving optimization problems is cpu intensive, you might eventually consider C or Java instead of Python. (as fitness at least will be interpreted and thus cost a lot).
Alternatively I would look at using Y as a weight to some supervised learning algorithm (if supervised learning is the target).
Let's start by the problem: You consider the fact that some features lead to some of your classes a 'strike'. You are taking a subset of your data and try to establish a rule for the strikes. You do establish one but then you notice that the accuracy of your rule depends on the volume of the dataset that was used to establish the 'strike' rate anyway. You are also commenting on the effect of some samples in biasing your 'strike' estimate.
The immediate answer is that it looks like you have a lot of variation in your data, therefore you will in one way or another need to collect more to account for that variation. (That is, variation that is inherent to the problem).
The fact that in some cases the numbers end up in 'unusable cases' could also be down to outliers. That is, measurements that are 'out of bounds' for a number of reasons and which you would have to find a way to either exclude them or re-adjust them. But this depends a lot on the context of the problem.
'Strike rates' on their own will not help but they are perhaps a step towards the right direction. In any case, you can not compare strike rates if they are coming from samples of different sizes as you have found out too. If your problem is purely to determine the size of your sample so that your results conform to some specific accuracy then i would recommend that you have a look at Statistical Power and how does the sample size affects it. But still, to determine the sample size you need to know a bit more about your data, which brings us back to point #1 about the inherent variation.
Therefore, my attempt to an answer is this: If i have understood your question correctly, you are dealing with a classification problem in which you seek to assign a number of items (patients) to a number of classes (types of cancer) on the evidence of some features (existence of genetic markers, or frequency of their appearance or any other quantity anyway) about these items. But, some features might not exist for all items or, there is a core group of features but there might be some more that do not appear all the time. The question now is, which classifier do you use to achieve this? Logistic regression was mentioned previously and has not helped. Therefore, what i would suggest is going for a Naive Bayesian Classifier. The classifier can be trained with the datasets you have used to derive the 'strike rates' which will provide the a-priori probabilities. When the classifier is 'running' it will be using the features of new data to construct a likelihood that the patient who provided this data should be assigned to each class.
Perhaps the more common example for such a classifier is the spam-email detectors where the likelihood that an email is spam is judged on the existence of specific words in the email (and a suitable training dataset that provides a good starting point of course).
Now, in terms of trying this out practically (and since your post is tagged with python related tags :) ), i would like to recommend Weka. Weka contains a lot of related functionality including bootstrapping that could potentially help you with those differences in the size of the datasets. Although Weka is Java, bindings exist for it in Python too. I would definitely give it a go, the Weka package, book and community are very helpful.
No. Don't use a genetic algorithm.
The bigger the search space of models and parameters, the better your chances of finding a good fit for your data points. But the less this fit will mean. Especially since for some groups your sample sizes are small and therefore the measurements have a high random component to them. This is why, somewhat counterintuitively, it is often actually harder to find a good model for your data after collecting it than before.
You have taken the question to the programmer's lair. This is not the place for it. We solve puzzles.
This is not a puzzle to find the best line through the dots. You are searching for a model that makes sense and brings understanding on the subject matter. A genetic algorithm is very creative at line-through-dot drawing but will bring you little understanding.
Take the problem back where it belongs and ask the statisticians instead.
For a good model should be based on theory behind the data. It'll have to match the points on the right side of the graph, where (if I understand you right) most of the samples are. It'll be able to explain in hard probabilities how likely the deviations on the left are and tell you if they are significant or not.
If you do want to do some programming, I'd suggest you take the simplest linear model, add some random noise, and do a couple simulation runs for a population like your subjects. See if the data looks like the data you're looking at or if it generally 'looks' different, in which case there really is something nonlinear (and possibly interesting) going on on the left.
I once tackled a similar problem (as similar as problems like this ever are), in which there were many classes and high variance in features per data point. I personally used a Random Forest classifier (which I wrote in Java). Since your data is highly variant, and therefore hard to model, you could create multiple forests from different random samples of your large dataset and put a control layer on top to classify data against all the forests, then take the best score. I don't write python, but i found this link
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
which may give you something to play with.
Following Occam's razor, you must select a simpler model for small dataset and may want to switch to a more complex model as your dataset grows.
There are no [good] statistical tests that show you if a given model, in isolation, is a good predictor of your data. Or rather, a test may tell you that given model fitness is N, but you can never tell what the acceptable value of N is.
Thus, build several models and pick one with better tradeoff of predictive power and simplicity using Akaike information criterion. It has useful properties and not too hard to understand. :)
There are other tests of course, but AIC should get you started.
For a simple test, check out p-value

Categories

Resources