Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have series of line data (2-3 connected points).
What is the best machine learning algorithm that I can use to be able to classify lines to their location similarities? (image below)
Preferably python libraries such as SciKit-Learn.
Edit:
I have tried DBSCAN, but the problem I faced was if there are two lines intersect each other, sometimes DBSCAN consider them to one group even though they are completely in different direction.
Here is a solution I found so far:
GeoPath Clustering Algorithm
The idea here is to cluster geo paths that travel very similar to each other into groups.
Steps:
1- Cluster lines based on slope
2- Within each cluster from step 1, find centriod of lines and by using k-mean
algorithm cluster them into smaller groups
3- Within each geoup from step 2, calculate lenght of each line and group lines within defined length threshold
Result will be small groups of lines that have similar slope, close to each other and with similar travel distance.
Here are screen shots of visualization:
Yellow lines are all lines and red are cluster of paths travel together.
I'll throw an answer since I think the current one is incomplete...and I also think the comment of "simple heuristic" is premature. I think that if you cluster on points, you'll get a different result than what your diagram depicts. As the clusters will be near the end-points and you wouldn't get your nice ellipses.
So, if your data really does behave similarly to how you display it. I would take a stab at turning each set of 2/3 points into a longer list of points that basically trace out the lines. (you will need to experiment on how dense)
Then run HDBSCAN on the result see video ( https://www.youtube.com/watch?v=AgPQ76RIi6A ) to get your clusters. I believe "pip install hdbscan" installs it.
Now, when testing a new sample, first decompose it into many(N) points and fit them with your hdbscan model. I reckon that if you take a majority voting approach with your N points, you'll get the best overall cluster to which the "line" belongs.
So, while I sort of agree with the "simple heuristic" comment, it's not so simple if you want the whole thing automated. And once you watch the video you may be convinced that HDBSCAN, because of its density-based algorithm, will suit this problem(if you decide to create many points from each sample).
I'll wrap up by saying that I'm sure there are line-intersection models that have done this before...and that there does exist heuristics and rules that can do the job. Likely, they're computationally more economical too. My answer is just something organic using sklearn as you requested...and I haven't even tested it! It's just how I would proceed if I were in your shoes.
edit
I poked around and there a couple of line similarity measures you can possibly try. Frechet and Hausdorff distance measures.
Frechet: http://arxiv.org/pdf/1307.6628.pdf
Hausdorff: distance matrix of curves in python for a python example.
If you generate all pair-wise similarities and then group them according to similarity and/or into N bins, you can then call those bins your "clusters" (not kmeans clusters though!). For each new line, generate all similarities and see which bin it belongs to. I revise my original comment of possibly being computationally less intensive...you're lucky your lines only have 2 or 3 points!
The problem you're trying to solve is called clustering. For an overview of clustering algorithms in sklearn, see http://scikit-learn.org/stable/modules/clustering.html#clustering.
Edit 2: KMeans was what sprung to mind when I first saw your post, but based on feedback from the comments it looks like it's not a good fit. You may want to try sklearn's DBSCAN instead.
A potential transformation or extra feature you could add would be to fit a straight line to each set of points, and then use the (slope, intercept) pair. You may also want to use the centroid of each line.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Have a project I'm working on and am running into an issue. Essentially I these points scattered across an x / y plot. I have one test point, where I get the target data (y) for the classification (number from 1 - 6). I have lots points where I have depth indexed data, with some features. The issue with these points is that I don't get a lot of data per point (maybe 100 points).
I'm using the point closest to the test point to fit the model, then trying to generalize that to the other points that are farther apart. It's not giving me great results.
I understand there's not a lot of data to fit to so I'm trying to improve the model by adding a set of 'k' points close to the test point.
These points all share the same columns, so I've tried to add vertically, but then my indexes don't match with the predictor variable y.
I've tried to concat them at the end using a suffix denoting the specific point id, but then I get an error about the amount of input features (for one point) when I try predicting again with the model using combined features.
Essentially what I'm trying to do is the following :
model.fit([X_1,X_2,X_3,X_4],y)
model.predict(X_5)
Where :
All features are numeric (floats)
X_1.columns = X_i.columns
Each X matrix is about 100 points long with a continuous index [0:100].
I only have one test point (with 100 observations) for each group of points, so it's imperative I use as much data close to the test point as possible.
Is there another model or technique I can use for this? I've done a bit more research into NN models (not familiar so would prefer to avoid), and found that Keras has the ability to take multiple inputs to fit using their functions API, but can I predict with only one input after it has been fitted to multiple?
Keras Sequential model with multiple inputs
Could you give more information about the features / classes, and the model you're using? It would make things easier to understand.
However, I can give two pointers based on what you've said so far.
To have a better measurement of how well your model is generalizing, you should have more than one test point. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets
Sounds like you're using a k-Nearest Neighbors approach. If you aren't already, using the sklearn implementation will save a lot of time, and you can easily experiment with different hyperparameters: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Other techniques: I like to start off with XGBoost or Random Forest, as those methods require little tuning and are reasonably robust. However, there is no magic bullet cure for modeling on a small dataset. The best thing to do would be to collect more data, or if that's impossible, you need to drill down and really understand your data (identify outliers, plot histograms / KDE, etc.).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Edit This question was written with little knowledge of clustering techniques and now in hindsight does not even meet the Standards of Stack Overflow Website, but SO won't let me delete it saying others have Invested time and Energy in this(Valid Point) and if I proceed to delete, I may not be able to ask questions for a while, So I am updating this question to make it relevant in a way that others can learn from this. Still it doesn't strictly comply with SO guidelines as I myself would flag this as too broad, but in it's current state it is of no value, so adding a little value to it is going to be worth the downvotes.
Updated Conversation topic
The Issue was to select the optimal number of cluster in a clustering algorithm which would be grouping various shapes which were the input of contour detection on an Image and then a deviation in cluster properties was to be marked as Noise or anomalies, The main point that raised the question at the time was that all datasets were different, the shapes obtained in them different, and the no of shapes would also vary from dataset to dataset. The proper solution to do this would be to go about using DBSCAN(Density based spatial clustering application with Noise) application of which can be find in scikit-learn which I was unaware of at the time, that works and now the product is in testing, I just wanted to come back to this and correct this old mistake.
Old Question
Old Title Dynamic selection of k in kmeans clustering
I have to generate a k-means clustering model in which number of classes are not known in advance, is there a way to automatically determine the value of k based on the Euclidean distance within the clusters.
How I want it to work. Start with a value of k, perform clustering, see if it satisfies threshold criterion and increase or decrease k accordingly. The problem is framework independent and If you have an Idea or implementation in a language other than Python, please share that as well.
I found this while researching the problem https://www.researchgate.net/publication/267752474_Dynamic_Clustering_of_Data_with_Modified_K-Means_Algorithm.
I couldn't find its Implementation.
I am looking for similar ideas to select the best and implement it myself, or an implementation that can be ported to my code.
Edit
The Ideas I am considering right now are:
The elbow method
X-means clustering
You can use elbow method. What this method basically do, is it use various values of k (no of clusters) and then calculate distance of each point from its cluster center. After certain number there won;t any major improvement this value you can take for k(no of cluster).
You can refer for further reading this link.
You iterate over the values of K and check your cluster validity using Silhouette Score
You can iterate through the score of k values of any range. Either you can check silhouette score for each k value or calculate the difference between SSE values for each k values. Where the difference is highest after 0.4 * number of k values will be the elbow point.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So I am fairly new to machine learning and all and I am trying to create a python script to analyse a energy dataset of a computer.
The script should in the end determine the different states of the computer (like idle, standby, working, etc...) and how much energy those states are using on average.
And I was wondering if this task could be done by some clustering method like k-means or DBSCAN.
I tinkered a bit with some clustering methods in scikit learn but the results so far where not as good as I expected.
I researched a lot about clustering methods but I could never find a scenario similar to mine.
So my question is if it's even worth the trouble and if yes wich clustering method (or overall machine learning algorithm) would be best fitted for that task? or are there better ways to do it?
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
You will not be able to apply supervised learning for this dataset as you do not have labels for your dataset (there is no known state given an energy value). This means that models like SVM, decision trees, etc. are not feasible given your dataset.
What you have is a timeseries with a single variable output. As I understand it, your goal is to determine whether or not there are different energy states, and what the average value is for those state(s).
I think it would be incredibly helpful to plot the timeseries using something like matplotlib or seaborn. After plotting the data, you can have a better feel for whether your hypothesis is reasonable and how you might further want to approach the problem. You may be able to solve your problem by just plotting the timeseries and observing that there are, say, four distinct energy states (e.g. idle, standby, working, etc.), avoiding any complex statistical techniques, machine learning, etc.
To answer your question, you can in principle use k-means for one dimensional data. However, this is probably not recommended as these techniques are usually used on multidimensional data.
I would recommend that you look into Jenks natural breaks optimization or kernel density optimization. Similar questions to yours can be found here and here, and should help you get started.
Don't ignore time.
First of all, if your signal is noisy, temporal smoothing will likely help.
Secondly, you'll want to perform some feature extraction first. For example, by using segmentation to cut your time series into separate states. You can then try to cluster these states, but I am not convinced that clustering is applicable here at all. You probably will want to use a histogram, or a density plot. It's one dimensional data - you can visualize this, and choose thresholds manually instead of hoping that some automated technique may work (because it may not...)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm trying to locate a (possibly perspective-deformed) book in an image and extract it so that it is "straight" and "front-on" (i.e. perspective-corrected).
The particular book is unknown -- there is no query or reference image to check for matches against (i.e. by some sort of feature descriptor matching process). In other words, I'm trying to hunt through the image and find a bunch of pixels that look like they belong to the object class "book", not a particular book.
The book may be somewhat rotated or otherwise perspective-deformed. However, it is assumed the amount of deformation is within fairly reasonable bounds: the person taking the photo is working "with" me. This means as well that the book should feature prominently in the image -- perhaps 30-90% of total image area (and not as some random item amidst a bunch of other clutter).
Good resources exist for (superficially) similar problems online. For example, this well-written tutorial covers automatic perspective-correction of playing cards: https://opencv-code.com/tutorials/automatic-perspective-correction-for-quadrilateral-objects/.
Currently, the system follows a loosely similar process as this tutorial, with some additions. The general technique stack is:
Pre-processing
Find edges with Canny edge detection
Find edges that look like lines with Hough transform
Find intersection points between lines in the hope of finding book corners
Filter out implausible lines and intersection points based on simple geometric properties
Take convex hull of intersection points
Get polygon approximation to the convex hull and use this to get four corners
Apply perspective/homographic transform
The output points (used to calculate the perspective transform) are known because we assume a known aspect ratio (i.e. book dimensions).
It works for some images where the book is against fairly homogeneous backgrounds (around 1/3 to 1/2 of "nicer" images). After experimenting with the fairly dumb convex hull approach as well as a more involved quadrilateral-enumeration approach, I've concluded that the problem may be impossible using just geometric/spatial information alone -- it would probably need augmenting with colour/texture information (well, this is obvious when you consider the case of 180 degrees rotation/upside-down books).
The obvious challenge is that there is an almost infinite variety of possible book covers, and an almost infinite variety of possible backgrounds. Therefore, solving for the general case would be impossible or at least intractably hard. I knew this when I began the task. But, I hoped it would be the sort of problem that may have a solution enough of the time.
Other approaches I've considered looking at include OCRing the titles/text to work out orientation or possibly general position. The other approach that might conceivably be fruitful is some sort of learning-based classifier.
A related subtask I'm working on is the same goal but in a webcam video stream. This is definitely easier since I can use temporal information (i.e. position across frames). I just started this one yesterday but, after some initial progress, plateaued. A human holding the book generates background movement noise which throws off trivial approaches like frame differencing / background subtraction. Compared with the static image problem, however, I feel this is far more doable.
Sorry if that was a little long-winded. I wanted to make sure I made a sincere effort to articulate the problem(s). What do people think? Anyone have any thoughts as to how these problems might best be tackled?
Does calculating homography with 4 lines instead of 4 points help the problem? As you probably know, if points are related as p2=Hp1, the lines are related as l2=H-1l1. The lines on the book border should be quite prominent especially if the deformation is not large. Is you main problem selecting right lines (you did NOT actually said what's your problem was)? May be some kind of Hough-rectangle can help to find lines?
Anyway, selecting lines for homography input has an additional advantage that RANSAC homography with a constraint on aspect ratio is likely to keep right lines as inliners in the presence of numerous outliers from the background. And if those outliers sneak in they probably look like another book.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm testing some things in image retrival and i was thinking about how to sort out bad pictures of a dataset. For e.g there are only pictures of houses and in between there is a picture of people and some of cars. So at the end i want to get only the houses.
At the Moment my approach looks like:
computing descriptors (Sift) of all pictures
clustering all descriptors with k-means
creating histograms of the pictures by computing the euclidean distance between the cluster centers and the descriptors of a picture
clustering the histograms again.
at this moment i have got a first sort (which isn't really good). Now my Idea is to take all pictures which are clustered to a center with len(center) > 1 and cluster them again and again. So the Result is that the pictures which are particular in a center will be sorted out. Maybe its enough to fit the result again to the same k-means without clustering again?!
the result isn't satisfying so maybe someone has got a good idea.
For Clustering etc. I'm using k-means of scikit learn.
K-means is not very robust to noise; and your "bad pictures" probably can be considered as such. Furthermore, k-means doesn't work too well for sparse data; as the means will not be sparse.
You may want to try other, more modern, clustering algorithms that can handle this situation much better.
I don't have the solution to your problem but here is a sanity check to perform prior to the final clustering, to check that the kind of features you extracted is suitable for your problem:
extract the histogram features for all the pictures in your dataset
compute the pairwise distances of all the pictures in your dataset using the histogram features (you can use sklearn.metrics.pairwise_distance)
np.argsort the raveled distances matrix to find the indices of the 20 top closest pairs of distinct pictures according to your features (you have to filter out the zero-valued diagonal elements of the distance matrix) and do the same to extract the top 20 most farest pairs of pictures based on your histogram features.
Visualize (for instance with plt.imshow) the pictures of top closest pairs and check that they are all pairs that you would expect to be very similar.
Visualize the pictures of the top farest pairs and check that they are all very dissimilar.
If one of those 2 checks fails, then it means that histogram of bag of SIFT words is not suitable to your task. Maybe you need to extract other kinds of features (e.g. HoG features) or reorganized the way your extract the cluster of SIFT descriptors, maybe using a pyramidal pooling structure to extract info on the global layout of the pictures at various scales.