I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.
Related
I'm currently trying to use a sensor to measure a process's consistency. The sensor output varies wildly in its actual reading but displays features that are statistically different across three categories [dark, appropriate, light], with dark and light being out of control items. For example, one output could read approximately 0V, the process repeats and the sensor then reads 0.6V. Both the 0V reading and the 0.6V reading could represent an in control process. There is a consistent difference for sensor readings for out of control items vs in control items. An example set of an in control item can be found here and an example set of two out of control items can be found here. Because of the wildness of the sensor and characteristic shapes of each category's data, I think the best way to assess the readings is to process them with an AI model. This is my first foray into creating a model that creates a categorical prediction given a time series window. I haven't been able to find anything on the internet with my searches (I'm possibly looking for the wrong thing). I'm certain that what I'm attempting is feasible and has a strong case for an AI model, I'm just not certain what the optimal way to make it is. One idea that I had was to treat the data similarly to how an image is treated by an object detection model, with the readings as the input array and the category as the output, but I'm not certain that this is the best way to go about solving the problem. If anyone can help point me in the right direction or give me a resource, I would greatly appreciate it. Thanks for reading my post!
In accounting the data set representing the transactions is called a 'general ledger' and takes following form:
Note that a 'journal' i.e. a transaction consists of two line items. E.g. transaction (Journal Number) 1 has two lines. The receipt of cash and the income. Companies could also have transactions (journals) which can consist of 3 line items or even more.
Will I first need to cleanse the data to only have one line item for each journal? I.e. cleanse the above 8 rows into 4.
Are there any python machine learning algorithms which will allow me to cluster the above data without further manipulation?
The aim of this is to detect anomalies in transactions data. I do not know what anomalies look like so this would need to be unsupervised learning.
Use gaussians on each dimension of the data to determine what is an anomaly. Mean and variance are backed out per dimension, and if the value of a new datapoint on that dimension is below a threshold, it is considered an outlier. This creates one gaussian per dimension. You can use some feature engineering here, rather than just fit gaussians on the raw data.
If features don't look gaussian (plot their histogram), use data transformations like log(x) or sqrt(x) to change them until they look better.
Use anomaly detection if supervised learning is not available, or if you want to find new, previously unseen kind of anomalies (such as the failure of a power plant, or someone acting suspiciously rather than whether someone is male/female)
Error analysis: However, what if p(x), the probability the an example is not an anomaly, is large for all examples? Add another dimension, and hope it helps to show the anomaly. You could create this dimension by combining some of the others.
To fit the gaussian a bit more to the shape of your data, you can make it multivariate. It then takes a matrix mean and variance, and you can vary parameters to change its shape. It will also show feature correlations, if your features are not all independent.
https://stats.stackexchange.com/questions/368618/multivariate-gaussian-distribution
I have a set of product data with specifications etc.
I have applied kmode clustering to the dataset to form clusters of the most similar products.
When I enter a new data, I want to know which cluster this data belongs to and what are the other products( which are almost same as this new product). How do I go about this?
Use the nearest neighbors.
No need to rely on clustering, which tends to be unstable and produce unbalanced clusters. It's fairly common to have 90% of your data rightfully in the same cluster (e.g. a "normal users" cluster, or a "single visit" cluster). So you should ask yourself: what do you gain by doing this, what is the cost-benefit ratio?
I have a pandas dataframe with the following 2 columns:
Database Name Name
db1_user Login
db1_client Login
db_care Login
db_control LoginEdit
db_technology View
db_advanced LoginEdit
I have to cluster the Database Name based on the field “Name”. When I convert it to a numpy, using
dataset = df2.values
When I print the print(dataset.dtype), the type is object. I have just started with Clustering, from what I read, I understand that object is not a type suitable for Kmeans clustering.
Any help would be appreicated!!
What is the mean of
Login
LoginEdit
View
supposed to be?
There is a reason why k-means only works on continuous numerical data. Because the mean requires such data to be well defined.
I don't think clustering is applicable on your problem at all (rather look into data cleaning). But clearly you need a method that works with arbitrary distances - k-mean does not.
I don't understand whether you want to develop clusters for each GROUP of "Name" Attributes, or alternatively create n clusters regardless of the value of "Name"; and I don't understand what clustering could achieve here.
In any case, just a few days ago there was a similar question on the datascience SE site (from an R user, though), asking for similarity of local-names of email addresses (the part before the "#"), not of database-names. The problem is similar to yours.
Check this out:
https://datascience.stackexchange.com/questions/14146/text-similarities/14148#14148
The answer was comprehensive with respect to the different distance measures for strings.
Maybe this is what you should investigate. Then decide on a proper distance measure that is available in python (or one that you can program yourself), and that fits your needs.
I am trying to determine the conditions of a wireless channel by analysis of captured I/Q samples. Indeed, I have a 50000 data samples and as it is shown in the attached figure, there are some sparks in the graphs when there is an activity (e.g. data transmission) over the channel. I am trying to count the number of sparks which are data values higher than a threshold.
I need to have an accurate estimation of the threshold and then I can find the channel load. the threshold value in the attached figure is around 0.0025 and it should be noted that it varies over time. So, each time that I took 50000 samples, I have to find the threshold value first using some sort of unsupervised learning.
I tried k-means (in python scikit-learn) to cluster the data and find the centroids of the estimated clusters, but it can't give me good estimation on the threshold value (especially when there is no activity over the channel and the channel is idle).
I would like to know is there anyone who has prior experience on similar topics?
Captured data
Since the idle noise seems relatively consistent and very different from when data is transmitted, I can think of several simple algorithms which could give you a reasonable threshold in an unsupervised manner.
The most direct method would be to sort the values (perhaps first group into buckets), then find the lowest-valued region where a large enough proportion (at least ~5%) of values fall. Take a reasonable margin above the highest values (50%?) and you should be good to go.
You'll need to fiddle with the thresholds a bit. I'd collect sample data and tweak the values until I get it working 100% of the time and the values used make sense.