I have recently got in to using SKLearn, especially Classification models and had a question more on use case examples, than being stuck on any particular bit of code, so apolgies in advance if this isn't the right place to be asking questions such as this.
So far I have been using sample data where one trains the model based on data that has already been classified. The 'Iris' data set for example, all the data is classified in to one of the three species. But what if one wants to group/classify the data without knowing the classifications in the first place.
Let's take this imaginary data:
Name Feat_1 Feat_2 Feat_3 Feat_4
0 A 12 0.10 0 9734
1 B 76 0.03 1 10024
2 C 97 0.07 1 8188
3 D 32 0.21 1 6420
4 E 45 0.15 0 7723
5 F 61 0.02 1 14987
6 G 25 0.22 0 5290
7 H 49 0.30 0 7107
If one wanted to split the names in to 4 separate classifications, using the different features, is this possible, and which SKLearn model(s) is needed? I'm not asking for any code, I'm quite able to research on my own if someone could point me in the right direction? So far I can only find examples where the classifications are already known.
In the example above, if I wanted to break the data down in to 4 classifications I would want my outcome to be something like this (note the new column, denoting the class):
Name Feat_1 Feat_2 Feat_3 Feat_4 Class
0 A 12 0.10 0 9734 4
1 B 76 0.03 1 10024 1
2 C 97 0.07 1 8188 3
3 D 32 0.21 1 6420 3
4 E 45 0.15 0 7723 2
5 F 61 0.02 1 14987 1
6 G 25 0.22 0 5290 4
7 H 49 0.30 0 7107 4
Many thanks for any help
you can you k-mean clustering which will group data into lesser in lesser classes in each iteration until all data are grouped in 1 group. Then you can either stop the iteration early when number of classes are what you wanted or you can also go back on already trained model to get number of class you want. For example to get 4 classes you can go 4 steps back when data are clustered in 4 classes
sklearn.cluster.KMeans doc
Classification is a supervised approach, meaning that the training data comes with features and labels. If you want to group the data according to the features, then you can go for some clustering algorithms (unsupervised), such as sklearn.cluster.KMeans (with k = 4).
Start with an unsupervised method to determine clusters... use those clusters as your labels.
I recommend using sklearn's GMM instead of k-means.
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
K-means assumes circular clusters.
This topic is called: unsupervised learning
Some definition is:
Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs.[1] It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.
There are tons of algorithms out there, you need to try what fits best for your algorithms, some examples are:
Hieracrchical clustering (implemented in Scipy: https://en.wikipedia.org/wiki/Single-linkage_clustering)
kmeans (implemented in sklearn: https://en.wikipedia.org/wiki/K-means_clustering)
Dbscan (implemented in sklearn: https://en.wikipedia.org/wiki/DBSCAN)
Related
For a project, I have a dataset:
Eruptions Waiting
0 3.600 79
1 1.800 54
2 3.333 74
3 2.283 62
4 4.533 85
and was instructed to turn it into a Seaborn pairplot:
I am then asked: What correlation method should I use based on this graph. I am stuck between Pearson and Spearman. I am unsure which I should use. Based on this, I am also asked: Are the durations correlated with the waiting time between eruptions? Any correlation >= .7 is a correlation. Please help
To evaluate diagnostic performance, I want to plot ROC curve, calculate AUC, and determine cutoff value
I have concentration of some protein, and actual disease diagnosis result (true of false)
I found some references, but I think they are optimized for machine learning.
And I’m not python expert. I can't figure out how to replace the test data with my own.
Here is some references and my sample data.
Could you please help me?
Sample Value Real
1 74.9 T
2 64.22 T
3 45.11 T
4 12.01 F
5 61.43 T
6 96 T
7 74.22 T
8 79.9 T
9 5.18 T
10 60.11 T
11 14.96 F
12 26.01 F
13 26.3 F
My original data looks like this.
id season home_team away_team home_goals away_goals result winner
0 0 2006-07 Shu Liv 1 1 D NaN
1 1 2006-07 Ars Avl 1 1 D NaN
2 2 2006-07 Eve Wat 2 1 H Eve
3 3 2006-07 New Wig 2 1 H New
4 4 2006-07 Por Bla 3 0 H Por
The purpose is to build a model that predicts
i.e.
Home Team Win 55%
Draw 13%
Away Team Win 32%
I Selected these 3 columns and label encoded them
home_team, away_team, winner
Then I created these new classes/lables.
df.loc[df["winner"]==df["home_team"],"home_team_win"]=1
df.loc[df["winner"]!=df["home_team"],"home_team_win"]=0
df.loc[df["result"]=='D',"draw"]=1
df.loc[df["result"]!='D',"draw"]=0
df.loc[df["winner"]==df["away_team"],"away_team_win"]=1
df.loc[df["winner"]!=df["away_team"],"away_team_win"]=0
Now the encoded data is looking like this,
home_team away_team home_team_win away_team_win draw
0 28 19 0 0 1
1 1 2 0 0 1
2 14 34 1 0 0
3 23 37 1 0 0
4 25 4 1 0 0
Initially, I used the code below for a single label 'home_team_win' and it worked fine, but it doesn't support multi classes/labels.
X = prediction_df.drop(['home_team_win'] ,axis=1)
y = prediction_df['home_team_win']
logReg=LogisticRegression(solver='lbfgs')
rfe = RFE(logReg, 20)
rfe = rfe.fit(X, y.values.ravel())
How to do Multi label classification or Multi class classification of this problem?
The target binary variables home_team_win, away_team_win, and draw are mutually exclusive. It does not seem to be a good idea to use multi-label methods in this problem, since, in general, they are designed to exploit dependencies among labels, which is nonexistent in this dataset.
I suggest modelling it as a multi-class problem in its most common form, where there is a single column with three classes: 0,1, and 2 (representing home_team_loss, draw, away_team_win).
Many implementations of classifiers in scikit-learn can work directly in this manner. Logistic Regression is one of them:
from sklearn.linear_model import LogisticRegression
logReg=LogisticRegression(solver='lbfgs', multi_class='ovr')
logReg.fit(X,Y)
logReg.predict_proba(X)
This code will output the desired probabilities for each class of each row of X.
In particular, this code trains one Logistic Regression for each class separately (this is what the multi_class='ovr' parameter do).
Take a look at https://scikit-learn.org/stable/supervised_learning.html for other classifiers that directly work in this multi-class dataset form that I suggested.
I am working on the Kaggl insurance pricing competition (https://www.kaggle.com/floser/french-motor-claims-datasets-fremtpl2freq). The data looks like this:
IDpol ClaimNb Exposure Area VehPower VehAge DrivAge BonusMalus VehBrand VehGas Density Region
0 1.0 1 0.10 D 5 0 55 50 B12 Regular 1217 R82
1 3.0 1 0.77 D 5 0 55 50 B12 Regular 1217 R82
2 5.0 1 0.75 B 6 2 52 50 B12 Diesel 54 R22
3 10.0 1 0.09 B 7 0 46 50 B12 Diesel 76 R72
4 11.0 1 0.84 B 7 0 46 50 B12 Diesel 76 R72
For preprocessing etc. I am using a Pipeline including a ColumnTransformer:
pt_columns = ['BonusMalus', 'RBVAge']
log_columns = ['Density']
kbins_columns = ['VehAge','VehPower', 'DrivAge']
cat_columns = ['Area', 'Region', 'VehBrand', 'VehGas']
X_train['RBVAge'] = 0
X_train.loc[(X_train['VehGas'] == 'Regular') & (X_train['VehBrand'] == 'B12') & (X_train['VehAge'] == 0), 'RBVAge'] = 1
ct = ColumnTransformer([('pt', 'passthrough', pt_columns),
('log', FunctionTransformer(np.log1p, validate=False), log_columns),
('kbins', KBinsDiscretizer(), kbins_columns),
('ohe', OneHotEncoder(), cat_columns)])
pipe_poisson_reg = Pipeline([('cf_trans', ct),
('ssc', StandardScaler(with_mean = False)),
('poisson_regressor', PoissonRegressor())])
Once the model is fitted I'd like to display the feature importances and names of the features like this:
name
feature_importances_
Area_A
0.25
Area_B
0.10
VehAge
0.30
...
...
The problem I am facing is to get the feature names while using a ColumnTransformer. Especially getting the feature names from the KBinsDiscretizer, which uses OneHot-encoding as well, does not work easily. What I have tried so far is creating a numpy array with the feature names manually, but as I have said I cannot manage to get the feature names from the KBinsDiscretizer and this solution does not seem very elegant.
columnNames = np.append(pt_columns, pipe['cftrans'].transformers_[1][1].get_feature_names(cat_columns))
Is there a simple (maybe even build in) way to create a DataFrame including both the feature names and feature importances?
Well and since we are already here (this might be a bit off topic):
Is there a simple way to create a custom ColumnTransformer which adds the new column 'RBVAge', which I currently add manually?
This will be a whole lot easier in the upcoming v1.0 (or you can get the current github code), which includes PR18444, adding a get_feature_names_out to KBinsDiscretizer as well as Pipeline (and others).
If you are stuck on your version of sklearn, you can have a look at the source code that does that to help you patch together the same; it appears to not do very much itself, just calling to the internal _encoder object to do the work, and OneHotEncoder has had a get_feature_names for a while (soon to be replaced by get_feature_names_out).
I am working in the field of Pharmaceutical sciences, I work on
chemical compounds and with calculating their chemical properties or descriptors we can predict certain biological function of that compounds. I use python and R programming language for the same and also use Weka machine learning tool. Weka provides facility for binary prediction using SVM and other supporting algorithms.
Ex data set: Training set
Chem_ID MW LogP HbD HbE IC50 Class_label
001 232 5 0 2 20 0
002 280 2 1 4 41 1
003 240 5 0 2 22 0
004 300 4 1 5 48 1
005 245 2 0 2 24 0
006 255 1 0 2 20 0
007 299 5 1 4 49 1
Test set
Chem_ID MW LogP HbD HbE IC50 Class_label
000 255 1 0 2 20
In weka there are few algorithm with them we can predict the "class_label" or we can also predict specific variable (we usually predict "IC50" values ), does scikit-learn or any other machine learning library in python having that capabilities. if yes how can we use it thanks.
Yes, this is a regression problem. There are many different models to solve a regression problem, from a simple Linear Regression, to Support Vector Regression or Decision Tree Regressors (and many more).
They work similarly to binary classifier: You give them your training data and instead of 0/1 labels you give them target values to train. In your case you would take the feature you want to predict as target value and delete it form the training data.
Short example:
target_values = training_set['IC50']
training_data = training_set.drop('IC50')
clf = LinearRegression()
clf.fit(training_data, target_values)
test_data = test_set.drop('IC50')
predicted_values = clf.predict(test_data)