To evaluate diagnostic performance, I want to plot ROC curve, calculate AUC, and determine cutoff value
I have concentration of some protein, and actual disease diagnosis result (true of false)
I found some references, but I think they are optimized for machine learning.
And I’m not python expert. I can't figure out how to replace the test data with my own.
Here is some references and my sample data.
Could you please help me?
Sample Value Real
1 74.9 T
2 64.22 T
3 45.11 T
4 12.01 F
5 61.43 T
6 96 T
7 74.22 T
8 79.9 T
9 5.18 T
10 60.11 T
11 14.96 F
12 26.01 F
13 26.3 F
Related
I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age Outcome
0 1 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50 1
1 1 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31 0
2 1 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32 1
3 1 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21 0
4 0 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33 1
5 1 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30 0
The description:
df.describe()
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age
count768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean0.855469 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885
std 0.351857 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000
50% 1.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000
75% 1.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000
max 1.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
The description of normalization from the paper is as follows:
As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below:
V'=V-Y/Z
where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation
z=scipy.stats.zscore(df)
But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].
There are several points to note here.
Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.
Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.
Set up data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]
$ print(df.head())
a b c d e f g h i
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Standardisation
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")
Min: -4.055 Max: 845.307
As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.
Min-max scaling
min_max = (df - df.values.min()) / (df.values.max() - df.values.min())
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")
Min: 0.000 Max: 1.000
Here we do indeed get values in [0, 1].
Discussion
These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:
There are fewer chances of making a mistake as you have to do less typing.
sklearn will be at least as computationally efficient and often more so.
You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).
I'm sure there are other reasons, but these are the main ones that come to mind.
Your standardization formula hasn't the aim of putting values in the range [0, 1].
If you want to normalize data to make it in such a range, you can use the following formula :
z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)
And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.
Assuming your original dataframe is df and it has no invalid float values, this should work
df2 = (df - df.values.min()) / (df.values.max()-df.values.min())
I have recently got in to using SKLearn, especially Classification models and had a question more on use case examples, than being stuck on any particular bit of code, so apolgies in advance if this isn't the right place to be asking questions such as this.
So far I have been using sample data where one trains the model based on data that has already been classified. The 'Iris' data set for example, all the data is classified in to one of the three species. But what if one wants to group/classify the data without knowing the classifications in the first place.
Let's take this imaginary data:
Name Feat_1 Feat_2 Feat_3 Feat_4
0 A 12 0.10 0 9734
1 B 76 0.03 1 10024
2 C 97 0.07 1 8188
3 D 32 0.21 1 6420
4 E 45 0.15 0 7723
5 F 61 0.02 1 14987
6 G 25 0.22 0 5290
7 H 49 0.30 0 7107
If one wanted to split the names in to 4 separate classifications, using the different features, is this possible, and which SKLearn model(s) is needed? I'm not asking for any code, I'm quite able to research on my own if someone could point me in the right direction? So far I can only find examples where the classifications are already known.
In the example above, if I wanted to break the data down in to 4 classifications I would want my outcome to be something like this (note the new column, denoting the class):
Name Feat_1 Feat_2 Feat_3 Feat_4 Class
0 A 12 0.10 0 9734 4
1 B 76 0.03 1 10024 1
2 C 97 0.07 1 8188 3
3 D 32 0.21 1 6420 3
4 E 45 0.15 0 7723 2
5 F 61 0.02 1 14987 1
6 G 25 0.22 0 5290 4
7 H 49 0.30 0 7107 4
Many thanks for any help
you can you k-mean clustering which will group data into lesser in lesser classes in each iteration until all data are grouped in 1 group. Then you can either stop the iteration early when number of classes are what you wanted or you can also go back on already trained model to get number of class you want. For example to get 4 classes you can go 4 steps back when data are clustered in 4 classes
sklearn.cluster.KMeans doc
Classification is a supervised approach, meaning that the training data comes with features and labels. If you want to group the data according to the features, then you can go for some clustering algorithms (unsupervised), such as sklearn.cluster.KMeans (with k = 4).
Start with an unsupervised method to determine clusters... use those clusters as your labels.
I recommend using sklearn's GMM instead of k-means.
https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
K-means assumes circular clusters.
This topic is called: unsupervised learning
Some definition is:
Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs.[1] It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques.
There are tons of algorithms out there, you need to try what fits best for your algorithms, some examples are:
Hieracrchical clustering (implemented in Scipy: https://en.wikipedia.org/wiki/Single-linkage_clustering)
kmeans (implemented in sklearn: https://en.wikipedia.org/wiki/K-means_clustering)
Dbscan (implemented in sklearn: https://en.wikipedia.org/wiki/DBSCAN)
I am new to Python. I need to obtain the ROC curve with two values in my pandas data frame, any solution or recommendation?
I need to use this formula:
x = (1-dfpercentiles['acum_0%'])
y = (1-dfpercentiles['acum_1%'])
I tries using sklearn libs and matplotlib but I didn't find a solution.
This is my DF:
In [109]: dfpercentiles['acum_0%']
Out[110]:
0 10.89
1 22.93
2 33.40
3 44.83
4 55.97
5 67.31
6 78.15
7 87.52
8 95.61
9 100.00
Name: acum_0%, dtype: float64
and
In [111]:dfpercentiles['acum_1%']
Out[112]:
0 2.06
1 5.36
2 8.30
3 13.49
4 18.98
5 23.89
6 29.72
7 42.87
8 62.31
9 100.00
Name: acum_1%, dtype: float64
This seems to be a matplotlib question.
Before anything, your percentiles are in the range 0-100 but your adjustment is 1 - percentile_value so you need to rescale your values to 0-1.
I just used pyplot.plot to generate the ROC curve
import matplotlib.pyplot as plt
plt.plot([1-(x/100) for x in [10.89, 22.93, 33.40, 44.83, 55.97, 67.31, 78.15, 87.52, 95.61, 100.00]],
[1-(x/100) for x in [2.06, 5.36, 8.30, 13.49, 18.98, 23.89, 29.72, 42.87, 62.31, 100.0]])
Using your dataframe, it would be
plt.plot((1-(dfpercentiles['acum_0%']/100)), (1-(dfpercentiles['acum_1%']/100))
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a dataset with n observations, where all the observations have m timesteps. I also have a n*m array which contains the label for each timestep on each given observation.
I am doing feature engineering on this dataset to find meaningful features in the data, according to the labels I have. Is there any Python package out there to facilitate this process?
I came across tsfresh (https://github.com/blue-yonder/tsfresh), although it seems like it's only intended to be used when we have a single label to classify each observation, and not a label to classify each timestep, as is my case.
You could try Featuretools. It is an open-source automated feature engineering library that explicitly deals with time to make sure you don't introduce label leakage.
For your data, you could create two entities: "observations" and "timesteps", and then apply featuretools.dfs (Deep Feature Synthesis) to generate features for each timestep. You can think of an entity as being the same as a table in a relational database.
Of particular usefulness for your problem would be the Cumulative primitives in Featuretools, which are operations that use many instances ordered by time to compute a single value. In your case, if there was a observation with multiple timesteps each with a certain value, you could calculate the mean value of the previous timesteps using the CumMean primitive.
Here is an example:
from featuretools.primitives import Day, Weekend, Percentile, CumMean, CumSum
import featuretools as ft
import pandas as pd
import numpy as np
timesteps = pd.DataFrame({'ts_id': range(12),
'timestamp': pd.DatetimeIndex(start='1/1/2018', freq='1d', periods=12),
'attr1': np.random.random(12),
'obs_id': [1, 2, 3] * 4})
print(timesteps)
attr1 obs_id timestamp ts_id
0 0.663216 1 2018-01-01 0
1 0.455353 2 2018-01-02 1
2 0.800848 3 2018-01-03 2
3 0.938645 1 2018-01-04 3
4 0.442037 2 2018-01-05 4
5 0.724044 3 2018-01-06 5
6 0.304241 1 2018-01-07 6
7 0.134359 2 2018-01-08 7
8 0.275078 3 2018-01-09 8
9 0.499343 1 2018-01-10 9
10 0.608565 2 2018-01-11 10
11 0.340991 3 2018-01-12 11
entityset = ft.EntitySet("timeseries")
entityset.entity_from_dataframe("timesteps",
timesteps,
index='ts_id',
time_index='timestamp')
entityset.normalize_entity(base_entity_id='timesteps',
new_entity_id='observations',
index='obs_id',
make_time_index=True)
# per timestep
cutoffs = timesteps[['ts_id', 'timestamp']]
feature_matrix, feature_list = ft.dfs(entityset=entityset,
target_entity='timesteps',
cutoff_time=cutoffs,
trans_primitives=[Day, Weekend, Percentile, CumMean, CumSum],
agg_primitives=[])
print(feature_matrix.iloc[:, -6:])
CUMMEAN(attr1 by obs_id) CUMSUM(attr1 by obs_id) CUMMEAN(PERCENTILE(attr1) by obs_id) CUMSUM(CUMMEAN(attr1 by obs_id) by obs_id) CUMSUM(PERCENTILE(attr1) by obs_id) observations.DAY(first_timesteps_time)
ts_id
0 0.100711 0.100711 1.000000 0.100711 1.000000 1
1 0.811898 0.811898 1.000000 0.811898 1.000000 2
2 0.989166 0.989166 1.000000 0.989166 1.000000 3
3 0.442035 0.442035 0.500000 0.442035 0.500000 1
4 0.910106 0.910106 0.800000 0.910106 0.800000 2
5 0.427610 0.427610 0.333333 0.427610 0.333333 3
6 0.832516 0.832516 0.714286 0.832516 0.714286 1
7 0.035121 0.035121 0.125000 0.035121 0.125000 2
8 0.178202 0.178202 0.333333 0.178202 0.333333 3
9 0.085608 0.085608 0.200000 0.085608 0.200000 1
10 0.891033 0.891033 0.818182 0.891033 0.818182 2
11 0.044010 0.044010 0.166667 0.044010 0.166667 3
This example also used “cutoff times” to tell the feature computation engine to only use data prior to the specified time for each “ts_id” or “obs_id”. You can read more about cutoff times on this page in the documentation.
Another cool thing that Featuretools lets you do is construct features per observation in the “observations” table, instead of per timestep. To do this, change the “target_entity” parameter. In the following example, we use take the last timestamp per observation to use as the cutoff time, which will make sure no data is used from after that time (e.g. data from obs_id = 2 at 2018-01-11 will not be included in the Percentile() computation for obs_id = 1, with cutoff time 2018-01-10).
# per observation
ocutoffs = timesteps[['obs_id', 'timestamp']].drop_duplicates(['obs_id'], keep='last')
ofeature_matrix, ofeature_list = ft.dfs(entityset=entityset,
target_entity='observations',
cutoff_time=ocutoffs,
trans_primitives=[Day, Weekend, Percentile, CumMean, CumSum])
print(ofeature_matrix.iloc[:, -6:])
PERCENTILE(STD(timesteps.attr1)) PERCENTILE(MAX(timesteps.attr1)) PERCENTILE(SKEW(timesteps.attr1)) PERCENTILE(MIN(timesteps.attr1)) PERCENTILE(MEAN(timesteps.attr1)) PERCENTILE(COUNT(timesteps))
obs_id
1 0.666667 1.000000 0.666667 0.666667 0.666667 1.000000
2 0.333333 0.666667 0.666667 0.666667 0.333333 0.833333
3 1.000000 1.000000 0.333333 0.333333 1.000000 0.666667
Lastly, it is actually possible to use tsfresh in conjunction with Featuretools as a "custom primitive". This is an advanced feature, but I could explain more if you are interested.
I could answer more in detail if you give me more details about the question. However, based on my understanding, you want to predict something with the time-series data that you have.
There is a package called keras in Python for machine learning. What you can do is, you can use LSTMs for training your model. The support for LSTMs is very good in keras.
I am working in the field of Pharmaceutical sciences, I work on
chemical compounds and with calculating their chemical properties or descriptors we can predict certain biological function of that compounds. I use python and R programming language for the same and also use Weka machine learning tool. Weka provides facility for binary prediction using SVM and other supporting algorithms.
Ex data set: Training set
Chem_ID MW LogP HbD HbE IC50 Class_label
001 232 5 0 2 20 0
002 280 2 1 4 41 1
003 240 5 0 2 22 0
004 300 4 1 5 48 1
005 245 2 0 2 24 0
006 255 1 0 2 20 0
007 299 5 1 4 49 1
Test set
Chem_ID MW LogP HbD HbE IC50 Class_label
000 255 1 0 2 20
In weka there are few algorithm with them we can predict the "class_label" or we can also predict specific variable (we usually predict "IC50" values ), does scikit-learn or any other machine learning library in python having that capabilities. if yes how can we use it thanks.
Yes, this is a regression problem. There are many different models to solve a regression problem, from a simple Linear Regression, to Support Vector Regression or Decision Tree Regressors (and many more).
They work similarly to binary classifier: You give them your training data and instead of 0/1 labels you give them target values to train. In your case you would take the feature you want to predict as target value and delete it form the training data.
Short example:
target_values = training_set['IC50']
training_data = training_set.drop('IC50')
clf = LinearRegression()
clf.fit(training_data, target_values)
test_data = test_set.drop('IC50')
predicted_values = clf.predict(test_data)