How to obtain an ROC Curve? - python

I am new to Python. I need to obtain the ROC curve with two values in my pandas data frame, any solution or recommendation?
I need to use this formula:
x = (1-dfpercentiles['acum_0%'])
y = (1-dfpercentiles['acum_1%'])
I tries using sklearn libs and matplotlib but I didn't find a solution.
This is my DF:
In [109]: dfpercentiles['acum_0%']
Out[110]:
0 10.89
1 22.93
2 33.40
3 44.83
4 55.97
5 67.31
6 78.15
7 87.52
8 95.61
9 100.00
Name: acum_0%, dtype: float64
and
In [111]:dfpercentiles['acum_1%']
Out[112]:
0 2.06
1 5.36
2 8.30
3 13.49
4 18.98
5 23.89
6 29.72
7 42.87
8 62.31
9 100.00
Name: acum_1%, dtype: float64

This seems to be a matplotlib question.
Before anything, your percentiles are in the range 0-100 but your adjustment is 1 - percentile_value so you need to rescale your values to 0-1.
I just used pyplot.plot to generate the ROC curve
import matplotlib.pyplot as plt
plt.plot([1-(x/100) for x in [10.89, 22.93, 33.40, 44.83, 55.97, 67.31, 78.15, 87.52, 95.61, 100.00]],
[1-(x/100) for x in [2.06, 5.36, 8.30, 13.49, 18.98, 23.89, 29.72, 42.87, 62.31, 100.0]])
Using your dataframe, it would be
plt.plot((1-(dfpercentiles['acum_0%']/100)), (1-(dfpercentiles['acum_1%']/100))

Related

How can I plot ROC curve and calculate AUC, determine cutoff value?

To evaluate diagnostic performance, I want to plot ROC curve, calculate AUC, and determine cutoff value
I have concentration of some protein, and actual disease diagnosis result (true of false)
I found some references, but I think they are optimized for machine learning.
And I’m not python expert. I can't figure out how to replace the test data with my own.
Here is some references and my sample data.
Could you please help me?
Sample Value Real
1 74.9 T
2 64.22 T
3 45.11 T
4 12.01 F
5 61.43 T
6 96 T
7 74.22 T
8 79.9 T
9 5.18 T
10 60.11 T
11 14.96 F
12 26.01 F
13 26.3 F

Why don't all of the factor variables appear in the legend?

I'm pretty new to plotting using matplotlib and I'm having a few problems with the legends, I have this data set:
Wavelength CD Time
0 250.0 0.00000 1
1 249.8 -0.04278 1
2 249.6 -0.03834 1
3 249.4 -0.02384 1
4 249.2 -0.04817 1
... ... ... ...
3760 200.8 0.99883 15
3761 200.6 0.50277 15
3762 200.4 -0.19228 15
3763 200.2 0.81317 15
3764 200.0 0.90226 15
[3765 rows x 3 columns]
Column types:
Wavelength float64
CD float64
Time int64
dtype: object
Why when plotted with Time as the categorical variable all the values are not shown in the legend?
x = df1['Wavelength']
y = df1['CD']
z = df1['Time']
sns.lineplot(x, y, hue = z)
plt.tight_layout()
plt.show()
But I can plot using pandas built in matplotlib function with a colorbar bar like this:
df1.plot.scatter('Wavelength', 'CD', c='Time', cmap='RdYlBu')
What's the best way of choosing between discrete and continuous legends using matplotlib/seaborn?
Many thanks!

Boxplots for grouped dataframe

In a python script I have a pd's describe() output, called df looks like the following. The output has two index --Class and EL_base.
I want to make individual boxplot for each class. How can I do it?
count mean std min 25% 50% 75% max
Class EL_base
PC1 0 8 247.04 8.16 236.90 244.15 245.17 247.71 265.41
1 8 243.25 2.96 237.22 242.57 243.84 244.49 247.29
PC2 0 8 243.25 2.96 237.22 242.57 243.84 244.49 247.29
1 8 518.96 6.35 507.27 515.38 519.72 523.65 526.25
2 8 519.52 2.84 513.77 518.17 520.50 521.46 522.39

Filtering pandas dataframe for a steady speed condition

Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun
Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5

spline interpolation between two arrays in python

I am trying to do spline interpolation between two arrays in Python. My data set looks like this:
| 5 15
-------+--------------------
1 32.68 29.16
2 32.73 27.20
3 32.78 28.24
4 32.83 27.27
5 32.88 25.27
6 32.93 31.35
7 32.98 27.39
8 33.03 26.42
9 33.08 27.46
10 33.13 30.50
11 33.18 27.53
12 33.23 29.57
13 33.23 27.99
14 33.23 28.64
15 33.23 26.68
16 33.23 29.72
And I am trying to do a spline interpolation between the two points and produce the values for 10, something that will eventually look like this (but spline interpolated):
| 10
-----+--------
1 30.92
2 29.965
3 30.51
4 30.05
5 29.075
6 32.14
7 30.185
8 29.725
9 30.27
10 31.815
11 30.355
12 31.4
13 30.61
14 30.935
15 29.955
16 31.475
I have been looking at examples of using scipy.interpolate.InterpolatedUnivariateSpline, but it seems to take only one array for x and one for y, and I can't figure out how to make it interpolate these two arrays.
Can someone please help point me in the right direction?
With the amount of data you have, only two points for each x value, piecewise linear interpolation is the most practical tool. Taking your two arrays to be v5 and v15 (values along y=5 line and y=15 line), and the x-values to be 1,2, ..., 16, we can create a piecewise linear interpolant like this:
from scipy.interpolate import interp2d
f = interp2d(np.arange(1, 17), [5, 15], np.stack((v5, v15)), kind='linear')
This can be evaluated in the usual way: for example, f(np.arange(1, 17), 10) returns precisely the numbers you expected.
[ 30.92 , 29.965, 30.51 , 30.05 , 29.075, 32.14 , 30.185,
29.725, 30.27 , 31.815, 30.355, 31.4 , 30.61 , 30.935,
29.955, 31.475]
interp2d can also create a cubic bivariate spline, but not from this data: your grid is too small in the y-direction.

Categories

Resources