Pandas: Calculate true positive rate for each row - python

I have a dataframe like this, with one column being the label and the other columns being predictions
label pred1 pred2 pred3
0 Apple Apple Orange Apple
1 Orange Orange Orange Orange
I would like to extend this dataframe with the true positive rate (TP/TP+FN) for each row. This column should look like this:
Score
0 0.66
1 1.00
I am unsure on how to go on about this. Are there pandas functions that would help with this task?
Executable code: https://www.online-python.com/WP7wbgcqMS

Here is one approach where we convert the data to long format and check if the label equals the prediction. The average of the True/False values will be your Score.
import pandas as pd
d = {'Label': ['Apple','Orange'], 'pred1': ['Apple','Orange'], 'pred2': ['Orange','Orange'], 'pred3': ['Apple','Orange']}
df = pd.DataFrame(data=d)
df = df.melt(id_vars='Label', value_name='pred')
df['match'] = df['Label'].eq(df['pred'])
df.groupby('Label')['match'].mean().reset_index(name='Score')
Output
Label Score
0 Apple 0.666667
1 Orange 1.000000

maybe like this
temp = df.T.apply(lambda x: x[0]==x).astype(int)
(temp.sum()-1)/(temp.count()-1)
Out:
0 0.666667
1 1.000000

Related

Compare the values of 2 columns in pandas dataframe to fill a third column

I have a dataframe (figure). Supose that I will add more observations to my dataframe.
For this new observations (9 and 10) I only add the color, food and age columns. For the score column I want to compare with the other observations if the column of "food" and "color" got the same label then the score value will be equal of that observation.
In this case the score value is 5.0 and 6.0 respectively. How can I automatize this process when i add a lot of observations without the score value?
You could try something like below:
import pandas as pd
#Shortened working data list, for demonstration purposes
data = [[1,'Red','Apple',70,5.0],[2,'Yellow','Pizza',90,6.0],[9,'Red','Apple', 2, None],[10,'Yellow','Pizza',2,None]]
#Set up data frame
df = pd.DataFrame(data, columns=['Observations', 'Color', 'Food', 'Age', 'Score'])
# Remove nan values
df_cleaned = df[df['Score'].notna()]
#Generate a dictionary with a key that combines Color and Food, and a value that equals Score
targetValues = dict(zip((df_cleaned.Color + df_cleaned.Food), df_cleaned.Score))
#Replace nan values in our original data frame with the values from our dictionary created above
df['Score']=df['Score'].fillna((df.Color + df.Food).map(targetValues))
print(df)
That will yield an output like below:
Observations Color Food Age Score
0 1 Red Apple 70 5.0
1 2 Yellow Pizza 90 6.0
2 9 Red Apple 2 5.0
3 10 Yellow Pizza 2 6.0
The general idea is to create a dictionary, and use those key-value pairs to replace the NaN values in your data frame

How to groupby and calculate new field with python pandas?

I'd like to group by a specific column within a data frame called 'Fruit' and calculate the percentage of that particular fruit that are 'Good'
See below for my initial dataframe
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple','Apple','Banana'], 'Condition': ['Good','Bad','Good']})
Dataframe
Fruit Condition
0 Apple Good
1 Apple Bad
2 Banana Good
See below for my desired output data frame
Fruit Percentage
0 Apple 50%
1 Banana 100%
Note: Because there is 1 "Good" Apple and 1 "Bad" Apple, the percentage of Good Apples is 50%.
See below for my attempt which is overwriting all the columns
groupedDF = df.groupby('Fruit')
groupedDF.apply(lambda x: x[(x['Condition'] == 'Good')].count()/x.count())
See below for resulting table, which seems to calculate percentage but within existing columns instead of new column:
Fruit Condition
Fruit
Apple 0.5 0.5
Banana 1.0 1.0
We can compare Condition with eq and take advantage of the fact that True is (1) and False is (0) when processed as numbers and take the groupby mean over Fruits:
new_df = (
df['Condition'].eq('Good').groupby(df['Fruit']).mean().reset_index()
)
new_df:
Fruit Condition
0 Apple 0.5
1 Banana 1.0
We can further map to a format string and rename to get output into the shown desired output:
new_df = (
df['Condition'].eq('Good')
.groupby(df['Fruit']).mean()
.map('{:.0%}'.format) # Change to Percent Format
.rename('Percentage') # Rename Column to Percentage
.reset_index() # Restore RangeIndex and make Fruit a Column
)
new_df:
Fruit Percentage
0 Apple 50%
1 Banana 100%
*Naturally further manipulations can be done as well.

How to fill a column under certain conditions?

I have two data frames df (with 15000 rows) and df1 ( with 20000 rows)
Where df looks like
Number Color Code Quantity
1 Red 12380 2
2 Bleu 14440 3
3 Red 15601 1
and df1 that has two columns Code and Quantity where I want to fill Quantity column under certain conditions using python in order to obtain like this
Code Quantity
12380 2
15601 1
15640 1
14400 0
The conditions that I want to take in considerations are:
If the two last caracters of Code column of df1 are both equal to zero, in this case I want to have 0 in the Quantity column of df1
If I don't find the Code in df, in this cas I put 1 in the Quantity column of df1
Otherwise I take the quantity value of df
Let us try:
mask = df1['Code'].astype(str).str[-2:].eq('00')
mapped = df1['Code'].map(df.set_index('Code')['Quantity'])
df1['Quantity'] = mapped.mask(mask, 0).fillna(1)
Details:
Create a boolean mask specifying the condition where the last two characters of Code are both 0:
>>> mask
0 False
1 False
2 False
3 True
Name: Code, dtype: bool
Using Series.map map the values in Code column in df1 to the Quantity column in df based on the matching Code:
>>> mapped
0 2.0
1 1.0
2 NaN
3 NaN
Name: Code, dtype: float64
mask the values in the above mapped column where the boolean mask is True, and lastly fill the NaN values with 1:
>>> df1
Code Quantity
0 12380 2.0
1 15601 1.0
2 15640 1.0
3 14400 0.0

python: use agg with more than one customized function

I have a data frame like this.
mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]})
a b c
0 1 NaN 1
1 1 2.0 3
2 3 3.0 3
3 3 6.0 9
I would like to have a resulting dataframe like this.
myResults = pd.concat([mydf.groupby('a').apply(lambda x: (x.b/x.c).max()), mydf.groupby('a').apply(lambda x: (x.b/x.c).min())], axis =1)
myResults.columns = ['max','min']
max min
a
1 0.666667 0.666667
3 1.000000 0.666667
Basically i would like to have max and min of ratio of column b and column c for each group (grouped by column a)
If it possible to achieve this by agg?
I tried mydf.groupby('a').agg([lambda x: (x.b/x.c).max(), lambda x: (x.b/x.c).min()]). It will not work, and seems column name b and c will not be recognized.
Another way i can think of is to add the ratio column first to mydf. i.e. mydf['ratio'] = mydf.b/mydf.c, and then use agg on the updated mydf like mydf.groupby('a')['ratio'],agg[max,min].
Is there a better way to achieve this through agg or other function? In summary, I would like to apply customized function to grouped DataFrame, and the customized function needs to read multiple columns from original DataFrame.
You can use a customized function to acheive this.
You can create any number of new columns using any input columns using the below function.
def f(x):
t = {}
t['max'] = (x['b']/x['c']).max()
t['min'] = (x['b']/x['c']).min()
return pd.Series(t)
mydf.groupby('a').apply(f)
Output:
max min
a
1 0.666667 0.666667
3 1.000000 0.666667

Counting values in several columns

I have the following DataFrame:
KPI_01 KPI_02 KPI_03
date
2015-05-24 green green red
2015-06-24 orange red NaN
And I want to count the number of colors for each date in order to obtain:
value green orange red
date
2015-05-24 2 0 1
2015-06-24 0 1 1
Here is my code that does the job. Is there a better way (shorter) to do that ?
# Test data
df= pd.DataFrame({'date': ['05-24-2015','06-24-2015'],
'KPI_01': ['green','orange'],
'KPI_02': ['green','red'],
'KPI_03': ['red',np.nan]
})
df.set_index('date', inplace=True)
# Transforming to long format
df.reset_index(inplace=True)
long = pd.melt(df, id_vars=['date'])
# Pivoting data
pivoted = pd.pivot_table(long, index='date', columns=['value'], aggfunc='count', fill_value=0)
# Dropping unnecessary level
pivoted.columns = pivoted.columns.droplevel()
You could apply value_counts:
>>> df.apply(pd.Series.value_counts,axis=1).fillna(0)
green orange red
date
05-24-2015 2 0 1
06-24-2015 0 1 1
apply tends to be slow, and row-wise operations slow as well, but to be honest if your frame isn't very big you might not even notice the difference.

Categories

Resources