jupyter notebook - Why notebook doesn't format this dictionary well? - python

I'm using classification_report function which is supposed to return a dictionary, but it doesn't return a well-structured one. So how can I print it well in jupyter notebook?
from sklearn.metrics import classification_report
classification_report(test_preds, y_test)
Here is my output:
' precision recall f1-score support\n\n 0 0.81 0.97 0.88 1653\n
1 0.76 0.28 0.41 533\n\n accuracy 0.80 2186\n macro avg
0.78 0.63 0.65 2186\nweighted avg 0.79 0.80 0.77 2186\n'

use print(classification_report(test_preds, y_test))

Related

Default positive class in multilevel sklearn classification

I am working on a churn classification with 3 classes 0, 1,2 but want to optimize class 0 and 1 for recall, does that mean sklearn needs to take classes 0 & 1 to be the positive classes. How can I explicitly mention for which class do I want to optimise recall , if that is not possible should I consider renaming the classes in an ascending order so that 1, 2 are default positive?
precision recall f1-score support
0 0.71 0.18 0.28 2611
1 0.57 0.54 0.56 5872
2 0.70 0.88 0.78 8913
accuracy 0.66 17396
macro avg 0.66 0.53 0.54 17396
weighted avg 0.66 0.66 0.63 17396
Here is the code I am using for reference (although I need more of an understanding of how to optimize for recall for only 0, 1 class here)
param_test1={'learning_rate':(0.05,0.1),'max_depth':(3,5)}
estimator=GridSearchCV(estimator=GradientBoostingClassifier(loss='deviance',subsample=0.8,random_state=10,
n_estimators=200),param_grid=param_test1,cv=2, refit='recall_score')
estimator.fit(df[predictors],df[target])

Python: how can I pass parameters in def to inputs in pandas loc?

I want to pass the parameters in my def to inputs in pandas loc but I am not sure how to do so, as loc requires defined labels as inputs. Or is there any other way I can perform Excel INDEX MATCH equivalent in Python but not using loc? Many thanks!
Below please find my code:
def get_correl_diff_tenor(p1, p2):
correl = IRCorrMatrix.loc['p1', 'p2']
return correl
p1 and p2 in loc['p1', 'p2'] refer to the tenor pairs for calling the corresponding correlation value in the matrix below.
IRCorrMatrix is shown below, which is a correlation matrix defined by tenor pairs.
2w 1m 3m 6m 1y
Tenor
2w 1.00 0.73 0.64 0.57 0.44
1m 0.73 1.00 0.78 0.67 0.50
3m 0.64 0.78 1.00 0.85 0.66
6m 0.57 0.67 0.85 1.00 0.81
1y 0.44 0.50 0.66 0.81 1.00
IIUC remove '' from 'p1', 'p2' for pass variables from function:
IRCorrMatrix.loc[p1, p2]

Sklearn - plotting classification report gives a different output than basic avg?

I wanted to leverage this answer How to plot scikit learn classification report? turning an sklearn classification report into a heatmap.
It's all working with their sample report, however my classification report looks slightly different and is thus screwing up the functions.
Their report (notice the avg / total):
sampleClassificationReport =
precision recall f1-score support
Acacia 0.62 1.00 0.76 66
Blossom 0.93 0.93 0.93 40
Camellia 0.59 0.97 0.73 67
Daisy 0.47 0.92 0.62 272
Echium 1.00 0.16 0.28 413
avg / total 0.77 0.57 0.49 858
My report with metrics.classification_report(valid_y, y_pred) :
precision recall f1-score support
0 1.00 0.18 0.31 11
1 0.00 0.00 0.00 14
2 0.00 0.00 0.00 19
3 0.50 0.77 0.61 66
4 0.39 0.64 0.49 47
5 0.00 0.00 0.00 23
accuracy 0.46 180
macro avg 0.32 0.27 0.23 180
weighted avg 0.35 0.46 0.37 180
The issue, from the selected answer in the heatmap link, is here:
for line in lines[2 : (len(lines) - 2)]:
t = line.strip().split()
if len(t) < 2: continue
classes.append(t[0])
v = [float(x) for x in t[1: len(t) - 1]]
support.append(int(t[-1]))
class_names.append(t[0])
print(v)
plotMat.append(v)
Because I get the error:
ValueError: could not convert string to float: 'avg'
So the problem truly is how my classification report is being outputted. What can I change here to match the sample?
EDIT: what Ive tried:
df = pd.DataFrame(metrics.classification_report(valid_y, y_pred)).T
df['support'] = df.support.apply(int)
df.style.background_gradient(cmap='viridis',
subset=pd.IndexSlice['0':'9', :'f1-score'])
Error:
ValueError: DataFrame constructor not properly called!
With the advent of output_dict param in classification_report, there is no hassle for parsing the report. You can directly use the output of classification report to be read as pd.DataFrame. Then, you could use the pd.Style option to render the heat map.
Example:
from sklearn.metrics import classification_report
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
X, y = make_classification(n_samples=1000, n_features=30,
n_informative=12,
n_clusters_per_class=1, n_classes=10,
class_sep=2.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y)
clf = LogisticRegression(max_iter=1000, random_state=42).fit(X_train, y_train)
df = pd.DataFrame(classification_report(clf.predict(X_test),
y_test, digits=2,
output_dict=True)).T
df['support'] = df.support.apply(int)
df.style.background_gradient(cmap='viridis',
subset=pd.IndexSlice['0':'9', :'f1-score'])

python pandas How to remove outliers from a dataframe and replace with an average value of preceding records

I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.
i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)
So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.
any tricks for this?
Country Week Bill%1 Bill%2 Bill%3 Bill%4 Bill%5 Bill%6
IT week1 0.94 0.88 0.85 1.21 0.77 0.75
IT week2 0.93 0.88 1.25 0.80 0.77 0.72
IT week3 0.94 1.33 0.85 0.82 0.76 0.76
IT week4 1.39 0.89 0.86 0.80 0.80 0.76
FR week1 0.92 0.86 0.82 1.18 0.75 0.73
FR week2 0.91 0.86 1.22 0.78 0.75 0.71
FR week3 0.92 1.29 0.83 0.80 0.75 0.75
FR week4 1.35 0.87 0.84 0.78 0.78 0.74
I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')
# Define cutoff value
cutoff = 0.90
for col in df.columns:
# Identify index locations above cutoff
outliers = df[col][ df[col]>cutoff ]
# Browse through outliers and average according to index location
for idx in outliers.index:
# Get index location
loc = df.index.get_loc(idx)
# If not one of last two values in dataframe
if loc<df.shape[0]-2:
df[col][loc] = np.mean( df[col][loc+1:loc+3] )
else:
df[col][loc] = np.mean( df[col][loc-3:loc-1] )

Precision of sklearn.metric classification_report

I would like to know if it is possible to get more numbers after the comma with classification_report from sklearn (scikit).
atm it looks like this:
precision recall f1-score support
1 0.61 0.73 0.67 71194
2 0.64 0.33 0.43 13877
3 0.56 0.59 0.57 61591
4 0.64 0.51 0.57 13187
5 0.66 0.69 0.67 57530
6 0.54 0.06 0.11 2391
7 0.54 0.40 0.46 30223
avg / total 0.60 0.60 0.60 249993
I don't think it is possible with that method, but maybe someone had the same idea (probably).
I know that sklearn.metrics.precision_score exists, though the classification_report is such a nice way to display all the results at once.
Not possible according to the source code. See lines 819 and 830, format strings are hardcoded to %0.2f. If you really want it, just change it in your local file sklearn/metrics/metrics.py. Better yet, add an argument to classification_report with a precision number and use that. And submit your patch to the project!

Categories

Resources