A quick question as I'm currently changing from R to pandas for some projects:
I get the following print output from metrics.classification_report from sci-kit learn:
precision recall f1-score support
0 0.67 0.67 0.67 3
1 0.50 1.00 0.67 1
2 1.00 0.80 0.89 5
avg / total 0.83 0.78 0.79 9
I want to use this (and similar ones) as a matrix/dataframe so, that I could subset it to extract, say the precision of class 0.
In R, I'd give the first "column" a name like 'outcome_class' and then subset it:
my_dataframe[my_dataframe$class_outcome == 1, 'precision']
And I can do this in pandas but the dataframe that I want to use is simply a string see sckikit's doc
How can I make the table output here to a useable dataframe in pandas?
Assign it to a variable, s:
s = classification_report(y_true, y_pred, target_names=target_names)
Or directly:
s = '''
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
avg / total 0.70 0.60 0.61 5
'''
Use that as the string input for StringIO:
import io # For Python 2.x use import StringIO
df = pd.read_table(io.StringIO(s), sep='\s{2,}') # For Python 2.x use StringIO.StringIO(s)
df
Out:
precision recall f1-score support
class 0 0.5 1.00 0.67 1
class 1 0.0 0.00 0.00 1
class 2 1.0 0.67 0.80 3
avg / total 0.7 0.60 0.61 5
Now you can slice it like an R data.frame:
df.loc['class 2']['f1-score']
Out: 0.80000000000000004
Here, classes are the index of the DataFrame. You can use reset_index() if you want to use it as a regular column:
df = df.reset_index().rename(columns={'index': 'outcome_class'})
df.loc[df['outcome_class']=='class 1', 'support']
Out:
1 1
Name: support, dtype: int64
Related
My goal is to predict the count (variable y) based on various features (variable x). My y is most of the time (98.4%) equal to 0, so this data is inflated by 0.
Based on this premise, I thought that using the Zero Inflated Model with SVC could be an asset, given the characteristics of my data.
So I found a code on the internet and i'm trying to apply it to my problem (i'm very new to this)
from sklego.meta import ZeroInflatedRegressor
zir = ZeroInflatedRegressor(
classifier=SVC(),
regressor=LinearRegression()
)
zir.fit(scaled_train, y_train.values.ravel())
predictions = zir.predict(scaled_test)
I believe my problem is exactly in this part of the code below, where I am appending only 0s and 1s. However, how do I make predictions for the other counts? This is what I'm not able to understand.
trashold = .5
preds =[]
for val in predictions:
if val>trashold:
preds.append(1)
else:
preds.append(0)
accuracy_score(y_test,preds)
0.9976208095270875
Even in my classification_report I have an alert precisely because I don't have samples for the other labels. How to fix such a problem?
print(classification_report(y_test,preds))
Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
precision recall f1-score support
0.0 1.00 1.00 1.00 916493
1.0 0.86 0.97 0.91 11398
2.0 0.00 0.00 0.00 1498
3.0 0.00 0.00 0.00 231
4.0 0.00 0.00 0.00 73
5.0 0.00 0.00 0.00 22
6.0 0.00 0.00 0.00 5
7.0 0.00 0.00 0.00 5
8.0 0.00 0.00 0.00 2
9.0 0.00 0.00 0.00 1
accuracy 1.00 929728
macro avg 0.19 0.20 0.19 929728
weighted avg 1.00 1.00 1.00 929728
To illustrate my problem, I have the following image. Which shows that 0 values are reasonably well predicted, but from 1 I don't have any predictions anymore.
plt.clf()
plt.hist([y_train.values, preds], log=True)
plt.legend(('orig','pred'))
plt.show()
Any idea how I can improve this model and have predictions for all classes? Thank you very much in advance.
EDIT 1 - Based on Alex Comments
I tried to implement Alex's suggestions on a subset of the data.
df_count = df.groupby(['count']).size().to_frame(name = 'size').reset_index()
df_count
count size
0.0 859225
1.0 9208
2.0 1224
3.0 180
4.0 59
5.0 3
6.0 3
7.0 2
8.0 1
zir = ZeroInflatedRegressor(
classifier=SVC(),
regressor=LinearRegression()
)
zir.fit(scaled_train, y_train.values.ravel())
predictions = zir.predict(scaled_test)
As indicated, I switched the assessment measure to the MSE
mean_squared_error(y_test['count'].values,predictions)
0.0033405170885248804
I made a new plot of this data subset, which demonstrates that the model has some success in predicting when there are counts of up to 3. for counts greater than 3, as the sample is very small, the model has difficulty in making predictions.
plt.clf()
plt.hist([y_test['count'].values, predictions], log=True)
plt.legend(('real','predction'))
plt.show()
I am working on a churn classification with 3 classes 0, 1,2 but want to optimize class 0 and 1 for recall, does that mean sklearn needs to take classes 0 & 1 to be the positive classes. How can I explicitly mention for which class do I want to optimise recall , if that is not possible should I consider renaming the classes in an ascending order so that 1, 2 are default positive?
precision recall f1-score support
0 0.71 0.18 0.28 2611
1 0.57 0.54 0.56 5872
2 0.70 0.88 0.78 8913
accuracy 0.66 17396
macro avg 0.66 0.53 0.54 17396
weighted avg 0.66 0.66 0.63 17396
Here is the code I am using for reference (although I need more of an understanding of how to optimize for recall for only 0, 1 class here)
param_test1={'learning_rate':(0.05,0.1),'max_depth':(3,5)}
estimator=GridSearchCV(estimator=GradientBoostingClassifier(loss='deviance',subsample=0.8,random_state=10,
n_estimators=200),param_grid=param_test1,cv=2, refit='recall_score')
estimator.fit(df[predictors],df[target])
I want to pass the parameters in my def to inputs in pandas loc but I am not sure how to do so, as loc requires defined labels as inputs. Or is there any other way I can perform Excel INDEX MATCH equivalent in Python but not using loc? Many thanks!
Below please find my code:
def get_correl_diff_tenor(p1, p2):
correl = IRCorrMatrix.loc['p1', 'p2']
return correl
p1 and p2 in loc['p1', 'p2'] refer to the tenor pairs for calling the corresponding correlation value in the matrix below.
IRCorrMatrix is shown below, which is a correlation matrix defined by tenor pairs.
2w 1m 3m 6m 1y
Tenor
2w 1.00 0.73 0.64 0.57 0.44
1m 0.73 1.00 0.78 0.67 0.50
3m 0.64 0.78 1.00 0.85 0.66
6m 0.57 0.67 0.85 1.00 0.81
1y 0.44 0.50 0.66 0.81 1.00
IIUC remove '' from 'p1', 'p2' for pass variables from function:
IRCorrMatrix.loc[p1, p2]
I called describe on one column of a dataframe and ended up with the following output,
count 1.048575e+06
mean 8.232821e+01
std 2.859016e+02
min 0.000000e+00
25% 3.000000e+00
50% 1.400000e+01
75% 6.000000e+01
max 8.599700e+04
What parameter do I pass to get meaningful integer values. What I mean is when I check the SQL count its about 43 million. All the other values are also different.Can someone help me understand what this conversion means and how do I get float rounded to 2 decimal places. I'm new to Pandas.
You can directly use round() and pass the number of decimals you want as argument
# importing pandas as pd
import pandas as pd
# importing numpy as np
import numpy as np
# setting the seed to create the dataframe
np.random.seed(25)
# Creating a 5 * 4 dataframe
df = pd.DataFrame(np.random.random([5, 4]), columns =["A", "B", "C", "D"])
# rounding describe
df.describe().round(2)
A B C D
count 5.00 5.00 5.00 5.00
mean 0.52 0.47 0.38 0.42
std 0.21 0.23 0.19 0.29
min 0.33 0.12 0.16 0.11
25% 0.41 0.37 0.28 0.19
50% 0.45 0.58 0.37 0.44
75% 0.56 0.59 0.40 0.52
max 0.87 0.70 0.68 0.84
DOCS
There are two ways to control the output of pandas, either by controlling it or by using apply.
pd.set_option('display.float_format', lambda x: '%.5f' % x)
df['X'].describe().apply("{0:.5f}".format)
I have a dataframe 16k records and multiple groups of countries and other fields. I have produced an initial output of the a data that looks like the snipit below. Now i need to do some data cleansing, manipulating, remove skews or outliers and replace it with a value based on certain rules.
i.e. on the below how could i identify the skewed points (any value greater than 1) and replace them with the average of the next two records or previous record if there no later records.(in that group)
So in the dataframe below I would like to replace Bill%4 for IT week1 of 1.21 with the average of week2 and week3 for IT so it is 0.81.
any tricks for this?
Country Week Bill%1 Bill%2 Bill%3 Bill%4 Bill%5 Bill%6
IT week1 0.94 0.88 0.85 1.21 0.77 0.75
IT week2 0.93 0.88 1.25 0.80 0.77 0.72
IT week3 0.94 1.33 0.85 0.82 0.76 0.76
IT week4 1.39 0.89 0.86 0.80 0.80 0.76
FR week1 0.92 0.86 0.82 1.18 0.75 0.73
FR week2 0.91 0.86 1.22 0.78 0.75 0.71
FR week3 0.92 1.29 0.83 0.80 0.75 0.75
FR week4 1.35 0.87 0.84 0.78 0.78 0.74
I don't know of any built-ins to do this, but you should be able to customize this to meet your needs, no?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10,5),columns=list('ABCDE'))
df.index = list('abcdeflght')
# Define cutoff value
cutoff = 0.90
for col in df.columns:
# Identify index locations above cutoff
outliers = df[col][ df[col]>cutoff ]
# Browse through outliers and average according to index location
for idx in outliers.index:
# Get index location
loc = df.index.get_loc(idx)
# If not one of last two values in dataframe
if loc<df.shape[0]-2:
df[col][loc] = np.mean( df[col][loc+1:loc+3] )
else:
df[col][loc] = np.mean( df[col][loc-3:loc-1] )