Discretizing Continuous Data into Columns for Confusion Matrix - python

The goal is to create a confusion matrix for a chosen model column and compare it with the true column, by discretizing the values into regions.
I have a large dataset where I have constructed a large number of models and created predictions (modelx), and the true values (true) which resemble the following models:
The values of both the models and the true column are between [0,1]. I want to create a function where I can specify regions (Ex: [0, 0.25, 0.5, 0.75, 1]) and discretize a chosen model (a column) into binary values (unless a categorical string would work), whether the values are within the region or not.
In the example above, I have four regions and from here would like to create a confusion matrix of the chosen model.

Here's one solution - use pd.cut:
import pandas as pd
import
from sklearn.metrics import confusion_matrix
import plotly.express as px
df = pd.DataFrame(np.random.random((100,7)), columns = [j for j in range(6)] + ["true"])
df_binned = pd.DataFrame()
for col in df.columns:
df_binned[col] = pd.cut(df[col], bins=[0,0.25, 0.5, 0.75, 1.0], labels=list("lmhs"))
# generate confusion matrix
cm = confusion_matrix(y_true=df_binned.true, y_pred=df_binned[0])
# plot
px.imshow(cm).show()

Related

how can I drop low correlated features

I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.
Right now I am dropping such features manually by using pandas.
I want to make a code which can drop such features automaticlly.
I wrote a code to visualize heat map and correlation in this way:
#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data
def calculateCorrelationByPearson(self):
columns = self.data.columns
plt.figure(figsize=(12, 8))
sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f',
linewidths=0.5, cmap='Blues')
plt.show()
for column in columns:
corr = stats.spearmanr(self.data['total'], self.data[columns])
print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')
This gives me a perfect view of my features and relationship with each other.
Now I want to drop columns which are not important.
Let's say correlation less than 0.4.
How can I apply this logic in to my code?
Here is an approach to remove variables with a correlation coef value below some threshold:
import pandas as pd
from scipy.stats import spearmanr
data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4
corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold
vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar) #add the target variable back in
data2 = data[vars_to_keep]

How to set a seaborn color map in an arbitrary range?

I am creating a heatmap for the correlations between items.
sns.heatmap(df_corr, fmt=".2g", cmap='vlag', cbar='True', annot = True)
I choose vlag as it has red for high values, blue for low values, and white for the middle.
Seaborn automatically sets red for the highest value and blue for the lowest value in the dataframe.
However, as I am tracking Pearson's correlation, the value range is between -1 and 1 - as so I would like to set 1 to be represented by red, -1 with blue, leaving 0 to be represented by white.
How the results looks like:
How it should be*:
*(Of course this was generated by "cheating" - setting -1 as value(s) to force the range to be from -1 to 1; I want to set this range without warping my data)
it is vmin=-1 and vmax=1:
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
data = np.random.uniform(low=-0.5, high=0.5, size=(5,5))
hm = sn.heatmap(data = data, cmap= 'vlag', annot = True, vmin=-1, vmax=1)
plt.show()
Here is an unorthodox solution. You can "standardize" your data to a range 1 and -1. Even though the theoretical range of Pearson coefficient is [-1, 1]; strong negative correlations are not present in your dataset.
So, you can create another dataframe which contains the data with its max being 1 and min being -1. You can then plot this dataframe to get the desired effect. The advantage this procedure provides is that this technique generalizes to pretty much any dataframe (not verified though).
Here is the code -
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Setting the initial scale of the data
scale_minimum = -1
scale_maximum = 1
scale_range = scale_maximum-scale_minimum
# Applying the scaling
df_minimun, df_maximum = df.min(), df.max() # Getting the range of the current dataframe
df_range = df_maximum - df_minimun # The range of the data
df = (df - df_minimun)/(df_range) # Scaling between 0 and 1
df_scaled = df*(scale_range) + scale_minimum # Scaling between 1 and -1
Hope this solves your problem.

Selecting specific features based on correlation values

I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)

Find a pivot point to which minimize the "overlap" of two python lists

I have two Python lists of float numbers and I want to find a pivot point which minimizes the "overlap" of these two lists.
The problem is illustrated in the figure below, where I would like to get the cross point of the two curves (each curve can be imagined as the histogram plot of a list), and the "overlap" is defined as the green area.
For example, I have two lists [2.1, 3.5, 3.8, 3.8, 3.8, 4.2] and [3.7, 4.1, 4.1, 4.1, 5.0]. A good pivot point could be 4.0 (or any number between 3.8 and 4.1), where the "overlap" corresponds to only one number (4.2) from the 1st list and one number (3.7) from the 2nd list.
Apparently the set() & set() method doesn't apply here as the numbers wouldn't be the same in both lists. The only method I came up is a brute force search, starting from 4.2 and ending at 3.7, which is not ideal.
By the comments, I need to separate it into two questions:
1) What's the Python solution to find such a pivot point of the two lists?
2) Much better, maybe too much to ask it here, but how to get a statistically rigor solution to minimize the separation of the two set of values? I am not sure if I can assume a Gaussian distribution of the values, but let's assume we can if that helps to formulate a solution.
We have two lists a and b. We are looking for such a value x for which the cumulative probability of higher values in a is equal to cumulative probability of lower values in b.
Formally:
1 − CDF(a, x) == CDF(b, x)
Alternatively:
1 − CDF(a, x) − CDF(b, x) == 0
Let's implement it in Python.
import itertools
import random
def boundary(a, b):
"""Return interval of boundary values."""
# Calculate probability density function for both list
# Merge lists and sort them by their values
cc = sorted(itertools.chain(
((x, 1/len(a)) for i, x in enumerate(a)),
((x, 1/len(b)) for i, x in enumerate(b))))
# Mark all values with 1 − CDF(a, x) − CDF(b, x)
pp = [(x[0], 1-sum(z[1] for z in cc[:i+1])) for i, x in enumerate(cc)]
# Find index of a value closest to zero
m = min(enumerate(pp), key=lambda x: abs(x[1][1]))
# Return range of values
index = m[0]
return pp[index][0], pp[index+1][0]
Test simple cases:
print(boundary([1, 2], [3, 4])) # -> (2, 3)
print(boundary([1], [3])) # -> (1, 3)
print(boundary([1, 3], [2, 4])) # -> (2, 3)
And test a more complicated case:
a = sorted(random.gauss(0, 1) for _ in range(300))
b = sorted(random.gauss(1, 1) for _ in range(200))
print(boundary(a, b)) # -> approx (0.5, 0.5 + Δ)
Please note that the algorithm correctly processes lists of different lengths.
And with slight performance optimizations it can successfully handle lists with millions of items.
One idea is to use a decision classifier to determine the best separation point.
Code
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
# Setup Data
df = pd.DataFrame({'Feature': [1.0, 2.1, 3.5, 4.2,3.7, 4.1, 5.0],'Label':[0,0,0,0,1,1,1]})
feature_cols = ['Feature']
X = df[feature_cols] # Features
y = df.Label # Target variable
# Create Decision Tree classifer object (use max_depth of 1 to have one boundary)
clf = DecisionTreeClassifier(max_depth = 1)
# Train Decision Tree Classifer
clf = clf.fit(X, y)
# Found decision boundary by creating test data in 0.1 steps from min to max
# (i.e. 1 to 5)
arr = np.arange(1, 5.1, 0.1)
test_set = pd.DataFrame({'Feature': arr})
# Create predictor so we can see where boundary is created
y_pred = clf.predict(test_set)
indexes = np.where(y_pred > 0) # all points with label 1
pivot_index = indexes[0][0] # first point with label 1
pivot_value = arr[pivot_index] # value is pivot value
print(f'Pivot value: {pivot_value}')
Output
Pivot value: 3.7000000000000024
If I understood correctly the values you are looking for do not necessarily belong to the lists.
If that is the case you can "artificially" resample your lists with decimal spacing between min and max of the original lists, transform them to "sets" and compute the intersection between them.

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

Categories

Resources