mlextend plot_decision_regions with model fit on Pandas DataFrame? - python

I'm a big fan of mlxtend's plot_decision_regions function, (http://rasbt.github.io/mlxtend/#examples , https://stackoverflow.com/a/43298736/1870832)
It accepts an X(just two columns at a time), y, and (fitted) classifier clf object, and then provides a pretty awesome visualization of the relationship between model predictions, true y-values, and a pair of independent variables.
A couple restrictions:
X and y have to be numpy arrays, and clf needs to have a predict() method. Fair enough. My problem is that in my case, the classifier clf object I would like to visualize has already been fitted on a Pandas DataFrame...
import numpy as np
import pandas as pd
import xgboost as xgb
import matplotlib
matplotlib.use('Agg')
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit a Classifier to the data
# This classifier is fit on the data as a Pandas DataFrame
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = xgb.XGBClassifier()
clf.fit(X, y)
So now when I try to use plot_decision_regions passing X/y as numpy arrays...
# Plot Decision Region using mlxtend's awesome plotting function
plot_decision_regions(X=X.values,
y=y.values,
clf=clf,
legend=2)
I (understandably) get an error that the model can't find the column names of the dataset it was trained on
ValueError: feature_names mismatch: ['Planned_End', 'Actual_End'] ['f0', 'f1']
expected Planned_End, Actual_End in input data
training data did not have the following fields: f1, f0
In my actual case, it would be a big deal to avoid training our model on Pandas DataFrames. Is there a way to still produce decision_regions plots for a classifier trained on a Pandas DataFrame?

Try to change:
X = df[['Planned_End', 'Actual_End']].values
y = df['Late'].values
and proceed to:
clf = xgb.XGBClassifier()
clf.fit(X, y)
plot_decision_regions(X=X,
y=y,
clf=clf,
legend=2)
OR fit & plot using X.values and y.values

Related

Feature importance using gridsearchcv for logistic regression

I've trained a logistic regression model like this:
reg = LogisticRegression(random_state = 40)
cvreg = GridSearchCV(reg, param_grid={'C':[0.05,0.1,0.5],
'penalty':['none','l1','l2'],
'solver':['saga']},
cv = 5)
cvreg.fit(X_train, y_train)
Now to show the feature's importance I've tried this code, but I don't get the names of the coefficients in the plot:
from matplotlib import pyplot
importance = cvreg.best_estimator_.coef_[0]
pyplot.bar([x for x in range(len(importance))], importance)
pyplot.show()
Obviously, the plot isn't very informative. How do I add the names of the coefficients to the x-axis?
The importance of the coeff is:
cvreg.best_estimator_.coef_
array([[1.10303023e+00, 7.48816905e-01, 4.27705027e-04, 6.01404570e-01]])
The coefficients correspond to the columns of X_train, so pass in the X_train names instead of range(len(importance)).
Assuming X_train is a pandas dataframe:
import matplotlib.pyplot as plt
features = X_train.columns
importance = cvreg.best_estimator_.coef_[0]
plt.bar(features, importance)
plt.show()
Note that if X_train is just a numpy array without column names, you will have to define the features list based on your own data dictionary.

LinearSVC: Equation of a straight line that separates two classes from a scatterplot graph and pandas DataFrame

I'm trying to create a straight line that separates two classes. I'm using panda's dataframe with scatterplot.
Here is my code before I get you into my problem:
Libraries:
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import LinearSVC
from sklearn.metrics import ConfusionMatrixDisplay
from scipy.io import arff
Data:
arquivo_arff = arff.loadarff(r"/content/Rice_MSC_Dataset.arff")
dados = pd.DataFrame(arquivo_arff[0])
Filter:
dados = dados[['MINOR_AXIS', 'MAJOR_AXIS', 'CLASS']]
Another filter:
dados = dados[dados['CLASS'].isin([b"Arborio", b"Ipsala"])]
Graph with two parameters:
sns.scatterplot(
data=dados,
x="MINOR_AXIS",
y="MAJOR_AXIS",
hue="CLASS")
plt.show()
My problem is here, when I use LinearSVC for finding que parameters and coeficients of my equation:
model = LinearSVC()
model.fit(dados.drop('CLASS', axis=1), dados['CLASS'])
a, b = model.coef_[0]
d = model.intercept_[0]
print('a:', a)
print('b:', b)
print('d:', d)
You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
I didn't understand that error quite well. Is there any ways that I can fix this in my code?
The documentation for multilabelbinarizer have some good examples for specific use, but a general workflow for sklearn transformers is:
Split data into features and labels
X = dados.drop('CLASS', axis=1)
y = dados['CLASS']
#optionally, use train_test_split to split data into training and validation sets
#X_train,X_test,y_train,y_test=train_test_split(X,y)
Do transformations on input and target data
mb = MultiLabelBinarizer()
mb.fit(y)
mb.transform(y)
#can also be done in one step with mb.fit_transform(y)
#if using train_test_split: mb.fit_transform(y_train); mb.transform(y_test)
Fit your model
model = LinearSVC()
model.fit(X,y) #or model.fit(X_train,y_train) if using training and validation sets

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)

Identifying arrays with a band structure

I would like to identify arrays with a band like structure (first image) as compared to a more homogeneous structure shown in the homogenous image.
I have so far used some skewness and RMS techniques to test for this but it doesn't work well if the bands are evenly spaced. Are there any more refined ways of identifying such arrays in Python?
Try sns.pairplot.
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# The dataset: wages
# We fetch the data from OpenML. Note that setting the parameter as_frame to True will retrieve the data as a pandas dataframe.
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# Then, we identify features X and targets y: the column WAGE is our target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

How to get predicted values along with test data, and visualize actual vs predicted?

from sklearn import datasets
import numpy as np
import pandas as pd from sklearn.model_selection
import train_test_split
from sklearn.linear_model import Perceptron
data = pd.read_csv('student_selection.csv')
x = data[['Average','Pass','Division','Domicile']]
y = data[['Selected']]
x_train,x_test,y_train,y_test train_test_split(x,y,test_size=1,random_state=0)
ppn = Perceptron(eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, random_state=0)
ppn.fit(x_train, y_train)
y_pred = ppn.predict(x_train)
x_train['Predicted'] = pd.Series(y_pred)
How to see the actual vs predicted as a table and along with a plot? x_train is the value I am getting as predicted, but I am unable to merge it with the actual data to see the deviation.
How to see the actual vs predicted as a table and along with a plot?
Just run:
y_predict= pnn.predict(x)
data['y_predict'] = y_predict
and have the column in your dataframe, if you want to plot it you can use:
import matplotlib.pyplot as plt
plt.scatter(data['Selected'], data['y_predict'])
plt.show()

Categories

Resources