I am following the book Building Machine Learning Systems with Python. After loading the dataset from scipy I need to extract index of all features belonging to setosa. But I am unable to extract. Probably because I am not using a numpy array. can someone please help me in extracting index numbers? Code below
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
import numpy as np
# We load the data with load_iris from sklearn
data = load_iris()
features = data['data']
feature_names = data['feature_names']
target = data['target']
for t,marker,c in zip(xrange(3),">ox","rgb"):
# We plot each class on its own to get different colored markers
plt.scatter(features[target == t,0], features[target == t,1],
marker=marker, c=c)
plength = features[:, 2]
# use numpy operations to get setosa features
is_setosa = (labels == 'setosa')
# This is the important step:
max_setosa = plength[is_setosa].max()
min_non_setosa = plength[~is_setosa].min()
print('Maximum of setosa: {0}.'.format(max_setosa))
print('Minimum of others: {0}.'.format(min_non_setosa))
Define labels before the problem line.
target_names = data['target_names']
labels = target_names[target]
Now these lines will work fine:
is_setosa = (labels == 'setosa')
setosa_petal_length = plength[is_setosa].
Extra.
Data Bunch from sklearn ( data = load_iris() ) consists of target array with numbers 0-2 which are related to features and means sort of the flower. Using that you can extract all features belonged to setosa (where target equals 0) like:
petal_length = features[:, 2]
setosa_petal_length = petal_length[target == 0]
Confront this with data['target_names'] and you will get two lines on the top which are solution to your question. By the way all arrays from the data are ndarrays from NumPy.
Related
Let's say that i have two 1-D arrays with 2 different statistical distributions. Now, i want to match both distributions using one of them as "target".
In the example, i "shifted" one of the distributions using MinMaxScaler() from SciKit to match it with the other one...but i am sure i can achieve a "automatic" and "better" match with some API...or some code...
In the example i have both arrays in the same DataFrame (and both have the same length), but i'd be very pleased if somebody kwnow a way to achieve it using 2 different Dataframes and/or 2 arrays with different lengths.
Thank you!!
CODE
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import plotly.figure_factory as ff
################## DATA ######################
np.random.seed(54)
crv = np.random.uniform(1,99,(1,100)).flatten()
np.random.seed(115)
crv_target = np.random.uniform(51,149,(1,100)).flatten()
# Create DataFrame
df = pd.DataFrame(data=[crv, crv_target]).T
df = df.rename(columns={0: "crv", 1: "crv_target"})
# Scaler
scale = MinMaxScaler(feature_range=(50,150))
df['crv_shifted'] = scale.fit_transform(X=df['crv'].values.reshape(-1, 1),y=df['crv_target'].values.reshape(-1, 1))
# Create distplot
data = [df['crv_shifted'],df['crv_target'],df['crv']]
labels = ['crv_shifted','crv_target','crv']
colors = ['#F8C471', '#22D2E6','#CD6155']
fig = ff.create_distplot(data, labels,show_hist=False,show_rug=False,colors=colors)
fig.show()
LINK TO PLOT
I found some code on SO which seems to work quite well.
This code, directly below, produces the plot, also below.
from sklearn import datasets
from sklearn import cluster
import plotly
plotly.offline.init_notebook_mode()
iris = datasets.load_iris()
kmeans = cluster.KMeans(n_clusters=5, random_state=42).fit(iris.data[:,0:1])
data = [plotly.graph_objs.Scatter(x=iris.data[:,0],
y=iris.data[:,1],
mode='markers',
marker=dict(color=kmeans.labels_)
)]
plotly.offline.iplot(data)
Now, I make a simple substitution in the code, to point to my own data, like this.
from sklearn import datasets
from sklearn import cluster
import plotly
plotly.offline.init_notebook_mode()
x = df[['Spend']]
y = df[['Revenue']]
kmeans = cluster.KMeans(n_clusters=5, random_state=42).fit(x,y)
data = [plotly.graph_objs.Scatter(x=df[['Spend']],
y=df[['Revenue']],
mode='markers',
marker=dict(color=kmeans.labels_))]
plotly.offline.iplot(data)
That gives me this plot.
Here is my data frame.
# Import pandas library
import pandas as pd
# initialize list of lists
data = [[110,'CHASE CENTER',53901,8904,44997,4], [541,'METS STADIUM',57999,4921,53078,1], [538,'DEN BRONCOS',91015,9945,81070,1], [640,'LAMBEAU WI',76214,5773,70441,3], [619,'SAL AIRPORT',93000,8278,84722,5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Location', 'Location_Description', 'Revenue','Spend','Profit_Or_Loss','cluster_number'])
# print dataframe.
df
I must be missing something silly, but I don't see what it is.
You have a problem with the dimension:
# In the iris dataset
>>> iris.data[:,0].shape
(150,)
# Your data
>>> x.shape
(5, 1)
# You need to flatter your array
x.values.flatten().shape
(5,)
For example:
from sklearn import datasets
from sklearn import cluster
import plotly
plotly.offline.init_notebook_mode()
x = df[['Spend']]
y = df[['Revenue']]
x_flat = x.values.flatten()
y_flat = y.values.flatten()
kmeans = cluster.KMeans(n_clusters=5, random_state=42).fit(x)
data = [plotly.graph_objs.Scatter(x=x_flat,
y=y_flat,
mode='markers',
marker=dict(color=kmeans.labels_))]
plotly.offline.iplot(data)
On the other hand cluster.KMeans.fit accepts an array (and not two as you are passing). You're going to have to convert them to something of of shape (n_samples, n_features):
X = np.zeros((x_flat.shape[0], 2))
X[:, 0] = x_flat
X[:, 1] = y_flat
# X.shape -> (5, 2)
kmeans = cluster.KMeans(n_clusters=5, random_state=42).fit(X)
I'm trying to match features scores with columns names using a Pipeline with SelectPercentile and RandomForestClassifier.
From my initial EDA, the feature importance analysis seems to make sense. Nevertheless I'm not sure if I've done it correctly.
My main concern is that I'm getting the original DataFrame column Index ids (to get to know the original column names) via SelectPercentile.scores_ and assuming these features are aligned (when ordered) with RandomForestClassifier.feature_importances_
I haven't seen this done anywhere. Am I doing it right?
My Pipeline:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = df_delays_final_ml
features = tpot_data.drop('class', axis=1)
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['class'], random_state=None)
# Average CV score on the training set was: 0.9131840089838837
exported_pipeline = make_pipeline(
SelectPercentile(score_func=f_classif, percentile=56),
RandomForestClassifier(bootstrap=False, criterion="entropy", max_features=0.9000000000000001, min_samples_leaf=9, min_samples_split=9, n_estimators=100)
)
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
The Feature analysis bit:
nbr_of_features = exported_pipeline.named_steps['selectpercentile'].get_support().sum()
top_features = [tpot_data.columns.tolist()[i] for i in np.argsort(exported_pipeline.named_steps['selectpercentile'].scores_)[::-1]][0:nbr_of_features]
importances = exported_pipeline.named_steps['randomforestclassifier'].feature_importances_
std = np.std([tree.feature_importances_ for tree in exported_pipeline.named_steps['randomforestclassifier'].estimators_], axis=0)
indices = np.argsort(importances)[::-1]
col_arr = []
for f in range(nbr_of_features):
col_name = top_features[f]
col_arr.insert(len(col_arr), col_name)
print("%d. feature %d (%f) [%s]" % (f + 1, indices[f], importances[indices[f]], col_name))
plt.figure(figsize=(19, 6))
plt.title("Feature importances")
plt.bar(range(nbr_of_features), importances[indices], color="r", yerr=std[indices], align="center")
plt.xticks(range(nbr_of_features), col_arr, rotation=45)
plt.xlim([-1, 9])
plt.show()
And the result:
feature 0 (0.291853) [dew_point_celsius]
feature 3 (0.272066) [air_temperature_celcius]
feature 1 (0.114740) [minute_of_day]
feature 2 (0.084233) [course]
feature 5 (0.079613) [business_weekday]
feature 8 (0.069989) [precipitation_mm]
feature 6 (0.041912) [wind_gusts_10m_ms]
feature 7 (0.037132) [line_number]
feature 4 (0.008462) [wind_speed_10m_ms]
I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.
I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the identification of the original features that are important in the reduced dataset.
How do I find out which feature is important and which is not among the remaining Principal Components after the dimension reduction?
Here is my code:
from sklearn.decomposition import PCA
pca = PCA(n_components=8)
pca.fit(scaledDataset)
projection = pca.transform(scaledDataset)
Furthermore, I tried also to perform a clustering algorithm on the reduced dataset but surprisingly for me, the score is lower than on the original dataset. How is it possible?
First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data.
Before the example, please note that the basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings). See my last paragraph after the plot for more details.
Overview:
PART1: I explain how to check the importance of the features and how to plot a biplot.
PART2: I explain how to check the importance of the features and how to save them into a pandas dataframe using the feature names.
PART 1:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
#In general a good idea is to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
pca = PCA()
x_new = pca.fit_transform(X)
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
plt.scatter(xs * scalex,ys * scaley, c = y)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()
#Call the function. Use only the 2 PCs.
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
Visualize what's going on using the biplot
Now, the importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance)
Let's see first what amount of variance does each PC explain.
pca.explained_variance_ratio_
[0.72770452, 0.23030523, 0.03683832, 0.00515193]
PC1 explains 72% and PC2 23%. Together, if we keep PC1 and PC2 only, they explain 95%.
Now, let's find the most important features.
print(abs( pca.components_ ))
[[0.52237162 0.26335492 0.58125401 0.56561105]
[0.37231836 0.92555649 0.02109478 0.06541577]
[0.72101681 0.24203288 0.14089226 0.6338014 ]
[0.26199559 0.12413481 0.80115427 0.52354627]]
Here, pca.components_ has shape [n_components, n_features]. Thus, by looking at the PC1 (First Principal Component) which is the first row: [0.52237162 0.26335492 0.58125401 0.56561105]] we can conclude that feature 1, 3 and 4 (or Var 1, 3 and 4 in the biplot) are the most important. This is also clearly visible from the biplot (that's why we often use this plot to summarize the information in a visual way).
To sum up, look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained_variance_. The larger they are these absolute values, the more a specific feature contributes to that principal component.
PART 2:
The important features are the ones that influence more the components and thus, have a large absolute value/score on the component.
To get the most important features on the PCs with names and save them into a pandas dataframe use this:
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)
# 10 samples with 5 features
train_features = np.random.rand(10,5)
model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)
# number of components
n_pcs= model.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]
initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
df = pd.DataFrame(dic.items())
This prints:
0 1
0 PC0 e
1 PC1 d
So on the PC1 the feature named e is the most important and on PC2 the d.
Nice article as well here: https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e?source=friends_link&sk=65bf5440e444c24aff192fedf9f8b64f
the pca library contains this functionality.
pip install pca
A demonstration to extract the feature importance is as following:
# Import libraries
import numpy as np
import pandas as pd
from pca import pca
# Lets create a dataset with features that have decreasing variance.
# We want to extract feature f1 as most important, followed by f2 etc
f1=np.random.randint(0,100,250)
f2=np.random.randint(0,50,250)
f3=np.random.randint(0,25,250)
f4=np.random.randint(0,10,250)
f5=np.random.randint(0,5,250)
f6=np.random.randint(0,4,250)
f7=np.random.randint(0,3,250)
f8=np.random.randint(0,2,250)
f9=np.random.randint(0,1,250)
# Combine into dataframe
X = np.c_[f1,f2,f3,f4,f5,f6,f7,f8,f9]
X = pd.DataFrame(data=X, columns=['f1','f2','f3','f4','f5','f6','f7','f8','f9'])
# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)
# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])
# PC feature
# 0 PC1 f1
# 1 PC2 f2
# 2 PC3 f3
# 3 PC4 f4
# 4 PC5 f5
# 5 PC6 f6
# 6 PC7 f7
# 7 PC8 f8
# 8 PC9 f9
Plot the explained variance
model.plot()
Make the biplot. It can be nicely seen that the first feature with most variance (f1), is almost horizontal in the plot, whereas the second most variance (f2) is almost vertical. This is expected because most of the variance is in f1, followed by f2 etc.
ax = model.biplot(n_feat=10, legend=False)
Biplot in 3d. Here we see the nice addition of the expected f3 in the plot in the z-direction.
ax = model.biplot3d(n_feat=10, legend=False)
# original_num_df the original numeric dataframe
# pca is the model
def create_importance_dataframe(pca, original_num_df):
# Change pcs components ndarray to a dataframe
importance_df = pd.DataFrame(pca.components_)
# Assign columns
importance_df.columns = original_num_df.columns
# Change to absolute values
importance_df =importance_df.apply(np.abs)
# Transpose
importance_df=importance_df.transpose()
# Change column names again
## First get number of pcs
num_pcs = importance_df.shape[1]
## Generate the new column names
new_columns = [f'PC{i}' for i in range(1, num_pcs + 1)]
## Now rename
importance_df.columns =new_columns
# Return importance df
return importance_df
# Call function to create importance df
importance_df =create_importance_dataframe(pca, original_num_df)
# Show first few rows
display(importance_df.head())
# Sort depending on PC of interest
## PC1 top 10 important features
pc1_top_10_features = importance_df['PC1'].sort_values(ascending = False)[:10]
print(), print(f'PC1 top 10 feautres are \n')
display(pc1_top_10_features )
## PC2 top 10 important features
pc2_top_10_features = importance_df['PC2'].sort_values(ascending = False)[:10]
print(), print(f'PC2 top 10 feautres are \n')
display(pc2_top_10_features )