Python pandas.dataframe.isin returning unexpected results - python

I have encountered this several times where I'm trying to filter a dataframe using a column from another dataframe. isin incorrectly returns true for every row. It is probably just a misunderstanding on my part as to how it should work. Why is it doing this, and is there a better way to code it?
#Read the data into a pandas dataframe
ar_data = pd.read_excel('~/data/Accounts-Receivable.xlsx')
ar_data.set_index('customerID', inplace=True)
#randomly select records for 70/30 train/test split
train = ar_data.sample(frac=.7, random_state = 1)
mask = ~ar_data.index.isin(list(train.index)) #why does this return False for every value?
test = ar_data[mask]
ar_data.shape #returns (2466, 11)
train.shape #(1726, 11)
test.shape #returns (0, 11). Should return 740 rows!
Example

I tried to execute you code with a sample DataFrame and it works:
import pandas as pd
ar_data = [[10,20],[11,2],[9,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (1,1)
The problem you may probably have is that the index you used is not a key, hence there are multiple lines with the same Customer_id.
In fact executing your code with duplicated indexes leads to the bug you encountered.
import pandas as pd
ar_data = [[10,20],[10,2],[10,3]]
df = pd.DataFrame(ar_data,columns=["1","2"])
df.set_index("1", inplace=True)
train = df.sample(frac=.7, random_state = 1)
mask = ~df.index.isin(list(train.index))
test = df[mask]
train.shape #shape = (2,1)
test.shape #shape = (0,1)
Anyways an easier and faster way to split your dataset would be:
from sklearn.model_selection import train_test_split
X = ar_data
y = ar_data
train, test, _, _ = train_test_split(X,y,test_size=0.3,random_state=1)
with that possibility, you can also split the features and the predictions with only one function, and it doesn't rely on the indexes.

Related

Spliting dataset to train and test in python

I have dataset whose Label is 0 or 1.
I want to divide my data into test and train sets.For this, I used the
train_test_split method from sklearn at first,
But I want to select the test data in such a way that 10% of them are from class 0 and 90% are from class 1.
How can I do this?
Refer to the official documentation sklearn.model_selection.train_test_split.
You want to specify the response variable with the stratify parameter when performing the split.
Stratification preserves the ratio of the class variable when the split is performed.
You should write your own function to do this,
One way to do this is select rows by index and shuffle it after take them.
Split your dataset in class 1 and class 0, then split as you want:
df_0 = df.loc[df.class == 0]
df_1 = df.loc[df.class == 1]
test_0, train_0 = train_test_split(df_0, 0.1)
test_1, train_1 = train_test_split(df_1, 0.9)
test = pd.concat((test_0, test_1),
axis = 1,
ignore_index = True).sample(1) # sample(1) is to shuffle the df
train = pd.concat((train_0, train_1),
axis = 1,
ignore_index = True).sample(1)

How to scale all columns except last column?

I'm using python 3.7.6.
I'm working on classification problem.
I want to scale my data frame (df) features columns.
The dataframe contains 56 columns (55 feature columns and the last column is the target column).
I want to scale the feature columns.
I'm doing it as follows:
y = df.iloc[:,-1]
target_name = df.columns[-1]
from FeatureScaling import feature_scaling
df = feature_scaling.scale(df.iloc[:,0:-1], standardize=False)
df[target_name] = y
but it seems not effective, because I need to recreate dataframe (add the target column to the scaling result).
Is there a way to scale just some columns without change the others, in effective way ?
(i.e the result from scale will contain the scaled columns and one column which is not scale)
Using index of columns for scaling or other pre-processing operations is not a very good idea as every time you create a new feature it breaks the code. Rather use column names. e.g.
using scikit-learn:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
features = [<featues to standardize>]
scalar = StandardScaler()
# the fit_transform ops returns a 2d numpy.array, we cast it to a pd.DataFrame
standardized_features = pd.DataFrame(scalar.fit_transform(df[features].copy()), columns = features)
old_shape = df.shape
# drop the unnormalized features from the dataframe
df.drop(features, axis = 1, inplace = True)
# join back the normalized features
df = pd.concat([df, standardized_features], axis= 1)
assert old_shape == df.shape, "something went wrong!"
or you can use a function like this if you don't prefer splitting and joining the data back.
import numpy as np
def normalize(x):
if np.std(x) == 0:
raise ValueError('Constant column')
return (x -np.mean(x)) / np.std(x)
for col in features:
df[col] = df[col].map(normalize)
You can slice the columns you want:
df.iloc[:, :-1] = feature_scaling.scale(df.iloc[:, :-1], standardize=False)

Replacing Manual Standardization with Standard Scaler Function

I want to replace the manual calculation of standardizing the monthly data with the StandardScaler package from sklearn. I tried the line of code below the commented out code, but I am receiving the following error.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
arr = pd.DataFrame(np.arange(1,21), columns=['Output'])
arr2 = pd.DataFrame(np.arange(10, 210, 10), columns=['Output2'])
index2 = pd.date_range('20180928 10:00am', periods=20, freq="W")
# index3 = pd.DataFrame(index2, columns=['Date'])
df2 = pd.concat([pd.DataFrame(index2, columns=['Date']), arr, arr2], axis=1)
print(df2)
cols = df2.columns[1:]
# df2_grouped = df2.groupby(['Date'])
df2.set_index('Date', inplace=True)
df2_grouped = df2.groupby(pd.Grouper(freq='M'))
for c in cols:
#df2[c] = df2_grouped[c].apply(lambda x: (x-x.mean()) / (x.std()))
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
print(df2)
ValueError: Expected 2D array, got 1D array instead:
array=[1.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
The error message says that StandardScaler().fit_transform only accept a 2-D argument.
So you could replace:
df2[c] = df2_grouped[c].apply(lambda x: StandardScaler().fit_transform(x))
with:
from sklearn.preprocessing import scale
df2[c] = df2_grouped[c].transform(lambda x: scale(x.astype(float)))
as a workaround.
From sklearn.preprocessing.scale:
Standardize a dataset along any axis
Center to the mean and component wise scale to unit variance.
So it should work as a standard scaler.

Python - How to determine the feature / column names returned by Chi Squared test [duplicate]

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?
This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`
As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

The easiest way for getting feature names after running SelectKBest in Scikit Learn

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?
This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`
As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)
Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

Categories

Resources