Python SciKitLearn and Pandas categoric data

Python SciKitLearn and Pandas categoric data - python

I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.
My code so far:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')
df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)
scale = StandardScaler()
pd.options.mode.chained_assignment = None # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']
X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())
#print (X)
est = sm.OLS(y, X).fit()
est.summary()

You could use the get_dummies function pandas provides and convert the categorical values.
Something like this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
axis=1)
then you could get the outcome/label column by doing
outcome = predictor['label']
del predictor['label']
Then call the model on the data doing
est = sm.OLS(outcome, predictor).fit()

Related

Feature selection using mixed data types

I am trying to create some code that gives weight to the most impactful features.
My dataframe contains both nominal and categorical data.
example data:
[Brand] [Model] [Car_price] [...] [Prime]
BMW X1 40,000 300
The Y is the prime and X is all other columns.
I tried using the following:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data, delimiter=";")
#df = df.dropna(axis=1)
array = df.values
X = array[:,(6,7,9,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,34,35,37,44,45,47,48,54,61,62)]
Y = array[:,51]
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
forest.fit(X, Y)
And get the following error: ValueError: could not convert string to float
I know there is a way to transform from string into numerical data, but was wondering if it is necessary. What fixes can I apply to get weighted features?

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)

If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.

In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

Is there simple way to convert the values into int format using Python?

There are about 40 different values in the col "Weather_Condition", in .csv. Is there a simple way to map them to integer values for later use in the col as y variable in Logistic Regression?

import pandas as pd
df = pd.read_csv(<filename>)
df['Weather_Condition'] = pd.Categorical(df['Weather_Condition'])
y = df['Weather_Condition'].codes

How about using Label Encoder ?
from sklearn.preprocessing import LabelEncoder
data = ['Light Rain', 'Overcast', 'Cloudy', 'Snow']
lencoder = LabelEncoder()
encoded = lencoder.fit_transform(data)
print(encoded)
Inverse transform would look like below;
inverted = lencoder.inverse_transform([0])
print(inverted)
Hope it was helpful.
PS:Also if you are interested take a look at one-hot-encoder;

Numpy Array for SVM model rather than a DataFrame

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

The easiest way for getting feature names after running SelectKBest in Scikit Learn

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:
Let's assume I would like to conduct the experiment selecting 5 best features:
from sklearn.feature_selection import SelectKBest, f_classif
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)
Now, if I add the line:
import pandas as pd
dataframe = pd.DataFrame(select_k_best_classifier)
I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)
My question is how to create the features_names list?
I know that I should use:
select_k_best_classifier.get_support()
Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.
How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?

This doesn't require loops.
# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]

For me this code works fine and is more 'pythonic':
mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]

You can do the following :
mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features
for bool_val, feature in zip(mask, feature_names):
if bool_val:
new_features.append(feature)
Then change the name of your features:
dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)

Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

Select Best 10 feature according to chi2;
from sklearn.feature_selection import SelectKBest, chi2
KBest = SelectKBest(chi2, k=10).fit(X, y)
Get features with get_support()
f = KBest.get_support(1) #the most important features
Create new df called X_new;
X_new = X[X.columns[f]] # final features`

As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write
dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())

There is an another alternative method, which ,however, is not fast as above solutions.
# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])
# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
index=train.index,
columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]

# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)
# Extract the required features
new_features = select_k_best_classifier.get_feature_names_out(features_names)

Suppose that you want to choose 10 best features:
import pandas as pd
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python SciKitLearn and Pandas categoric data - python

Related

Feature selection using mixed data types

scikit preprocessing across entire dataframe

Is there simple way to convert the values into int format using Python?

Numpy Array for SVM model rather than a DataFrame

The easiest way for getting feature names after running SelectKBest in Scikit Learn

Categories

Resources