can you help me to solve the error I get when I run the code below?
enter image description here
enter image description here
enter image description here
The error is returned as you are trying to scale string variables, which you cannot do. You can only scale numerical variables.
Depending on what your string data looks like, if it's simple categorical data you could try one-hot-encoding. If it's more complicated, you'll probably have to go down the natural language processing route.
You need to encode your string columns (categorical features) first. Use OrdinalEncoder(), LabelEncoder() or OneHotEncoder() to convert categorical columns to numeric. You can only scale numerical variables.
for example:
# Preform label encoding for gender variable (two values possible for male and female)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lableencoder_X_2 = LabelEncoder()
X[:, 2] = lableencoder_X_2.fit_transform(X[:, 2])
# preform one hot encoding for geography varaible (3 values for states)
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('ohe', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = str)
X = X[:, 1:]
Related
I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()
I need help understanding this line of code:
y_train2 = train_target2.astype('category').cat.codes
Am I right in saying that y_train2 is being changed to a categorical variable using astype(category) and then cat.codes is used to change it into integers?
Below is the full block of code.
# Train data pre-processing
train_target2 = df_train_01['class_2']
train_target5 = df_train_01['class_5']
df_train_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
# convert text labels to integers
y_train2 = train_target2.astype('category').cat.codes
y_train5 = train_target5.astype('category').cat.codes
# Test data pre-processing
test_target2 = df_test_01['class_2']
test_target5 = df_test_01['class_5']
# drop 'class_2' and 'class_5' columns
df_test_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_test2 = test_target2.astype('category').cat.codes
y_test5 = test_target5.astype('category').cat.codes
I think your understanding on the dataframe function and attribute is correct; pdf.astype('category') is turning values into categorical data and pdf.Categorical.codes() (or pdf.Series.codes() ) is an attribute that converts the values into a set of integers that start with 0.
Try to type some simple snippet below to see how they work.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
pdf = pd.DataFrame(iris.data, columns=['s-length', 's-width', 'p-length', 'p-width'])
print(
iris['s-length'].astype('category'),
len(np.unique(iris['s-length'])), # -> 35
len( set(iris['s-length'].astype('category').cat.codes ), # -> 35
np.unique(iris['s-length'].astype('category').cat.codes)), # -> array([ 0, 1,...34]), dtype=int8)
)
In essence, a pandas categorical data type is a mapping between values that do not have a numeric interpretation and a unique number for each value.
Let's break down your code:
# Take the series `train_target2` and convert it to categorical type
train_target2.astype('category')
# Access the attributes or methods of a categorical series
train_target2.astype('category').cat
# Take the `codes` attribute
train_target2.astype('category').cat.codes
In reality, .codes is not converting the data into numbers. Rather, you are only taking the numeric equivalent of each category. Strictly speaking, .astype('category') is the part that converted your data to categorical.
You can find the attributes and methods of this data type here.
There are about 40 different values in the col "Weather_Condition", in .csv. Is there a simple way to map them to integer values for later use in the col as y variable in Logistic Regression?
import pandas as pd
df = pd.read_csv(<filename>)
df['Weather_Condition'] = pd.Categorical(df['Weather_Condition'])
y = df['Weather_Condition'].codes
How about using Label Encoder ?
from sklearn.preprocessing import LabelEncoder
data = ['Light Rain', 'Overcast', 'Cloudy', 'Snow']
lencoder = LabelEncoder()
encoded = lencoder.fit_transform(data)
print(encoded)
Inverse transform would look like below;
inverted = lencoder.inverse_transform([0])
print(inverted)
Hope it was helpful.
PS:Also if you are interested take a look at one-hot-encoder;
I have a data set. It has some string columns. I want to convert these string columns. I'm developing a Neural network using this data set. But since the dataset has some string values I can't train my Neural network. What is the best way to convert these string values to Neural Network readable format?
This is the data set that I have
type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,1,0
PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,0,1
I want to convert those type,nameOrig,nameDest fields to neural network readable format.
I have used below method. But I don't know wheater it's right or wrong.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
test_set = pd.read_csv('cs.csv')
new_test_set['type'] = enc.fit(new_test_set['type'])
I have gone through below questions. But most of them are not worked for me
How to convert string based data frame to numeric
converting non-numeric to numeric value using Panda libraries
converting non-numeric to numeric value using Panda libraries
In this case you can use the datatype category of pandas to map strings to indices (see categorical data). So it's not necessary to use LabelEncoder or OneHotEncoder of scikit-learn.
import pandas as pd
df = pd.read_csv('54055554.csv', header=0, dtype={
'type': 'category', # <--
'amount': float,
'nameOrig': str,
'oldbalanceOrg': float,
'newbalanceOrig': float,
'nameDest': str,
'oldbalanceDest': float,
'newbalanceDest': float,
'isFraud': bool,
'isFlaggedFraud': bool
})
print(dict(enumerate(df['type'].cat.categories)))
# {0: 'PAYMENT', 1: 'TRANSFER'}
print(list(df['type'].cat.codes))
# [0, 0, 1]
The data from the CSV:
type, ...
PAYMENT, ...
PAYMENT, ...
TRANSFER, ...
Transformation
First you need to transform the three columns using LableEncoder class.
Encoding Categorical Data
Well here you have the type as categorical value. For this you can use the class OneHotEncoder available in sklearn.preprocessing.
Avoiding Dummy Variable Trap
Then you need to avoid the Dummy Variable Trap by removing any one of the column that are being used to represent type.
Code
Here I have put the sample code for your reference.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
dataset = pd.read_csv('cs.csv')
X = dataset.iloc[:].values
labelencoder = LabelEncoder()
X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]
You need to encode the string values into numeric ones. What I usually do in this case is creating a table by a non numeric feature, the created table contains all the possible value of that feature. And then, the index of the value in the corresponding features table is used when training a model.
Example:
type_values = ['PAYMENT', 'TRANSFER']
I'm working on multivariable regression from a csv, predicting crop performance based on multiple factors. Some of my columns are numerical and meaningful. Others are numerical and categorical, or strings and categorical (for instance, crop variety, or plot code or whatever.) How do I teach Python to use them? I've found One Hot Encoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) but don't really understand how to apply it here.
My code so far:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
df = pd.read_csv('filepath.csv')
df.drop(df[df['LabeledDataColumn'].isnull()].index.tolist(),inplace=True)
scale = StandardScaler()
pd.options.mode.chained_assignment = None # default='warn'
X = df[['inputColumn1', 'inputColumn2', ...,'inputColumn20']]
y = df['LabeledDataColumn']
X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']] = scale.fit_transform(X[['inputColumn1', 'inputColumn2', ...,'inputColumn20']].as_matrix())
#print (X)
est = sm.OLS(y, X).fit()
est.summary()
You could use the get_dummies function pandas provides and convert the categorical values.
Something like this..
predictor = pd.concat([data.get(['numerical_column_1','numerical_column_2','label']),
pd.get_dummies(data['categorical_column1'], prefix='Categorical_col1'),
pd.get_dummies(data['categorical_column2'], prefix='categorical_col2'),
axis=1)
then you could get the outcome/label column by doing
outcome = predictor['label']
del predictor['label']
Then call the model on the data doing
est = sm.OLS(outcome, predictor).fit()