How to convert pandas data frame string values to numeric values - python

I have a data set. It has some string columns. I want to convert these string columns. I'm developing a Neural network using this data set. But since the dataset has some string values I can't train my Neural network. What is the best way to convert these string values to Neural Network readable format?
This is the data set that I have
type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,1,0
PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,0,1
I want to convert those type,nameOrig,nameDest fields to neural network readable format.
I have used below method. But I don't know wheater it's right or wrong.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
test_set = pd.read_csv('cs.csv')
new_test_set['type'] = enc.fit(new_test_set['type'])
I have gone through below questions. But most of them are not worked for me
How to convert string based data frame to numeric
converting non-numeric to numeric value using Panda libraries
converting non-numeric to numeric value using Panda libraries

In this case you can use the datatype category of pandas to map strings to indices (see categorical data). So it's not necessary to use LabelEncoder or OneHotEncoder of scikit-learn.
import pandas as pd
df = pd.read_csv('54055554.csv', header=0, dtype={
'type': 'category', # <--
'amount': float,
'nameOrig': str,
'oldbalanceOrg': float,
'newbalanceOrig': float,
'nameDest': str,
'oldbalanceDest': float,
'newbalanceDest': float,
'isFraud': bool,
'isFlaggedFraud': bool
})
print(dict(enumerate(df['type'].cat.categories)))
# {0: 'PAYMENT', 1: 'TRANSFER'}
print(list(df['type'].cat.codes))
# [0, 0, 1]
The data from the CSV:
type, ...
PAYMENT, ...
PAYMENT, ...
TRANSFER, ...

Transformation
First you need to transform the three columns using LableEncoder class.
Encoding Categorical Data
Well here you have the type as categorical value. For this you can use the class OneHotEncoder available in sklearn.preprocessing.
Avoiding Dummy Variable Trap
Then you need to avoid the Dummy Variable Trap by removing any one of the column that are being used to represent type.
Code
Here I have put the sample code for your reference.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
dataset = pd.read_csv('cs.csv')
X = dataset.iloc[:].values
labelencoder = LabelEncoder()
X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]

You need to encode the string values into numeric ones. What I usually do in this case is creating a table by a non numeric feature, the created table contains all the possible value of that feature. And then, the index of the value in the corresponding features table is used when training a model.
Example:
type_values = ['PAYMENT', 'TRANSFER']

Related

Standard Scaler Python Pandas

can you help me to solve the error I get when I run the code below?
enter image description here
enter image description here
enter image description here
The error is returned as you are trying to scale string variables, which you cannot do. You can only scale numerical variables.
Depending on what your string data looks like, if it's simple categorical data you could try one-hot-encoding. If it's more complicated, you'll probably have to go down the natural language processing route.
You need to encode your string columns (categorical features) first. Use OrdinalEncoder(), LabelEncoder() or OneHotEncoder() to convert categorical columns to numeric. You can only scale numerical variables.
for example:
# Preform label encoding for gender variable (two values possible for male and female)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lableencoder_X_2 = LabelEncoder()
X[:, 2] = lableencoder_X_2.fit_transform(X[:, 2])
# preform one hot encoding for geography varaible (3 values for states)
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('ohe', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = str)
X = X[:, 1:]

scikit preprocessing across entire dataframe

I have a dataframe:
df = pd.DataFrame({'Company': ['abc', 'xyz', 'def'],
'Q1-2019': [9.05, 8.64, 6.3],
'Q2-2019': [8.94, 8.56, 7.09],
'Q3-2019': [8.86, 8.45, 7.09],
'Q4-2019': [8.34, 8.61, 7.25]})
The data is an average response of the same question asked across 4 quarters.
I am trying to create a benchmark index from this data. To do so I wanted to preprocess it first using either standardize or normalize.
How would I standardize/normalize across the entire dataframe. What is the best way to go about this?
I can do this for a row or column using but struggling across the dataframe.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
#define scaler
scaler = MinMaxScaler() #or StandardScaler
X = df.loc[1].T
X = X.to_numpy()
#transform data
scaled = scaler.fit_transform(X)
If I understood correctly your need, you can use ColumnTransformer to apply the same transformation (e.g. scaling) separately to different columns.
As you can read from the linked documentation, you need to provide inside a tuple:
a name for the step
the chosen transformer (e.g. StandardScaler) or a Pipeline as well
a list of columns to which apply the selected transformations
Code example
# specify columns
columns = ['Q1-2019', 'Q2-2019', 'Q3-2019', 'Q4-2019']
# create a ColumnTransformer instance
ct = ColumnTransformer([
('scaler', StandardScaler(), columns)
])
# fit and transform the input dataframe
ct.fit_transform(df)
array([[ 0.86955718, 0.93177476, 0.96056682, 0.46493449],
[ 0.53109031, 0.45544147, 0.41859563, 0.92419906],
[-1.40064749, -1.38721623, -1.37916245, -1.38913355]])
ColumnTransformer will output a numpy array with the transformed value, which were fitted on the input dataset df. Even though there are no column names now, the array columns are still ordered in the same way as the input dataframe, so it's easy to convert the array to a pandas dataframe if you need to.
In addition to #RicS's answer, note that what scikit-learn function return is a numpy array, and it is not a dataframe anymore. Also Company column is not included. You may consider this to convert results to dataframe again:
scaler = StandardScaler()
x = scaler.fit_transform(df.drop("Company",axis=1)) # scale all columns except Company
y = pd.concat([df["Company"],pd.DataFrame(x, columns=df.columns[1:])],axis=1) # adds results and company into dataframe again
y.head()

How is pandas converting my data into categories in this line of code?

I need help understanding this line of code:
y_train2 = train_target2.astype('category').cat.codes
Am I right in saying that y_train2 is being changed to a categorical variable using astype(category) and then cat.codes is used to change it into integers?
Below is the full block of code.
# Train data pre-processing
train_target2 = df_train_01['class_2']
train_target5 = df_train_01['class_5']
df_train_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
# convert text labels to integers
y_train2 = train_target2.astype('category').cat.codes
y_train5 = train_target5.astype('category').cat.codes
# Test data pre-processing
test_target2 = df_test_01['class_2']
test_target5 = df_test_01['class_5']
# drop 'class_2' and 'class_5' columns
df_test_02.drop(['class_2', 'class_5'], axis=1, inplace=True)
y_test2 = test_target2.astype('category').cat.codes
y_test5 = test_target5.astype('category').cat.codes
I think your understanding on the dataframe function and attribute is correct; pdf.astype('category') is turning values into categorical data and pdf.Categorical.codes() (or pdf.Series.codes() ) is an attribute that converts the values into a set of integers that start with 0.
Try to type some simple snippet below to see how they work.
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
iris = load_iris()
pdf = pd.DataFrame(iris.data, columns=['s-length', 's-width', 'p-length', 'p-width'])
print(
iris['s-length'].astype('category'),
len(np.unique(iris['s-length'])), # -> 35
len( set(iris['s-length'].astype('category').cat.codes ), # -> 35
np.unique(iris['s-length'].astype('category').cat.codes)), # -> array([ 0, 1,...34]), dtype=int8)
)
In essence, a pandas categorical data type is a mapping between values that do not have a numeric interpretation and a unique number for each value.
Let's break down your code:
# Take the series `train_target2` and convert it to categorical type
train_target2.astype('category')
# Access the attributes or methods of a categorical series
train_target2.astype('category').cat
# Take the `codes` attribute
train_target2.astype('category').cat.codes
In reality, .codes is not converting the data into numbers. Rather, you are only taking the numeric equivalent of each category. Strictly speaking, .astype('category') is the part that converted your data to categorical.
You can find the attributes and methods of this data type here.

Is there a way to reorder columns in a pandas data frame for hundreds of attributes?

I am implementing a classification machine learning algorithm. The dataset contains numerical as well as nominal values in no specific order. The numeric and nominal values are in separate columns but these columns are scattered throughout the dataset. Also, the columns/ attributes are exactly 279 in number. How can I reorder or sort the dataset to have numerical valued columns on one side and nominal valued columns on the other side of the dataset?
P.S. I need to sort the columns so I can implement standardization in this way:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:, 49: ] = sc.fit_transform(x_train[:, 49: ])
x_test[:, 49: ] = sc.transform(x_test[:, 49: ])
The link to the dataset is provided below for more information.
https://archive.ics.uci.edu/ml/datasets/Arrhythmia
Select number columns and object (string) columns and then concat them side by side:
df1 = pd.concat((df.select_dtypes('number'), df.select_dtypes('object')), axis=1)
If I understand correctly, you can first determine the dtype of the columns with
import pandas as pd
import itertools
g = df.columns.to_series().groupby(df.dtypes).groups
And then simply get a list of this new order
columns_list = []
for key in g.keys():
columns_list.append(g[key].values.tolist())
columns_list = list(itertools.chain(*columns_list))
And finally reindex
df = df.reindex(columns_list, axis=1)

Is there simple way to convert the values into int format using Python?

There are about 40 different values in the col "Weather_Condition", in .csv. Is there a simple way to map them to integer values for later use in the col as y variable in Logistic Regression?
import pandas as pd
df = pd.read_csv(<filename>)
df['Weather_Condition'] = pd.Categorical(df['Weather_Condition'])
y = df['Weather_Condition'].codes
How about using Label Encoder ?
from sklearn.preprocessing import LabelEncoder
data = ['Light Rain', 'Overcast', 'Cloudy', 'Snow']
lencoder = LabelEncoder()
encoded = lencoder.fit_transform(data)
print(encoded)
Inverse transform would look like below;
inverted = lencoder.inverse_transform([0])
print(inverted)
Hope it was helpful.
PS:Also if you are interested take a look at one-hot-encoder;

Categories

Resources