standardizing data column-wise before using keras models - python

I'm working with a large dataset whose data I want to standardize to use with a CNN.
Does keras have a quick utility to standardize a block of numbers column-wise that you can use inside a Sequential model? I'm asking this as i expect eventually the data to be used on-line so ideally this standardization feature could be used on incoming data, ie a trailing moving average of mean and std to normalize the incoming data.
import numpy as np
import pandas as pd
np.random.seed(42)
col_names = ['Column' + str(x+1) for x in range(5)]
training_data = pd.DataFrame(np.random.randint(1,10 **6, 50).reshape(-1,5), columns = col_names)

I am not sure about online, but using sklearn's StandardScaler() should do the right thing, as described here, seems like the right thing.

We can do from sklearn
from sklearn.preprocessing import StandardScaler
training_data[:]= StandardScaler().fit_transform(training_data.T).T

Related

LinearSVC: Equation of a straight line that separates two classes from a scatterplot graph and pandas DataFrame

I'm trying to create a straight line that separates two classes. I'm using panda's dataframe with scatterplot.
Here is my code before I get you into my problem:
Libraries:
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import LinearSVC
from sklearn.metrics import ConfusionMatrixDisplay
from scipy.io import arff
Data:
arquivo_arff = arff.loadarff(r"/content/Rice_MSC_Dataset.arff")
dados = pd.DataFrame(arquivo_arff[0])
Filter:
dados = dados[['MINOR_AXIS', 'MAJOR_AXIS', 'CLASS']]
Another filter:
dados = dados[dados['CLASS'].isin([b"Arborio", b"Ipsala"])]
Graph with two parameters:
sns.scatterplot(
data=dados,
x="MINOR_AXIS",
y="MAJOR_AXIS",
hue="CLASS")
plt.show()
My problem is here, when I use LinearSVC for finding que parameters and coeficients of my equation:
model = LinearSVC()
model.fit(dados.drop('CLASS', axis=1), dados['CLASS'])
a, b = model.coef_[0]
d = model.intercept_[0]
print('a:', a)
print('b:', b)
print('d:', d)
You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
I didn't understand that error quite well. Is there any ways that I can fix this in my code?
The documentation for multilabelbinarizer have some good examples for specific use, but a general workflow for sklearn transformers is:
Split data into features and labels
X = dados.drop('CLASS', axis=1)
y = dados['CLASS']
#optionally, use train_test_split to split data into training and validation sets
#X_train,X_test,y_train,y_test=train_test_split(X,y)
Do transformations on input and target data
mb = MultiLabelBinarizer()
mb.fit(y)
mb.transform(y)
#can also be done in one step with mb.fit_transform(y)
#if using train_test_split: mb.fit_transform(y_train); mb.transform(y_test)
Fit your model
model = LinearSVC()
model.fit(X,y) #or model.fit(X_train,y_train) if using training and validation sets

Creating a Decision Tree in Python, Numerical and Categorical Variables: "Unable to coerce to Series"

Creating a Decision Tree and the dataset has 21 columns, a mix of numeric and categorical variables. Using sklearn, I understand it does not support categorical variables. I converted categorical to numeric using Label Encoding while also separating the numeric variables. I would then think I'd have to add both groups together so I can split into testing and training data. However when I tried to add the two together (originally numeric variables with the categorical variables converted to numeric) I received a ValueError.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
credit = pd.read_csv('german_credit_risk.csv')
credit.head(10)
image of output
credit.info()
image of output
credit.describe(include='all')
image ouf output
col_names = ['Duration', 'Credit.Amount', 'Disposable.Income', 'Present.Residence', 'Age', 'Existing.Credits', 'Number.Liable', 'Cost.Matrix']
obj_cols = list(credit.select_dtypes(include='O').columns)
obj_cols
image of output
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded_obj_df = pd.DataFrame(columns=obj_cols)
for col in obj_cols:
encoded_obj_df[col] = le.fit_transform(credit[col])
encoded_obj_df.head(10)
image of output
credit.columns = col_names + encoded_obj_df
ValueError
Do I have the right idea and I'm just not adding the two together properly?
The error occurred because you are adding a list of strings to a DataFrame and try to assign the result of this operation to column names of other DataFrame.
You would need to concatenate data frames (with only numerical and label encoded values) on axis 1 with pd.concat function.
However, as you are using Scikit Learn then I would advise you to use it to the full extend. There is Pipeline and ColumnTransformer classes that can help you with the task of preprocessing and classification.
The Pipeline combines the sequence of SK Learn transformers so you don't need to pass the data to each component by yourself.
The ColumnTransformer is used to select the data and apply given transformers to the given data slices. Then it automatically combines the processed (and remained) data into single np.array.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
clf = make_pipeline(
ColumnTransformer(
[('categorical', LabelEncoder(), credit.select_dtypes(include='O').columns)],
remainder='passthrough'
),
DecisionTreeClassifier()
)
You can then use the standard clf.fit and clf.predict on the resulting pipeline and all of the data processing and prediction will happen at once.

Logistic Regression test input format help in python

I do have the below dataset.
I've created Logistic Regression out of it and checked Accuracy and is working fine. So now requirement is I've a new data with Age 30 and EstimatedSalary 50000 and I would like to predict whether Purchased will be 0 or 1. How to pass the new values 30 and 50000 in my python code.
Below is the python code which I've used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline
dataset = pd.read_csv(r"suv_data.csv")
X=dataset.iloc[:,[0,1]].values
y=dataset.iloc[:,2].values
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=1)
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
accuracy_score(y_test,y_pred)*100
Regards,
Bharath Vikas
In general, to evaluate (i.e. call .predict in sklearn) a trained model, you need to input samples that have the same shape as the samples the model was trained on.
In your case I suppose (see my comment on your question) you wanted to have samples with Age and EstimatedSalary in the training set using Purchased as label.
Then, to test on a single sample just try this:
single_test_sample = pd.DataFrame({'Age':[30], 'EstimatedSalary':[50000]}).iloc[:,[0,1]].values
single_test_sample = sc.transform(single_test_sample)
single_test_prediction = classifier.predict(single_test_sample)
Note that you can also add more values in the test dataframe Age and EstimatedSalary columns, now I only added the sample you were interested in. If you add more, the model will output a prediction for each row in the test dataframe.
Also note that your code and mine, will also work without this .values at the end of the train/test set as sklearn already provides functionality with pandas dataframes.
Your question is not clear but I understand that you need to use the fitted model to predict a new sample.
After having fitted your model just use this:
new_sample = np.array([[30,50000]]) # 2D numpy array
new_sample_sc = sc.transform(new_sample)
y_pred_new = classifier.predict(new_sample_sc)
print(y_pred_new)

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?

I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps

What is leaf_values from Python LightGBM?

I'm using the LightGBM Package.
I have successfully created a new tree using "create_tree_digraph" but I face some trouble understanding the result.
There is "leaf_value" in a leaf node. I don't know what it means. Please, somebody help me understand this. Thanks. :)
I used this example code from here: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
#importing standard libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import graphviz
import lightgbm as lgb
#loading our training dataset 'adult.csv' with name 'data' using pandas
data=pd.read_csv('./adult.csv',header=None)
#Assigning names to the columns
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income']
# Label Encoding our target variable
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder()
l.fit(data.Income)
data.Income=Series(l.transform(data.Income)) #label encoding our target variable
#One Hot Encoding of the Categorical features
one_hot_workclass=pd.get_dummies(data.workclass)
one_hot_education=pd.get_dummies(data.education)
#removing categorical features
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_workclass,one_hot_education],axis=1)
#Here our target variable is 'Income' with values as 1 or 0.
#Separating our data into features dataset x and our target dataset y
x=data.drop('Income',axis=1)
y=data.Income
#Imputing missing values in our target variable
y.fillna(y.mode()[0],inplace=True)
#Now splitting our dataset into test and train
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':3,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']
#training our model using light gbm
num_round=50
lgbm=lgb.train(param,train_data,num_round)
graph = lgb.create_tree_digraph(lgbm)
graph.render(view=True)
Then I applied 'create_tree_digraph' function.
Pics
These are the raw predicted probabilities before the sigmoid function is applied. However, one thing to be aware of is your image is only showing 1 tree out of the entire model so it will not be the same as the actual outcome (unless your model is just this 1 tree).
This Image is showing what it would look like if you applied the sigmoid to the leaf values prior to creating the plots.

Categories

Resources