Custom function in sklearn2pmml PMMLPipeline

Custom function in sklearn2pmml PMMLPipeline - python

I am trying to create a machine learning model to suggest treatment for stroke patients based on their responses to various questionnaires and assessments. For instance, the patient will be asked to rate the stiffness of the fingers, elbow,  shoulder, and pectoral muscles (each on a scale of 0 to 100) or answer 14 questions related to mental health (each on a scale of 0 to 3).
I would like to create an sklearn pipeline roughly as follows:
1.       The patient responses are aggregated. For example, the four stiffness responses should be averaged to create a single “stiffness” value, while the fourteen mental health questions should be summed up to create a single “mental health” value. The “stiffness” and “mental health” values would then be features in the model.
2.       Once the features have been aggregated in this way, a decision tree classifier is trained on labeled data to assign each patient to the appropriate therapy.
3.       The trained pipeline is exported as a pmml file for production
I assume this must be doable with some code like this:
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import sklearn2pmml
from sklearn.tree import DecisionTreeClassifier
from somewhere import Something
pipeline = PMMLPipeline([
    ("input_aggregation", Something()),
    ("classifier", DecisionTreeClassifier())
])
pipeline.fit(patient_input, therapy_labels)
 
sklearn2pmml(pipeline, "ClassificationPipeline.pmml", with_repr = True)
I’ve been poking around the documentation and I can figure out to apply PCA to a group of columns but not how to do something as straightforward as collapsing a group of columns by summing or averaging. Does anyone have any hints about how I could do this?
Thanks for your help.

Sample code:
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.preprocessing import Aggregator
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["stiffness_1", "stiffness_2", "stiffness_3", "stiffness_4"], Aggregator(function = "mean")),
(["mental_health_1", "mental_health2", .., "mental_health_14"], Aggregator(function = "sum"))
])),
("classifier", DecisionTreeClassifier())
])
pipeline.fit(X, y)
Explanation - you can use sklearn_pandas.DataFrameMapper to define a column group, and apply a transformation to it. For the conversion to PMML work, you need to provide a transformer class, not a direct function. Perhaps all your transformation needs are handled by the sklearn2pmml.preprocessing.Aggregator transformer class. If not, you can always define your own.
While #makis has provided a 100% valid Python example, it wouldn't work in the Python-to-PMML case, because the converter cannot parse/handle custom Python functions.

You just need to define a custom function and use it in the Pipeline.
Here is the full code:
from sklearn.preprocessing import FunctionTransformer
import numpy as np
from sklearn2pmml import make_pmml_pipeline
# fake data with 7 columns
X = np.random.rand(10,7)
n_rows = X.shape[0]
def custom_function(X):
#averiging 4 first columns, sums the others, column-wise
return np.concatenate([np.mean(X[:,0:5],axis = 1).reshape(n_rows,1), np.sum(X[:,5:],axis=1).reshape(n_rows,1)],axis = 1)
# Now, if you run: `custom_function(X)` it should return an array (10,2).
pipeline = make_pmml_pipeline(
FunctionTransformer(custom_function),
)

Related

Loading CSV to Scikit Learn

I'm new to python but I'm trying to run a regression with a bunch of different variables. So far I've got it down to Scikit. I've been searching for hours but can't seem to find a way to import the data and then run a linear regression on it while returning the coefficients of each variable. Any help is much appreciated. I have 15 columns that I want to run against the X.
X = Margin
Ys = A1, B1, C1, D1 etc.
Example set below:
Margin,A1
-8,110.7
-10,112
-1,106.7
9,109
-9,107.5
1,108.1
-19,109.2
Here's what I've got so far I know it's not much
import pandas as pd
data = pd.read_csv("NBA.csv")

As a convention in machine learning we consider X as the features and Y as the target.
If you want to run a linear regression and extract the coefficients, you can do the following :
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = data.columns.tolist() # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)

Logistic Regression test input format help in python

I do have the below dataset.
I've created Logistic Regression out of it and checked Accuracy and is working fine. So now requirement is I've a new data with Age 30 and EstimatedSalary 50000 and I would like to predict whether Purchased will be 0 or 1. How to pass the new values 30 and 50000 in my python code.
Below is the python code which I've used.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
%matplotlib inline
dataset = pd.read_csv(r"suv_data.csv")
X=dataset.iloc[:,[0,1]].values
y=dataset.iloc[:,2].values
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.2, random_state=1)
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
classifier=LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
accuracy_score(y_test,y_pred)*100
Regards,
Bharath Vikas

In general, to evaluate (i.e. call .predict in sklearn) a trained model, you need to input samples that have the same shape as the samples the model was trained on.
In your case I suppose (see my comment on your question) you wanted to have samples with Age and EstimatedSalary in the training set using Purchased as label.
Then, to test on a single sample just try this:
single_test_sample = pd.DataFrame({'Age':[30], 'EstimatedSalary':[50000]}).iloc[:,[0,1]].values
single_test_sample = sc.transform(single_test_sample)
single_test_prediction = classifier.predict(single_test_sample)
Note that you can also add more values in the test dataframe Age and EstimatedSalary columns, now I only added the sample you were interested in. If you add more, the model will output a prediction for each row in the test dataframe.
Also note that your code and mine, will also work without this .values at the end of the train/test set as sklearn already provides functionality with pandas dataframes.

Your question is not clear but I understand that you need to use the fitted model to predict a new sample.
After having fitted your model just use this:
new_sample = np.array([[30,50000]]) # 2D numpy array
new_sample_sc = sc.transform(new_sample)
y_pred_new = classifier.predict(new_sample_sc)
print(y_pred_new)

Getting array([10]) issue

I am new to ML and learning concepts. I am importing one csv file contained some columns with customer code and product details. I am trying to predict the what product will buy future also but getting array([10])
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import numpy as np
sales_data = pd.read_csv("salesdata.csv")
sales_data = sales_data.astype('int32')
X = sales_data.drop(columns=['Product'])
y = sales_data['Product']
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict([[7301, 52199000]])
predictions

So the good news is, your code is all great. Just a little misunderstanding on what the model is returning... Basically it is saying product number '10' is its prediction based off of the values you plugged into model.predict(). This is the models way of talking to us, and sometimes we need to take an extra step to see what that tenth class label is in our language. Try model.classes_ ; this should print all of the class names the model is trained on.

What is leaf_values from Python LightGBM?

I'm using the LightGBM Package.
I have successfully created a new tree using "create_tree_digraph" but I face some trouble understanding the result.
There is "leaf_value" in a leaf node. I don't know what it means. Please, somebody help me understand this. Thanks. :)
I used this example code from here: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
#importing standard libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import graphviz
import lightgbm as lgb
#loading our training dataset 'adult.csv' with name 'data' using pandas
data=pd.read_csv('./adult.csv',header=None)
#Assigning names to the columns
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income']
# Label Encoding our target variable
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder()
l.fit(data.Income)
data.Income=Series(l.transform(data.Income)) #label encoding our target variable
#One Hot Encoding of the Categorical features
one_hot_workclass=pd.get_dummies(data.workclass)
one_hot_education=pd.get_dummies(data.education)
#removing categorical features
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_workclass,one_hot_education],axis=1)
#Here our target variable is 'Income' with values as 1 or 0.
#Separating our data into features dataset x and our target dataset y
x=data.drop('Income',axis=1)
y=data.Income
#Imputing missing values in our target variable
y.fillna(y.mode()[0],inplace=True)
#Now splitting our dataset into test and train
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':3,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']
#training our model using light gbm
num_round=50
lgbm=lgb.train(param,train_data,num_round)
graph = lgb.create_tree_digraph(lgbm)
graph.render(view=True)
Then I applied 'create_tree_digraph' function.
Pics

These are the raw predicted probabilities before the sigmoid function is applied. However, one thing to be aware of is your image is only showing 1 tree out of the entire model so it will not be the same as the actual outcome (unless your model is just this 1 tree).
This Image is showing what it would look like if you applied the sigmoid to the leaf values prior to creating the plots.

How to match the name column with result after classification scikit-learn

This is an example of my data:
filename,2,3,4,5,6,7,class
a.txt,0,0,0,0,0,0,0
b.txt,0,0,0,0,0,1,0
c.txt,0,0,0,0,1,0,0
d.txt,1,0,1,0,0,1,1
When I train my data, I just use the columns from 2 -> 7 as input, class as output. But when I test the model after it trained and save, I need to know that which files are belong to which class. I mean like how to know d.txt is class 1.
I use pandas to import the data from .csv file, I use train set and test set in 2 different csv files. In the train phase, I uses columns 2-7 as input, and column class as target, these columns are numerical. The filename class is just text. In the test phase, I need to know the filename with the predicted class. But I don't know how to do that.
Thanks
P/s: I used MLP,SVM, NB as classifier.

Assuming your data is in .csv format:
filename,2,3,4,5,6,7,class
a.txt,0,0,0,0,0,0,0
b.txt,0,0,0,0,0,1,0
c.txt,0,0,0,0,1,0,0
d.txt,1,0,1,0,0,1,1
You can output the corresponding filename to a predicted class using:
features=[1,0,1,0,0,1] #input
output=clf.predict([features])[0] #predicted class
print(df[df["class"]==output]["filename"]) #corresponding filename
Note that in your example you're facing the problem where the amount of features is greater than the amount of examples, therefore the classifier may deteriorate.
Hopefully you just gave a sample of your data. In this case you're likely to be good. Just watch out for what classifier to use.
Full code:
import numpy as np
import pandas as pd
from sklearn import svm
df=pd.read_csv('file.csv')
X = df.iloc[:,1:7].values
y = df.iloc[:,7:8].values
clf = svm.SVC() #using SVM as classifier
clf.fit(X, y)
features=[1,0,1,0,0,1]
output=clf.predict([features])[0]
print(df[df["class"]==output]["filename"])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Custom function in sklearn2pmml PMMLPipeline - python

Related

Loading CSV to Scikit Learn

Logistic Regression test input format help in python

Getting array([10]) issue

What is leaf_values from Python LightGBM?

How to match the name column with result after classification scikit-learn

Categories

Resources