Catboost - Unexpected behavior in the java api - python

I'm training a toy model in python, saving the model binary to disk, and then loading it into java (kotlin) and evaluating. My predictions don't agree between python and kotlin. Anyone know what I'm doing wrong?
import catboost as cb
import pandas as pd
x = pd.DataFrame(data={'a': [1, 3, 4, 99, 12],
'b': [0.5, 0, 1.3, 3, 44],
'c': [0.5, 0, 1.3, 0.91, 0],
'd': ['a', 'b', 'c', 'd', 'e']
})
y = pd.DataFrame(data={'y': [1.23, 3.2, 1.0, 1.5, 0.2]})
model = cb.CatBoostRegressor()
model.fit(x, y, cat_features=[3], verbose=False, plot=False)
proba = model.predict([1, 0.5, 0.5, 'c'])
print(proba) # 1.2274747745772228
model.save_model('./very_basic_model.cbm')
val model = CatBoostModel.loadModel('~/path/very_basic_model.cbm')
val floatFeatures = floatArrayOf(
1.0f,
0.5f,
0.5f
)
val categoricalFeatures:Array<String> = arrayOf("c")
val pred = model.predict(floatFeatures, categoricalFeatures).get(0,0)
System.out.println(pred) # -0.198525224469103

The answer is that the catboost java api wasn't correctly implemented until a fairly recent version. When I updated to version "0.24", I could confirm parity between python and java.

Related

Not being able to run grid search for Prophet on AWS EC2 instance

EC2 instance used: d2.4xlarge on EU servers
As the question says, I tried to run grid search with multiple configs for FBProphet on an AWS EC2 Ubuntu instance and I was not able to.
On my laptop it runs fine, just slowly. And that`s why I want to do the grid search on the VM
Could you please help me out? I think it is a problem because the VM apparently only uses 1 single vCPU ( I do not know why this happens).
Moreover, I tried disabling parallel processes on the Prophet`s side and set them to None/Threads/Processes and nothing happened. Still got the same error.
Can somebody please help me out because I am stuck.
Error code:
The error I get : 10:52:44 - cmdstanpy - INFO - CmdStan done processing.
Traceback (most recent call last):
File "demo_prophet.py", line 72, in <module>
m = Prophet(**params).fit(df_to_process) # Fit model with given params
File "/home/ubuntu/.local/lib/python3.8/site-packages/prophet/forecaster.py", line 1169, in fit
self.params = self.stan_backend.sampling(stan_init, dat, self.mcmc_samples, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/prophet/models.py", line 140, in sampling
self.stan_fit = self.model.sample(**args)
File "/home/ubuntu/.local/lib/python3.8/site-packages/cmdstanpy/model.py", line 1188, in sample
raise RuntimeError(msg)
RuntimeError: Error during sampling
Actual code snippet:
param_grid = {
'growth': ["linear", "logistic", "flat"],
'changepoint_range': [0.1, 0.5, 0.6, 0.8, 0.9],
'changepoint_prior_scale': [0.1, 0.5, 0.6, 0.9, 1, 10, 20],
'seasonality_prior_scale': [0.1, 0.5, 0.6, 1, 10.4, 20, 50],
'holidays_prior_scale': [1, 3, 7, 8, 9, 10, 10.4, 15, 20,50],
'n_changepoints': [1, 5, 10, 25, 50, 70, 100, 500],
"mcmc_samples": [0, 1, 6, 10, 20, 50, 70],
"interval_width": [0.1, 0.5, 0.9],
"uncertainty_samples": [0, 1, 5, 10, 50, 100, ]
}
name = "Amo"
df_to_process = dataframe_FKG79[dataframe_FKG79["Name"] == name]
if(df_to_process.shape[0] >= 2):
df_to_process.drop("Name", axis=1, inplace=True)
df_to_process.reset_index(inplace=True)
df_to_process.columns = ["ds", "y"]
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
rmses = [] # Store the RMSEs for each params here
for params in all_params:
m = Prophet(**params).fit(df_to_process) # Fit model with given params
cutoffs = [pd.to_datetime('2021-12-10'), pd.to_datetime('2021-12-31'), pd.to_datetime('2022-01-10')]
df_cv = cross_validation(m, initial = '950 days', cutoffs = cutoffs, horizon = '28 days')
df_p = performance_metrics(df_cv, rolling_window=1)
rmses.append(df_p['rmse'].values[0])
# Find the best parameters
tuning_results = pd.DataFrame(all_params)

Interpreting results from linearmodels PanelOLS .predict() method

Suppose I have the following toy data:
import pandas as pd
from linearmodels.panel import PanelOLS
y = pd.DataFrame(
index=[[1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3]],
data=[70, 60, 50, 30, 33, 27],
columns=["y"],
)
y.index.set_names(["Entity", "Time"], inplace=True)
x = pd.DataFrame(
index=[[1, 1, 1, 2, 2, 2], [1, 2, 3, 1, 2, 3]],
data=[[100], [89], [62], [29], [49], [23]],
columns=["X"],
)
x.index.set_names(["Entity", "Time"], inplace=True)
And build a model using PanelOLS with entity_effects=True:
model_within = PanelOLS(dependent=y, exog=x, entity_effects=True).fit()
And then wanted to use the predict() method to see how a new "entity" would be modelled. First creating a new entity with:
new_x = pd.DataFrame(
index=[[3, 3, 3], [1, 2, 3]],
data=[[40], [70], [33]],
columns=["X"],
)
new_x.index.set_names(["Entity", "Time"], inplace=True)
Then predicting with:
model_within.predict(new_x)
To get the following output:
predictions
Entity
Time
3
1
16.136230
2
28.238403
3
13.312390
According to Wooldridge, 2012, pg 485, the within estimator is estimating:
Since this is modelling a change in expected y from the average of past y's for this entity, how should the predictions be interpreted? My intuition is that the prediction is saying:
For this new entity, 3, in time period 1, given these X inputs, y at time 1 should be 16 units higher than it's average y across all time, for this entity. Is this interpretation correct? How might it be improved?
linearmodels .predict() documentation
Posting results from seeking clarification through the repo:
https://github.com/bashtage/linearmodels/issues/465
"The model is always Y=XB + epsilon + (eta_t ) + (nu_i ). The effects are treated as errors, and so when you predict you get new_x # params and so the entity effects are not used."
So the predictions are for actual values of y, not time-demeaned predictions. However, to achieve time-demeaned predictions, one can create the same model using data that has first been time-demeaned, and pass in new time-demeaned data to predict on.

How to get multi-label output after using tf.nn.sigmoid()

Purpose:
I want to get label name directly in tensorflow-serving when prediction, my question is how to transfer pred = tf.nn.sigmoid(last_layer_output) to real label name?
Question Description:
I know how to do it with multi-classes issue:
CLASSES = tf.constant(['a', 'b', 'c', 'd', 'e'])
pred = tf.nn.softmax(last_layer_output)
# pred pretty similar to:
pred = [[0, 1, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]]
classes_value = tf.argmax(pred, 1)
classes_name = tf.gather(CLASSES, classes_value)
# classes_name: [b'b' b'd' b'e']
# batch_size = 3
So classes_name is what I want, I could use it design signature in tensorflow-serving, when I prediction, I could get final labels.
But how to do like this in multi-label issue?
e.g.
CLASSES = tf.constant(['a', 'b', 'c', 'd', 'e'])
pred = tf.nn.sigmoid(last_layer_output)
pred = tf.round(pred)
# pred pretty similar to:
pred = [[0, 1, 1, 0, 1], # 'b','c','e'
[1, 0, 0, 1, 0], # 'a','d'
[1, 1, 1, 0, 1]] # 'a','b','c','e'
How could I transfer pred to labels name? I can't do it after sees.run() or using other api like numpy because this is for tensorflow-serving, I think I need to do it using tf method.
Any suggestions are appreciated !
You should first define given probabilities of many classes, which classes you want to return. For example if the probability of this class is above 0.5.
Then you can use tf.where with proper condition to get indices and then same tf.gather to get classes.
Like this:
indicies = tf.where(tf.greater(pred, 0.5))
classes = tf.gather(CLASSES, indicies[:, 1])
Then you need to re-organize it using indicies[:, 0] that tells you which example the class from.
Also, you must understand that correct form of the answer is SparseTensor, which are not very supported by serving and etc. So you may want to return two tensors [strings and indicators which strings are for which example] and deal with on your client side.

Pandas sklearn one-hot encoding dataframe or numpy?

How can I transform a pandas data frame to sklearn one-hot-encoded (dataframe / numpy array) where some columns do not require encoding?
mydf = pd.DataFrame({'Target':[0,1,0,0,1, 1,1],
'GroupFoo':[1,1,2,2,3,1,2],
'GroupBar':[2,1,1,0,3,1,2],
'GroupBar2':[2,1,1,0,3,1,2],
'SomeOtherShouldBeUnaffected':[2,1,1,0,3,1,2]})
columnsToEncode = ['GroupFoo', 'GroupBar']
Is an already label encoded data frame and I would like to only encode the columns marked by columnsToEncode?
My problem is that I am unsure if a pd.Dataframe or the numpy array representation are better and how to re-merge the encoded part with the other one.
My attempts so far:
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(X_train)
df = pd.concat([
df[~columnsToEncode], # select all other / numeric
# select category to one-hot encode
pd.Dataframe(encoder.transform(X_train[columnsToEncode]))#.toarray() # not sure what this is for
], axis=1).reindex_axis(X_train.columns, axis=1)
Notice: I am aware of Pandas: Get Dummies / http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html but that does not play well in a train / test split where I require such an encoding per fold.
This library provides several categorical encoders which make sklearn / numpy play nicely with pandas https://github.com/wdm0006/categorical_encoding
However, they do not yet support "handle unknown category"
for now I will use
myEncoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
myEncoder.fit(df[columnsToEncode])
pd.concat([df.drop(columnsToEncode, 1),
pd.DataFrame(myEncoder.transform(df[columnsToEncode]))], axis=1).reindex()
As this supports unknown datasets. For now, I will stick with half-pandas half-numpy because of the nice pandas labels. for the numeric columns.
For One Hot Encoding I recommend using ColumnTransformer and OneHotEncoder instead of get_dummies. That's because OneHotEncoder returns an object which can be used to encode unseen samples using the same mapping that you used on your training data.
The following code encodes all the columns provided in the columns_to_encode variable:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X:
array([[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 100],
[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 200],
[0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 300]], dtype=object)
To avoid multicollinearity due to the dummy variable trap, I would also suggest removing one of the columns returned by each column that you encoded. The following code encodes all the columns provided in the columns_to_encode variable AND it removes the last column of each one hot encoded column:
import pandas as pd
import numpy as np
def sum_prev (l_in):
l_out = []
l_out.append(l_in[0])
for i in range(len(l_in)-1):
l_out.append(l_out[i] + l_in[i+1])
return [e - 1 for e in l_out]
df = pd.DataFrame({'cat_1': ['A1', 'B1', 'C1'], 'num_1': [100, 200, 300],
'cat_2': ['A2', 'B2', 'C2'], 'cat_3': ['A3', 'B3', 'C3'],
'label': [1, 0, 0]})
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
columns_to_encode = [0, 2, 3] # Change here
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns_to_encode)], remainder='passthrough')
columns_to_encode = [df.iloc[:, del_idx].nunique() for del_idx in columns_to_encode]
columns_to_encode = sum_prev(columns_to_encode)
X = np.array(ct.fit_transform(X))
X = np.delete(X, columns_to_encode, 1)
X:
array([[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 100],
[0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 200],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 300]], dtype=object)
I believe that this update to the initial answer is even better in order t perform dummy coding
import logging
import pandas as pd
from sklearn.base import TransformerMixin
log = logging.getLogger(__name__)
class CategoricalDummyCoder(TransformerMixin):
"""Identifies categorical columns by dtype of object and dummy codes them. Optionally a pandas.DataFrame
can be returned where categories are of pandas.Category dtype and not binarized for better coding strategies
than dummy coding."""
def __init__(self, only_categoricals=False):
self.categorical_variables = []
self.categories_per_column = {}
self.only_categoricals = only_categoricals
def fit(self, X, y):
self.categorical_variables = list(X.select_dtypes(include=['object']).columns)
logging.debug(f'identified the following categorical variables: {self.categorical_variables}')
for col in self.categorical_variables:
self.categories_per_column[col] = X[col].astype('category').cat.categories
logging.debug('fitted categories')
return self
def transform(self, X):
for col in self.categorical_variables:
logging.debug(f'transforming cat col: {col}')
X[col] = pd.Categorical(X[col], categories=self.categories_per_column[col])
if self.only_categoricals:
X[col] = X[col].cat.codes
if not self.only_categoricals:
return pd.get_dummies(X, sparse=True)
else:
return X

how to run hidden markov models in Python with hmmlearn?

I tried to use hmmlearn from GitHub to run a binary hidden markov model. This does not work:
import hmmlearn.hmm as hmm
transmat = np.array([[0.7, 0.3],
[0.3, 0.7]])
emitmat = np.array([[0.9, 0.1],
[0.2, 0.8]])
obs = np.array([0, 0, 1, 0, 0])
startprob = np.array([0.5, 0.5])
h = hmm.MultinomialHMM(n_components=2, startprob=startprob,
transmat=transmat)
h.emissionprob_ = emitmat
# fails
h.fit([0, 0, 1, 0, 0])
# fails
h.decode([0, 0, 1, 0, 0])
print h
I get this error:
ValueError: zero-dimensional arrays cannot be concatenated
What is the right way to use this module? Note I am using the version of hmmlearn that was separated from sklearn, because apparently sklearn doesn't maintain hmmlearn anymore.
Fit accepts list of sequences and not a single sequence (as in general you can have multiple, independent sequences observed from different runs of your experiments/observations). Thus simply put your list inside another list
import hmmlearn.hmm as hmm
import numpy as np
transmat = np.array([[0.7, 0.3],
[0.3, 0.7]])
emitmat = np.array([[0.9, 0.1],
[0.2, 0.8]])
startprob = np.array([0.5, 0.5])
h = hmm.MultinomialHMM(n_components=2, startprob=startprob,
transmat=transmat)
h.emissionprob_ = emitmat
# works fine
h.fit([[0, 0, 1, 0, 0]])
# h.fit([[0, 0, 1, 0, 0], [0, 0], [1,1,1]]) # this is the reason for such
# syntax, you can fit to multiple
# sequences
print h.decode([0, 0, 1, 0, 0])
print h
gives
(-4.125363362578882, array([1, 1, 1, 1, 1]))
MultinomialHMM(algorithm='viterbi',
init_params='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',
n_components=2, n_iter=10,
params='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ',
random_state=<mtrand.RandomState object at 0x7fe245ac7510>,
startprob=None, startprob_prior=1.0, thresh=0.01, transmat=None,
transmat_prior=1.0)

Categories

Resources