I am predicting the IPL match win probability. While deploying the model using streamlit it show this error:
AttributeError: 'ColumnTransformer' object has no attribute '_name_to_fitted_passthrough'
That's my code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
trf = ColumnTransformer([('trf',OneHotEncoder(sparse=False,drop='first'),
['batting_team','bowling_team','city'])],remainder='passthrough')
pipeline code
pipe = Pipeline(steps=[
('step1',trf),
('step2',LogisticRegression(solver='liblinear'))])
Related
Hi I am trying to use pypi kds package.
I have installed it with: pip install kds
I didn't have any installation problem. But when I ran the following example script:
# REPRODUCABLE EXAMPLE
# Load Dataset and train-test split
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,random_state=3)
clf = tree.DecisionTreeClassifier(max_depth=1,random_state=3)
clf = clf.fit(X_train, y_train)
y_prob = clf.predict_proba(X_test)
# The magic happens here
import kds
kds.metrics.report(y_test, y_prob)
It gives an error:
AttributeError Traceback (most recent call last)
<ipython-input-4-fa00bcb248e7> in <module>
13 # The magic happens here
14 import kds
---> 15 kds.metrics.report(y_test, y_prob)
AttributeError: module 'kds' has no attribute 'metrics'
Issue is resolved in the latest version. Please update the package using pip install kds
Also don't forget the y_prob[:,1] in the last line. (output of scikit learn has 2 columns, so select 1 column)
# The magic happens here
import kds
kds.metrics.plot_lift(y_test, y_prob[:,1])
When I run the following code. I am having an error saying AttributeError: module 'python_utils' has no attribute 'clean_data', but I know know how to fix it.
import python_utils
import pandas as pd
from sklearn import linear_model
def clean_data(data):
data["Fare"]=data["Fare"].fillna(data["Fare"].dropna().median())
data["Age"]=data["Age"].fillna(data["age"].dropna().median())
data.loc[data["Sex"]=="male", "Sex"]=0
data.loc[data["Sex"]=="female", "Sex"]=1
data["Embarked"]=data["Embarked"].fillna("S")
data.loc[data["Embarked"]=="S",Embarked]=0
data.loc[data["Embarked"]=="C",Embarked]=1
data.loc[data["Embarked"]=="Q",Embarked]=2
train=pd.read_csv('train.csv')
python_utils.clean_data(train)
target=train["Survived"].values
features=train[["Pclass","Age","Sex","SibSp","Parch"]].values
classifier=linear_model.logisticRegression()
classifier=classifier.fit(features,target)
print(classifier_.score(features,target))
I am trying to tokenize some numerical strings using a WordLevel/BPE tokenizer, create a data collator and eventually use it in a PyTorch DataLoader to train a new model from scratch.
However, I am getting an error
AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
when running the following code
from transformers import DataCollatorForLanguageModeling
from tokenizers import ByteLevelBPETokenizer
from tokenizers.pre_tokenizers import Whitespace
from torch.utils.data import DataLoader, TensorDataset
data = ['4814 4832 4761 4523 4999 4860 4699 5024 4788 <unk>']
# Tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train_from_iterator(data, vocab_size=1000, min_frequency=1,
special_tokens=[
"<s>",
"</s>",
"<unk>",
"<mask>",
])
# Data Collator
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
train_dataset = TensorDataset(torch.tensor(tokenizer(data, ......)))
# DataLoader
train_dataloader = DataLoader(
train_dataset,
collate_fn=data_collator
)
Is this error due to not having configured the pad_token_id for the tokenizer? If so, how can we do this?
Thanks!
Error trace:
AttributeError: Caught AttributeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/opt/anaconda3/envs/x/lib/python3.8/site-packages/transformers/data/data_collator.py", line 351, in __call__
if self.tokenizer.pad_token_id is not None:
AttributeError: 'ByteLevelBPETokenizer' object has no attribute 'pad_token_id'
Conda packages
pytorch 1.7.0 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
pytorch-lightning 1.2.5 pyhd8ed1ab_0 conda-forge
tokenizers 0.10.1 pypi_0 pypi
transformers 4.4.2 pypi_0 pypi
The error tells you that the tokenizer needs an attribute called pad_token_id. You can either wrap the ByteLevelBPETokenizer into a class with such an attribute (... and met other missing attributes down the road) or use the wrapper class from the transformers library:
from transformers import PreTrainedTokenizerFast
#your code
tokenizer.save(SOMEWHERE)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
I am getting an error
AttributeError: 'RandomForestClassifier' object has no attribute 'fit_transform'
However, there is a method named fit_transform(X,y) in sklearn.ensemble.RandomForestClassifier. This can be seen here
I don't understand why I am getting this error and how do I resolve it.
Here is the code snippet-
from sklearn.ensemble import RandomForestClassifier
import pickle
import sys
import numpy as np
X1=np.array(pickle.load(open('X2g_train.p','rb')))
X2=np.array(pickle.load(open('X3g_train.p','rb')))
X3=np.array(pickle.load(open('X4g_train.p','rb')))
X4=np.array(pickle.load(open('Xhead_train.p','rb')))
X=np.hstack((X2,X1,X3,X4))
y = np.array(pickle.load(open('y.p','rb')))
rf=RandomForestClassifier(n_estimators=200)
Xr=rf.fit_transform(X,y)
There's no such method in the scikit-learn API documentation
To train your model and get predictions, you need to do like this
rf = RandomForestClassifier()
# train the model
rf.fit(X_train, y_train)
# get predictions
predictions = rf.predict(X_test)
I need to convert my random forest model into pmml format in python. I've imported sklearn2pmml from github and tried create a pmml file. I run the code below;
import pandas
import sklearn_pandas
iris = iris.csv
iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "sepal_width", "petal_length", "petal_width"]), pandas.DataFrame(iris.target, columns = ["species"])), axis = 1)
iris_mapper = sklearn_pandas.DataFrameMapper([('sepal_length',None),
('sepal_width', None),
('petal_width', None),
('petal_width', None),
('species',None)])
iris = iris_mapper.fit_transform(iris_df)
from sklearn.ensemble import RandomForestClassifier
iris_X = iris[:, 0:4]
iris_y = iris[:, 4]
iris_classifier = RandomForestClassifier(n_estimators=10)
iris_classifier.fit(iris_X, iris_y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(iris_classifier, iris_mapper, "randomforest.pmml")
However, I get an error;
TypeError: The pipeline object is not an instance of PMMLPipeline
Any suggestion what I am missing or another way to creat pmml format?
TypeError: The pipeline object is not an instance of PMMLPipeline
The first argument of the sklearn2pmml function call must be an instance of sklearn2pmml.PMMLPipeline. You're passing an instance of sklearn.ensemble.RandomForestClassifier instead.
Any suggestion what I am missing or another way to creat pmml format?
You're pairing a pre-historic code example with the latest version of the sklearn2pmml library. These are your options:
Upgrade code example to latest sklearn2pmml library version. Please take two minutes to read through the "Usage" section of its README.file.
Downgrade the sklearn2pmml library to 0.13.0 (or older) version.
sklearn2pmml() need a PMMLPipeline model, so try to pack iris_classifier with PMMLPipeline like this:
import pandas
import sklearn_pandas
from sklearn.datasets import load_iris
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn.ensemble import RandomForestClassifier
d = load_iris()
iris_X = d.data
iris_y = d.target
iris_classifier = RandomForestClassifier(n_estimators=10)
#rfc_model = iris_classifier.fit(iris_X, iris_y)
pipeline_model = PMMLPipeline([('iris_classifier',
iris_classifier)]).fit(iris_X, iris_y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline_model, 'rfc.pmml', with_repr = True)