Pickling a Complicated Trained FeatureUnion in Python - python

I have a complicated featureunion consisting of several pipelines with custom transformers and standard transformers.
I am trying to pickle a fit featureunion for subsequent use but I'm getting errors.
I fit and pickle my featureunion as follows:
# Pickle fit pipeline
feature_union = feature_union.fit(X_train)
pickle.dump(feature_union, open("feature_union.p","wb"))
elsewhere I load the pickled featureunion and try to transform new data like this:
# Open fit pipeline and transform new data
feature_union = pickle.load(open("feature_union.p","rb"))
X_validation_enc = feature_union.transform(X_validation)
I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-15-7b78df603a5a> in <module>
1 # Open fit pipeline
2
----> 3 feature_union = pickle.load(open("feature_union.p","rb"))
4
5 X_validation_enc = feature_union.transform(X_validation)
AttributeError: Can't get attribute 'column_selector' on <module '__main__'>
The pickle works when I have the entire featureunion, pipeline and transformer code in the 'new' (destination) script. Does that mean the only thing I can pickle is the the fit featureunion object? The error suggests I need all the code in the new script and all I'm loading is a fit featureunion object so the only 'savings' is that I don't need to fit the featureunion on training data. Is this correct? Is there some way to pickle so I can remove all the featureunion/pipeline/transformer code in the new script?
My featureunion consists of numerous custom and 'library-based' transformers and actions. In some instances I pass outside lists and variables into the class (tranformer). All of these lists and variables are present in the new code.
At a loss.
If it helps, the structure of my featureunion, pipelines, and some code for the transformers is shown below.
Guidance appreciated.
The structure looks like this:
feature_union = FeatureUnion([
('cat_binary', pipeline_categorical_binary),
('cat_ordinal_string', pipeline_categorical_ordinal_string),
('cont', pipeline_continuous)
])
One of the pipelines has this structure:
pipeline_continuous = Pipeline(steps = [
('column_selector', column_selector(numeric_features)),
('numerical_impute', numerical_imputer(numerical_impute_approach)),
('continuous_transform', continuous_transformer(continuous_transform_dict,do_transform)),
('scaler',DFStandardScaler(perform_scaling))
])
Within the pipeline, I have custom and packaged transformers. For example, the 'continuous transform' custom transformer referenced in the above pipeline log transforms data and looks like this:
# 3 Transform continuous features
class continuous_transformer(BaseEstimator,TransformerMixin):
def __init__(self, type_transform, do_transform ='No'):
self.do_transform = do_transform
self.type_transform = type_transform
def fit(self,X,y=None):
return self
def transform(self, X):
if self.do_transform == 'Yes':
for key, value in self.type_transform.items():
if value == 'log_transform':
X[key] = X[key].apply(lambda x: np.log(x+1.0))
X.rename(columns = {key:'log_' + key}, inplace = True)
X_continuous_transformed_df = X
return X_continuous_transformed_df
else:
return X
And the 'scaler' transformer uses the StandardScaler module and looks like this:
# 3 Standardize continuous features
class DFStandardScaler(BaseEstimator,TransformerMixin):
def __init__(self, perform_scaling):
self.ss = None
self.perform = perform_scaling
def fit(self,X,y=None):
self.ss = StandardScaler().fit(X)
return self
def transform(self, X):
if self.perform == 'Yes':
Xss = self.ss.transform(X)
X_continuous_scaled_df = pd.DataFrame(Xss, index=X.index, columns=X.columns)
return X_continuous_scaled_df
else:
X_continuous_scaled_df = X
return X_continuous_scaled_df
The above hierarchy is well defined in my code.

Related

creating a custom dataset class like sst2, throws `Cannot index by location index with a non-integer key`

I'm trying to experiment with PyTorch some model; the dataset they were using for the experiment is sst
But I'm also learning PyTorch, so I thought it would be better to play with Dataset class and create my own dataset.
So this was my approach:
class CustomDataset(Dataset):
def __init__(self, dataframe):
self.dataframe = dataframe
self.column_names = ['text','label']
def __getitem__(self, index):
print('index: ',index)
row = self.dataframe.iloc[index].to_numpy()
features = row[1:]
label = row[0]
return features, label
def __len__(self):
return len(self.dataframe)
df = pd.DataFrame(np.array([
["hello", 0] ,
["sex", 1] ,
["beshi kore sex", 1],]),
columns=['text','label'])
dataset = CustomDataset(dataframe=df)
Instead of creating sub-categories like validation/test/train, I'm just trying to create one custom Dataset class at first.
And it keeps giving me Cannot index by location index with a non-integer key During conceptual development, I tried this: df.iloc[0].to_numpy(), and it works absolutely fine. But it's sending index: text for some reason. I even tried putting an 'id' column.
But I'm sure that there must be some other way to achieve this. How can I resolve this issue? As my code worked fine for sst, as this not working any longer. I'm pretty sure, this is not one to one mapping.
Complete code:
#!pip install sentence_transformers -q
#!pip install setfit -q
from sentence_transformers.losses import CosineSimilarityLoss
from torch.utils.data import Dataset
import pandas as pd
import numpy as np
from setfit import SetFitModel, SetFitTrainer, sample_dataset
class CustomDataset(Dataset):
def __init__(self, dataframe):
self.dataframe = dataframe
self.column_names = ['id','text','label']
def __getitem__(self, index):
print('index: ',index)
row = self.dataframe.iloc[index].to_numpy()
features = row[1:]
label = row[0]
return features, label
def __len__(self):
return len(self.dataframe)
df = pd.DataFrame(np.array([ [1,"hello", 0] ,
[2,"sex", 1] ,
[3,"beshi kore sex", 1],]),columns=['id','text','label'])
# df.head()
dataset = CustomDataset(dataframe=df)
# Load a dataset from the Hugging Face Hub
# dataset = load_dataset("sst2") # HERE, previously I was simply using sst/sst2
# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = dataset
eval_dataset = dataset
# Load a SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# Create trainer
trainer = SetFitTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss_class=CosineSimilarityLoss,
metric="accuracy",
batch_size=16,
num_iterations=1, # The number of text pairs to generate for contrastive learning
num_epochs=1, # The number of epochs to use for contrastive learning
)
# Train and evaluate
trainer.train()

'MeanEmbeddingVectorizer' object has no attribute 'transform'

Hello i'm working with text classification.
I've a dataset with 2 columns one made of text and the other one is the label.
Since i'm a beginner i'm following step by step a tutorial on W2vec trying to understand if it can work for my usecase but i keep getting this error.
This is my code
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(next(iter(word2vec.values())))
def fit(self, X, y):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])
train_df['clean_text_tok']=[nltk.word_tokenize(i) for i in train_df['clean_text']]
model = Word2Vec(train_df['clean_text_tok'],min_count=1)
w2v = dict(zip(model.wv.index_to_key, model.wv.vectors))
modelw = MeanEmbeddingVectorizer(w2v)
# converting text to numerical data using Word2Vec
X_train_vectors_w2v = modelw.transform(X_train_tok)
X_val_vectors_w2v = modelw.transform(X_test_tok)
the error i'm getting is :
Dimension: 100
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-127-289141692350> in <module>
4 modelw = MeanEmbeddingVectorizer(w2v)
5 # converting text to numerical data using Word2Vec
----> 6 X_train_vectors_w2v = modelw.transform(X_train_tok)
7 X_val_vectors_w2v = modelw.transform(X_test_tok)
AttributeError: 'MeanEmbeddingVectorizer' object has no attribute 'transform'
If your MeanEmbeddingVectorizer is defined in your code exactly as its shows here, the failure-to-indent the .fit() and .transform() functions means they're not part of the class, as you likely intended.
Indenting those each an extra 4 spaces – as was likely the intent of any source you copied this code from! – will put them "inside" the MeanEmbeddingVectorizer class, as class methods. Then, objects of that class won't give the same "no attribute" error.
For example:
class MeanEmbeddingVectorizer(object):
def __init__(self, word2vec):
self.word2vec = word2vec
# if a text is empty we should return a vector of zeros
# with the same dimensionality as all the other vectors
self.dim = len(next(iter(word2vec.values())))
def fit(self, X, y):
return self
def transform(self, X):
return np.array([
np.mean([self.word2vec[w] for w in words if w in self.word2vec]
or [np.zeros(self.dim)], axis=0)
for words in X
])

How to make prediction from train Pytorch and PytorchText model?

General speaking, after I have successfully trained a text RNN model with Pytorch, using PytorchText to leverage data loading on an origin source, I would like to test with other data sets (a sort of blink test) that are from different sources but the same text format.
First I defined a class to handle the data loading.
class Dataset(object):
def __init__(self, config):
# init what I need
def load_data(self, df: pd.DataFrame, *args):
# implementation below
# Data format like `(LABEL, TEXT)`
def load_data_but_error(self, df: pd.DataFrame):
# implementation below
# Data format like `(TEXT)`
Here is the detail of load_data which I load data that trained successfully.
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True, fix_length=self.config.max_sen_len)
LABEL = data.Field(sequential=False, use_vocab=False)
datafields = [(label_col, LABEL), (data_col, TEXT)]
# split my data to train/test
train_df, test_df = train_test_split(df, test_size=0.33, random_state=random_state)
train_examples = [data.Example.fromlist(i, datafields) for i in train_df.values.tolist()]
train_data = data.Dataset(train_examples, datafields)
# split train to train/val
train_data, val_data = train_data.split(split_ratio=0.8)
# build vocab
TEXT.build_vocab(train_data, vectors=Vectors(w2v_file))
self.word_embeddings = TEXT.vocab.vectors
self.vocab = TEXT.vocab
test_examples = [data.Example.fromlist(i, datafields) for i in test_df.values.tolist()]
test_data = data.Dataset(test_examples, datafields)
self.train_iterator = data.BucketIterator(
(train_data),
batch_size=self.config.batch_size,
sort_key=lambda x: len(x.title),
repeat=False,
shuffle=True)
self.val_iterator, self.test_iterator = data.BucketIterator.splits(
(val_data, test_data),
batch_size=self.config.batch_size,
sort_key=lambda x: len(x.title),
repeat=False,
shuffle=False)
Next is my code (load_data_but_error) to load others source but causing error
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True, fix_length=self.config.max_sen_len)
datafields = [('title', TEXT)]
examples = [data.Example.fromlist(i, datafields) for i in df.values.tolist()]
blink_test = data.Dataset(examples, datafields)
self.blink_test = data.BucketIterator(
(blink_test),
batch_size=self.config.batch_size,
sort_key=lambda x: len(x.title),
repeat=False,
shuffle=True)
When I was executing code, I had an error AttributeError: 'Field' object has no attribute 'vocab' which has a question at here but it doesn't like my situation as here I had vocab from load_data and I want to use it for blink tests.
My question is what the correct way to load and feed new data with a trained PyTorch model for testing current model is?
What I need are
to keep TEXT in load_data and reuse in load_data_but_error by assigning to class variables
add train=True to object data.BucketIterator on load_data_but_error function
Not really sure, but considering you have re-defined TEXT, you will have to explicitly create the vocab for your Field TEXT again. This can be done as follows:
TEXT.build_vocab(examples, min_freq = 2)
This particular statement adds the word from your data to the vocab only if it occurs at least two times in your data-set examples, you can change it as per your requirement.
You can read about build_vocab method at https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field.build_vocab.

Expected output of step in a Scikit learn feature union step

I have a Scikit learn pipeline which includes a feature union as so
from sklearn.pipeline import Pipeline, FeatureUnion
pipeline = Pipeline([
('feats', FeatureUnion([
#
('Vec', Doc2vec()),
('Counter', I_counter()),
])),
('clf', LogisticRegression()) # classifier
])
Each of the two processes in the feature union are classes I've written myself. The first of these is a self written vectorizer based on the Gensim Doc2Vec model. Full code here
If I understand the feature union documentation correctly, it runs each step in parallel and concatenates the output vectors into a single vector passed to the next step (The CLF pipeline in this case).
I wrote each class to return a single numpy array, however the above code is triggering an error.
TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None, steps=[('vec', Doc2vec())])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't
If I understand the error correctly (?), it's stating that the Doc2vec class is not outputting a suitable feature?
The Doc2vec class outputs a single numpy array, containing a 100 vector array for each inputted text sequence. I naively assumed it would simply concatenate this to the I_counter output and all work happyly.
Might someone be able to highlight where my logic is wrong?
--
EDIT, more code
class Doc2vec(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def vec(data):
print('starting')
SentimentDocument = namedtuple('SentimentDocument', 'words tags split sentiment')
alldocs = []
for line_no, line in data.iterrows():
#tokens = gensim.utils.to_unicode(line).split()
words = gensim.utils.simple_preprocess(line['post'])
tags = [line_no] # 'tags = [tokens[0]]' would also work at extra memory cost
split = ['train', 'test'][line_no//1200] # 25k train, 25k test, 25k extra
if gensim.utils.simple_preprocess(line['type']) == ['depression']:
sentiment = (1.0)
else:
sentiment = (0.0)
alldocs.append(SentimentDocument(words, tags, split, sentiment))
train_docs = [doc for doc in alldocs if doc.split == 'train']
test_docs = [doc for doc in alldocs if doc.split == 'test']
#print('%d docs: %d train-sentiment, %d test-sentiment' % (len(alldocs), len(train_docs), len(test_docs)))
from random import shuffle
doc_list = alldocs[:]
shuffle(doc_list)
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DM w/ default averaging; a higher starting alpha may improve CBOW/PV-DM modes
Doc2Vec(dm=1, vector_size=100, window=10, negative=5, hs=0, min_count=2, sample=0,
epochs=20, workers=cores, alpha=0.05, comment='alpha=0.05')
]
for model in simple_models:
model.build_vocab(train_docs)
#print("%s vocabulary scanned & state initialized" % model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
model.train(train_docs, total_examples=len(train_docs), epochs=model.epochs)
train_targets, train_regressors = zip(*[(doc.words, doc.sentiment) for doc in train_docs])
import numpy as np
X = []
for i in range(len(train_targets)):
X.append(model.infer_vector(train_targets[i]))
train_x = np.asarray(X)
print(type(train_x))
return(train_x)
class I_counter(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
​
​
def transform(self, data):
def i_count(name):
tokens = nltk.word_tokenize(name)
count = tokens.count("I")
count2 = tokens.count("i")
return(count+count2)
vecfunc = np.vectorize(i_count)
data = np.transpose(np.matrix(data['post']))
result = vecfunc(data)
return result

scikit-learn: How to compose LabelEncoder and OneHotEncoder with a pipeline?

While preprocessing the labels for a machine learning classifying task, I need to one hot encode the labels which take string values. It happens that OneHotEncoder from sklearn.preprocessing or to_categorical from kera.np_utils require int inputs. This means that I need to precede the one hot encoder with a LabelEncoder. I have done it by hand with a custom class:
class LabelOneHotEncoder():
def __init__(self):
self.ohe = OneHotEncoder()
self.le = LabelEncoder()
def fit_transform(self, x):
features = self.le.fit_transform( x)
return self.ohe.fit_transform( features.reshape(-1,1))
def transform( self, x):
return self.ohe.transform( self.la.transform( x.reshape(-1,1)))
def inverse_tranform( self, x):
return self.le.inverse_transform( self.ohe.inverse_tranform( x))
def inverse_labels( self, x):
return self.le.inverse_transform( x)
I am confident there must a way of doing it within the sklearn API using a sklearn.pipeline, but when using:
LabelOneHotEncoder = Pipeline( [ ("le",LabelEncoder), ("ohe", OneHotEncoder)])
I get the error ValueError: bad input shape () from the OneHotEncoder. My guess is that the output of the LabelEncoder needs to be reshaped, by adding a trivial second axis. I am not sure how to add this feature though.
It's strange that they don't play together nicely... I'm surprised. I'd extend the class to return the reshaped data like you suggested.
class ModifiedLabelEncoder(LabelEncoder):
def fit_transform(self, y, *args, **kwargs):
return super().fit_transform(y).reshape(-1, 1)
def transform(self, y, *args, **kwargs):
return super().transform(y).reshape(-1, 1)
Then using the pipeline should work.
pipe = Pipeline([("le", ModifiedLabelEncoder()), ("ohe", OneHotEncoder())])
pipe.fit_transform(['dog', 'cat', 'dog'])
https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/preprocessing/label.py#L39
From scikit-learn 0.20, OneHotEncoder accepts strings, so you don't need a LabelEncoder before it anymore. And you can just use it in a pipeline.
I have used a customized class to wrap my label encoder function and it returns the whole updated dataset.
class CustomLabelEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X ,y=None):
le=LabelEncoder()
for i in X[cat_cols]:
X[i]=le.fit_transform(X[i])
return X
cat_cols=['Family','Education','Securities Account','CDAccount','Online','CreditCard']
le_ct=make_column_transformer((CustomLabelEncode(),cat_cols),remainder='passthrough')
pd.DataFrame(ct3.fit_transform(X)) #This will show you your changes
Final_pipeline=make_pipeline(le_ct)
[I have implemented it you can see my github link]
[1]: https://github.com/Ayushmina-20/sklearn_pipeline
It is not for the asked question but for applying only LabelEncoder to all columns you can use the below format
df_non_numeric =df.select_dtypes(['object'])
non_numeric_cols = df_non_numeric.columns.values
from sklearn.preprocessing import LabelEncoder
for col in non_numeric_cols:
df[col] = LabelEncoder().fit_transform(df[col].values)
df.head()

Categories

Resources