ValueError: blocks[0,:] has incompatible row dimensions

ValueError: blocks[0,:] has incompatible row dimensions - python

Im trying to extract few text features(word_count.char_count...) & tf-idf from a twitter dataset for sentiment analysis. Using sklearn's featureUnion to combine them and give them to a classifier in a Pipeline.
Im getting the following error ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,8].shape[0] == 7920, expected 1. Here is the code:
features_union = FeatureUnion(transformer_list = [('word_count', WordCalculator()),
('char_count', CharCalculator()),
('avg_word_len', AvdWordLengthCalculater()),
('stop_words_count', StopWordsCalculater()),
('spl_char_count', SplCharCalculater()),
('hash_tag_count', HashTagCalculator()),
('num_count',NumericsCalculator()),
('cap_letter_count',CapsCalculator()),
('tfidf_feature',Pipeline([('preprocessor', Preprocessor()),
('selector', ItemSelector('tweet')),
('count', CountVectorizer()),
('tfidf', TfidfTransformer())]))])
pipeline = Pipeline([('noise_remover', UrlRemover()),
('features', features_union),
('model', MultinomialNB())])
pipeline.fit(train, train['label'])```
Here is the complete error log
ValueError Traceback (most recent call last)
<ipython-input-33-bb532fc90bb0> in <module>
14 ('features', features_union),
15 ('model', MultinomialNB())])
---> 16 pipeline.fit(train, train['label'])
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
348 This estimator
349 """
--> 350 Xt, fit_params = self._fit(X, y, **fit_params)
351 with _print_elapsed_time('Pipeline',
352 self._log_message(len(self.steps) - 1)):
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
313 message_clsname='Pipeline',
314 message=self._log_message(step_idx),
--> 315 **fit_params_steps[name])
316 # Replace the transformer of the step with the fitted
317 # transformer. This is necessary when loading the transformer
~/opt/anaconda3/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
353
354 def __call__(self, *args, **kwargs):
--> 355 return self.func(*args, **kwargs)
356
357 def call_and_shelve(self, *args, **kwargs):
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
943
944 if any(sparse.issparse(f) for f in Xs):
--> 945 Xs = sparse.hstack(Xs).tocsr()
946 else:
947 Xs = np.hstack(Xs)
~/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
~/opt/anaconda3/lib/python3.7/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,8].shape[0] == 7920, expected 1.
data set sample:
0 1 0 #fingerprint #Pregnancy Test https://google.com...
1 2 0 Finally a transparant silicon case ^^ Thanks t...
2 3 0 We love this! Would you go? #talk #makememorie...
3 4 0 I'm wired I know I'm George I was made that wa...
4 5 1 What amazing service! Apple won't even talk to...
dataset shape - (7920, 3)
Any immediate help on this would be grateful.

Related

StackingClassifier Raises Exception 'numpy.ndarray' object has no attribute 'columns'

I am trying to train a StackingClassifier in Sklearn, but I keep running into this error where the fit method seems to be complaining about me having passed it numpy arrays. To my knowledge, this is how all the fit methods in sklearn are supposed to work. I read and followed the example from the documentation and expanded on it to include a more complex and comprehensive pipeline that would process categorical, ordinal, scalar, and text data.
Sorry in advance for the lengthy code sample, but I felt it was necessary to provide a complete reproducible example. Simply breaking down the pipeline into its constituent estimators and test those individually did not raise any exceptions, so I figure that the error somehow comes from the gestalt estimator.
Select Features
categorical_data = [
"race",
"gender",
"admission_type_id",
"discharge_disposition_id",
"admission_source_id",
"insulin",
"diabetesMed",
"change",
"payer_code",
"A1Cresult",
"metformin",
"repaglinide",
"nateglinide",
"chlorpropamide",
"glimepiride",
"glipizide",
"glyburide",
"tolbutamide",
"pioglitazone",
"rosiglitazone",
"acarbose",
"miglitol",
"tolazamide",
"glyburide.metformin",
"glipizide.metformin",
]
ordinal_data = [
"age"
]
scalar_data = [
"num_medications",
"time_in_hospital",
"num_lab_procedures",
"num_procedures",
"number_outpatient",
"number_emergency",
"number_inpatient",
"number_diagnoses",
]
text_data = [
"diag_1_desc",
"diag_2_desc",
"diag_3_desc"
]
Create Column Transformers
impute_trans = compose.make_column_transformer(
(
impute.SimpleImputer(
strategy="constant",
fill_value="missing"
),
categorical_data
)
)
encode_trans = compose.make_column_transformer(
(
preprocessing.OneHotEncoder(
sparse=False,
handle_unknown="ignore"
),
categorical_data
),
(
preprocessing.OrdinalEncoder(),
ordinal_data
)
)
scalar_trans = compose.make_column_transformer(
(preprocessing.StandardScaler(), scalar_data),
)
text_trans = compose.make_column_transformer(
(TfidfVectorizer(ngram_range=(1,2)), "diag_1_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_2_desc"),
(TfidfVectorizer(ngram_range=(1,2)), "diag_3_desc"),
)
Create Estimators
cat_pre_pipe = make_pipeline(impute_trans, encode_trans)
logreg = LogisticRegression(
solver = "saga",
penalty="elasticnet",
l1_ratio=0.5,
max_iter=1000
)
text_pipe = make_pipeline(text_trans, logreg)
scalar_pipe = make_pipeline(scalar_trans, logreg)
cat_pipe = make_pipeline(cat_pre_pipe, logreg)
estimators = [
("cat", cat_pipe),
("text", text_pipe),
("scalar", scalar_pipe)
]
Create Stacking Classifier
stack_clf = StackingClassifier(
estimators=estimators,
final_estimator=logreg
)
diabetes_data = pd.read_csv("8k_diabetes.csv", delimiter=',')
x_train, x_test, y_train, y_test = train_test_split(
pd.concat([
preprocess_dataframe(diabetes_data[text_data]),
diabetes_data[categorical_data + scalar_data]
], axis=1),
diabetes_data["readmitted"].astype(int)
)
stack_clf.fit(x_train, y_train)
Complete Stack Trace
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/__init__.py:409, in _get_column_indices(X, key)
408 try:
--> 409 all_columns = X.columns
410 except AttributeError:
AttributeError: 'numpy.ndarray' object has no attribute 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [19], in <cell line: 1>()
----> 1 stack_clf.fit(x_train, y_train)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py:488, in StackingClassifier.fit(self, X, y, sample_weight)
486 self._le = LabelEncoder().fit(y)
487 self.classes_ = self._le.classes_
--> 488 return super().fit(X, self._le.transform(y), sample_weight)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_stacking.py:158, in _BaseStacking.fit(self, X, y, sample_weight)
153 stack_method = [self.stack_method] * len(all_estimators)
155 # Fit the base estimators on the whole training data. Those
156 # base estimators will be used in transform, predict, and
157 # predict_proba. They are exposed publicly.
--> 158 self.estimators_ = Parallel(n_jobs=self.n_jobs)(
159 delayed(_fit_single_estimator)(clone(est), X, y, sample_weight)
160 for est in all_estimators
161 if est != "drop"
162 )
164 self.named_estimators_ = Bunch()
165 est_fitted_idx = 0
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:1043, in Parallel.__call__(self, iterable)
1034 try:
1035 # Only set self._iterating to True if at least a batch
1036 # was dispatched. In particular this covers the edge
(...)
1040 # was very quick and its callback already dispatched all the
1041 # remaining jobs.
1042 self._iterating = False
-> 1043 if self.dispatch_one_batch(iterator):
1044 self._iterating = self._original_iterator is not None
1046 while self.dispatch_one_batch(iterator):
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:861, in Parallel.dispatch_one_batch(self, iterator)
859 return False
860 else:
--> 861 self._dispatch(tasks)
862 return True
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:779, in Parallel._dispatch(self, batch)
777 with self._lock:
778 job_idx = len(self._jobs)
--> 779 job = self._backend.apply_async(batch, callback=cb)
780 # A job can complete so quickly than its callback is
781 # called before we get here, causing self._jobs to
782 # grow. To ensure correct results ordering, .insert is
783 # used (rather than .append) in the following line
784 self._jobs.insert(job_idx, job)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/_parallel_backends.py:208, in SequentialBackend.apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/_parallel_backends.py:572, in ImmediateResult.__init__(self, batch)
569 def __init__(self, batch):
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:262, in BatchedCalls.__call__(self)
258 def __call__(self):
259 # Set the default nested backend to self._backend but do not set the
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/parallel.py:262, in <listcomp>(.0)
258 def __call__(self):
259 # Set the default nested backend to self._backend but do not set the
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/fixes.py:216, in _FuncWrapper.__call__(self, *args, **kwargs)
214 def __call__(self, *args, **kwargs):
215 with config_context(**self.config):
--> 216 return self.function(*args, **kwargs)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/ensemble/_base.py:42, in _fit_single_estimator(estimator, X, y, sample_weight, message_clsname, message)
40 else:
41 with _print_elapsed_time(message_clsname, message):
---> 42 estimator.fit(X, y)
43 return estimator
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:390, in Pipeline.fit(self, X, y, **fit_params)
364 """Fit the model.
365
366 Fit all the transformers one after the other and transform the
(...)
387 Pipeline with fitted steps.
388 """
389 fit_params_steps = self._check_fit_params(**fit_params)
--> 390 Xt = self._fit(X, y, **fit_params_steps)
391 with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
392 if self._final_estimator != "passthrough":
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:348, in Pipeline._fit(self, X, y, **fit_params_steps)
346 cloned_transformer = clone(transformer)
347 # Fit or load from cache the current transformer
--> 348 X, fitted_transformer = fit_transform_one_cached(
349 cloned_transformer,
350 X,
351 y,
352 None,
353 message_clsname="Pipeline",
354 message=self._log_message(step_idx),
355 **fit_params_steps[name],
356 )
357 # Replace the transformer of the step with the fitted
358 # transformer. This is necessary when loading the transformer
359 # from the cache.
360 self.steps[step_idx] = (name, fitted_transformer)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/joblib/memory.py:349, in NotMemorizedFunc.__call__(self, *args, **kwargs)
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:893, in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, "fit_transform"):
--> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/pipeline.py:434, in Pipeline.fit_transform(self, X, y, **fit_params)
432 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
433 if hasattr(last_step, "fit_transform"):
--> 434 return last_step.fit_transform(Xt, y, **fit_params_last_step)
435 else:
436 return last_step.fit(Xt, y, **fit_params_last_step).transform(Xt)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py:672, in ColumnTransformer.fit_transform(self, X, y)
670 self._check_n_features(X, reset=True)
671 self._validate_transformers()
--> 672 self._validate_column_callables(X)
673 self._validate_remainder(X)
675 result = self._fit_transform(X, y, _fit_transform_one)
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/compose/_column_transformer.py:352, in ColumnTransformer._validate_column_callables(self, X)
350 columns = columns(X)
351 all_columns.append(columns)
--> 352 transformer_to_input_indices[name] = _get_column_indices(X, columns)
354 self._columns = all_columns
355 self._transformer_to_input_indices = transformer_to_input_indices
File ~/anaconda3/envs/assignment2/lib/python3.8/site-packages/sklearn/utils/__init__.py:411, in _get_column_indices(X, key)
409 all_columns = X.columns
410 except AttributeError:
--> 411 raise ValueError(
412 "Specifying the columns using strings is only "
413 "supported for pandas DataFrames"
414 )
415 if isinstance(key, str):
416 columns = [key]
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Full Pipeline Diagram

Your categorical pipeline chains two column transformers together. After the first one, the output is a numpy array, but then the second one cannot select transformers by column name as you've requested. Notice the final error message is more informative here, ValueError: Specifying the columns using strings is only supported for pandas DataFrames.
I'd suggest using one column transformer with separate pipelines instead of one pipeline with multiple columntransformers for this reason.

i want to establish a pipe line to pubg data on kaggle to procces it but when i implement a pipe line this error get to me

i want to establish a pipe line to pubg data on kaggle to procces it but when i implement a pipe line this error get to me:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/tmp/ipykernel_35/3879657662.py in <module>
8 ])
9
---> 10 pubg_num_tr = num_pipeline.fit_transform(pubg_num)
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
424 """
425 fit_params_steps = self._check_fit_params(**fit_params)
--> 426 Xt = self._fit(X, y, **fit_params_steps)
427
428 last_step = self._final_estimator
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
353 message_clsname="Pipeline",
354 message=self._log_message(step_idx),
--> 355 **fit_params_steps[name],
356 )
357 # Replace the transformer of the step with the fitted
/opt/conda/lib/python3.7/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
347
348 def __call__(self, *args, **kwargs):
--> 349 return self.func(*args, **kwargs)
350
351 def call_and_shelve(self, *args, **kwargs):
/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
891 with _print_elapsed_time(message_clsname, message):
892 if hasattr(transformer, "fit_transform"):
--> 893 res = transformer.fit_transform(X, y, **fit_params)
894 else:
895 res = transformer.fit(X, y, **fit_params).transform(X)
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
845 if y is None:
846 # fit method of arity 1 (unsupervised transformation)
--> 847 return self.fit(X, **fit_params).transform(X)
848 else:
849 # fit method of arity 2 (supervised transformation)
/tmp/ipykernel_35/2077244363.py in transform(self, X)
13 total_distance = X[:, walkDistance_ix] + X[:, rideDistance_ix]+X[:, swimDistance_ix]
14 if self.add_total_distance_per_seconda:
---> 15 add_total_distance_per_seconda = X[:, total_distance] / X[:, matchDuration_ix]
16 return np.c_[X, walk_distance_per_seconda, total_distance,
17 add_total_distance_per_seconda]
IndexError: arrays used as indices must be of integer (or boolean) type
my pipeline code is:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler())
])
pubg_num_tr = num_pipeline.fit_transform(pubg_num)
i implemented an attribute adder and it worked properly but when i turn on the pipline it fails, i need a solution without the need to converse a float to integers because it harms data.

AttributeError and TypeError using CustomTransformers

I am building a model using customized transformers (KeyError: "None of [Index([('A','B','C')] , dtype='object')] are in the [columns]).
When I run the below code, I get an error because of .fit:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-165-289e1d466eb9> in <module>
10
11 # fit on the complete pipeline
---> 12 training = full_pipeline.fit(X, y)
13
14 # metrics
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
339 """
340 fit_params_steps = self._check_fit_params(**fit_params)
--> 341 Xt = self._fit(X, y, **fit_params_steps)
342 with _print_elapsed_time('Pipeline',
343 self._log_message(len(self.steps) - 1)):
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
301 cloned_transformer = clone(transformer)
302 # Fit or load from cache the current transformer
--> 303 X, fitted_transformer = fit_transform_one_cached(
304 cloned_transformer, X, y, None,
305 message_clsname='Pipeline',
~/opt/anaconda3/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
385 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
386 if hasattr(last_step, 'fit_transform'):
--> 387 return last_step.fit_transform(Xt, y, **fit_params_last_step)
388 else:
389 return last_step.fit(Xt, y,
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
978 sum of n_components (output dimension) over transformers.
979 """
--> 980 results = self._parallel_func(X, y, fit_params, _fit_transform_one)
981 if not results:
982 # All transformers are None
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _parallel_func(self, X, y, fit_params, func)
1000 transformers = list(self._iter())
1001
-> 1002 return Parallel(n_jobs=self.n_jobs)(delayed(func)(
1003 transformer, X, y, weight,
1004 message_clsname='FeatureUnion',
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1042 self._iterating = self._original_iterator is not None
1043
-> 1044 while self.dispatch_one_batch(iterator):
1045 pass
1046
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
857 return False
858 else:
--> 859 self._dispatch(tasks)
860 return True
861
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
775 with self._lock:
776 job_idx = len(self._jobs)
--> 777 job = self._backend.apply_async(batch, callback=cb)
778 # A job can complete so quickly than its callback is
779 # called before we get here, causing self._jobs to
~/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
220 def __call__(self, *args, **kwargs):
221 with config_context(**self.config):
--> 222 return self.function(*args, **kwargs)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
385 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
386 if hasattr(last_step, 'fit_transform'):
--> 387 return last_step.fit_transform(Xt, y, **fit_params_last_step)
388 else:
389 return last_step.fit(Xt, y,
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1112 for doc in raw_documents:
1113 feature_counter = {}
-> 1114 for feature in analyze(doc):
1115 try:
1116 feature_idx = vocabulary[feature]
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
102 else:
103 if preprocessor is not None:
--> 104 doc = preprocessor(doc)
105 if tokenizer is not None:
106 doc = tokenizer(doc)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _preprocess(doc, accent_function, lower)
67 """
68 if lower:
---> 69 doc = doc.lower()
70 if accent_function is not None:
71 doc = accent_function(doc)
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
The code is
# MODEL
from sklearn import tree
# Decision Tree
decision_tree = tree.DecisionTreeClassifier()
# define full pipeline --> preprocessing + model
full_pipeline = Pipeline(steps=[
('preprocess_pipeline', preprocess_pipeline),
('model', decision_tree)])
# fit on the complete pipeline
training = full_pipeline.fit(X, y) # <- this step returns the error
# metrics
score_test = \
round(training.score(X, y) * 100, 2)
print(f"\nTraining Accuracy: {score_test}")
I have also tried with .fit_transform but I get the same error.
I read this: AttributeError: 'numpy.ndarray' object has no attribute 'lower' fitting logistic model data but it seems that I am not passing X or y in the Decision tree like in that example, but maybe I am wrong.
Adding
# Defining the steps in the text pipeline
text_pipeline = Pipeline(steps=[
('text_transformer', TextTransformer()),
('cv', CountVectorizer(analyzer='word', ngram_range=(2, 2), lowercase=False))])
I get this new error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-159-289e1d466eb9> in <module>
10
11 # fit on the complete pipeline
---> 12 training = full_pipeline.fit(X, y)
13
14 # metrics
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
339 """
340 fit_params_steps = self._check_fit_params(**fit_params)
--> 341 Xt = self._fit(X, y, **fit_params_steps)
342 with _print_elapsed_time('Pipeline',
343 self._log_message(len(self.steps) - 1)):
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params_steps)
301 cloned_transformer = clone(transformer)
302 # Fit or load from cache the current transformer
--> 303 X, fitted_transformer = fit_transform_one_cached(
304 cloned_transformer, X, y, None,
305 message_clsname='Pipeline',
~/opt/anaconda3/lib/python3.8/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
385 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
386 if hasattr(last_step, 'fit_transform'):
--> 387 return last_step.fit_transform(Xt, y, **fit_params_last_step)
388 else:
389 return last_step.fit(Xt, y,
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
978 sum of n_components (output dimension) over transformers.
979 """
--> 980 results = self._parallel_func(X, y, fit_params, _fit_transform_one)
981 if not results:
982 # All transformers are None
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _parallel_func(self, X, y, fit_params, func)
1000 transformers = list(self._iter())
1001
-> 1002 return Parallel(n_jobs=self.n_jobs)(delayed(func)(
1003 transformer, X, y, weight,
1004 message_clsname='FeatureUnion',
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable)
1042 self._iterating = self._original_iterator is not None
1043
-> 1044 while self.dispatch_one_batch(iterator):
1045 pass
1046
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
857 return False
858 else:
--> 859 self._dispatch(tasks)
860 return True
861
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in _dispatch(self, batch)
775 with self._lock:
776 job_idx = len(self._jobs)
--> 777 job = self._backend.apply_async(batch, callback=cb)
778 # A job can complete so quickly than its callback is
779 # called before we get here, causing self._jobs to
~/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in <listcomp>(.0)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
220 def __call__(self, *args, **kwargs):
221 with config_context(**self.config):
--> 222 return self.function(*args, **kwargs)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
385 fit_params_last_step = fit_params_steps[self.steps[-1][0]]
386 if hasattr(last_step, 'fit_transform'):
--> 387 return last_step.fit_transform(Xt, y, **fit_params_last_step)
388 else:
389 return last_step.fit(Xt, y,
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1112 for doc in raw_documents:
1113 feature_counter = {}
-> 1114 for feature in analyze(doc):
1115 try:
1116 feature_idx = vocabulary[feature]
~/opt/anaconda3/lib/python3.8/site-packages/sklearn/feature_extraction/text.py in _analyze(doc, analyzer, tokenizer, ngrams, preprocessor, decoder, stop_words)
104 doc = preprocessor(doc)
105 if tokenizer is not None:
--> 106 doc = tokenizer(doc)
107 if ngrams is not None:
108 if stop_words is not None:
TypeError: cannot use a string pattern on a bytes-like object
If I remove text_pipeline, the error does not occur, so it seems that something is going wrong because of the way to use countVectorizer.
An example of text is
an example
example number 1
this is another small example
I have other columns that are numerical and categorical.
Have you experienced a similar issue? If yes, how did you handle it?

A common error in text transformers of sklearn involves the shape of the data: unlike most other sklearn preprocessors, text transformers generally expect a one-dimensional input, and python's duck-typing causes weird errors from both arrays and strings being iterables.
Your TextTransformer.transform returns X[['Tweet']], which is 2-dimensional, and will cause problems with the subsequent CountVectorizer. (Converting to a numpy array with .values doesn't change the dimensionality problem, but there's also no compelling reason to do that conversion.) Returning X['Tweet'] instead should cure that problem.

Custom FeatureUnion won't work?

I'm trying to modify this example to use a Pandas dataframe instead of the test datasets. I am not able to do so, as ItemSelector does not seem to recognise the column name.
Please do note the columns of the dataframe df_resolved.columns returns:
Index(['u_category', ... ... 'resolution_time', 'rawtext'],
dtype='object')
So I obviously do have this in my dataframe.
However, when I try to run the solution, I get the error
"ValueError: no field of name u_category"
Also, I don't seem to be able to modify the code to support choosing multiple columns in the ItemSelector, so in this solution, I'd have to apply the transformers separately with each column.
My code is:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
class TextStats(BaseEstimator, TransformerMixin):
"""Extract features from each document for DictVectorizer"""
def fit(self, x, y=None):
return self
def transform(self, posts):
return [{'length': len(text),
'num_sentences': text.count('.')}
for text in posts]
class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass.
Takes a sequence of strings and produces a dict of sequences. Keys are
`subject` and `body`.
"""
def fit(self, x, y=None):
return self
def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('subject', object), ('body', object)])
for i, text in enumerate(posts):
headers, _, bod = text.partition('\n\n')
bod = strip_newsgroup_footer(bod)
bod = strip_newsgroup_quoting(bod)
features['body'][i] = bod
prefix = 'Subject:'
sub = ''
for line in headers.split('\n'):
if line.startswith(prefix):
sub = line[len(prefix):]
break
features['subject'][i] = sub
return features
pipeline = Pipeline([
# Extract the subject & body
('subjectbody', SubjectBodyExtractor()),
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('rawtext', Pipeline([
('selector', ItemSelector(key='u_category')),
('labelenc', preprocessing.LabelEncoder()),
])),
# Pipeline for standard bag-of-words model for body
('features', Pipeline([
('selector', ItemSelector(key='rawtext')),
('tfidf', TfidfVectorizer(max_df=0.5, min_df=1,
stop_words='english',
token_pattern=u'(?ui)\\b\\w*[a-z]{2,}\\w*\\b')),
])),
],
# weight components in FeatureUnion
transformer_weights={
'rawtext': 1.0,
'features': 1.0,
},
)),
# Use a SVC classifier on the combined features
('linear_svc', LinearSVC(penalty="l2")),
])
# limit the list of categories to make running this example faster.
X_train, X_test, y_train, y_test = train_test_split(df_resolved.ix[:, (df_resolved.columns != 'assignment_group.name')], df_resolved['assignment_group.name'], test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))
How can I modify this code to work properly with my dataframe, and possibly support applying a transformer to multiple columns at once?
If I take the ItemSelector out, it seems to work. So this works:
ds = ItemSelector(key='u_category')
ds.fit(df_resolved)
labelenc = preprocessing.LabelEncoder()
labelenc_transformed = labelenc.fit_transform(ds.transform(df_resolved))
FULL STACK TRACE:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-93-a4ba29c137ec> in <module>()
136
137
--> 138 pipeline.fit(X_train, y_train)
139 #y = pipeline.predict(X_test)
140 #print(classification_report(y, test.target))
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 """
--> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
732 delayed(_fit_transform_one)(trans, name, weight, X, y,
733 **fit_params)
--> 734 for name, trans, weight in self._iter())
735
736 if not result:
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
606 return False
607 else:
--> 608 self._dispatch(tasks)
609 return True
610
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
569 dispatch_timestamp = time.time()
570 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571 job = self._backend.apply_async(batch, callback=cb)
572 self._jobs.append(job)
573
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
107 def apply_async(self, func, callback=None):
108 """Schedule a func to be run"""
--> 109 result = ImmediateResult(func)
110 if callback:
111 callback(result)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
324 # Don't delay the application, to avoid keeping the input
325 # arguments in memory
--> 326 self.results = batch()
327
328 def get(self):
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, name, weight, X, y, **fit_params)
575 **fit_params):
576 if hasattr(transformer, 'fit_transform'):
--> 577 res = transformer.fit_transform(X, y, **fit_params)
578 else:
579 res = transformer.fit(X, y, **fit_params).transform(X)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
299 """
300 last_step = self._final_estimator
--> 301 Xt, fit_params = self._fit(X, y, **fit_params)
302 if hasattr(last_step, 'fit_transform'):
303 return last_step.fit_transform(Xt, y, **fit_params)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
495 else:
496 # fit method of arity 2 (supervised transformation)
--> 497 return self.fit(X, y, **fit_params).transform(X)
498
499
<ipython-input-93-a4ba29c137ec> in transform(self, data_dict)
55
56 def transform(self, data_dict):
---> 57 return data_dict[self.key]
58
59
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/numpy/core/records.py in __getitem__(self, indx)
497
498 def __getitem__(self, indx):
--> 499 obj = super(recarray, self).__getitem__(indx)
500
501 # copy behavior of getattr, except that here
ValueError: no field of name u_category
UPDATE:
Even if I use dataframes (NO train_test_split), the issue persists:
UPDATE 2:
OK so I removed the SubjectBodyExtractor, since I won't need that. Now the ValueError: no field of name u_category is gone, but I have a new error: TypeError: fit_transform() takes 2 positional arguments but 3 were given.
Stack trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-110-292294015e44> in <module>()
129
130
--> 131 pipeline.fit(X_train.ix[:, (X_test.columns != 'assignment_group.name')], X_test['assignment_group.name'])
132 #y = pipeline.predict(X_test)
133 #print(classification_report(y, test.target))
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 """
--> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
732 delayed(_fit_transform_one)(trans, name, weight, X, y,
733 **fit_params)
--> 734 for name, trans, weight in self._iter())
735
736 if not result:
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
606 return False
607 else:
--> 608 self._dispatch(tasks)
609 return True
610
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
569 dispatch_timestamp = time.time()
570 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571 job = self._backend.apply_async(batch, callback=cb)
572 self._jobs.append(job)
573
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
107 def apply_async(self, func, callback=None):
108 """Schedule a func to be run"""
--> 109 result = ImmediateResult(func)
110 if callback:
111 callback(result)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
324 # Don't delay the application, to avoid keeping the input
325 # arguments in memory
--> 326 self.results = batch()
327
328 def get(self):
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, name, weight, X, y, **fit_params)
575 **fit_params):
576 if hasattr(transformer, 'fit_transform'):
--> 577 res = transformer.fit_transform(X, y, **fit_params)
578 else:
579 res = transformer.fit(X, y, **fit_params).transform(X)
/Users/csanadpoda/Documents/Jupyter/anaconda/lib/python3.6/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
301 Xt, fit_params = self._fit(X, y, **fit_params)
302 if hasattr(last_step, 'fit_transform'):
--> 303 return last_step.fit_transform(Xt, y, **fit_params)
304 elif last_step is None:
305 return Xt
TypeError: fit_transform() takes 2 positional arguments but 3 were given

Yes, thats because LabelEncoder only requires a single array y whereas FeatureUnion will try sending X and y both to it.
See this: https://github.com/scikit-learn/scikit-learn/issues/3956
You can use a simple workaround for this:
Define a custom labelEncoder like this:
class MyLabelEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
self.le = LabelEncoder()
def fit(self, x, y=None):
return self.le.fit(x)
def transform(self, x, y=None):
return self.le.transform(x).reshape(-1,1)
def fit_transform(self, x, y=None):
self.fit(x)
return self.transform(x)
And in the pipeline, do this:
....
....
('selector', ItemSelector(key='u_category')),
('labelenc', MyLabelEncoder()),
Please note the reshape(-1,1) in the trasform() method. Thats because FeatureUnion only works with 2-d data. All the individual transformers inside the FeatureUnion should only return 2-d data.

you may need to add them in the features array like this , please try to add the two selectors in the features like this and show me the results
features = np.recarray(shape=(len(posts),),
dtype=[('u_category', object), ('rawtext', object)])

IndexError while fitting pipeline with FeatureUnion

I keep getting an
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
while trying to fit my dataframe to the following pipeline. Train and Test are two dataframes with same columns. There are different columns but I only want to focus on three of them through the ItemSelector.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn import preprocessing
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.column]
def predictCases(train, test):
target_names = sorted(list(set(train['TARGET'].values)))
y_train = np.array([target_names.index(x) for x in train['TARGET'].values])
y_test = np.array([target_names.index(x) for x in test['TARGET'].values])
# train and predict
classifier = Pipeline([
('union', FeatureUnion([
('text', Pipeline([
('selector', ItemSelector(column='TEXT')),
('tfidf_vec', TfidfVectorizer())
])),
('feature1', Pipeline([
('selector', ItemSelector(column='CATEG_FEAT1')),
('lbe', LabelEncoder())
])),
('feature2', Pipeline([
('selector', ItemSelector(column='CATEG_FEAT2')),
('lbe', LabelEncoder())
]))
])),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(train.values, y_train)
predicted = classifier.predict(test.values)
return(metrics.precision_recall_fscore_support(y_test, predicted))
Full Error:
IndexError Traceback (most recent call last)
<ipython-input-19-95d9d0c337f4> in <module>()
----> 1 tt = predictCases(train_resampled, validate)
<ipython-input-17-efc951f4192e> in predictCases(train, test)
24 ])),
25 ('clf', OneVsRestClassifier(LinearSVC()))])
---> 26 classifier.fit(train.values, y_train)
27 predicted = classifier.predict(test.values)
28 return(metrics.precision_recall_fscore_support(y_test, predicted))
C:\\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 """
--> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)
C:\\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
C:\\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
732 delayed(_fit_transform_one)(trans, name, weight, X, y,
733 **fit_params)
--> 734 for name, trans, weight in self._iter())
735
736 if not result:
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
606 return False
607 else:
--> 608 self._dispatch(tasks)
609 return True
610
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
569 dispatch_timestamp = time.time()
570 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571 job = self._backend.apply_async(batch, callback=cb)
572 self._jobs.append(job)
573
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
107 def apply_async(self, func, callback=None):
108 """Schedule a func to be run"""
--> 109 result = ImmediateResult(func)
110 if callback:
111 callback(result)
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
324 # Don't delay the application, to avoid keeping the input
325 # arguments in memory
--> 326 self.results = batch()
327
328 def get(self):
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
C:\\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
C:\\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, name, weight, X, y, **fit_params)
575 **fit_params):
576 if hasattr(transformer, 'fit_transform'):
--> 577 res = transformer.fit_transform(X, y, **fit_params)
578 else:
579 res = transformer.fit(X, y, **fit_params).transform(X)
C:\\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
299 """
300 last_step = self._final_estimator
--> 301 Xt, fit_params = self._fit(X, y, **fit_params)
302 if hasattr(last_step, 'fit_transform'):
303 return last_step.fit_transform(Xt, y, **fit_params)
C:\\Anaconda3\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
C:\\Anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
495 else:
496 # fit method of arity 2 (supervised transformation)
--> 497 return self.fit(X, y, **fit_params).transform(X)
498
499
<ipython-input-2-fdc42fd9d831> in transform(self, X)
10
11 def transform(self, X):
---> 12 return X[self.column]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Edit:
If I use train instead of train.values in fit I get the following error:
TypeError: fit_transform() takes 2 positional arguments but 3 were given

You're passing test.values (i.e. a numpy array with raw DataFrame values) to classifier.predict and classifier.fit, while your transformer expects a DataFrame object.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ValueError: blocks[0,:] has incompatible row dimensions - python

Related

StackingClassifier Raises Exception 'numpy.ndarray' object has no attribute 'columns'

i want to establish a pipe line to pubg data on kaggle to procces it but when i implement a pipe line this error get to me

AttributeError and TypeError using CustomTransformers

Custom FeatureUnion won't work?

IndexError while fitting pipeline with FeatureUnion

Categories

Resources