Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality

Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality - python

I am currently developing a machine learning algorithm for ticket classification that combines a Title, Description and Customer name together to predict what team a ticket should be assigned to but have been stuck for the past few days.
Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.
file = "train_data.csv"
train_data= pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()))
categorical_features = ['Customer']
categorical_transformer = Pipeline(steps=[('OHE', preprocessing.OneHotEncoder()))
preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),('clf', SGDClassifier())]
X_train = train_data.drop('Team', axis=1)
y_train = train_data['Team']
clf.fit(X_train, y_train)
However I get an error: all the input array dimensions except for the concatenation axis must match exactly.
After looking into it, print(OneHotEncoder().fit_transform(X_train['Customer'])) on its own returns an error: Expected 2d array got 1d array instead.
I believe that OneHotEncoder is failing as it is expecting an array of arrays (a pandas dataframe), each being length one containing the customer name. But instead is just getting a pandas series. By converting the series to a dataframe with .to_frame() the printed output now seems to match what is outputted by the TfidfVectorizer and the dimensions should match.
Is there a way I can modify OneHotEncoder in the pipeline so that it accepts the input as it is in 1 dimension? Or is there something I can add to the pipeline that will convert it before it's passed into OneHotEncoder? Am I right in that this is the reason for the error?
Thanks.

I believe the problem lies in the fact that you're giving two columns to the TfIdfVectorizer (which is thus converted to a DataFrame). This will not work: TfIdfVectorizer expects a list of strings. So an immediate solution (and therefore a check of whether this is in fact the source of the problem), is changing this line to: string_features = 'Description'. Note this is not a list, it just a string. Therefore the Series is passed to the TfIdfVectorizer, and not the DataFrame.
If you would like to combine both string columns, you could either
concatanenate the strings, so you keep one column (which is the easiest), or
Fit two different TfIdfVectorizers, which is more complex but might perform better. See for instance Computing separate tfidf scores for two different columns using sklearn
Should this not solve your problem, I would advise you to share some sample data so we can actually test what is happening.
I believe the difference between your perceived error and the actual pipeline lies in the fact that you're giving it X_train['Customer'] (again a Series), but in the actual pipeline you're giving it X_train[['Customer']] (a DataFrame).

Related

Pandas Dataframe, TensorFlow Dataset: Where to do the TensorFlow Tokenization step?

I am working on a logistic regression model to predict if a customer is a business or non-business costumer with the help of Keras in TensorFlow. At the moment I am able to use columns like latitude with the help of tf.feature_columns. Now I am working on the NAME1 field. The name often has repeating parts like “GmbH” (e.g. “Mustermann GmbH”) which in this context has a similar meaning to Corp. which is an indicator that the customer is a business customer. To separate all the different parts of the name and to work with them separately, I am using tokenization with the help of the function text_to_word_sequence().
I import the data into a Pandas Dataframe and afterwards I convert this Dataframe to a TensorFlow Dataset with the function from_tensor_slices() so I can work with the tf.feature_columns function.
I tried two different strategies for the tokenization:
Tokenization before converting the pandas Dataframe to a TensorFlow Dataset
After importing the Dataframe I used the Pandas Dataframe method apply() to create a new tokenized column within the Dataframe:
data['NAME1TOKENIZED'] = data['NAME1'].apply(lambda x: text_to_word_sequence(x))
The new column has the following structure:
0 [palle]
1 [pertl]
2 [graf, robert]
3 [löberbauer, stefanie, asg]
4 [stauber, martin, asg]
...
99995 [truber]
99996 [mesgec]
99997 [mesgec]
99998 [miedl]
99999 [millegger]
Name: NAME1TOKENIZED, Length: 100000, dtype: object
As you can see, the list has an different amount of entries, so I have problems to convert the Dataframe into a Dataset:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
I also tried the tf.ragged.constant() function to create a ragged Tensor which allows this type of lists.
Here my function for converting the DataFrame to a Dataset:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
tok_names = dataframe.loc[:,'NAME1TOKENIZED']
del dataframe['NAME1TOKENIZED']
rt_tok_names = tf.ragged.constant(tok_names)
labels = dataframe.pop('RECEIVERTYPE')
labels = labels - 1
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), rt_tok_names, labels))
if shuffle:
ds = ds.shuffle(buffer_size=len(dataframe))
ds = ds.batch(batch_size)
return ds
This works pretty well but as you can imagine, now I have a problem on the other side. When I am now trying to use the following function:
name_embedding = tf.feature_column.categorical_column_with_hash_bucket('NAME1TOKENIZED', hash_bucket_size=2500)
I get the following Error:
ValueError: Feature NAME1TOKENIZED is not in features dictionary.
I also tried to input a Dataframe instead of a Serie into tf.ragged.constant() so I can use dict(rt_tok_names) to pass the label, but then I am getting the following error again:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
Tokenization after converting the pandas Dataframe to a TensorFlow Dataset
I have tried e.g. the following:
train_ds.map(lambda x, _: text_to_word_sequence(x['NAME1']))
But I got the following error:
AttributeError: 'Tensor' object has no attribute 'lower'
As you can see I tried it several ways but without success. I would be happy for any recommendations how to solve my problem.
Thanks!

I found a solution for my problem. I used the Tokenizer to transform the text to sequences and then I pad the resulting list of sequences per row to the max length of two. Finally, I added these two new columns to the Dataframe. Afterwards I was able to transform the Dataframe to a Dataset and then I used these two columns with the help of tf.feature_column
Here the relevant code:
t = Tokenizer(num_words=name_num_words)
t.fit_on_texts(data['NAME1PRO'])
name1_tokenized = t.texts_to_sequences(data['NAME1PRO'])
name1_tokenized_pad = tf.keras.preprocessing.sequence.pad_sequences(name1_tokenized, maxlen=2, truncating='pre')
data = pd.concat([data, pd.DataFrame(name1_tokenized_pad, columns=['NAME1W1', 'NAME1W2'])], axis=1)

Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

I have a series like:
df['ID'] = ['ABC123', 'IDF345', ...]
I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.
During the training, I'm doing as follows:
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.
In the test file, I was doing as follows:
new_df['ID'] = le_dpid.transform(new_df.ID)
But, I'm getting the following error: ValueError: y contains new labels
How do I fix this?? Thanks!
UPDATE:
So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.
df =
BankNum | ID | Labels
0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low
And then predict it on something like:
BankNum | ID |
00982222 | AB999 |
00982222 | AB999 |
00981111 | AB890 |
I'm doing something like this:
df['BankNum'] = df.BankNum.astype(np.float128)
le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID)
X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
clf = RandomForestClassifier(random_state=42, n_estimators=140)
clf.fit(X_train, y_train)

I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.
One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.
An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549
The thing is to create dictionary with classes, than map column and fill new classes with some "known value"
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)

If your data are pd.DataFrame I suggest you this simple solution...
I build a custom transformer that integer maps each categorical feature. When fitted you can transform all the data you want. You can compute also simple label encoding or onehot encoding.
If new unseen categories or NaNs are present in new data:
1] For label encoding, 0 is a special token reserved for mapping these cases.
2] For onehot encoding, all the onehot columns are zeros in these cases.
class FeatureTransformer:
def __init__(self, categorical_features):
self.categorical_features = categorical_features
def fit(self, X):
if not isinstance(X, pd.DataFrame):
raise ValueError("Pass a pandas.DataFrame")
if not isinstance(self.categorical_features, list):
raise ValueError(
"Pass categorical_features as a list of column names")
self.encoding = {}
for c in self.categorical_features:
_, int_id = X[c].factorize()
self.encoding[c] = dict(zip(list(int_id), range(1,len(int_id)+1)))
return self
def transform(self, X, onehot=True):
if not isinstance(X, pd.DataFrame):
raise ValueError("Pass a pandas.DataFrame")
if not hasattr(self, 'encoding'):
raise AttributeError("FeatureTransformer must be fitted")
df = X.drop(self.categorical_features, axis=1)
if onehot: # one-hot encoding
for c in sorted(self.categorical_features):
categories = X[c].map(self.encoding[c]).values
for val in self.encoding[c].values():
df["{}_{}".format(c,val)] = (categories == val).astype('int16')
else: # label encoding
for c in sorted(self.categorical_features):
df[c] = X[c].map(self.encoding[c]).fillna(0)
return df
Usage:
X_train = pd.DataFrame(np.random.randint(10,20, (100,10)))
X_test = pd.DataFrame(np.random.randint(20,30, (100,10)))
ft = FeatureTransformer(categorical_features=[0,1,3])
ft.fit(X_train)
ft.transform(X_test, onehot=False).shape

This is in fact a known bug on LabelEncoder : BUG for fit_transform ... basically you have to fit it and then transform. It will work fine ! A suggestion is to keep a dictionary of your encoders to each and every column so that in the inverse transform you are able to retrieve the original categorical values.

I'm able to mentally process operations better when dealing in DataFrames. The approach below fits and transforms the LabelEncoder() using the training data, then uses a series of pd.merge joins to map the trained fit/transform encoder values to the test data. When there is a test data value not seen in the training data, the code defaults to the max trained encoder value + 1.
# encode class values as integers
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y_train = encoder.transform(y_train)
# make a dataframe of the unique train values and their corresponding encoded integers
y_map = pd.DataFrame({'y_train': y_train, 'encoded_y_train': encoded_y_train})
y_map = y_map.drop_duplicates()
# map the unique test values to the trained encoded integers
y_test_df = pd.DataFrame({'y_test': y_test})
y_test_unique = y_test_df.drop_duplicates()
y_join = pd.merge(y_test_unique, y_map,
left_on = 'y_test', right_on = 'y_train',
how = 'left')
# if the test category is not found in the training category group, then make the
# value the maximum value of the training group + 1
y_join['encoded_y_test'] = np.where(y_join['encoded_y_train'].isnull(),
y_map.shape[0] + 1,
y_join['encoded_y_train']).astype('int')
encoded_y_test = pd.merge(y_test_df, y_join, on = 'y_test', how = 'left') \
.encoded_y_test.values

I found an easy hack around this issue.
Assuming X is the dataframe of features,
First, we need to create a list of dicts which would have the key as the iterable starting from 0 and the corresponding value pair would be the categorical column name. We easily accomplish this using enum.
cat_cols_enum = list(enumerate(X.select_dtypes(include = ['O']).columns))
Then the idea is to create a list of label encoders whose dimension is equal to the number of qualitative(categorical) columns present in the dataframe X.
le = [LabelEncoder() for i in range(len(cat_cols_enum))]
Next and the last part would be fitting each of the label encoders present in the list of encoders with the unique values of each of the categorical columns present in the list of dicts respectively.
for i in cat_cols_enum: le[i[0]].fit(X[i[1]].value_counts().index)
Now, we can transform the labels to their respective encodings using
for i in cat_cols_enum:
X[i[1]] = le[i[0]].transform(X[i[1]])

This error comes when transform function getting any new value for which LabelEncoder try to encode and because in training samples, when you are using fit_transform, that specific value did not present in the corpus. So there is a hack, whether use all the unique values with fit_transform function if you are sure that no new value will come further, or try some different encoding method which suits on the problem statement like HashingEncoder.
Here is the example if no further new values will come in testing
le_id.fit_transform(list(set(df['ID'].unique()).union(set(new_df['ID'].unique()))))
new_df['ID'] = le_id.transform(new_df.ID)

I also encountered the exact same error and was able to fix it.
my array: [[63, 1, 'True' , 1850000000.0, 61666.67]]
When I used the following code, the error occurred
pa_ = preprocessing.LabelEncoder()
pa_.fit([True,False])
data[:,2] = pa_.transform(data[:,2])
The reason for the error was that the true value into array was of string type
And I set boolean. i changed bool type to string Thus, the error was fixed
fit(['True','False'])
I hope my answer be helpful

I hope this helps someone as it's more recent.
sklearn uses the fit_transform to perform the fit function and transform function directing on label encoding.
To solve the problem for Y label throwing error for unseen values, use:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(Col)
This solves it!

I used
le.fit_transform(Col)
and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split

Extracing tf-idf values and features from TfidfVectorizer and making them into pandas Series

I was extracting tf-idf values for each feature name from a text document (in .csv format where each row entry represents a text message (dtype=str)) using TfidfVectorizer with default parameters. This is what I did:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from pandas import Series
# .csv document has been converted into pandas format
tf = TfidfVectorizer()
X_tf = tf.fit_transform(document)
# get feature names and tf-idf values
feature_names = tf.get_feature_names()
tfidf = tf.idf_
I also used the last two lines to extract feature names and tf-idf values. However, I was also asked to (1) sort features by their tf-idf values in both ascending and descending orders, then by alphabetical order (if multiple features' tf-idf values are tied) and (2) make the output into a pandas Series object using feature name as index, such that the output looks like this (this one in descending order):
feature tf-idf
he 0.031
she 0.047
i 0.068
a 0.084
the 1.527
Seems that I can simply achieve this by matching 'feature_names' and 'tfidf' and sort them, but I am not sure if their sequences actually match as 'feature_names' is a list object while 'tfidf' is a numpy array and given that I don't really know what sklearn is doing under the hood.
If I want to compile a sorted Series in descending (and ascending) order with the exact feature name as index (sorted in alphabetical oder), what should I proceed from the last line of my code? If will be really appreciated if someone could enlighten me on this.
Thank you.

Convert pandas sparse dataframe to sparse numpy matrix for sklearn use?

I have some data, around 400 million rows, some features are categorical. I apply pandas.get_dummies to do one-hot encoding, and I have to use sparse=Trueoption because the data is a little big(otherwise exceptions/errors are raised).
result = result.drop(["time", "Ds"], 1)
result_encoded = pd.get_dummies(result, columns=["id1", "id2", "id3", "id4"], sparse=True)
Then, I get a sparse dataframe(result_encoded) with 9000 features. After that, I want to run a ridge regression on the data. At first, I tried to feed dataframe.value into sklearn,
train_data = result_encoded.drop(['count'].values)
but raised the error: "array is too big".
Then, I just fed sparse dataframe to sklearn, similar error message showed again.
train_data = result_encoded.drop(['count'])
Do I need to consider a different method or preparation of the data so it can be used by sklearn directly?

You should be able to use the experimental .to_coo() method in pandas [1] in the following way:
result_encoded, idx_rows, idx_cols = result_encoded.stack().to_sparse().to_coo()
result_encoded = result_encoded.tocsr()
This method, instead of taking a DataFrame (rows / columns) it takes a Series with rows and columns in a MultiIndex (this is why you need the .stack() method). This Series with the MultiIndex needs to be a SparseSeries, and even if your input is a SparseDataFrame, .stack() returns a regular Series. So, you need to use the .to_sparse() method before calling .to_coo().
The Series returned by .stack(), even if it's not a SparseSeries only contains the elements that are not null, so it shouldn't take more memory than the sparse version (at least with np.nan when the type is np.float).
In general, you'll want to more efficient CSR or CCR format for your sparse scipy array, instead of the simpler COO, so you can convert it with the .tocsr() method.
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse

Know feature names after imputation

I run an sk-learn classifier on a pandas dataframe (X). Since some data is missing, I use sk-learn's imputer like this:
imp=Imputer(strategy='mean',axis=0)
X=imp.fit_transform(X)
After doing that however, my number of features is decreased, presumably because the imputer just gets rids of the empty columns.
That's fine, except that the imputer transforms my dataframe into a numpy ndarray, and thus I lose the column/feature names. I need them later on to identify the important features (with clf.feature_importances_).
How can I know the names of the features in clf.feature_importances_, if some of the columns of my initial dataframe have been dropped by the imputer?

you can do this:
invalid_mask = np.isnan(imp.statistics_)
valid_mask = np.logical_not(invalid_mask)
valid_idx, = np.where(valid_mask)
Now you have old indexes (Indexes that these columns had in matrix X) for valid columns. You can get feature names by these indexes from list of feature names of old X.

It's more difficult than it should be. The answer is that SimpleImputer should get an argument, add_indicator=True. Then, after fitting, simple_imputer.indicator_ takes the value of another transformer of the type sklearn.impute.MissingIndicator. This in turn will have a variable features_, which contains the features.
So it's roughly like this:
simple_imputer = SimpleImputer(add_indicator=True)
simple_imputer.fit(X)
print(simple_imputer.indicator_.features_)
I've implemented a thin wrapper around SimpleImputer, called SimpleImputerWithFeatureNames, which gives you feature names. It's available on github.
>> import openml_speed_dating_pipeline_steps as pipeline_steps
>> imputer = pipeline_steps.SimpleImputerWithFeatureNames()
>> imputer.fit(X_train[numeric_features])
>> imputer.get_feature_names()
[...]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scikit Learn - Combining output of TfidfVectorizer and OneHotEncoder - dimensionality - python

Related

Pandas Dataframe, TensorFlow Dataset: Where to do the TensorFlow Tokenization step?

Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

Extracing tf-idf values and features from TfidfVectorizer and making them into pandas Series

Convert pandas sparse dataframe to sparse numpy matrix for sklearn use?

Know feature names after imputation

Categories

Resources