How do I keep track of the columns of the transformed array produced by sklearn.compose.ColumnTransformer? By "keeping track of" I mean every bit of information required to perform a inverse transform must be shown explicitly. This includes at least the following:
What is the source variable of each column in the output array?
If a column of the output array comes from one-hot encoding of a categorical variable, what is that category?
What is the exact imputed value for each variable?
What is the (mean, stdev) used to standardize each numerical variable? (These may differ from direct calculation because of imputed missing values.)
I am using the same approach based on this answer. My input dataset is also a generic pandas.DataFrame with multiple numerical and categorical columns. Yes, that answer can transform the raw dataset. But I lost track of the columns in the output array. I need these information for peer review, report writing, presentation and further model-building steps. I've been searching for a systematic approach but with no luck.
The answer which had mentioned is based on this in Sklearn.
You can get the answer for your first two question using the following snippet.
def get_feature_names(columnTransformer):
output_features = []
for name, pipe, features in columnTransformer.transformers_:
if name!='remainder':
for i in pipe:
trans_features = []
if hasattr(i,'categories_'):
trans_features.extend(i.get_feature_names(features))
else:
trans_features = features
output_features.extend(trans_features)
return output_features
import pandas as pd
pd.DataFrame(preprocessor.fit_transform(X_train),
columns=get_feature_names(preprocessor))
transformed_cols = get_feature_names(preprocessor)
def get_original_column(col_index):
return transformed_cols[col_index].split('_')[0]
get_original_column(3)
# 'embarked'
get_original_column(0)
# 'age'
def get_category(col_index):
new_col = transformed_cols[col_index].split('_')
return 'no category' if len(new_col)<2 else new_col[-1]
print(get_category(3))
# 'Q'
print(get_category(0))
# 'no category'
Tracking whether there has been some imputation or scaling done on a feature is not trivial with the current version of Sklearn.
Related
I'm working on an exercise in Kaggle, it's on their module for categorical variables, specifically the one - hot encoding section: https://www.kaggle.com/alexisbcook/categorical-variables
I'm through the entire workbook fine, and there's one last piece I'm trying to work out, it's the optional piece at the end to apply the one - hot encoder to predict the house sale values. I've worked out the following code`, but on the line in bold: OH_cols_test = pd.DatFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols])), i'm getting the error that the input contains NaN.
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column? And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers? Can someone please let me know where I'm going wrong here? Thanks very much!:
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
**OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))**
# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)
So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column?
NA's are just the absence of data, and so you can loosely think of rows with NA's as being incomplete. You may find yourself dealing with a dataset where NAs occur in half of the rows, and will require some clever feature engineering to compensate for this. Think about it this way: if one hot encoding is a simple way to represent binary state (e.g. is_male, salary_is_less_than_100000, etc...), then what does NaN/null mean? You have a bit of a Schrodinger's cat on your hands there. You're generally safe to drop NA's so long as it doesn't mangle your dataset size. The amount of data loss you're willing to handle is entirely situation-based (it's probably fine for a practice exercise).
And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers?
May I suggest this.
I deal with this topic on my blog. You can check the link at the bottom of this answer. All my code/logic appears directly below.
# There are various ways to deal with missing data points.
# You can simply drop records if they contain any nulls.
# data.dropna()
# You can fill nulls with zeros
# data.fillna(0)
# You can also fill with mean, median, or do a forward-fill or back-fill.
# The problems with all of these options, is that if you have a lot of missing values for one specific feature,
# you won't be able to do very reliable predictive analytics.
# A viable alternative is to impute missing values using some machine learning techniques
# (regression or classification).
import pandas as pd
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
# Load data
data = pd.read_csv('C:\\Users\\ryans\\seaborn-data\\titanic.csv')
print(data)
list(data)
data.dtypes
# Now, we will use a simple regression technique to predict the missing values
data_with_null = data[['survived','pclass','sibsp','parch','fare','age']]
data_without_null = data_with_null.dropna()
train_data_x = data_without_null.iloc[:,:5]
train_data_y = data_without_null.iloc[:,5]
linreg.fit(train_data_x,train_data_y)
test_data = data_with_null.iloc[:,:5]
age = pd.DataFrame(linreg.predict(test_data))
# check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
# Find any/all missing data points in entire data set
print(data_with_null.isnull().sum().sum())
# WOW 177 NULLS!!
# LET'S IMPUTE MISSING VALUES...
# View age feature
age = list(linreg.predict(test_data))
print(age)
# Finally, we will join our predicted values back into the 'data_with_null' dataframe
data_with_null.age = age
# Check for nulls
data_with_null.apply(lambda x: sum(x.isnull()),axis=0)
https://github.com/ASH-WICUS/Notebooks/blob/master/Fillna%20with%20Predicted%20Values.ipynb
One final thought, just in case you don't already know about this. There are two kinds of categorical data:
Labeled Data: The categories have an inherent order (small, medium, large)
When your data is labeled in some kind of order, USE LABEL ENCODING!
Nominal Data: The categories do not have an inherent order (states in the US)
When your data is nominal, and there is no specific order, USE ONE HOT ENCODING!
https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/
I work with a test system that outputs a large CSV matrix of values which I then process using the Pandas module in Python. The parameters that system uses when testing a given part are governed by a predetermined sequence. A simplified example is shown here:
Raw data frame
However, not all of these steps are desired in the output data. In fact, the rows containing a 'Clock Frequency' value of '3.0MHz' are only included to act as buffer points to allow a climate chamber to reach the intended temperature. I do not wish to include data collected at these parameters in my results.
I found I was pretty easily able to remove these rows from my data frame by using the below code. Note that in this example I am working with a Pandas data frame called 'csvDF'.
tempBuffers = csvDF[csvDF['Clock Frequency']==3e6].index
csvDF.drop(tempBuffers, inplace=True)
This produces the following output:
Data frame with buffer steps removed
The issue with this is that now my 'Sequence Step' column is wrong. I want the data table to appear as if those buffer steps never existed. The sequence steps should be sequential for all non-buffer steps. The desired output is shown below:
Data frame with buffer steps removed and corrected sequence step column
What code do I need to instantiate in order to achieve this?
You can try something like this:
n = 3 # number of rows in step
csvDF.reset_index(inplace=True, drop=True)
csvDF['Sequence step'] = pd.Series(range(len(csvDF)))
csvDF['Sequence step'] = csvDF['Sequence step'].apply(lambda x: int(x / n))
So I have 2 scripts for an Artificial Neural Network on insurance claims - one script is to train/test and one to execute going forward. I am done with the first one and developing the second one using real production data as a test of it. The target/class label is a binary 1 or 0. Input data is initially in a dataframe of shape (5914, 23) and it is all numeric data. I then do a df.values.tolist() on it, I do StandardScaler() on all values (other than the first one which is a Claim ID) and in the process, it goes through np.asarray. I then run it through ANN_Model.Predict_Proba which gives me a list of 5,914 pairs of probabilities. Now I want to merge back to the dataframe which I had before I did the tolist(), all of the probabilities (called "predicted_probs") and to do so into a new column on that original dataframe (column called "Results") and to do so for one class (I am only interested in the positive class). I do so via the following code. But I don't know if the order of my results is the same as the order of the dataframe. Is it?
for i in range (0,len(predicted_probs)):
original_df["Results"] = pd.Series(predicted_probs[i])
print (predicted_probs[[i],[1]])
Should I be doing it another way? I have to replicate what is done in the training script in order to expect like-for-like results, hence the StandardScaler(), np.asarray etc.
Thanks in advance
Your dataframe's shape is (5914, 23) and the output from ann_model.predict_proba is 5914. Since a row from your df will output a single probability you can expect that the order of your results is the same as the order of your dataframe. To add the probability of the positive class to the dataframe,
original_df['Results'] = [i[1] for i in predicted_probs]
There is no need for you to loop through the predicted_probs
I am using sklearn's multilabelbinarizer() to train multiple columns in my machine learning which I use to train my model.
After using it I noticed it was mixing up my data when it inverse transforms it. I created a test set of random values where I fit the data, transform it, and inverse_transform the data to get back to the original data.
I ran a simple test in jupyter notebook to show the error:
In the inverse_transformed value it messes up in row 1 mixing up the state and month.
jupyter notebook code
First of all, is there an error in how I use the multilabelbinarizer? Is there a different way to achieve the same output?
EDIT:
Thank you to #Nicolas M. for helping me solve my question. I ended up solving this issue like this.
Forgive the rough explanation, but it turned out to be more complicated than I originally thought. I switched to using the label_binarizer instead of the multi_label_binarizer because it
I ended up pickling the label_binarizer defaultdict so I can load it and use it in different modules for my machine learning project.
One thing that might not be trivial is me adding new headers to dataframe I make for each column. It was in the form of column_name + column number. I did this because I needed to inverse transform the data. To do that I searched for the columns that contained the original column name which separated the larger dataframe into the individual column chunks.
here some variables that I used and what they mean for reference:
lb_dict - default dict that stores the different label binarizers.
binarize_df - dataframe that stores the binarized data.
binarized_label - label binarizes one label in the column.
header - creates a new header form: column name + number column.
inverse_df - dataframe that stores the inverse_transformed data.
one_label_list - finds the list of column names with the original column tag.
one_label_df - creates a new data frame that only stores the binarized data for one column.
single_label - binarized data that gets inverse_transformed into one column.
in this code data is the dataframe that I pass to the function.
lb_dict = defaultdict(LabelBinarizer)
# create a place holder dataframe to join new binarized data to
binarize_df = pd.DataFrame(['x'] * len(data.index), columns=['place_holder'])
# loop through each column and create a binarizer and fit/transform the data
# add new data to the binarize_df dataframe
for column in data.columns.values.tolist():
lb_dict[column].fit(data[column])
binarized_label = lb_dict[column].transform(data[column])
header = [column + str(i) for i in range(0, len(binarized_label[0]))]
binarize_df = binarize_df.join(pd.DataFrame(binarized_label, columns=header))
# drop the place holder value
binarize_df.drop(labels=['place_holder'], axis=1, inplace=True)
Here is the inverse_transform function that I wrote:
inverse_df = pd.DataFrame(['x'] * len(output.index), columns=['place_holder'])
# use a for loop to run through the different output columns that need to be inverse_transformed
for column in output_cols:
# create a list of the different headers based on if the name contains the original output column name
one_label_list = [x for x in output.columns.values.tolist() if column in x]
one_label_df = output[one_label_list]
# inverse transform the data frame for one label
single_label = label_binarizer[column].inverse_transform(one_label_df.values)
# join the output of the single label df to the entire output df
inverse_df = inverse_df.join(pd.DataFrame(single_label, columns=[column]))
inverse_df.drop(labels=['place_holder'], axis=1, inplace=True)
The issue comes from the data (and in this case a bad use of the model). If you create a Dataframe of your MultiLabelBinarizer you will have :
You can see that all columns are sorted in ascending order. When you ask to reconstruct, the model will reconstruct it by "scanning" values by row.
So if you take the line one, you have :
1000 - California - January
Now if you take the second one, you have :
750 - February - New York
And so on...
So your month is swapped because of sorting order. If you replace the month by "ZFebrury", it's gonna be OK but still only by "luck"
What you should do is train 1 model per categorical feature and stack every matrix to have your final matrix. To revert it, you should extract both "sub_matrix" and do the inverse_transform.
To create 1 model per feature, you can refer to the answer of Napitupulu Jon in this SO question
EDIT 1:
I tried the code from the SO question and it doesn't work as the number of columns changed. This is what I have now (but you still have to save somewhere the column for every features)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from collections import defaultdict
data = {
"State" : ["California", "New York", "Alaska", "Arizona", "Alaska", "Arizona"],
"Month" : ["January", "February", "May", "February", "January", "February" ],
"Number" : ["1000", "750", "500", "25000", "2000", "1"]
}
df = pd.DataFrame(data)
d = defaultdict(MultiLabelBinarizer) # dict of Features => model
list_encoded = [] # store single matrices
for column in df:
d[column].fit(df[column])
list_encoded.append(d[column].transform(df[column]))
merged = np.hstack(list_encoded) # matrix of 6 x 32
I hope it helps and the explaination is clear enough,
Nicolas
Okay, I don't know if I phrased it badly or something, but I can't seem to find anything similar here for my problem.
So I have a 2D list, each row representing a case and each column representing a feature (for machine learning). In addition, I have a separated list (column) as labels.
I want to randomly select the rows from the 2D list to train a classifier while using the rest to test for accuracy. Thus I want to be able to know all the indices of rows I used for training to avoid repeats.
I think there are 2 parts of the question:
1) how to randomly select
2) how to get indices
again I have no idea why I can't find good info here by searching (maybe I just suck)
Sorry I'm still new to the community so I might have made a lot of format mistake. If you have any suggestion, please let me know.
Here's the part of code I'm using to get the 2D list
#273 = number of cases
feature_list=[[0]*len(mega_list)]*273
#create counters to use for index later
link_count=0
feature_count=0
#print len(mega_list)
for link in url_list[:-1]:
#setup the url
samp_url='http://www.mtsamples.com'+link
samp_url = "%20".join( samp_url.split() )
#soup it for keywords
samp_soup=BeautifulSoup(urllib2.urlopen(samp_url).read())
keywords=samp_soup.find('meta')['content']
keywords=keywords.split(',')
for keys in keywords:
#print 'megalist: '+ str(mega_list.index(keys))
if keys in mega_list:
feature_list[link_count][mega_list.index(keys)]=1
mega_list: a list with all keywords
feature_list: the 2D list, with any word in mega_list, that specific cell is set to 1, otherwise 0
I would store the data in a pandas data frame instead of a 2D list. If I understand your data right you could do that like this:
import pandas as pd
df = pd.DataFrame(feature_list, columns = mega_list)
I don't see any mention of a dependent variable, but I'm assuming you have one because you mentioned a classifier algorithm. If your dependent variable is called "Y" and is in a list format with indices that align with your features, then this code will work for you:
from sklearn import cross_validation
x_train, x_test, y_train, y_test = cross_validation.train_test_split(
df, Y, test_size=0.8, random_state=0)
As I understand the problem, you have a list and you want to sample the list and save the indices for future use.
See: https://docs.python.org/2/library/random.html
you could do a random.sample(xrange(sizeoflist),sizeofsample) which will return the indices of your sample. You can then use that sample for training and skip over them (or get fancy and do a Set difference) for validation.
Hope this helps