I have a model that runs the following:
import pandas as pd
import numpy as np
# initialize list of lists
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
, ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
, ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category'])
print(df.head(2))
Name Attempts Score Category
0 tom 10 1 a
1 tom 15 5 a
Then I have created a dummy df to use in the model using the following code:
from sklearn.linear_model import LogisticRegression
df_dum = pd.get_dummies(df)
print(df_dum.head(2))
Attempts Score Name_matt Name_tom Category_a Category_b Category_c
0 10 1 0 1 1 0 0
1 15 5 0 1 1 0 0
Then I have created the following model:
#Model
X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values
#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]
#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)
df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])
print(dfpredictions)
Name Attempts Score Category predictions
14 matt 18 1 a 1.0
15 matt 15 6 a 1.0
16 matt 17 3 a 1.0
17 matt 14 7 c 1.0
18 matt 16 6 b 1.0
19 matt 10 2 b 1.0
Now I have new data which i would like to predict:
newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]
newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category'])
print(newdf)
Name Attempts Category
0 tom 10 a
1 tom 15 a
2 tom 14 a
Then create dummies and run prediction
newpredict = pd.get_dummies(newdf)
predict = model.predict(newpredict)
Output:
ValueError: X has 3 features per sample; expecting 6
Which makes sense because there are no categories b and c and no name called matt.
My question is how is the best way to set this model up given my new data wont always have the full set of columns used in the original data. Each day i have new data so I'm not quite sure of the most efficient and error free way.
This is an example data - my dataset has 2000 columns when running pd.get_dummies. Thanks very much!
Let me explain Nicolas and BlueSkyz's recommendation a bit more in detail.
pd.get_dummies is useful when you are sure that there will not be any new categories for a specific categorical variable in production/new data set, e.g. Gender, Products, etc. based on your Company or Database's internal data classification/consistency rules.
However, for the majority of machine learning tasks where you can expect to have new categories in the future which were not used in model training, sklearn.OneHotEncoder should be the standard choice. handle_unknown parameter of sklearn.OneHotEncoder can be set to 'ignore' to do just that: ignore new categories when applying the encoder in future. From the documentation:
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None
The full flow based on LabelEncoding and OneHotEncoding for your example is as below:
# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()
# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)
# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)
# predict on df_ohe_new
predict = model.predict(df_ohe_new)
Output (that you can assign back to newdf):
array([1, 1, 1])
However, if you really want to use pd.get_dummies only, then the following can work as well:
newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)
The above code snippet will make sure that you have the same columns in your new dummies df (newpredict) as the original df_dum (with 0 values) and drop the 'Score' column. Output here is same as above. This code will ensure that any categorical values present in the new data set but now in the original trained data will be removed while keeping the order of the columns same as that in the original df.
EDIT:
One thing I forgot to add is that pd.get_dummies is usually much faster to execute than sklearn.OneHotEncoder
Related
I have a time series index with few variables and humidity reading. I have already trained an ML model to predict Humidity values based on X, Y and Z. Now, when I load the saved model using pickle, I would like to fill the Humidity missing values using X, Y and Z. However, it should consider the fact that X, Y and Z themselves shouldnt be missing.
Time X Y Z Humidity
1/2/2017 13:00 31 22 21 48
1/2/2017 14:00 NaN 12 NaN NaN
1/2/2017 15:00 25 55 33 NaN
In this example, the last row of humidity will be filled using the model. Whereas the 2nd row should not be predicted by the model since X and Z is also missing.
I have tried this so far:
with open('model_pickle','rb') as f:
mp = pickle.load(f)
for i, value in enumerate(df['Humidity'].values):
if np.isnan(value):
df['Humidity'][i] = mp.predict(df['X'][i],df['Y'][i],df['Z'][i])
This gave me an error 'predict() takes from 2 to 5 positional arguments but 6 were given' and also I did not consider X, Y and Z column values. Below is the code I used to train the model and save it to a file:
df = df.dropna()
dfTest = df.loc['2017-01-01':'2019-02-28']
dfTrain = df.loc['2019-03-01':'2019-03-18']
features = [ 'X', 'Y', 'Z']
train_X = dfTrain[features]
train_y = dfTrain.Humidity
test_X = dfTest[features]
test_y = dfTest.Humidity
model = xgb.XGBRegressor(max_depth=10,learning_rate=0.07)
model.fit(train_X,train_y)
predXGB = model.predict(test_X)
mae = mean_absolute_error(predXGB,test_y)
import pickle
with open('model_pickle','wb') as f:
pickle.dump(model,f)
I had no errors during training and saving the model.
For prediction, since you want to make sure you have all the X, Y, Z values, you can do,
df = df.dropna(subset = ["X", "Y", "Z"])
And now you can predict the values for the remaining valid examples as,
# where features = ["X", "Y", "Z"]
df['Humidity'] = mp.predict(df[features])
mp.predict will return prediction for all the rows, so there is no need to predict iteratively.
Edit:.
For inference, say you have a dataframe df, you can do,
# Get rows with missing Humidity where it can be predicted.
df_inference = df[df.Humidity.isnull()]
# remaining rows
df = df[df.Humidity.notnull()]
# This might still have rows with missing features.
# Since you cannot infer with missing features, Remove them too and add them to remaining rows
df = df.append(df_inference[df_inference[features].isnull().any(1)])
# and remove them from df_inference
df_inference = df_inference[~df_inference[features].isnull().any(1)]
#Now you can infer on these rows
df_inference['Humidity'] = mp.predict(df_inference[features])
# Now you can merge this back to the remaining rows to get the original number of rows and sort the rows by index
df = df.append(df_inference)
df.sort_index()
Can you report the error?
Anyway, if you have missing values you have different options to deal with them. You can either discard the datapoint entirely or try and infer the missing parts with a method of choice: mean, interpolation, etc.
Pandas documentation has a nice guide on how to deal with them:
https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Try
df['Humidity'][i] = mp.predict(df[['X', 'Y', 'Z']][i])
This way you the data is passed as a single argument, as function expects. The way you wrote it, you split your data to 3 arguments.
I'm estimating an OLS model, as seen below. I need the coefficients on the categorical variable along with their values.
Here's my code:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 1), columns=list('A'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,5,5,5,5,5,6,6,6,6,6]
df['groupid'] = df['groupid'].astype('int')
###Fixed effects models
FE_ols = smf.ols(formula = 'A ~ C(groupid) - 1', data=df).fit()
FE_coeffs = FE_ols.params #Save coeffs
FE_coeffs.GroupID = FE_coeffs.index #Extract value of GroupID
FE_coeffs.GroupID = FE_coeffs.GroupID.str.extract('(\d+)') #Parse number from string
I'm able to extract the coefficients on the dummy variables. I put them in a new data frame.
C(groupid)[1] 0.2329694463342642
C(groupid)[2] 0.7567034333090062
C(groupid)[3] 0.31355791920072623
C(groupid)[5] -0.05131898650395289
C(groupid)[6] 0.31757453138500547
However, I want the data frame to be like:
1 0.2329694463342642
2 0.7567034333090062
3 0.31355791920072623
5 -0.05131898650395289
6 0.31757453138500547
The code seems to work, including the parsing. When I do this on Jupyter, it even shows the correct output. But the change isn't saved onto the data frame. There doesn't seem to be a inplace=True kind of command.
Will appreciate any help.
FE_coeffs is a Series, so adding an attribute GroupID as if it's adding a column is the wrong direction. Instead, just overwrite the index with the extracted integer values:
In [80]: FE_coeffs = FE_ols.params.copy()
In [81]: FE_coeffs.index = FE_coeffs.index.str.extract("(\d+)", expand=False).astype(int)
In [82]: FE_coeffs
Out[82]:
1 0.232969
2 0.756703
3 0.313558
5 -0.051319
6 0.317575
dtype: float64
I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.
I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:
non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)
train = train._get_numeric_data()
train.fillna(0, inplace = True)
non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)
test = test._get_numeric_data()
test.fillna(0, inplace = True)
feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
lm = LinearRegression(normalize = False)
lm.fit(X, y)
import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))
The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars. Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed.
Also, if there is a better way to do this I'd be eager to hear about it.
You seem to encounter collinearity due to introduction of categorical variables in feature column, since sum of the feature columns of "one-hot" encoded variables is always 1.
If you have one categorical variable , you need to set "fit_intercept=False" in your linear Regression (or drop one of the feature column of one-hot coded variable)
If you have more than one categorical variables, you need to drop one feature column for each of the category to break collinearity.
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
In [72]:
df = pd.read_csv('/home/siva/anaconda3/data.csv')
df
Out[72]:
C1 C2 C3 y
0 1 0 0 12.4
1 1 0 0 11.9
2 0 1 0 8.3
3 0 1 0 3.1
4 0 0 1 5.4
5 0 0 1 6.2
In [73]:
y
X = df.iloc[:,0:3]
y = df.iloc[:,-1]
In [74]:
reg = LinearRegression()
reg.fit(X,y)
Out[74]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [75]:
_
reg.coef_,reg.intercept_
Out[75]:
(array([ 4.26666667, -2.18333333, -2.08333333]), 7.8833333333333346)
we find that co_efficients for C1, C2 , C3 do not make sense according to given X.
In [76]:
reg1 = LinearRegression(fit_intercept=False)
reg1.fit(X,y)
Out[76]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)
In [77]:
reg1.coef_
Out[77]:
array([ 12.15, 5.7 , 5.8 ])
we find that co_efficients makes much more sense when the fit_intercept was set to False
A detailed explanation for a similar question at below.
https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn
I posted this at the stats site and Ami Tavory pointed out that the get_dummies should be run on the merged train and test dataframe to ensure that the same dummy variables were set up in both dataframes. This solved the issue.
I am training an sklearn.tree.DecisionTreeClassifier. I start out with a pandas.core.frame.DataFrame. Some of the columns of this data frame are strings that really should be categorical. For example, 'Color' is one such column and has values such as 'black', 'white', 'red', and so on. So I convert this column to be of type category like this:
data['Color'] = data['Color'].astype('category')
This works just fine. Now I split my data frame using sklearn.cross_validation.train_test_split, like this:
X = data.drop(['OutcomeType'], axis=1)
y = data['OutcomeType']
X_train, X_test, y_train, y_test = train_test_split(X, y)
Now X_train has type numpy.ndarray. However, the 'Color' values are no longer categorical, they are back to being strings.
So when I make the following calls:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)
I get the following error:
ValueError: could not convert string to float: Black
What do I need to do to get this working correctly?
If you want to convert your categorical column to an integer, you can use data.Color.cat.codes; this uses data.Color.cat.categories to perform the mapping (the i'th array element gets mapped to the integer i)
As ayhan said, a workaround would be to create dummy features from your 'Color' variable (which are pretty common use with decision trees / RF).
You could use something like this :
def feature_to_dummy(df, column, drop=False):
''' take a serie from a dataframe,
convert it to dummy and name it like feature_value
- df is a dataframe
- column is the name of the column to be transformed
- if drop is true, the serie is removed from dataframe'''
tmp = pd.get_dummies(df[column], prefix=column, prefix_sep='_')
df = pd.concat([df, tmp], axis=1, join_axes=[df.index])
if drop:
del df[column]
return df
See documentation for pandas.get_dummies
Example
df
Out[1]:
color
0 red
1 black
2 green
df_dummy = feature_to_dummy(df, 'color', drop=True)
df_dummy
Out[2]:
color_black color_green color_red
0 0 0 1
1 1 0 0
2 0 1 0
dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).