Dummy variables when not all categories are present - python

I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.
What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.
My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?
Something that would make this:
categories = ['a', 'b', 'c']
cat
1 a
2 b
3 a
Become this:
cat_a cat_b cat_c
1 1 0 0
2 0 1 0
3 1 0 0

TL;DR:
pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))
Older pandas: pd.get_dummies(cat.astype('category', categories=categories))
is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?
Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:
In [1]: import pandas as pd
In [2]: possible_categories = list('abc')
In [3]: dtype = pd.CategoricalDtype(categories=possible_categories)
In [4]: cat = pd.Series(list('aba'), dtype=dtype)
In [5]: cat
Out[5]:
0 a
1 b
2 a
dtype: category
Categories (3, object): [a, b, c]
Then, get_dummies will do exactly what you want!
In [6]: pd.get_dummies(cat)
Out[6]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.
EDIT:
I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).
For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.
It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.

Using transpose and reindex
import pandas as pd
cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)
print dummies
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0

Try this:
In[1]: import pandas as pd
cats = ["a", "b", "c"]
In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})
In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]:
a b c
0 1.0 0.0 0
1 0.0 1.0 0
2 1.0 0.0 0

I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categorical where you define all the possible categories.
df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])
get_dummies() will do the rest then as expected.

I don't think get_dummies provides this out of the box, it only allows for creating an extra column that highlights NaN values.
To add the missing columns yourself, you could use pd.concat along axis=0 to vertically 'stack' the DataFrames (the dummy columns plus a DataFrame id) and automatically create any missing columns, use fillna(0) to replace missing values, and then use .groupby('id') to separate the various DataFrame again.

Adding the missing category in the test set:
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]
Notice that this code also remove column resulting from category in the test dataset but not present in the training dataset

As suggested by others - Converting your Categorical features to 'category' data type should resolve the unseen label issue using 'get_dummies'.
# Your Data frame(df)
from sklearn.model_selection import train_test_split
X = df.loc[:,df.columns !='label']
Y = df.loc[:,df.columns =='label']
# Split the data into 70% training and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
# Convert Categorical Columns in your data frame to type 'category'
for col in df.select_dtypes(include=[np.object]).columns:
X_train[col] = X_train[col].astype('category', categories = df[col].unique())
X_test[col] = X_test[col].astype('category', categories = df[col].unique())
# Now, use get_dummies on training, test data and we will get same set of columns
X_train = pd.get_dummies(X_train,columns = ["Categorical_Columns"])
X_test = pd.get_dummies(X_test,columns = ["Categorical_Columns"])

The shorter the better:
import pandas as pd
cats = pd.Index(['a', 'b', 'c'])
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
pd.get_dummies(df, prefix='', prefix_sep='').reindex(columns = cats, fill_value=0)
Result:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
Notes:
cats need to be a pandas index
prefix='' and prefix_sep='' need to be set in order to use the cats category as you defined in a first place. Otherwise, get_dummies converts into: cats_a, cats_b and cats_c). To me this is better because it is explicit.
use the fill_value=0 to convert the NaN from column c. Alternatively, you can use fillna(0) at the end of the sentence. (I don't which is faster).
Here's a shorter-shorter version (changed the Index values):
import pandas as pd
cats = pd.Index(['cat_a', 'cat_b', 'cat_c'])
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
pd.get_dummies(df).reindex(columns = cats, fill_value=0)
Result:
cat_a cat_b cat_c
0 1 0 0
1 0 1 0
2 1 0 0
Bonus track!
I imagine you have the categories because you did a previous dummy/one hot using training data. You can save the original encoding (.columns), and then apply during production time:
cats = pd.Index(['cat_a', 'cat_b', 'cat_c']) # it might come from the original onehot encoding (df_ohe.columns)
import pickle
with open('cats.pickle', 'wb') as handle:
pickle.dump(cats, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('cats.pickle', 'rb') as handle:
saved_cats = pickle.load(handle)
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
pd.get_dummies(df).reindex(columns = saved_cats, fill_value=0)
Result:
cat_a cat_b cat_c
0 1 0 0
1 0 1 0
2 1 0 0

If you know your categories you can first apply pd.get_dummies() as you suggested and add the missing category columns afterwards.
This will create your example with the missing cat_c:
import pandas as pd
categories = ['a', 'b', 'c']
df = pd.DataFrame(list('aba'), columns=['cat'])
df = pd.get_dummies(df)
print(df)
cat_a cat_b
0 1 0
1 0 1
2 1 0
Now simply add the missing category columns with a union operation (as suggested here).
possible_categories = ['cat_' + cat for cat in categories]
df = df.reindex(df.columns.union(possible_categories, sort=False), axis=1, fill_value=0)
print(df)
cat_a cat_b cat_c
0 1 0 0
1 0 1 0
2 1 0 0

I was recently looking to solve this same issue, but working with a multi-column dataframe and with two datasets (a train set and test set for a machine learning task). The test dataframe had the same categorical columns as the train dataframe, but some of these columns had missing categories that were present in the train dataframe.
I did not want to manually define all the possible categories for every column. Instead, I combined the train and test dataframes into one, called get_dummies, and then split that back into two.
# train_cat, test_cat are dataframes instantiated elsewhere
train_test_cat = pd.concat([train_cat, test_cat]
tran_test_cat = pd.get_dummies(train_test_cat, axis=0))
train_cat = train_test_cat.iloc[:train_cat.shape[0], :]
test_cat = train_test_cat.iloc[train_cat.shape[0]:, :]

Related

add/combine columns after searching in a DataFrame

I'm trying to copy data from different columns to a particular column in the same DataFrame.
Index
col1A
col2A
colB
list
CT
CW
CH
0
1
:
1
b
2
2
3
3d
But prior to that I wanted to search if those columns(col1A,col2A,colB) exist in the DataFrame and group those columns which are present and move the grouped data to relevant columns(CT,CH,etc) like,
CH
CW
CT
0
1
1
1
b
b
2
2
2
3
3d
3d
I did,
col_list1 = ['ColA','ColB','ColC']
test1 = any([ i in df.columns for i in col_list1 ])
if test1==True:
df['CH'] = df['Col1A'] +df['Col2A']
df['CT'] = df['ColB']
this code is throwing me a keyerror
.
I want it to ignore columns that are not present and add only those that are present
IIUC, you can use Python set or Series.isin to find the common columns
cols = list(set(col_list1) & set(df.columns))
# or
cols = df.columns[df.columns.isin(col_list1)]
df['CH'] = df[cols].sum(axis=1)
Instead of just concatenating the columns with +, collect them into a list and use sum with axis=1:
df['CH'] = np.sum([df[c] for c in cl if c in df], axis=1)

Add multi level column to dataframe

At the beginning, I'd like to add a multilevel column to an empty dataframe.
df = pd.DataFrame({"nodes": list(range(1, 5, 2))})
df.set_index("nodes", inplace=True)
So this is the dataframe to start with (still empty):
>>> df
nodes
1
3
Now I'd like to a first multilevel column.
I tried the following:
new_df = pd.DataFrame.from_dict(dict(zip(df.index, [1,2])), orient="index",
columns=["value"])
df = pd.concat([new_df], axis=1, keys=["test"])
Now the dataframe df looks like this:
>>> df
test
value
1 1
3 2
To add another column, i've done something similar.
new_df2 = pd.DataFrame.from_dict(dict(zip(df.index, [3,4])), orient="index",
columns=[("test2", "value2")])
df = pd.concat([df, new_df2], axis=1)
df.index.name = "nodes"
So the desired dataframe looks like this:
>>> df
test test2
nodes value value2
1 1 3
3 2 4
This way of adding multilevel columns seems a bit strange. Is there a better way of doing so?
Create a MultIndex on the columns by storing your DataFrames in a dict then concat along axis=1. The keys of the dict become levels of the column MultiIndex (if you use tuples it adds multiple levels depending on the length, scalar keys add a single level) and the DataFrame columns stay as is. Alignment is enforced on the row Index.
import pandas as pd
d = {}
d[('foo', 'bar')] = pd.DataFrame({'val': [1,2,3]}).rename_axis(index='nodes')
d[('foo2', 'bar2')] = pd.DataFrame({'val2': [4,5,6]}).rename_axis(index='nodes')
d[('foo2', 'bar1')] = pd.DataFrame({'val2': [7,8,9]}).rename_axis(index='nodes')
pd.concat(d, axis=1)
foo foo2
bar bar2 bar1
val val2 val2
nodes
0 1 4 7
1 2 5 8
2 3 6 9

fillna() for Multi-Index Pandas DataFrame

I have a multi-index Pandas dataframe and I want to use ffill() to fill any NaNs in certain columns. Following code shows the structure of the sample dataframe, and the result of ffill() in the next snapshot.
room = ['A', 'B']
val = range(3)
df = pd.DataFrame(columns=pd.MultiIndex.from_product([room, val]),data=np.random.randn(3,6))
df.loc[1,('B',0)]=np.nan
# print(df.loc[1,('B',0)])
display(df)
df = df.ffill(axis=1)
display(df)
What I was hoping to get is that the NaN at [1,('B',0)] is replaced with -0.392674 and not with -1.349675.
Generally, I want to be able to ffill() from the corresponding columns from level 1 ([0,1,2]).
How do I achieve this?
I think you are looking for groupby fillna
df=df.groupby(level=1,axis=1).fillna(method='ffill')
df
Out[496]:
A B
0 1 2 0 1 2
0 -0.177358 -1.531091 -0.945004 1.665143 0.602459 -0.008192
1 -0.006995 0.472267 -0.859471 -0.006995 -0.601538 -0.410391
2 0.101494 1.031941 0.499288 0.804391 -0.224750 -0.778403

pandas dataframe with identical column names - is it valid process?

I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']
df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])

Check which columns in DataFrame are Categorical

I am new to Pandas... I want to a simple and generic way to find which columns are categorical in my DataFrame, when I don't manually specify each column type, unlike in this SO question. The df is created with:
import pandas as pd
df = pd.read_csv("test.csv", header=None)
e.g.
0 1 2 3 4
0 1.539240 0.423437 -0.687014 Chicago Safari
1 0.815336 0.913623 1.800160 Boston Safari
2 0.821214 -0.824839 0.483724 New York Safari
.
UPDATE (2018/02/04) The question assumes numerical columns are NOT categorical, #Zero's accepted answer solves this.
BE CAREFUL - As #Sagarkar's comment points out that's not always true. The difficulty is that Data Types and Categorical/Ordinal/Nominal types are orthogonal concepts, thus mapping between them isn't straightforward. #Jeff's answer below specifies the precise manner to achieve the manual mapping.
You could use df._get_numeric_data() to get numeric columns and then find out categorical columns
In [66]: cols = df.columns
In [67]: num_cols = df._get_numeric_data().columns
In [68]: num_cols
Out[68]: Index([u'0', u'1', u'2'], dtype='object')
In [69]: list(set(cols) - set(num_cols))
Out[69]: ['3', '4']
The way I found was updating to Pandas v0.16.0, then excluding number dtypes with:
df.select_dtypes(exclude=["number","bool_","object_"])
Which works, providing no types are changed and no more are added to NumPy. The suggestion in the question's comments by #Jeff suggests include=["category"], but that didn't seem to work.
NumPy Types: link
For posterity. The canonical method to select dtypes is .select_dtypes. You can specify an actual numpy dtype or convertible, or 'category' which not a numpy dtype.
In [1]: df = DataFrame({'A' : Series(range(3)).astype('category'), 'B' : range(3), 'C' : list('abc'), 'D' : np.random.randn(3) })
In [2]: df
Out[2]:
A B C D
0 0 0 a 0.141296
1 1 1 b 0.939059
2 2 2 c -2.305019
In [3]: df.select_dtypes(include=['category'])
Out[3]:
A
0 0
1 1
2 2
In [4]: df.select_dtypes(include=['object'])
Out[4]:
C
0 a
1 b
2 c
In [5]: df.select_dtypes(include=['object']).dtypes
Out[5]:
C object
dtype: object
In [6]: df.select_dtypes(include=['category','int']).dtypes
Out[6]:
A category
B int64
dtype: object
In [7]: df.select_dtypes(include=['category','int','float']).dtypes
Out[7]:
A category
B int64
D float64
dtype: object
You can get the list of categorical columns using this code :
dfName.select_dtypes(include=['object']).columns.tolist()
And intuitively for numerical columns :
dfName.select_dtypes(exclude=['object']).columns.tolist()
Hope that helps.
select categorical column names
cat_features=[i for i in df.columns if df.dtypes[i]=='object']
# Get categorical and numerical variables
numCols = X.select_dtypes("number").columns
catCols = X.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
numeric_var = [key for key in dict(df.dtypes)
if dict(pd.dtypes)[key]
in ['float64','float32','int32','int64']] # Numeric Variable
cat_var = [key for key in dict(df.dtypes)
if dict(df.dtypes)[key] in ['object'] ] # Categorical Varible
Often columns get pandas dtype of string (or "object") or category. Better to include both incase the columns you look for don't get listed under category dtype.
dataframe.select_dtypes(include=['object','category']).columns.tolist()
You don't need to query the data if you are just interested in which columns are of what type.
The fastest method (when %%timeit-ing it) is:
df.dtypes[df.dtypes == 'category'].index
(this will give you a pandas' Index. You can .tolist() to get a list out of it, if you need that.)
This works because df.dtypes is a pd.Series of strings (its own dtype is 'object'), so you can actually just select for the type that you need with normal pandas querying.
You don't have your categorical types as 'category' but as simple strings ('object')? Then just:
df.dtypes[df.dtypes == 'object'].index
Do you have a mix of 'object' and 'category'? Then use isin like you would do normally to query for multiple matches:
df.dtypes[df.dtypes.isin(['object','category'])].index
Use .dtypes
In [10]: df.dtypes
Out[10]:
0 float64
1 float64
2 float64
3 object
4 object
dtype: object
Use pandas.DataFrame.select_dtypes. There are categorical dtypes that can be found by 'categorical' flag. For Strings you might use the numpy object dtype
More Info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html
Exemple:
import pandas as pd
df = pd.DataFrame({'Integer': [1, 2] * 3,'Bool': [True, False] * 3,'Float': [1.0, 2.0] * 3,'String': ['Dog', 'Cat'] * 3})
df
Out[1]:
Integer Bool Float String
0 1 True 1.0 Dog
1 2 False 2.0 Cat
2 1 True 1.0 Dog
3 2 False 2.0 Cat
4 1 True 1.0 Dog
5 2 False 2.0 Cat
df.select_dtypes(include=['category', object]).columns
Out[2]:
Index(['String'], dtype='object')
I have faced similar obstacle where categorizing variables was a challenge. However I came up with some approaches based on the nature of the data. This would give a general and flexible answer to your issue as well as to future data.
Normally while categorization of data is done on the basis of its datatype which sometimes may result in wrong analysis. (Usually done by df.select_dtypes(include = ['object', 'category'])
Approach:
The approach is of viewing the data not on a column level but on a row level. This approach would give the number of distinct values which would automatically distinguish categorical variables from numerical types.
That is if count of unique values in a row exceed more than certain number of values
(This is for you to decide how many categorical variables you presume in your column)
for eg: if ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values don't exceed more than those five unique categorical values in that column then that column falls under categorical variables.
elif ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values exceed more than those five unique categorical values in that column then they fall under numerical variables.
if [col for col in df.columns if len(df[col].unique()) <=5]:
cat_var = [col for col in df.columns if len(df[col].unique()) <=5]
elif [col for col in df.columns if len(df[col].unique()) > 5]:
num_var = [col for col in df.columns if len(df[col].unique()) > 5]
# where 5 : presumed number of categorical variables and may be flexible for user to decide.
I have used if and elif for better illustration. There is no need for that you can directly go for lines inside the condition.
`categorical_values = (df.dtypes == 'object')
categorical_variables = categorical_variables =[categorical_values.index[ind]
for ind, val in enumerate(categorical_values) if val == True]
In the first line of code, we obtain a series which gives information regarding all the columns. The series gives information on which column is an object type and which column is not of the object type by representing it with a Boolean value.
In the second line, we use a list comprehension using enumeration(iterating through index and value), so that we could easily find the column which is of categorical type and append it to the categorical_variables list
First we can segregate the data frame with the default types available when we read the datasets. This will list out all the different types and the corresponding data.
for types in data.dtypes.unique():
print(types)
print(data.select_dtypes(types).columns)
This code will get all categorical variables:
cat_cols = [col for col in df.columns if col not in df.describe().columns]
This will give an array of all the categorical variables in a dataframe.
dataset.select_dtypes(include=['O']).columns.values
# Import packages
import numpy as np
import pandas as pd
# Data
df = pd.DataFrame({"Country" : ["France", "Spain", "Germany", "Spain", "Germany", "France"],
"Age" : [34, 27, 30, 32, 42, 30],
"Purchased" : ["No", "Yes", "No", "No", "Yes", "Yes"]})
df
Out[1]:
Country Age Purchased
0 France 34 No
1 Spain 27 Yes
2 Germany 30 No
3 Spain 32 No
4 Germany 42 Yes
5 France 30 Yes
# Checking data type
df.dtypes
Out[2]:
Country object
Age int64
Purchased object
dtype: object
# Saving CATEGORICAL Variables
cat_col = [c for i, c in enumerate(df.columns) if df.dtypes[i] in [np.object]]
cat_col
Out[3]: ['Country', 'Purchased']
This might help. But you need to check the columns with slightly less than 10 characters, or you need to check columns with unique values that are slightly more than 10 characters manually.
def find_cate(df):
cols=df.columns
i=0
for col in cols:
if len(df[col].unique())<=10:
print(col,len(df[col].unique()))
i=i+1
print(i)
df.select_dtypes(exclude=["number"]).columns
This will help you to directly display all the non numerical rows
This always worked pretty well for me :
categorical_columns = list(set(df.columns) - set(df.describe().columns))
Sklearn gives you a one liner (or a 2 liner if you want to use it on many DataFrames). Lets say your DataFrame object is df then:
## good example in https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html
from sklearn.compose import make_column_selector
cat_cols = make_column_selector(dtype_include=object) (df)
print (cat_cols)
## OR to use with many DataFrames, create one _selector object first
num_selector = make_column_selector(dtype_include=np.number)
num_cols = num_selector (df)
print (num_cols)
You can get the list of categorical columns using this code :
categorical_columns = (df.dtypes == 'object')
get categorical columns names:
object_cols = list(categorical_columns[categorical_columns].index)

Categories

Resources