Let's say I have a dataframe in python with a range of animals, and a range of attributes, with dummy variables for whether the animal has that attribute. I'm interested in creating lists, both vertically and horizontally based on dummy variable value. e.g. I'd like to:
a) create a list of animals that have hair
b) create a list of all the attributes that a dog has.
Could anyone please assist with how I would do this in Python? Thanks very much!
Name
Hair
Eyes
Dog
1
1
Fish
0
1
You could use a dictionary to store values regarding the animals. And the first value of the values list can hold the 0 or 1 denoting hair on the animal.
animals = { "Dog": [ 1, 1 ], "Fish": [ 0, 1 ] }
(a)
df[ df['Hair'] == 1 ]['Name'].to_list()
df.loc[ df['Hair'] == 1, 'Name'].to_list()
(b)
It may need to transpose dataframe (to convert rows into columns) and set column's names.
And later you can use similar code
df[ df['Dog'] == 1 ].index.to_list()
Minimal working code
text = '''Name,Hair,Eyes
Dog,1,1
Fish,0,1'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
print(df)
print('---')
print('Hair 1:', df[ df['Hair'] == 1 ]['Name'].to_list())
print('hair 2:', df.loc[ df['Hair'] == 1, 'Name'].to_list())
print('---')
# transpose
#new_df = df.transpose() #
new_df = df.T # shorter name - without `()`
# convert first row into column's names
new_df.columns = new_df.loc['Name']
new_df = new_df[1:]
print(new_df)
print('---')
print('Dog :', new_df[ new_df['Dog'] == 1 ].index.to_list())
print('Fish:', new_df[ new_df['Fish'] == 1 ].index.to_list())
Result:
Name Hair Eyes
0 Dog 1 1
1 Fish 0 1
---
Hair 1: ['Dog']
hair 2: ['Dog']
---
Name Dog Fish
Hair 1 0
Eyes 1 1
---
Dog : ['Hair', 'Eyes']
Fish: ['Eyes']
I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.
Not a an ideal title but I wouldn't know how to describe it better.
I have a dataframe (df1) and want to split it on the column "chicken" so that:
each chicken that laid an egg becomes a distinct row
the chickens that didn't lay an egg are aggregated in a unique row.
The output I need is df2, example:
In farm "A", there are 5 chicken, of which 2 chicken laid an egg, so there are 2 rows with egg = "True" and weight = 1 each, and 1 row with egg = "False" and weight = 3 (the 3 chicken that didn't lay an egg).
The code I came up with is messy, can you guys think of a cleaner way of doing it? Thanks!!
#code to create df1:
df1 = pd.DataFrame({'farm':["A","B","C"],"chicken":[5,10,5],"eggs":[2,3,0]})
df1=df1[["farm","chicken","eggs"]]
#code to transform df1 to df2:
df2 = pd.DataFrame()
for i in df1.index:
number_of_trues = df1.iloc[i]["eggs"]
number_of_falses = df1.iloc[i]["chicken"] - number_of_trues
col_farm = [df1.iloc[i]["farm"]]*(number_of_trues+1)
col_egg = ["True"]*number_of_trues + ["False"]*1
col_weight = [1]*number_of_trues + [number_of_falses]
mini_df = pd.DataFrame({"farm":col_farm,"egg":col_egg,"weight":col_weight})
df2=df2.append(mini_df)
df2 = df2[["farm","egg","weight"]]
df2
This is customize solution , by creating two different sub dataframe then concat it back to achieve the expected output.Key method : repeat
s=pd.DataFrame({'farm':df1.farm.repeat(df1.eggs),'egg':[True]*df1.eggs.sum(),'weight':[1]*df1.eggs.sum()})
t=pd.DataFrame({'farm':df1.farm,'egg':[False]*len(df1.farm),'weight':df1.chicken-df1.eggs})
pd.concat([t,s]).sort_values(['farm','egg'],ascending=[True,False])
Out[847]:
egg farm weight
0 True A 1
0 True A 1
0 False A 3
1 True B 1
1 True B 1
1 True B 1
1 False B 7
2 False C 5
I have a big dataset (2m rows, 70 variables), which has many categorical variables. All categorical variables are coded in numbers (e.g. see df1)
df1:
obs gender job
1 1 1
2 1 2
3 2 2
4 1 1
I have a another data frame with all explanations, looking like this:
df2:
Var: Value: Label:
gender 1 male
gender 2 female
job 1 blue collar
job 2 white collar
Is there a fast way to replace all values of the categorical columns with their label from df2? This would save me the work to always look up the meaning of the value in df2. I found some solutions to replace values by hand, but I look for an automatic way doing this.
Thank you
You could use a dictionary generated from df2. Like this:
Firstly, generating some dummy data:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)
df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']
If you want to replace one variable something like this:
genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])
And if you'd like to replace a bunch of variables:
colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])
For a million rows it takes about 1 second so should be reasonable fast.
Create a mapper dictionary from df2 using groupby
d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()
{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}
Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper
for col in df1.columns:
if col in d.keys():
df1[col] = df1[col].map(d[col])
You get
obs gender job
0 1 male blue collar
1 2 male white collar
2 3 female white collar
3 4 male blue collar
I have a set of dataframes where one of the columns contains a categorical variable. I'd like to convert it to several dummy variables, in which case I'd normally use get_dummies.
What happens is that get_dummies looks at the data available in each dataframe to find out how many categories there are, and thus create the appropriate number of dummy variables. However, in the problem I'm working right now, I actually know in advance what the possible categories are. But when looking at each dataframe individually, not all categories necessarily appear.
My question is: is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?
Something that would make this:
categories = ['a', 'b', 'c']
cat
1 a
2 b
3 a
Become this:
cat_a cat_b cat_c
1 1 0 0
2 0 1 0
3 1 0 0
TL;DR:
pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories)))
Older pandas: pd.get_dummies(cat.astype('category', categories=categories))
is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don't appear in a given dataframe, it'd just create a column of 0s?
Yes, there is! Pandas has a special type of Series just for categorical data. One of the attributes of this series is the possible categories, which get_dummies takes into account. Here's an example:
In [1]: import pandas as pd
In [2]: possible_categories = list('abc')
In [3]: dtype = pd.CategoricalDtype(categories=possible_categories)
In [4]: cat = pd.Series(list('aba'), dtype=dtype)
In [5]: cat
Out[5]:
0 a
1 b
2 a
dtype: category
Categories (3, object): [a, b, c]
Then, get_dummies will do exactly what you want!
In [6]: pd.get_dummies(cat)
Out[6]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
There are a bunch of other ways to create a categorical Series or DataFrame, this is just the one I find most convenient. You can read about all of them in the pandas documentation.
EDIT:
I haven't followed the exact versioning, but there was a bug in how pandas treats sparse matrices, at least until version 0.17.0. It was corrected by version 0.18.1 (released May 2016).
For version 0.17.0, if you try to do this with the sparse=True option with a DataFrame, the column of zeros for the missing dummy variable will be a column of NaN, and it will be converted to dense.
It looks like pandas 0.21.0 added a CategoricalDType, and creating categoricals which explicitly include the categories as in the original answer was deprecated, I'm not quite sure when.
Using transpose and reindex
import pandas as pd
cats = ['a', 'b', 'c']
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
dummies = pd.get_dummies(df, prefix='', prefix_sep='')
dummies = dummies.T.reindex(cats).T.fillna(0)
print dummies
a b c
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
Try this:
In[1]: import pandas as pd
cats = ["a", "b", "c"]
In[2]: df = pd.DataFrame({"cat": ["a", "b", "a"]})
In[3]: pd.concat((pd.get_dummies(df.cat, columns=cats), pd.DataFrame(columns=cats))).fillna(0)
Out[3]:
a b c
0 1.0 0.0 0
1 0.0 1.0 0
2 1.0 0.0 0
I did ask this on the pandas github. Turns out it is really easy to get around it when you define the column as a Categorical where you define all the possible categories.
df['col'] = pd.Categorical(df['col'], categories=['a', 'b', 'c', 'd'])
get_dummies() will do the rest then as expected.
I don't think get_dummies provides this out of the box, it only allows for creating an extra column that highlights NaN values.
To add the missing columns yourself, you could use pd.concat along axis=0 to vertically 'stack' the DataFrames (the dummy columns plus a DataFrame id) and automatically create any missing columns, use fillna(0) to replace missing values, and then use .groupby('id') to separate the various DataFrame again.
Adding the missing category in the test set:
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]
Notice that this code also remove column resulting from category in the test dataset but not present in the training dataset
As suggested by others - Converting your Categorical features to 'category' data type should resolve the unseen label issue using 'get_dummies'.
# Your Data frame(df)
from sklearn.model_selection import train_test_split
X = df.loc[:,df.columns !='label']
Y = df.loc[:,df.columns =='label']
# Split the data into 70% training and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
# Convert Categorical Columns in your data frame to type 'category'
for col in df.select_dtypes(include=[np.object]).columns:
X_train[col] = X_train[col].astype('category', categories = df[col].unique())
X_test[col] = X_test[col].astype('category', categories = df[col].unique())
# Now, use get_dummies on training, test data and we will get same set of columns
X_train = pd.get_dummies(X_train,columns = ["Categorical_Columns"])
X_test = pd.get_dummies(X_test,columns = ["Categorical_Columns"])
The shorter the better:
import pandas as pd
cats = pd.Index(['a', 'b', 'c'])
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
pd.get_dummies(df, prefix='', prefix_sep='').reindex(columns = cats, fill_value=0)
Result:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
Notes:
cats need to be a pandas index
prefix='' and prefix_sep='' need to be set in order to use the cats category as you defined in a first place. Otherwise, get_dummies converts into: cats_a, cats_b and cats_c). To me this is better because it is explicit.
use the fill_value=0 to convert the NaN from column c. Alternatively, you can use fillna(0) at the end of the sentence. (I don't which is faster).
Here's a shorter-shorter version (changed the Index values):
import pandas as pd
cats = pd.Index(['cat_a', 'cat_b', 'cat_c'])
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
pd.get_dummies(df).reindex(columns = cats, fill_value=0)
Result:
cat_a cat_b cat_c
0 1 0 0
1 0 1 0
2 1 0 0
Bonus track!
I imagine you have the categories because you did a previous dummy/one hot using training data. You can save the original encoding (.columns), and then apply during production time:
cats = pd.Index(['cat_a', 'cat_b', 'cat_c']) # it might come from the original onehot encoding (df_ohe.columns)
import pickle
with open('cats.pickle', 'wb') as handle:
pickle.dump(cats, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('cats.pickle', 'rb') as handle:
saved_cats = pickle.load(handle)
df = pd.DataFrame({'cat': ['a', 'b', 'a']})
pd.get_dummies(df).reindex(columns = saved_cats, fill_value=0)
Result:
cat_a cat_b cat_c
0 1 0 0
1 0 1 0
2 1 0 0
If you know your categories you can first apply pd.get_dummies() as you suggested and add the missing category columns afterwards.
This will create your example with the missing cat_c:
import pandas as pd
categories = ['a', 'b', 'c']
df = pd.DataFrame(list('aba'), columns=['cat'])
df = pd.get_dummies(df)
print(df)
cat_a cat_b
0 1 0
1 0 1
2 1 0
Now simply add the missing category columns with a union operation (as suggested here).
possible_categories = ['cat_' + cat for cat in categories]
df = df.reindex(df.columns.union(possible_categories, sort=False), axis=1, fill_value=0)
print(df)
cat_a cat_b cat_c
0 1 0 0
1 0 1 0
2 1 0 0
I was recently looking to solve this same issue, but working with a multi-column dataframe and with two datasets (a train set and test set for a machine learning task). The test dataframe had the same categorical columns as the train dataframe, but some of these columns had missing categories that were present in the train dataframe.
I did not want to manually define all the possible categories for every column. Instead, I combined the train and test dataframes into one, called get_dummies, and then split that back into two.
# train_cat, test_cat are dataframes instantiated elsewhere
train_test_cat = pd.concat([train_cat, test_cat]
tran_test_cat = pd.get_dummies(train_test_cat, axis=0))
train_cat = train_test_cat.iloc[:train_cat.shape[0], :]
test_cat = train_test_cat.iloc[train_cat.shape[0]:, :]