I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("catgegory")
Do I really need to write df["col"] three times in order to achieve this?
After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.
df = df.assign(col="hello").astype({'col':'category'})
df.dtypes
A int64
col category
dtype: object
That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.
This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.
df = pd.DataFrame({'A':[1,2,3,4]})
df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting
col2 = lambda x:x['A']**2, #Define column based on existing columns
col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns
.astype({'col1':'category',
'col2':'float'}))
print(df)
print(df.dtypes)
A col1 col2 col3
0 1 hello 1.0 1.0
1 2 hello 4.0 2.0
2 3 hello 9.0 3.0
3 4 hello 16.0 4.0
A int64
col1 category #<-changed dtype
col2 float64 #<-changed dtype
col3 float64
dtype: object
This solution surely solves the first point, not sure about the second:
df['col'] = pd.Categorical(('hello' for i in len(df)))
Essentially
we first create a generator of 'hello' with length equal to the number of records in df
then we pass it to pd.Categorical to make it a categorical column.
I try to replace in all the empty cell of a dataset the mean of that column.
I use modifiedData=data.fillna(data.mean())
but it works only on integer column type.
I have also a column with float values and in it fillna does not work.
Why?
.fillna() works on columns that are nan. The concept of nan can't exist in an int column. Pandas dtype int does not support nan.
If you have a column with what seems to be integers, it is more likely an object column. Perhaps even filled with strings. Strings that are empty in some cases.
Empty strings are not filled by .fillna()
In [8]: pd.Series(["2", "1", ""]).fillna(0)
Out[8]:
0 2
1 1
2
dtype: object
An easy way to figure out what's going on is to use the df.Column.isna() method.
If that method gives you all False. you know there are no nan to fill.
To turn empty strings into nan values
In [11]: s = pd.Series(["2", "1", ""])
In [12]: empty_string_mask = s.str.len() == 0
In [21]: s.loc[empty_string_mask] = float('nan')
In [22]: s
Out[22]:
0 2
1 1
2 NaN
dtype: object
After that you can fillna
In [23]: s.fillna(0)
Out[23]:
0 2
1 1
2 0
dtype: object
Another way of going about this problem is to check the dtype
df.column.dtype
If it says 'object' It confirms your issue
You can cast the column to a float column
df.column = df.column.dtype(float)
Though manipulating dtypes in pandas usually leads to pains, this may be an easier route to take for this particular problem.
I have a pandas dataframe to which I want to incrementally append rows. My issue is that when trying to happen values, their type is lost. This is especially annoying for 'boolean' which become 'object' (int becoming float is still a bad thing but at least the rest of the program can still run, just less efficiently):
data1 = pd.DataFrame()
data1['foo'] = 5
print("*\n",data1.dtypes)
data2 =pd.DataFrame()
data2['bar'] = True
print("**\n",data2.dtypes)
data3 = pd.concat([data1, data2])
print("***\n",data3.dtypes)
data4 = data1.append(data2)
print("****\n",data4.dtypes)
*
foo int64
dtype: object
**
bar bool
dtype: object
***
bar object
foo float64
dtype: object
****
bar object # <-- bool type becomes object
foo float64
dtype: object
Do you have an idea how to prevent it?
Solution to the issue :
The type of columns is changed to allow the representation of the missing values which are represented by np.nan (either because the row adds or misses some columns compared to the dataframe it get append to) .
Empirically, appending/concatenating a new row inducing missing information will change the types in this manners:
int64 -->float64
bool --> float64 if using a dictionary to set the new line
bool --> object if using a dataframe to set the new line
Your question mixes between rows and columns.
In pandas, each column has a type, and each row get the types of each of its columns.
When you do data1['foo'] = [some values] you define a new column
and when you append two dataframes with different names of columns, then you:
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
(see here)
On the other hand using concat do column stack of dataframes, keeping the column data type.
Last, note that your column assignment need to use brackets. i.e.
data1['foo'] = [5]
instead of
data1['foo'] = 5
EDIT: in the spirit of your comment I did a small experiment trying to follow your intention:
df = pd.DataFrame() # Creating a DF
df['a'] = [1,2,3] # Adding a column of integers
df['b'] = [True, False, True] # Adding a column of Boolean
print df['b'].dtype
>bool
We see that indeed col 'b' is bool.
Adding a row with partial data:
df = df.append({'a':1}, ignore_index=True)
print df['b'].dtype
>float64
Now col 'b' changed to float64, to support the NaN type. That's the known numpy NaN gotcha.
Last, printing the df results with:
print df
a b
0 1.0 1.0
1 2.0 0.0
2 3.0 1.0
3 1.0 NaN
import pandas as pd
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
# create DataFrame
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.dtypes
col1 int64
col2 int64
dtype: object
# Keep dtypes
ser_types = df.dtypes
ser_types
col1 int64
col2 int64
dtype: object
# Chnange dtypes for test purpose
df = df.astype('float64').copy()
df.dtypes
col1 float64
col2 float64
dtype: object
# Series dtype to dictionary
ser_types.to_dict()
{'col1': dtype('int64'), 'col2': dtype('int64')}
# Chnage dtype from initial df
df = df.astype(ser_types.to_dict()).copy()
df.dtypes
col1 int64
col2 int64
dtype: object
df
col1 col2
0 1 3
1 2 4
I am new to Pandas... I want to a simple and generic way to find which columns are categorical in my DataFrame, when I don't manually specify each column type, unlike in this SO question. The df is created with:
import pandas as pd
df = pd.read_csv("test.csv", header=None)
e.g.
0 1 2 3 4
0 1.539240 0.423437 -0.687014 Chicago Safari
1 0.815336 0.913623 1.800160 Boston Safari
2 0.821214 -0.824839 0.483724 New York Safari
.
UPDATE (2018/02/04) The question assumes numerical columns are NOT categorical, #Zero's accepted answer solves this.
BE CAREFUL - As #Sagarkar's comment points out that's not always true. The difficulty is that Data Types and Categorical/Ordinal/Nominal types are orthogonal concepts, thus mapping between them isn't straightforward. #Jeff's answer below specifies the precise manner to achieve the manual mapping.
You could use df._get_numeric_data() to get numeric columns and then find out categorical columns
In [66]: cols = df.columns
In [67]: num_cols = df._get_numeric_data().columns
In [68]: num_cols
Out[68]: Index([u'0', u'1', u'2'], dtype='object')
In [69]: list(set(cols) - set(num_cols))
Out[69]: ['3', '4']
The way I found was updating to Pandas v0.16.0, then excluding number dtypes with:
df.select_dtypes(exclude=["number","bool_","object_"])
Which works, providing no types are changed and no more are added to NumPy. The suggestion in the question's comments by #Jeff suggests include=["category"], but that didn't seem to work.
NumPy Types: link
For posterity. The canonical method to select dtypes is .select_dtypes. You can specify an actual numpy dtype or convertible, or 'category' which not a numpy dtype.
In [1]: df = DataFrame({'A' : Series(range(3)).astype('category'), 'B' : range(3), 'C' : list('abc'), 'D' : np.random.randn(3) })
In [2]: df
Out[2]:
A B C D
0 0 0 a 0.141296
1 1 1 b 0.939059
2 2 2 c -2.305019
In [3]: df.select_dtypes(include=['category'])
Out[3]:
A
0 0
1 1
2 2
In [4]: df.select_dtypes(include=['object'])
Out[4]:
C
0 a
1 b
2 c
In [5]: df.select_dtypes(include=['object']).dtypes
Out[5]:
C object
dtype: object
In [6]: df.select_dtypes(include=['category','int']).dtypes
Out[6]:
A category
B int64
dtype: object
In [7]: df.select_dtypes(include=['category','int','float']).dtypes
Out[7]:
A category
B int64
D float64
dtype: object
You can get the list of categorical columns using this code :
dfName.select_dtypes(include=['object']).columns.tolist()
And intuitively for numerical columns :
dfName.select_dtypes(exclude=['object']).columns.tolist()
Hope that helps.
select categorical column names
cat_features=[i for i in df.columns if df.dtypes[i]=='object']
# Get categorical and numerical variables
numCols = X.select_dtypes("number").columns
catCols = X.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
numeric_var = [key for key in dict(df.dtypes)
if dict(pd.dtypes)[key]
in ['float64','float32','int32','int64']] # Numeric Variable
cat_var = [key for key in dict(df.dtypes)
if dict(df.dtypes)[key] in ['object'] ] # Categorical Varible
Often columns get pandas dtype of string (or "object") or category. Better to include both incase the columns you look for don't get listed under category dtype.
dataframe.select_dtypes(include=['object','category']).columns.tolist()
You don't need to query the data if you are just interested in which columns are of what type.
The fastest method (when %%timeit-ing it) is:
df.dtypes[df.dtypes == 'category'].index
(this will give you a pandas' Index. You can .tolist() to get a list out of it, if you need that.)
This works because df.dtypes is a pd.Series of strings (its own dtype is 'object'), so you can actually just select for the type that you need with normal pandas querying.
You don't have your categorical types as 'category' but as simple strings ('object')? Then just:
df.dtypes[df.dtypes == 'object'].index
Do you have a mix of 'object' and 'category'? Then use isin like you would do normally to query for multiple matches:
df.dtypes[df.dtypes.isin(['object','category'])].index
Use .dtypes
In [10]: df.dtypes
Out[10]:
0 float64
1 float64
2 float64
3 object
4 object
dtype: object
Use pandas.DataFrame.select_dtypes. There are categorical dtypes that can be found by 'categorical' flag. For Strings you might use the numpy object dtype
More Info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html
Exemple:
import pandas as pd
df = pd.DataFrame({'Integer': [1, 2] * 3,'Bool': [True, False] * 3,'Float': [1.0, 2.0] * 3,'String': ['Dog', 'Cat'] * 3})
df
Out[1]:
Integer Bool Float String
0 1 True 1.0 Dog
1 2 False 2.0 Cat
2 1 True 1.0 Dog
3 2 False 2.0 Cat
4 1 True 1.0 Dog
5 2 False 2.0 Cat
df.select_dtypes(include=['category', object]).columns
Out[2]:
Index(['String'], dtype='object')
I have faced similar obstacle where categorizing variables was a challenge. However I came up with some approaches based on the nature of the data. This would give a general and flexible answer to your issue as well as to future data.
Normally while categorization of data is done on the basis of its datatype which sometimes may result in wrong analysis. (Usually done by df.select_dtypes(include = ['object', 'category'])
Approach:
The approach is of viewing the data not on a column level but on a row level. This approach would give the number of distinct values which would automatically distinguish categorical variables from numerical types.
That is if count of unique values in a row exceed more than certain number of values
(This is for you to decide how many categorical variables you presume in your column)
for eg: if ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values don't exceed more than those five unique categorical values in that column then that column falls under categorical variables.
elif ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values exceed more than those five unique categorical values in that column then they fall under numerical variables.
if [col for col in df.columns if len(df[col].unique()) <=5]:
cat_var = [col for col in df.columns if len(df[col].unique()) <=5]
elif [col for col in df.columns if len(df[col].unique()) > 5]:
num_var = [col for col in df.columns if len(df[col].unique()) > 5]
# where 5 : presumed number of categorical variables and may be flexible for user to decide.
I have used if and elif for better illustration. There is no need for that you can directly go for lines inside the condition.
`categorical_values = (df.dtypes == 'object')
categorical_variables = categorical_variables =[categorical_values.index[ind]
for ind, val in enumerate(categorical_values) if val == True]
In the first line of code, we obtain a series which gives information regarding all the columns. The series gives information on which column is an object type and which column is not of the object type by representing it with a Boolean value.
In the second line, we use a list comprehension using enumeration(iterating through index and value), so that we could easily find the column which is of categorical type and append it to the categorical_variables list
First we can segregate the data frame with the default types available when we read the datasets. This will list out all the different types and the corresponding data.
for types in data.dtypes.unique():
print(types)
print(data.select_dtypes(types).columns)
This code will get all categorical variables:
cat_cols = [col for col in df.columns if col not in df.describe().columns]
This will give an array of all the categorical variables in a dataframe.
dataset.select_dtypes(include=['O']).columns.values
# Import packages
import numpy as np
import pandas as pd
# Data
df = pd.DataFrame({"Country" : ["France", "Spain", "Germany", "Spain", "Germany", "France"],
"Age" : [34, 27, 30, 32, 42, 30],
"Purchased" : ["No", "Yes", "No", "No", "Yes", "Yes"]})
df
Out[1]:
Country Age Purchased
0 France 34 No
1 Spain 27 Yes
2 Germany 30 No
3 Spain 32 No
4 Germany 42 Yes
5 France 30 Yes
# Checking data type
df.dtypes
Out[2]:
Country object
Age int64
Purchased object
dtype: object
# Saving CATEGORICAL Variables
cat_col = [c for i, c in enumerate(df.columns) if df.dtypes[i] in [np.object]]
cat_col
Out[3]: ['Country', 'Purchased']
This might help. But you need to check the columns with slightly less than 10 characters, or you need to check columns with unique values that are slightly more than 10 characters manually.
def find_cate(df):
cols=df.columns
i=0
for col in cols:
if len(df[col].unique())<=10:
print(col,len(df[col].unique()))
i=i+1
print(i)
df.select_dtypes(exclude=["number"]).columns
This will help you to directly display all the non numerical rows
This always worked pretty well for me :
categorical_columns = list(set(df.columns) - set(df.describe().columns))
Sklearn gives you a one liner (or a 2 liner if you want to use it on many DataFrames). Lets say your DataFrame object is df then:
## good example in https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html
from sklearn.compose import make_column_selector
cat_cols = make_column_selector(dtype_include=object) (df)
print (cat_cols)
## OR to use with many DataFrames, create one _selector object first
num_selector = make_column_selector(dtype_include=np.number)
num_cols = num_selector (df)
print (num_cols)
You can get the list of categorical columns using this code :
categorical_columns = (df.dtypes == 'object')
get categorical columns names:
object_cols = list(categorical_columns[categorical_columns].index)