Check which columns in DataFrame are Categorical - python
I am new to Pandas... I want to a simple and generic way to find which columns are categorical in my DataFrame, when I don't manually specify each column type, unlike in this SO question. The df is created with:
import pandas as pd
df = pd.read_csv("test.csv", header=None)
e.g.
0 1 2 3 4
0 1.539240 0.423437 -0.687014 Chicago Safari
1 0.815336 0.913623 1.800160 Boston Safari
2 0.821214 -0.824839 0.483724 New York Safari
.
UPDATE (2018/02/04) The question assumes numerical columns are NOT categorical, #Zero's accepted answer solves this.
BE CAREFUL - As #Sagarkar's comment points out that's not always true. The difficulty is that Data Types and Categorical/Ordinal/Nominal types are orthogonal concepts, thus mapping between them isn't straightforward. #Jeff's answer below specifies the precise manner to achieve the manual mapping.
You could use df._get_numeric_data() to get numeric columns and then find out categorical columns
In [66]: cols = df.columns
In [67]: num_cols = df._get_numeric_data().columns
In [68]: num_cols
Out[68]: Index([u'0', u'1', u'2'], dtype='object')
In [69]: list(set(cols) - set(num_cols))
Out[69]: ['3', '4']
The way I found was updating to Pandas v0.16.0, then excluding number dtypes with:
df.select_dtypes(exclude=["number","bool_","object_"])
Which works, providing no types are changed and no more are added to NumPy. The suggestion in the question's comments by #Jeff suggests include=["category"], but that didn't seem to work.
NumPy Types: link
For posterity. The canonical method to select dtypes is .select_dtypes. You can specify an actual numpy dtype or convertible, or 'category' which not a numpy dtype.
In [1]: df = DataFrame({'A' : Series(range(3)).astype('category'), 'B' : range(3), 'C' : list('abc'), 'D' : np.random.randn(3) })
In [2]: df
Out[2]:
A B C D
0 0 0 a 0.141296
1 1 1 b 0.939059
2 2 2 c -2.305019
In [3]: df.select_dtypes(include=['category'])
Out[3]:
A
0 0
1 1
2 2
In [4]: df.select_dtypes(include=['object'])
Out[4]:
C
0 a
1 b
2 c
In [5]: df.select_dtypes(include=['object']).dtypes
Out[5]:
C object
dtype: object
In [6]: df.select_dtypes(include=['category','int']).dtypes
Out[6]:
A category
B int64
dtype: object
In [7]: df.select_dtypes(include=['category','int','float']).dtypes
Out[7]:
A category
B int64
D float64
dtype: object
You can get the list of categorical columns using this code :
dfName.select_dtypes(include=['object']).columns.tolist()
And intuitively for numerical columns :
dfName.select_dtypes(exclude=['object']).columns.tolist()
Hope that helps.
select categorical column names
cat_features=[i for i in df.columns if df.dtypes[i]=='object']
# Get categorical and numerical variables
numCols = X.select_dtypes("number").columns
catCols = X.select_dtypes("object").columns
numCols= list(set(numCols))
catCols= list(set(catCols))
numeric_var = [key for key in dict(df.dtypes)
if dict(pd.dtypes)[key]
in ['float64','float32','int32','int64']] # Numeric Variable
cat_var = [key for key in dict(df.dtypes)
if dict(df.dtypes)[key] in ['object'] ] # Categorical Varible
Often columns get pandas dtype of string (or "object") or category. Better to include both incase the columns you look for don't get listed under category dtype.
dataframe.select_dtypes(include=['object','category']).columns.tolist()
You don't need to query the data if you are just interested in which columns are of what type.
The fastest method (when %%timeit-ing it) is:
df.dtypes[df.dtypes == 'category'].index
(this will give you a pandas' Index. You can .tolist() to get a list out of it, if you need that.)
This works because df.dtypes is a pd.Series of strings (its own dtype is 'object'), so you can actually just select for the type that you need with normal pandas querying.
You don't have your categorical types as 'category' but as simple strings ('object')? Then just:
df.dtypes[df.dtypes == 'object'].index
Do you have a mix of 'object' and 'category'? Then use isin like you would do normally to query for multiple matches:
df.dtypes[df.dtypes.isin(['object','category'])].index
Use .dtypes
In [10]: df.dtypes
Out[10]:
0 float64
1 float64
2 float64
3 object
4 object
dtype: object
Use pandas.DataFrame.select_dtypes. There are categorical dtypes that can be found by 'categorical' flag. For Strings you might use the numpy object dtype
More Info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html
Exemple:
import pandas as pd
df = pd.DataFrame({'Integer': [1, 2] * 3,'Bool': [True, False] * 3,'Float': [1.0, 2.0] * 3,'String': ['Dog', 'Cat'] * 3})
df
Out[1]:
Integer Bool Float String
0 1 True 1.0 Dog
1 2 False 2.0 Cat
2 1 True 1.0 Dog
3 2 False 2.0 Cat
4 1 True 1.0 Dog
5 2 False 2.0 Cat
df.select_dtypes(include=['category', object]).columns
Out[2]:
Index(['String'], dtype='object')
I have faced similar obstacle where categorizing variables was a challenge. However I came up with some approaches based on the nature of the data. This would give a general and flexible answer to your issue as well as to future data.
Normally while categorization of data is done on the basis of its datatype which sometimes may result in wrong analysis. (Usually done by df.select_dtypes(include = ['object', 'category'])
Approach:
The approach is of viewing the data not on a column level but on a row level. This approach would give the number of distinct values which would automatically distinguish categorical variables from numerical types.
That is if count of unique values in a row exceed more than certain number of values
(This is for you to decide how many categorical variables you presume in your column)
for eg: if ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values don't exceed more than those five unique categorical values in that column then that column falls under categorical variables.
elif ['Dog', 'Cat', 'Bird', 'Fish', 'Reptile'] makes up for five unique categorical values for a particular column and if number of distinct values exceed more than those five unique categorical values in that column then they fall under numerical variables.
if [col for col in df.columns if len(df[col].unique()) <=5]:
cat_var = [col for col in df.columns if len(df[col].unique()) <=5]
elif [col for col in df.columns if len(df[col].unique()) > 5]:
num_var = [col for col in df.columns if len(df[col].unique()) > 5]
# where 5 : presumed number of categorical variables and may be flexible for user to decide.
I have used if and elif for better illustration. There is no need for that you can directly go for lines inside the condition.
`categorical_values = (df.dtypes == 'object')
categorical_variables = categorical_variables =[categorical_values.index[ind]
for ind, val in enumerate(categorical_values) if val == True]
In the first line of code, we obtain a series which gives information regarding all the columns. The series gives information on which column is an object type and which column is not of the object type by representing it with a Boolean value.
In the second line, we use a list comprehension using enumeration(iterating through index and value), so that we could easily find the column which is of categorical type and append it to the categorical_variables list
First we can segregate the data frame with the default types available when we read the datasets. This will list out all the different types and the corresponding data.
for types in data.dtypes.unique():
print(types)
print(data.select_dtypes(types).columns)
This code will get all categorical variables:
cat_cols = [col for col in df.columns if col not in df.describe().columns]
This will give an array of all the categorical variables in a dataframe.
dataset.select_dtypes(include=['O']).columns.values
# Import packages
import numpy as np
import pandas as pd
# Data
df = pd.DataFrame({"Country" : ["France", "Spain", "Germany", "Spain", "Germany", "France"],
"Age" : [34, 27, 30, 32, 42, 30],
"Purchased" : ["No", "Yes", "No", "No", "Yes", "Yes"]})
df
Out[1]:
Country Age Purchased
0 France 34 No
1 Spain 27 Yes
2 Germany 30 No
3 Spain 32 No
4 Germany 42 Yes
5 France 30 Yes
# Checking data type
df.dtypes
Out[2]:
Country object
Age int64
Purchased object
dtype: object
# Saving CATEGORICAL Variables
cat_col = [c for i, c in enumerate(df.columns) if df.dtypes[i] in [np.object]]
cat_col
Out[3]: ['Country', 'Purchased']
This might help. But you need to check the columns with slightly less than 10 characters, or you need to check columns with unique values that are slightly more than 10 characters manually.
def find_cate(df):
cols=df.columns
i=0
for col in cols:
if len(df[col].unique())<=10:
print(col,len(df[col].unique()))
i=i+1
print(i)
df.select_dtypes(exclude=["number"]).columns
This will help you to directly display all the non numerical rows
This always worked pretty well for me :
categorical_columns = list(set(df.columns) - set(df.describe().columns))
Sklearn gives you a one liner (or a 2 liner if you want to use it on many DataFrames). Lets say your DataFrame object is df then:
## good example in https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html
from sklearn.compose import make_column_selector
cat_cols = make_column_selector(dtype_include=object) (df)
print (cat_cols)
## OR to use with many DataFrames, create one _selector object first
num_selector = make_column_selector(dtype_include=np.number)
num_cols = num_selector (df)
print (num_cols)
You can get the list of categorical columns using this code :
categorical_columns = (df.dtypes == 'object')
get categorical columns names:
object_cols = list(categorical_columns[categorical_columns].index)
Related
Adding a column with one single categorical value to a pandas dataframe
I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following. df["col"] = "hello" df["col"] = df["col"].astype("catgegory") Do I really need to write df["col"] three times in order to achieve this? After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.) Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues? An alternative solution is df["col"] = pd.Categorical(itertools.repeat("hello", len(df))) but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting: df['col'] = pd.Series('hello', index=df.index, dtype='category') Sample Program: import pandas as pd df = pd.DataFrame({'a': [1, 2, 3]}) df['col'] = pd.Series('hello', index=df.index, dtype='category') print(df) print(df.dtypes) print(df['col'].cat.categories) a col 0 1 hello 1 2 hello 2 3 hello a int64 col category dtype: object Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns. df = df.assign(col="hello").astype({'col':'category'}) df.dtypes A int64 col category dtype: object That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient. This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement. df = pd.DataFrame({'A':[1,2,3,4]}) df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting col2 = lambda x:x['A']**2, #Define column based on existing columns col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns .astype({'col1':'category', 'col2':'float'})) print(df) print(df.dtypes) A col1 col2 col3 0 1 hello 1.0 1.0 1 2 hello 4.0 2.0 2 3 hello 9.0 3.0 3 4 hello 16.0 4.0 A int64 col1 category #<-changed dtype col2 float64 #<-changed dtype col3 float64 dtype: object
This solution surely solves the first point, not sure about the second: df['col'] = pd.Categorical(('hello' for i in len(df))) Essentially we first create a generator of 'hello' with length equal to the number of records in df then we pass it to pd.Categorical to make it a categorical column.
get value from dataframe based on row values without using column names
I am trying to get a value situated on the third column from a pandas dataframe by knowing the values of interest on the first two columns, which point me to the right value to fish out. I do not know the row index, just the values I need to look for on the first two columns. The combination of values from the first two columns is unique, so I do not expect to get a subset of the dataframe, but only a row. I do not have column names and I would like to avoid using them. Consider the dataframe df: a 1 bla b 2 tra b 3 foo b 1 bar c 3 cra I would like to get tra from the second row, based on the b and 2 combination that I know beforehand. I've tried subsetting with df = df.loc['b', :] which returns all the rows with b on the same column (provided I've read the data with index_col = 0) but I am not able to pass multiple conditions on it without crashing or knowing the index of the row of interest. I tried both df.loc and df.iloc. In other words, ideally I would like to get tra without even using row indexes, by doing something like: df[(df[,0] == 'b' & df[,1] == `2`)][2] Any suggestions? Probably it is something simple enough, but I have the tendency to use the same syntax as in R, which apparently is not compatible. Thank you in advance
As #anky has suggested, a way to do this without knowing the column names nor the row index where your value of interest is, would be to read the file in a pandas dataframe using multiple column indexing. For the provided example, knowing the column indexes at least, that would be: df = pd.read_csv(path, sep='\t', index_col=[0, 1]) then, you can use: df = df.iloc[df.index.get_loc(("b", 2)):] df.iloc[0] to get the value of interest. Thanks again #anky for your help. If you found this question useful, please upvote #anky 's comment in the posted question.
I'd probably use pd.query for that: import pandas as pd df = pd.DataFrame(index=['a', 'b', 'b', 'b', 'c'], data={"col1": [1, 2, 3, 1, 3], "col2": ['bla', 'tra', 'foo', 'bar', 'cra']}) df col1 col2 a 1 bla b 2 tra b 3 foo b 1 bar c 3 cra df.query('col1 == 2 and col2 == "tra"') col1 col2 b 2 tra
Adding an array to pandas dataframe
I have a dataframe, and I want to create a new column and add arrays to this each row of this new column. I know to do this I have to change the datatype of the column to 'object' I tried the following but it doesn;t work, import pandas import numpy as np df = pandas.DataFrame({'a':[1,2,3,4]}) df['b'] = np.nan df['b'] = df['b'].astype(object) df.loc[0,'b'] = [[1,2,4,5]] The error is ValueError: Must have equal len keys and value when setting with an ndarray However, it works if I convert the datatype of the whole dataframe into 'object': df = pandas.DataFrame({'a':[1,2,3,4]}) df['b'] = np.nan df = df.astype(object) df.loc[0,'b'] = [[1,2,4,5]] So my question is: why do I have to change the datatype of whole DataFrame?
try this: In [12]: df.at[0,'b'] = [1,2,4,5] In [13]: df Out[13]: a b 0 1 [1, 2, 4, 5] 1 2 NaN 2 3 NaN 3 4 NaN PS be aware that as soon as you put non scalar value in any cells - the corresponding column's dtype will be changed to object in order to be able to contain non-scalar values: In [14]: df.dtypes Out[14]: a int64 b object dtype: object PPS generally it's a bad idea to store non-scalar values in cells, because the vast majority of Pandas/Numpy methods will not work properly with such data.
Assign a series to ALL columns of the dataFrame (columnwise)?
I have a dataframe, and series of the same vertical size as df, I want to assign that series to ALL columns of the DataFrame. What is the natural why to do it ? For example df = pd.DataFrame([[1, 2 ], [3, 4], [5 , 6]] ) ser = pd.Series([1, 2, 3 ]) I want all columns of "df" to be equal to "ser". PS Related: One way to solve it via answer: How to assign dataframe[ boolean Mask] = Series - make it row-wise ? I.e. where Mask = true take values from the same row of the Series (creating all true mask), but I guess there should be some more simple way. If I need NOT all, but SOME columns - the answer is given here: Assign a Series to several Rows of a Pandas DataFrame
Use to_frame with reindex: a = ser.to_frame().reindex(columns=df.columns, method='ffill') print (a) 0 1 0 1 1 1 2 2 2 3 3 But it seems easier is solution from comment, there was added columns parameter if need same order columns as original with real data: df = pd.DataFrame({c:ser for c in df.columns}, columns=df.columns)
Maybe a different way to look at it: df = pd.concat([ser] * df.shape[1], axis=1)
How to add an empty column to a dataframe?
What's the easiest way to add an empty column to a pandas DataFrame object? The best I've stumbled upon is something like df['foo'] = df.apply(lambda _: '', axis=1) Is there a less perverse method?
If I understand correctly, assignment should fill: >>> import numpy as np >>> import pandas as pd >>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]}) >>> df A B 0 1 2 1 2 3 2 3 4 >>> df["C"] = "" >>> df["D"] = np.nan >>> df A B C D 0 1 2 NaN 1 2 3 NaN 2 3 4 NaN
To add to DSM's answer and building on this associated question, I'd split the approach into two cases: Adding a single column: Just assign empty values to the new columns, e.g. df['C'] = np.nan Adding multiple columns: I'd suggest using the .reindex(columns=[...]) method of pandas to add the new columns to the dataframe's column index. This also works for adding multiple new rows with .reindex(rows=[...]). Note that newer versions of Pandas (v>0.20) allow you to specify an axis keyword rather than explicitly assigning to columns or rows. Here is an example adding multiple columns: mydf = mydf.reindex(columns = mydf.columns.tolist() + ['newcol1','newcol2']) or mydf = mydf.reindex(mydf.columns.tolist() + ['newcol1','newcol2'], axis=1) # version > 0.20.0 You can also always concatenate a new (empty) dataframe to the existing dataframe, but that doesn't feel as pythonic to me :)
I like: df['new'] = pd.Series(dtype='int') # or use other dtypes like 'float', 'object', ... If you have an empty dataframe, this solution makes sure that no new row containing only NaN is added. Specifying dtype is not strictly necessary, however newer Pandas versions produce a DeprecationWarning if not specified.
an even simpler solution is: df = df.reindex(columns = header_list) where "header_list" is a list of the headers you want to appear. any header included in the list that is not found already in the dataframe will be added with blank cells below. so if header_list = ['a','b','c', 'd'] then c and d will be added as columns with blank cells
Starting with v0.16.0, DF.assign() could be used to assign new columns (single/multiple) to a DF. These columns get inserted in alphabetical order at the end of the DF. This becomes advantageous compared to simple assignment in cases wherein you want to perform a series of chained operations directly on the returned dataframe. Consider the same DF sample demonstrated by #DSM: df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]}) df Out[18]: A B 0 1 2 1 2 3 2 3 4 df.assign(C="",D=np.nan) Out[21]: A B C D 0 1 2 NaN 1 2 3 NaN 2 3 4 NaN Note that this returns a copy with all the previous columns along with the newly created ones. In order for the original DF to be modified accordingly, use it like : df = df.assign(...) as it does not support inplace operation currently.
if you want to add column name from a list df=pd.DataFrame() a=['col1','col2','col3','col4'] for i in a: df[i]=np.nan
df["C"] = "" df["D"] = np.nan Assignment will give you this warning SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead so its better to use insert: df.insert(index, column-name, column-value)
#emunsing's answer is really cool for adding multiple columns, but I couldn't get it to work for me in python 2.7. Instead, I found this works: mydf = mydf.reindex(columns = np.append( mydf.columns.values, ['newcol1','newcol2'])
One can use df.insert(index_to_insert_at, column_header, init_value) to insert new column at a specific index. cost_tbl.insert(1, "col_name", "") The above statement would insert an empty Column after the first column.
this will also work for multiple columns: df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]}) >>> df A B 0 1 2 1 2 3 2 3 4 df1 = pd.DataFrame(columns=['C','D','E']) df = df.join(df1, how="outer") >>>df A B C D E 0 1 2 NaN NaN NaN 1 2 3 NaN NaN NaN 2 3 4 NaN NaN NaN Then do whatever you want to do with the columns pd.Series.fillna(),pd.Series.map() etc.
The below code address the question "How do I add n number of empty columns to my existing dataframe". In the interest of keeping solutions to similar problems in one place, I am adding it here. Approach 1 (to create 64 additional columns with column names from 1-64) m = list(range(1,65,1)) dd=pd.DataFrame(columns=m) df.join(dd).replace(np.nan,'') #df is the dataframe that already exists Approach 2 (to create 64 additional columns with column names from 1-64) df.reindex(df.columns.tolist() + list(range(1,65,1)), axis=1).replace(np.nan,'')
You can do df['column'] = None #This works. This will create a new column with None type df.column = None #This will work only when the column is already present in the dataframe
If you have a list of columns that you want to be empty, you can use assign, then comprehension dict, then dict unpacking. >>> df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]}) >>> nan_cols_name = ["C","D","whatever"] >>> df.assign(**{col:np.nan for col in nan_cols_name}) A B C D whatever 0 1 2 NaN NaN NaN 1 2 3 NaN NaN NaN 2 3 4 NaN NaN NaN You can also unpack multiple dict in a dict that you unpack if you want different values for different columns. df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4]}) nan_cols_name = ["C","D","whatever"] empty_string_cols_name = ["E","F","bad column with space"] df.assign(**{ **{col:np.nan for col in my_empy_columns_name}, **{col:"" for col in empty_string_cols_name} } )
Sorry for I did not explain my answer really well at beginning. There is another way to add an new column to an existing dataframe. 1st step, make a new empty data frame (with all the columns in your data frame, plus a new or few columns you want to add) called df_temp 2nd step, combine the df_temp and your data frame. df_temp = pd.DataFrame(columns=(df_null.columns.tolist() + ['empty'])) df = pd.concat([df_temp, df]) It might be the best solution, but it is another way to think about this question. the reason of I am using this method is because I am get this warning all the time: : SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["empty1"], df["empty2"] = [np.nan, ""] great I found the way to disable the Warning pd.options.mode.chained_assignment = None
The reason I was looking for such a solution is simply to add spaces between multiple DFs which have been joined column-wise using the pd.concat function and then written to excel using xlsxwriter. df[' ']=df.apply(lambda _: '', axis=1) df_2 = pd.concat([df,df1],axis=1) #worked but only once. # Note: df & df1 have the same rows which is my index. # df_2[' ']=df_2.apply(lambda _: '', axis=1) #didn't work this time !!? df_4 = pd.concat([df_2,df_3],axis=1) I then replaced the second lambda call with df_2['']='' #which appears to add a blank column df_4 = pd.concat([df_2,df_3],axis=1) The output I tested it on was using xlsxwriter to excel. Jupyter blank columns look the same as in excel although doesnt have xlsx formatting. Not sure why the second Lambda call didnt work.