Convert categorical data in pandas dataframe - python

I have a dataframe with this type of data (too many columns):
col1 int64
col2 int64
col3 category
col4 category
col5 category
Columns look like this:
Name: col3, dtype: category
Categories (8, object): [B, C, E, G, H, N, S, W]
I want to convert all the values in each column to integer like this:
[1, 2, 3, 4, 5, 6, 7, 8]
I solved this for one column by this:
dataframe['c'] = pandas.Categorical.from_array(dataframe.col3).codes
Now I have two columns in my dataframe - old col3 and new c and need to drop old columns.
That's bad practice. It works but in my dataframe there are too many columns and I don't want do it manually.
How can I do this more cleverly?

First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:
In [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1

This works for me:
pandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
[0, 1, 2, 0]

If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.
dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes
You are done. Now as Categorical.from_array is deprecated, use Categorical directly
dataframe.col3 = pd.Categorical(dataframe.col3).codes
If you also need the mapping back from index to label, there is even better way for the same
dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()
check below
print(dataframe)
print(mapping_index.get_loc("c"))

Here multiple columns need to be converted. So, one approach i used is ..
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
This converts all string / object type columns to categorical. Then applies codes to each type of category.

What I do is, I replace values.
Like this-
df['col'].replace(to_replace=['category_1', 'category_2', 'category_3'], value=[1, 2, 3], inplace=True)
In this way, if the col column has categorical values, they get replaced by the numerical values.

For converting categorical data in column C of dataset data, we need to do the following:
from sklearn.preprocessing import LabelEncoder
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.

To convert all the columns in the Dataframe to numerical data:
df2 = df2.apply(lambda x: pd.factorize(x)[0])

Answers here seem outdated. Pandas now has a factorize() function and you can create categories as:
df.col.factorize()
Function signature:
pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)

One of the simplest ways to convert the categorical variable into dummy/indicator variables is to use get_dummies provided by pandas.
Say for example we have data in which sex is a categorical value (male & female)
and you need to convert it into a dummy/indicator here is how to do it.
tranning_data = pd.read_csv("../titanic/train.csv")
features = ["Age", "Sex", ] //here sex is catagorical value
X_train = pd.get_dummies(tranning_data[features])
print(X_train)
Age Sex_female Sex_male
20 0 1
33 1 0
40 1 0
22 1 0
54 0 1

you can use .replace as the following:
df['col3']=df['col3'].replace(['B', 'C', 'E', 'G', 'H', 'N', 'S', 'W'],[1,2,3,4,5,6,7,8])
or .map:
df['col3']=df['col3'].map({1: 'B', 2: 'C', 3: 'E', 4:'G', 5:'H', 6:'N', 7:'S', 8:'W'})

categorical_columns =['sex','class','deck','alone']
for column in categorical_columns:
df[column] = pd.factorize(df[column])[0]
Factorize will make each unique categorical data in a column into a specific number (from 0 to infinity).

#Quickbeam2k1 ,see below -
dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values
Using sklearn
from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])

You can do it less code like below :
f = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),'col3':list('ababb')})
f['col1'] =f['col1'].astype('category').cat.codes
f['col2'] =f['col2'].astype('category').cat.codes
f['col3'] =f['col3'].astype('category').cat.codes
f

Just use manual matching:
dict = {'Non-Travel':0, 'Travel_Rarely':1, 'Travel_Frequently':2}
df['BusinessTravel'] = df['BusinessTravel'].apply(lambda x: dict.get(x))

For a certain column, if you don't care about the ordering, use this
df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])
If you care about the ordering, specify them as a list and use this
df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))

Related

Opposite of factorize function (Map numeric to categorical values)

I am searching for a way to map some numeric columns to categorical features.
All columns are of categorical nature but are represented as integers. However I need them to be a "String".
e.g.
col1 col2 col3 -> col1new col2new col3new
0 1 1 -> "0" "1" "1"
2 2 3 -> "2" "2" "3"
1 3 2 -> "1" "3" "2"
It does not matter what kind of String the new column contains as long as all distinct values from the original data set map to the same new String value.
Any ideas?
I have a bumpy representation of my data right now but any pandas solution would be also helpful.
Thanks a lot!
You can use applymap method. Cosider the following example:
df = pd.DataFrame({'col1': [0, 2, 1], 'col2': [1, 2, 3], 'col3': [1, 3, 2]})
df.applymap(str)
col1 col2 col3
0 0 1 1
1 2 2 3
2 1 3 2
You can convert all elements of col1, col2, and col3 to str using the following command:
df = df.applymap(str)
you can modify the type of the elements in a list by using the dataframe.apply function which is offered by pandas-dataframe-apply.
frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1', 'col2', 'col3'])
in the new dataframe you can define columns and the value by:
updated_frame = pd.DataFrame(np.random.randint(0, 90, size =(5000000, 3)), columns =['col1new', 'col2new', 'col3new'])
updated_frame['col1new'] = frame['col1'].apply(str)
updated_frame['col2new'] = frame['col2'].apply(str)
updated_frame['col3new'] = frame['col3'].apply(str)
You could use the .astype method. If you want to replace all the current columns with a string version then you could do (df your dataframe):
df = df.astype(str)
If you want to add the string columns as new ones:
df = df.assign(**{f"{col}new": df[col].astype(str) for col in df.columns})

Python Pandas Selecting a Single Dataframe Column and Keeping as Dataframe instead of Series [duplicate]

When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]

How to convert data of type Panda to Panda.Dataframe?

I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00

Keep selected column as DataFrame instead of Series

When selecting a single column from a pandas DataFrame(say df.iloc[:, 0], df['A'], or df.A, etc), the resulting vector is automatically converted to a Series instead of a single-column DataFrame. However, I am writing some functions that takes a DataFrame as an input argument. Therefore, I prefer to deal with single-column DataFrame instead of Series so that the function can assume say df.columns is accessible. Right now I have to explicitly convert the Series into a DataFrame by using something like pd.DataFrame(df.iloc[:, 0]). This doesn't seem like the most clean method. Is there a more elegant way to index from a DataFrame directly so that the result is a single-column DataFrame instead of Series?
As #Jeff mentions there are a few ways to do this, but I recommend using loc/iloc to be more explicit (and raise errors early if you're trying something ambiguous):
In [10]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 'B'])
In [11]: df
Out[11]:
A B
0 1 2
1 3 4
In [12]: df[['A']]
In [13]: df[[0]]
In [14]: df.loc[:, ['A']]
In [15]: df.iloc[:, [0]]
Out[12-15]: # they all return the same thing:
A
0 1
1 3
The latter two choices remove ambiguity in the case of integer column names (precisely why loc/iloc were created). For example:
In [16]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['A', 0])
In [17]: df
Out[17]:
A 0
0 1 2
1 3 4
In [18]: df[[0]] # ambiguous
Out[18]:
A
0 1
1 3
As Andy Hayden recommends, utilizing .iloc/.loc to index out (single-columned) dataframe is the way to go; another point to note is how to express the index positions.
Use a listed Index labels/positions whilst specifying the argument values to index out as Dataframe; failure to do so will return a 'pandas.core.series.Series'
Input:
A_1 = train_data.loc[:,'Fraudster']
print('A_1 is of type', type(A_1))
A_2 = train_data.loc[:, ['Fraudster']]
print('A_2 is of type', type(A_2))
A_3 = train_data.iloc[:,12]
print('A_3 is of type', type(A_3))
A_4 = train_data.iloc[:,[12]]
print('A_4 is of type', type(A_4))
Output:
A_1 is of type <class 'pandas.core.series.Series'>
A_2 is of type <class 'pandas.core.frame.DataFrame'>
A_3 is of type <class 'pandas.core.series.Series'>
A_4 is of type <class 'pandas.core.frame.DataFrame'>
These three approaches have been mentioned:
pd.DataFrame(df.loc[:, 'A']) # Approach of the original post
df.loc[:,[['A']] # Approach 2 (note: use iloc for positional indexing)
df[['A']] # Approach 3
pd.Series.to_frame() is another approach.
Because it is a method, it can be used in situations where the second and third approaches above do not apply. In particular, it is useful when applying some method to a column in your dataframe and you want to convert the output into a dataframe instead of a series. For instance, in a Jupyter Notebook a series will not have pretty output, but a dataframe will.
# Basic use case:
df['A'].to_frame()
# Use case 2 (this will give you pretty output in a Jupyter Notebook):
df['A'].describe().to_frame()
# Use case 3:
df['A'].str.strip().to_frame()
# Use case 4:
def some_function(num):
...
df['A'].apply(some_function).to_frame()
You can use df.iloc[:, 0:1], in this case the resulting vector will be a DataFrame and not series.
As you can see:
(Talking about pandas 1.3.4)
I'd like to add a little more context to the answers involving .to_frame(). If you select a single row of a data frame and execute .to_frame() on that, then the index will be made up of the original column names and you'll get numeric column names. You can just tack on a .T to the end to transpose that back into the original data frame's format (see below).
import pandas as pd
print(pd.__version__) #1.3.4
df = pd.DataFrame({
"col1": ["a", "b", "c"],
"col2": [1, 2, 3]
})
# series
df.loc[0, ["col1", "col2"]]
# dataframe (column names are along the index; not what I wanted)
df.loc[0, ["col1", "col2"]].to_frame()
# 0
# col1 a
# col2 1
# looks like an actual single-row dataframe.
# To me, this is the true answer to the question
# because the output matches the format of the
# original dataframe.
df.loc[0, ["col1", "col2"]].to_frame().T
# col1 col2
# 0 a 1
# this works really well with .to_dict(orient="records") which is
# what I'm ultimately after by selecting a single row
df.loc[0, ["col1", "col2"]].to_frame().T.to_dict(orient="records")
# [{'col1': 'a', 'col2': 1}]

Delete a column from a Pandas DataFrame

To delete a column in a DataFrame, I can successfully use:
del df['column_name']
But why can't I use the following?
del df.column_name
Since it is possible to access the Series via df.column_name, I expected this to work.
The best way to do this in Pandas is to use drop:
df = df.drop('column_name', axis=1)
where 1 is the axis number (0 for rows and 1 for columns.)
Or, the drop() method accepts index/columns keywords as an alternative to specifying the axis. So we can now just do:
df = df.drop(columns=['column_nameA', 'column_nameB'])
This was introduced in v0.21.0 (October 27, 2017)
To delete the column without having to reassign df you can do:
df.drop('column_name', axis=1, inplace=True)
Finally, to drop by column number instead of by column label, try this to delete, e.g. the 1st, 2nd and 4th columns:
df = df.drop(df.columns[[0, 1, 3]], axis=1) # df.columns is zero-based pd.Index
Also working with "text" syntax for the columns:
df.drop(['column_nameA', 'column_nameB'], axis=1, inplace=True)
As you've guessed, the right syntax is
del df['column_name']
It's difficult to make del df.column_name work simply as the result of syntactic limitations in Python. del df[name] gets translated to df.__delitem__(name) under the covers by Python.
Use:
columns = ['Col1', 'Col2', ...]
df.drop(columns, inplace=True, axis=1)
This will delete one or more columns in-place. Note that inplace=True was added in pandas v0.13 and won't work on older versions. You'd have to assign the result back in that case:
df = df.drop(columns, axis=1)
Drop by index
Delete first, second and fourth columns:
df.drop(df.columns[[0,1,3]], axis=1, inplace=True)
Delete first column:
df.drop(df.columns[[0]], axis=1, inplace=True)
There is an optional parameter inplace so that the original
data can be modified without creating a copy.
Popped
Column selection, addition, deletion
Delete column column-name:
df.pop('column-name')
Examples:
df = DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]), ('C', [7,8, 9])], orient='index', columns=['one', 'two', 'three'])
print df:
one two three
A 1 2 3
B 4 5 6
C 7 8 9
df.drop(df.columns[[0]], axis=1, inplace=True)
print df:
two three
A 2 3
B 5 6
C 8 9
three = df.pop('three')
print df:
two
A 2
B 5
C 8
The actual question posed, missed by most answers here is:
Why can't I use del df.column_name?
At first we need to understand the problem, which requires us to dive into Python magic methods.
As Wes points out in his answer, del df['column'] maps to the Python magic method df.__delitem__('column') which is implemented in Pandas to drop the column.
However, as pointed out in the link above about Python magic methods:
In fact, __del__ should almost never be used because of the precarious circumstances under which it is called; use it with caution!
You could argue that del df['column_name'] should not be used or encouraged, and thereby del df.column_name should not even be considered.
However, in theory, del df.column_name could be implemented to work in Pandas using the magic method __delattr__. This does however introduce certain problems, problems which the del df['column_name'] implementation already has, but to a lesser degree.
Example Problem
What if I define a column in a dataframe called "dtypes" or "columns"?
Then assume I want to delete these columns.
del df.dtypes would make the __delattr__ method confused as if it should delete the "dtypes" attribute or the "dtypes" column.
Architectural questions behind this problem
Is a dataframe a collection of columns?
Is a dataframe a collection of rows?
Is a column an attribute of a dataframe?
Pandas answers:
Yes, in all ways
No, but if you want it to be, you can use the .ix, .loc or .iloc methods.
Maybe, do you want to read data? Then yes, unless the name of the attribute is already taken by another attribute belonging to the dataframe. Do you want to modify data? Then no.
TLDR;
You cannot do del df.column_name, because Pandas has a quite wildly grown architecture that needs to be reconsidered in order for this kind of cognitive dissonance not to occur to its users.
Pro tip:
Don't use df.column_name. It may be pretty, but it causes cognitive dissonance.
Zen of Python quotes that fits in here:
There are multiple ways of deleting a column.
There should be one-- and preferably only one --obvious way to do it.
Columns are sometimes attributes but sometimes not.
Special cases aren't special enough to break the rules.
Does del df.dtypes delete the dtypes attribute or the dtypes column?
In the face of ambiguity, refuse the temptation to guess.
A nice addition is the ability to drop columns only if they exist. This way you can cover more use cases, and it will only drop the existing columns from the labels passed to it:
Simply add errors='ignore', for example.:
df.drop(['col_name_1', 'col_name_2', ..., 'col_name_N'], inplace=True, axis=1, errors='ignore')
This is new from pandas 0.16.1 onward. Documentation is here.
From version 0.16.1, you can do
df.drop(['column_name'], axis = 1, inplace = True, errors = 'ignore')
It's good practice to always use the [] notation. One reason is that attribute notation (df.column_name) does not work for numbered indices:
In [1]: df = DataFrame([[1, 2, 3], [4, 5, 6]])
In [2]: df[1]
Out[2]:
0 2
1 5
Name: 1
In [3]: df.1
File "<ipython-input-3-e4803c0d1066>", line 1
df.1
^
SyntaxError: invalid syntax
Pandas 0.21+ answer
Pandas version 0.21 has changed the drop method slightly to include both the index and columns parameters to match the signature of the rename and reindex methods.
df.drop(columns=['column_a', 'column_c'])
Personally, I prefer using the axis parameter to denote columns or index because it is the predominant keyword parameter used in nearly all pandas methods. But, now you have some added choices in version 0.21.
In Pandas 0.16.1+, you can drop columns only if they exist per the solution posted by eiTan LaVi. Prior to that version, you can achieve the same result via a conditional list comprehension:
df.drop([col for col in ['col_name_1','col_name_2',...,'col_name_N'] if col in df],
axis=1, inplace=True)
Use:
df.drop('columnname', axis =1, inplace = True)
Or else you can go with
del df['colname']
To delete multiple columns based on column numbers
df.drop(df.iloc[:,1:3], axis = 1, inplace = True)
To delete multiple columns based on columns names
df.drop(['col1','col2',..'coln'], axis = 1, inplace = True)
TL;DR
A lot of effort to find a marginally more efficient solution. Difficult to justify the added complexity while sacrificing the simplicity of df.drop(dlst, 1, errors='ignore')
df.reindex_axis(np.setdiff1d(df.columns.values, dlst), 1)
Preamble
Deleting a column is semantically the same as selecting the other columns. I'll show a few additional methods to consider.
I'll also focus on the general solution of deleting multiple columns at once and allowing for the attempt to delete columns not present.
Using these solutions are general and will work for the simple case as well.
Setup
Consider the pd.DataFrame df and list to delete dlst
df = pd.DataFrame(dict(zip('ABCDEFGHIJ', range(1, 11))), range(3))
dlst = list('HIJKLM')
df
A B C D E F G H I J
0 1 2 3 4 5 6 7 8 9 10
1 1 2 3 4 5 6 7 8 9 10
2 1 2 3 4 5 6 7 8 9 10
dlst
['H', 'I', 'J', 'K', 'L', 'M']
The result should look like:
df.drop(dlst, 1, errors='ignore')
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Since I'm equating deleting a column to selecting the other columns, I'll break it into two types:
Label selection
Boolean selection
Label Selection
We start by manufacturing the list/array of labels that represent the columns we want to keep and without the columns we want to delete.
df.columns.difference(dlst)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
np.setdiff1d(df.columns.values, dlst)
array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype=object)
df.columns.drop(dlst, errors='ignore')
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')
list(set(df.columns.values.tolist()).difference(dlst))
# does not preserve order
['E', 'D', 'B', 'F', 'G', 'A', 'C']
[x for x in df.columns.values.tolist() if x not in dlst]
['A', 'B', 'C', 'D', 'E', 'F', 'G']
Columns from Labels
For the sake of comparing the selection process, assume:
cols = [x for x in df.columns.values.tolist() if x not in dlst]
Then we can evaluate
df.loc[:, cols]
df[cols]
df.reindex(columns=cols)
df.reindex_axis(cols, 1)
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Boolean Slice
We can construct an array/list of booleans for slicing
~df.columns.isin(dlst)
~np.in1d(df.columns.values, dlst)
[x not in dlst for x in df.columns.values.tolist()]
(df.columns.values[:, None] != dlst).all(1)
Columns from Boolean
For the sake of comparison
bools = [x not in dlst for x in df.columns.values.tolist()]
df.loc[: bools]
Which all evaluate to:
A B C D E F G
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
Robust Timing
Functions
setdiff1d = lambda df, dlst: np.setdiff1d(df.columns.values, dlst)
difference = lambda df, dlst: df.columns.difference(dlst)
columndrop = lambda df, dlst: df.columns.drop(dlst, errors='ignore')
setdifflst = lambda df, dlst: list(set(df.columns.values.tolist()).difference(dlst))
comprehension = lambda df, dlst: [x for x in df.columns.values.tolist() if x not in dlst]
loc = lambda df, cols: df.loc[:, cols]
slc = lambda df, cols: df[cols]
ridx = lambda df, cols: df.reindex(columns=cols)
ridxa = lambda df, cols: df.reindex_axis(cols, 1)
isin = lambda df, dlst: ~df.columns.isin(dlst)
in1d = lambda df, dlst: ~np.in1d(df.columns.values, dlst)
comp = lambda df, dlst: [x not in dlst for x in df.columns.values.tolist()]
brod = lambda df, dlst: (df.columns.values[:, None] != dlst).all(1)
Testing
res1 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc slc ridx ridxa'.split(),
'setdiff1d difference columndrop setdifflst comprehension'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res2 = pd.DataFrame(
index=pd.MultiIndex.from_product([
'loc'.split(),
'isin in1d comp brod'.split(),
], names=['Select', 'Label']),
columns=[10, 30, 100, 300, 1000],
dtype=float
)
res = res1.append(res2).sort_index()
dres = pd.Series(index=res.columns, name='drop')
for j in res.columns:
dlst = list(range(j))
cols = list(range(j // 2, j + j // 2))
d = pd.DataFrame(1, range(10), cols)
dres.at[j] = timeit('d.drop(dlst, 1, errors="ignore")', 'from __main__ import d, dlst', number=100)
for s, l in res.index:
stmt = '{}(d, {}(d, dlst))'.format(s, l)
setp = 'from __main__ import d, dlst, {}, {}'.format(s, l)
res.at[(s, l), j] = timeit(stmt, setp, number=100)
rs = res / dres
rs
10 30 100 300 1000
Select Label
loc brod 0.747373 0.861979 0.891144 1.284235 3.872157
columndrop 1.193983 1.292843 1.396841 1.484429 1.335733
comp 0.802036 0.732326 1.149397 3.473283 25.565922
comprehension 1.463503 1.568395 1.866441 4.421639 26.552276
difference 1.413010 1.460863 1.587594 1.568571 1.569735
in1d 0.818502 0.844374 0.994093 1.042360 1.076255
isin 1.008874 0.879706 1.021712 1.001119 0.964327
setdiff1d 1.352828 1.274061 1.483380 1.459986 1.466575
setdifflst 1.233332 1.444521 1.714199 1.797241 1.876425
ridx columndrop 0.903013 0.832814 0.949234 0.976366 0.982888
comprehension 0.777445 0.827151 1.108028 3.473164 25.528879
difference 1.086859 1.081396 1.293132 1.173044 1.237613
setdiff1d 0.946009 0.873169 0.900185 0.908194 1.036124
setdifflst 0.732964 0.823218 0.819748 0.990315 1.050910
ridxa columndrop 0.835254 0.774701 0.907105 0.908006 0.932754
comprehension 0.697749 0.762556 1.215225 3.510226 25.041832
difference 1.055099 1.010208 1.122005 1.119575 1.383065
setdiff1d 0.760716 0.725386 0.849949 0.879425 0.946460
setdifflst 0.710008 0.668108 0.778060 0.871766 0.939537
slc columndrop 1.268191 1.521264 2.646687 1.919423 1.981091
comprehension 0.856893 0.870365 1.290730 3.564219 26.208937
difference 1.470095 1.747211 2.886581 2.254690 2.050536
setdiff1d 1.098427 1.133476 1.466029 2.045965 3.123452
setdifflst 0.833700 0.846652 1.013061 1.110352 1.287831
fig, axes = plt.subplots(2, 2, figsize=(8, 6), sharey=True)
for i, (n, g) in enumerate([(n, g.xs(n)) for n, g in rs.groupby('Select')]):
ax = axes[i // 2, i % 2]
g.plot.bar(ax=ax, title=n)
ax.legend_.remove()
fig.tight_layout()
This is relative to the time it takes to run df.drop(dlst, 1, errors='ignore'). It seems like after all that effort, we only improve performance modestly.
If fact the best solutions use reindex or reindex_axis on the hack list(set(df.columns.values.tolist()).difference(dlst)). A close second and still very marginally better than drop is np.setdiff1d.
rs.idxmin().pipe(
lambda x: pd.DataFrame(
dict(idx=x.values, val=rs.lookup(x.values, x.index)),
x.index
)
)
idx val
10 (ridx, setdifflst) 0.653431
30 (ridxa, setdifflst) 0.746143
100 (ridxa, setdifflst) 0.816207
300 (ridx, setdifflst) 0.780157
1000 (ridxa, setdifflst) 0.861622
We can remove or delete a specified column or specified columns by the drop() method.
Suppose df is a dataframe.
Column to be removed = column0
Code:
df = df.drop(column0, axis=1)
To remove multiple columns col1, col2, . . . , coln, we have to insert all the columns that needed to be removed in a list. Then remove them by the drop() method.
Code:
df = df.drop([col1, col2, . . . , coln], axis=1)
If your original dataframe df is not too big, you have no memory constraints, and you only need to keep a few columns, or, if you don't know beforehand the names of all the extra columns that you do not need, then you might as well create a new dataframe with only the columns you need:
new_df = df[['spam', 'sausage']]
Deleting a column using the iloc function of dataframe and slicing, when we have a typical column name with unwanted values:
df = df.iloc[:,1:] # Removing an unnamed index column
Here 0 is the default row and 1 is the first column, hence :,1: is our parameter for deleting the first column.
The dot syntax works in JavaScript, but not in Python.
Python: del df['column_name']
JavaScript: del df['column_name'] or del df.column_name
Another way of deleting a column in a Pandas DataFrame
If you're not looking for in-place deletion then you can create a new DataFrame by specifying the columns using DataFrame(...) function as:
my_dict = { 'name' : ['a','b','c','d'], 'age' : [10,20,25,22], 'designation' : ['CEO', 'VP', 'MD', 'CEO']}
df = pd.DataFrame(my_dict)
Create a new DataFrame as
newdf = pd.DataFrame(df, columns=['name', 'age'])
You get a result as good as what you get with del / drop.
Taking advantage by using Autocomplete or "IntelliSense" over string literals:
del df[df.column1.name]
# or
df.drop(df.column1.name, axis=1, inplace=True)
It works fine with current Pandas versions.
To remove columns before and after specific columns you can use the method truncate. For example:
A B C D E
0 1 10 100 1000 10000
1 2 20 200 2000 20000
df.truncate(before='B', after='D', axis=1)
Output:
B C D
0 10 100 1000
1 20 200 2000
Viewed from a general Python standpoint, del obj.column_name makes sense if the attribute column_name can be deleted. It needs to be a regular attribute - or a property with a defined deleter.
The reasons why this doesn't translate to Pandas, and does not make sense for Pandas Dataframes are:
Consider df.column_name to be a “virtual attribute”, it is not a thing in its own right, it is not the “seat” of that column, it's just a way to access the column. Much like a property with no deleter.

Categories

Resources