python pandas: rename single column label in multi-index dataframe - python

I have a df that looks like this:
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([['1','2'],['A','B']])
print df
1 2
A B A B
0 0.030626 0.494912 0.364742 0.320088
1 0.178368 0.857469 0.628677 0.705226
2 0.886296 0.833130 0.495135 0.246427
3 0.391352 0.128498 0.162211 0.011254
How can I rename column '1' and '2' as 'One' and 'Two'?
I thought df.rename() would've helped but it doesn't. Have no idea how to do this?

That is indeed something missing in rename (ideally it should let you specify the level).
Another way is by setting the levels of the columns index, but then you need to know all values for that level:
In [41]: df.columns.levels[0]
Out[41]: Index([u'1', u'2'], dtype='object')
In [43]: df.columns = df.columns.set_levels(['one', 'two'], level=0)
In [44]: df
Out[44]:
one two
A B A B
0 0.899686 0.466577 0.867268 0.064329
1 0.162480 0.455039 0.736870 0.759595
2 0.620960 0.922119 0.060141 0.669997
3 0.871107 0.043799 0.080080 0.577421
In [45]: df.columns.levels[0]
Out[45]: Index([u'one', u'two'], dtype='object')

As of pandas 0.22.0 (and probably much earlier), you can specify the level:
df = df.rename(columns={'1': one, '2': two}, level=0)
or, alternatively (new notation since pandas 0.21.0):
df = df.rename({'1': one, '2': two}, axis='columns', level=0)
But actually, it works even when omitting the level:
df = df.rename(columns={'1': one, '2': two})
In that case, all column levels are checked for occurrences to be renamed.

Use set_levels:
>>> df.columns.set_levels(['one','two'], 0, inplace=True)
>>> print(df)
one two
A B A B
0 0.731851 0.489611 0.636441 0.774818
1 0.996034 0.298914 0.377097 0.404644
2 0.217106 0.808459 0.588594 0.009408
3 0.851270 0.799914 0.328863 0.009914

df.columns.set_levels(['one', 'two'], level=0, inplace=True)

df.rename_axis({'1':'one', '2':'two'}, axis='columns', inplace=True)

This is a good question. Combining the answer above, you can write a function:
def rename_col( df, columns, level = 0 ):
def rename_apply ( x, rename_dict ):
try:
return rename_dict[x]
except KeyError:
return x
if isinstance(df.columns, pd.core.index.MultiIndex):
df.columns = df.columns.set_levels([rename_apply(x, rename_dict = columns ) for x in df.columns.levels[level]], level= level)
else:
df.columns = [rename_apply(x, rename_dict = columns ) for x in df.columns ]
return df
It worked for me.
Ideally, a functionality like this should be integrated into the "official" "rename" function in the future, so you don't need to write a hack like this.

Related

Pandas: Modify a particular level of Multiindex, using replace method several times

I am trying to use the replace method several times in order to change the indeces of a given level of a multiindex pandas' dataframe.
As seen here: Pandas: Modify a particular level of Multiindex, #John got a solution that works great so long the replace method is used once.
The problem is, that it does not work if I use this method several times.
E.g.
df.index = df.index.set_levels(df.index.levels[0].str.replace("dataframe_",'').replace("_r",' r'), level=0)
I get the following error message:
AttributeError: 'Index' object has no attribute 'replace'
What am I missing?
Use str.replace twice:
idx = df.index.levels[0].str.replace("dataframe_",'').str.replace("_r",' r')
df.index = df.index.set_levels(idx, level=0)
Another solution is converting to_series and then replace by dictionary:
d = {'dataframe_':'','_r':' r'}
idx = df.index.levels[0].to_series().replace(d)
df.index = df.index.set_levels(idx, level=0)
And solution with map and fillna, if large data and performance is important:
d = {'dataframe_':'','_r':' r'}
s = df.index.levels[0].to_series()
df.index = df.index.set_levels(s.map(d).fillna(s), level=0)
Sample:
df = pd.DataFrame({
'A':['dataframe_','_r', 'a'],
'B':[7,8,9],
'C':[1,3,5],
}).set_index(['A','B'])
print (df)
C
A B
dataframe_ 7 1
_r 8 3
a 9 5
d = {'dataframe_':'','_r':' r'}
idx = df.index.levels[0].to_series().replace(d)
df.index = df.index.set_levels(idx, level=0)
print (df)
C
A B
7 1
r 8 3
a 9 5

Rename variously formatted column headers in pandas

I'm working on a small tool that does some calculations on a dataframe, let's say something like this:
df['column_c'] = df['column_a'] + df['column_b']
for this to work the dataframe need to have the columns 'column_a' and 'column_b'. I would like this code to work if the columns are named slightly different named in the import file (csv or xlsx). For example 'columnA', 'Col_a', ect).
The easiest way would be renaming the columns inside the imported file, but let's assume this is not possible. Therefore I would like to do some think like this:
if column name is in list ['columnA', 'Col_A', 'col_a', 'a'... ] rename it to 'column_a'
I was thinking about having a dictionary with possible column names, when a column name would be in this dictionary it will be renamed to 'column_a'. An additional complication would be the fact that the columns can be in arbitrary order.
How would one solve this problem?
I recommend you formulate the conversion logic and write a function accordingly:
lst = ['columnA', 'Col_A', 'col_a', 'a']
def converter(x):
return 'column_'+x[-1].lower()
res = list(map(converter, lst))
['column_a', 'column_a', 'column_a', 'column_a']
You can then use this directly in pd.DataFrame.rename:
df = df.rename(columns=converter)
Example usage:
df = pd.DataFrame(columns=['columnA', 'col_B', 'c'])
df = df.rename(columns=converter)
print(df.columns)
Index(['column_a', 'column_b', 'column_c'], dtype='object')
Simply
for index, column_name in enumerate(df.columns):
if column_name in ['columnA', 'Col_A', 'col_a' ]:
df.columns[index] = 'column_a'
with dictionary
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
for index, column_name in enumerate(df.columns):
for name, ex_names in dico:
if column_name in ex_names:
df.columns[index] = name
This should solve it:
df=pd.DataFrame({'colA':[1,2], 'columnB':[3,4]})
def rename_df(col):
if col in ['columnA', 'Col_A', 'colA' ]:
return 'column_a'
if col in ['columnB', 'Col_B', 'colB' ]:
return 'column_b'
return col
df = df.rename(rename_df, axis=1)
if you have the list of other names like list_othername_A or list_othername_B, you can do:
for col_name in df.columns:
if col_name in list_othername_A:
df = df.rename(columns = {col_name : 'column_a'})
elif col_name in list_othername_B:
df = df.rename(columns = {col_name : 'column_b'})
elif ...
EDIT: using the dictionary of #djangoliv, you can do even shorter:
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
#create a dict to rename, kind of reverse dico:
dict_rename = {col:key for key in dico.keys() for col in dico[key]}
# then just rename:
df = df.rename(columns = dict_rename )
Note that this method does not work if in df you have two columns 'columnA' and 'Col_A' but otherwise, it should work as rename does not care if any key in dict_rename is not in df.columns.

Renaming columns in a list of dataframes?

I have a list of DataFrames that look like this,
dfs
[
var1 var1
14.171250 13.593813
13.578317 13.595329
10.301850 13.580139
9.930217 13.593278
6.192517 13.561943
7.738100 13.565149
6.197983 13.572509,
var1 var2
2.456183 5.907528
5.052017 5.955731
5.960000 5.972480
8.039317 5.984608
7.559217 5.985348
6.933633 5.979438
...
]
I want to rename var1 and var2 in each DataFrame to be Foo and Hoo.
I tried the following,
renames_dfs = []
for df in dfs:
renames_dfs.append(df.rename(columns={'var1':'Foo','var2':'Hoo'},inplace = True))
This returns an empty list of None. What mistake am I making here when i rename column names?
You can do it like this.
[df.rename(columns={'var1':'Foo','var2':'Hoo'},inplace=True) for df in dfs]
Output:
[None,None]
BUT....
dfs
Output:
[ Foo Hoo
0 0.463904 0.765987
1 0.473314 0.609793
2 0.505549 0.449539
3 0.508157 0.444993
4 0.604366 0.368044, Foo Hoo
0 0.241526 0.225990
1 0.609949 0.454891
2 0.523094 0.443431
3 0.525026 0.714601
4 0.002260 0.763454]
Your existing code returns None because inplace=True updates the reference inplace.
One efficient solution is to just assign to df.columns directly:
for df in dfs:
df.columns = ['foo', 'bar']
Will update all dataframes in the same list without having to create a new list.
Another option is using set_axis, if you are renaming all the columns:
df2 = [df.columns.set_axis(['foo', 'bar'], axis=1, inplace=False) for df in dfs]
If renaming only a subset, use rename instead.
Parameter inplace=True always return None.
So you can use:
renames_dfs = []
for df in dfs:
df.rename(columns={'var1':'Foo','var2':'Hoo'},inplace = True)
renames_dfs.append(df)
But I think better:
renames_dfs = []
for df in dfs:
renames_dfs.append(df.rename(columns={'var1':'Foo','var2':'Hoo'}))

adding rows to empty dataframe with columns

I am using Pandas and want to add rows to an empty DataFrame with columns already established.
So far my code looks like this...
def addRows(cereals,lines):
for i in np.arange(1,len(lines)):
dt = parseLine(lines[i])
dt = pd.Series(dt)
print(dt)
# YOUR CODE GOES HERE (add dt to cereals)
cereals.append(dt, ignore_index = True)
return(cereals)
However, when I run...
cereals = addRows(cereals,lines)
cereals
the dataframe returns with no rows, just the columns. I am not sure what I am doing wrong but I am pretty sure it has something to do with the append method. Anyone have any ideas as to what I am doing wrong?
There are two probably reasons your code is not operating as intended:
cereals.append(dt, ignore_index = True) is not doing what you think it is. You're trying to append a series, not a DataFrame there.
cereals.append(dt, ignore_index = True) does not modify cereals in place, so when you return it, you're returning an unchanged copy. An equivalent function would look like this:
--
>>> def foo(a):
... a + 1
... return a
...
>>> foo(1)
1
I haven't tested this on my machine, but I think you're fixed solution would look like this:
def addRows(cereals, lines):
for i in np.arange(1,len(lines)):
data = parseLine(lines[i])
new_df = pd.DataFrame(data, columns=cereals.columns)
cereals = cereals.append(new_df, ignore_index=True)
return cereals
by the way.. I don't really know where lines is coming from, but right away I would at least modify it to look like this:
data = [parseLine(line) for line in lines]
cereals = cereals.append(pd.DataFrame(data, cereals.columns), ignore_index=True)
How to add an extra row to a pandas dataframe
You could also create a new DataFrame and just append that DataFrame to your existing one. E.g.
>>> import pandas as pd
>>> empty_alph = pd.DataFrame(columns=['letter', 'index'])
>>> alph_abc = pd.DataFrame([['a', 0], ['b', 1], ['c', 2]], columns=['letter', 'index'])
>>> empty_alph.append(alph_abc)
letter index
0 a 0.0
1 b 1.0
2 c 2.0
As I noted in the link, you can also use the loc method on a DataFrame:
>>> df = empty_alph.append(alph_abc)
>>> df.loc[df.shape[0]] = ['d', 3] // df.shape[0] just finds next # in index
letter index
0 a 0.0
1 b 1.0
2 c 2.0
3 d 3.0

Changing multiple column names but not all of them - Pandas Python

I would like to know if there is a function to change specific column names but without selecting a specific name or without changing all of them.
I have the code:
df=df.rename(columns = {'nameofacolumn':'newname'})
But with it i have to manually change each one of them writing each name.
Also to change all of them I have
df = df.columns['name1','name2','etc']
I would like to have a function to change columns 1 and 3 without writing their names just stating their location.
say you have a dictionary of the new column names and the name of the column they should replace:
df.rename(columns={'old_col':'new_col', 'old_col_2':'new_col_2'}, inplace=True)
But, if you don't have that, and you only have the indices, you can do this:
column_indices = [1,4,5,6]
new_names = ['a','b','c','d']
old_names = df.columns[column_indices]
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)
You can use a dict comprehension and pass this to rename:
In [246]:
df = pd.DataFrame(columns=list('abc'))
new_cols=['d','e']
df.rename(columns=dict(zip(df.columns[1:], new_cols)),inplace=True)
df
Out[246]:
Empty DataFrame
Columns: [a, d, e]
Index: []
It also works if you pass a list of ordinal positions:
df.rename(columns=dict(zip(df.columns[[1,2]], new_cols)),inplace=True)
You don't need to use rename method at all.
You simply replace the old column names with new ones using lists. To rename columns 1 and 3 (with index 0 and 2), you do something like this:
df.columns.values[[0, 2]] = ['newname0', 'newname2']
or possibly if you are using older version of pandas than 0.16.0, you do:
df.keys().values[[0, 2]] = ['newname0', 'newname2']
The advantage of this approach is, that you don't need to copy the whole dataframe with syntax df = df.rename, you just change the index values.
You should be able to reference the columns by index using ..df.columns[index]
>> temp = pd.DataFrame(np.random.randn(10, 5),columns=['a', 'b', 'c', 'd', 'e'])
>> print(temp.columns[0])
a
>> print(temp.columns[1])
b
So to change the value of specific columns, first assign the values to an array and change only the values you want
>> newcolumns=temp.columns.values
>> newcolumns[0] = 'New_a'
Assign the new array back to the columns and you'll have what you need
>> temp.columns = newcolumns
>> temp.columns
>> print(temp.columns[0])
New_a
if you have a dict of {position: new_name}, you can use items()
e.g.,
new_columns = {3: 'fourth_column'}
df.rename(columns={df.columns[i]: new_col for i, new_col in new_cols.items()})
full example:
$ ipython
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
...: import pandas as pd
...:
...: rng = np.random.default_rng(seed=0)
...: df = pd.DataFrame({key: rng.uniform(size=3) for key in list('abcde')})
...: df
Out[1]:
a b c d e
0 0.636962 0.016528 0.606636 0.935072 0.857404
1 0.269787 0.813270 0.729497 0.815854 0.033586
2 0.040974 0.912756 0.543625 0.002739 0.729655
In [2]: new_columns = {3: 'fourth_column'}
...: df.rename(columns={df.columns[i]: new_col for i, new_col in new_columns.items()})
Out[2]:
a b c fourth_column e
0 0.636962 0.016528 0.606636 0.935072 0.857404
1 0.269787 0.813270 0.729497 0.815854 0.033586
2 0.040974 0.912756 0.543625 0.002739 0.729655
In [3]:

Categories

Resources