Renaming columns in a list of dataframes? - python

I have a list of DataFrames that look like this,
dfs
[
var1 var1
14.171250 13.593813
13.578317 13.595329
10.301850 13.580139
9.930217 13.593278
6.192517 13.561943
7.738100 13.565149
6.197983 13.572509,
var1 var2
2.456183 5.907528
5.052017 5.955731
5.960000 5.972480
8.039317 5.984608
7.559217 5.985348
6.933633 5.979438
...
]
I want to rename var1 and var2 in each DataFrame to be Foo and Hoo.
I tried the following,
renames_dfs = []
for df in dfs:
renames_dfs.append(df.rename(columns={'var1':'Foo','var2':'Hoo'},inplace = True))
This returns an empty list of None. What mistake am I making here when i rename column names?

You can do it like this.
[df.rename(columns={'var1':'Foo','var2':'Hoo'},inplace=True) for df in dfs]
Output:
[None,None]
BUT....
dfs
Output:
[ Foo Hoo
0 0.463904 0.765987
1 0.473314 0.609793
2 0.505549 0.449539
3 0.508157 0.444993
4 0.604366 0.368044, Foo Hoo
0 0.241526 0.225990
1 0.609949 0.454891
2 0.523094 0.443431
3 0.525026 0.714601
4 0.002260 0.763454]

Your existing code returns None because inplace=True updates the reference inplace.
One efficient solution is to just assign to df.columns directly:
for df in dfs:
df.columns = ['foo', 'bar']
Will update all dataframes in the same list without having to create a new list.
Another option is using set_axis, if you are renaming all the columns:
df2 = [df.columns.set_axis(['foo', 'bar'], axis=1, inplace=False) for df in dfs]
If renaming only a subset, use rename instead.

Parameter inplace=True always return None.
So you can use:
renames_dfs = []
for df in dfs:
df.rename(columns={'var1':'Foo','var2':'Hoo'},inplace = True)
renames_dfs.append(df)
But I think better:
renames_dfs = []
for df in dfs:
renames_dfs.append(df.rename(columns={'var1':'Foo','var2':'Hoo'}))

Related

pandas: using a list of dfs, perform some action on all dfs and list df names

I made a list of a set of dfs:
dflist = [df1, df2, df3, df4]
I want to loop through all the dfs in the list, print the df name, and print the first 2 lines.
In unix it is simple to do what I want:
cat dflist | while read i; do echo $i && head -2 $i; done
Which returns:
US_distribution_of_soybean_aphid_2022.csv
State,Status of Aphis glycines in 2022,Source,
Connecticut,Present,Rutledge (2004),
But in pandas,
for i in dflist:
print('i')
print(i.head(2))
returns literal i followed by the desired head(2) results.
i
Year Analyte Class SortOrder PercentAcresTreated \
0 1991 Methomyl carbamate 2840 0.00125
1 1991 Methyl parathion organophosphate 2900 0.01000
Using:
for i in dflist:
print(i)
Prints each df in its entirety.
Very frustrating to a newbie trying to understand the python equivalents of commands I use every day. I'm currently working in a jupyter notebook, if that matters.
Before proceeding, give these answers a read. You can name your dataframes with df.name after you create them. Then you can print their names with print(df.name). Just be careful to not have a column named name. In reality, you can name name anything. df.dataframe_name works as well. Just make it unique in that it is not one of your column names:
test = pd.DataFrame({"Column":[1,2,3,4,5]})
test.name = "Test"
new_test = pd.DataFrame({"Column":[1,2,3,4,5]})
new_test.name = "New Test"
third_test = pd.DataFrame({"Column":[1,2,3,4,5]})
third_test.name = "Third Test"
last_test = pd.DataFrame({"Column":[1,2,3,4,5]})
last_test.name = "Last Test"
dflist = [test, new_test, third_test, last_test]
for i in dflist:
print(i.name)
print(i.head(2))
Output:
Test
Column
0 1
1 2
New Test
Column
0 1
1 2
Third Test
Column
0 1
1 2
Last Test
Column
0 1
1 2
You must use a container. Accessing names programmatically is highly discouraged.
For instance, a list:
dfs = [df1, df2, df3]
for df in dfs:
print(df.head())
Or, if you need "names", a dictionary:
dfs = {'start': df1, 'middle': df2, 'end': df3}
for name, df in dfs.items():
print(name)
print(df.head())

Rename variously formatted column headers in pandas

I'm working on a small tool that does some calculations on a dataframe, let's say something like this:
df['column_c'] = df['column_a'] + df['column_b']
for this to work the dataframe need to have the columns 'column_a' and 'column_b'. I would like this code to work if the columns are named slightly different named in the import file (csv or xlsx). For example 'columnA', 'Col_a', ect).
The easiest way would be renaming the columns inside the imported file, but let's assume this is not possible. Therefore I would like to do some think like this:
if column name is in list ['columnA', 'Col_A', 'col_a', 'a'... ] rename it to 'column_a'
I was thinking about having a dictionary with possible column names, when a column name would be in this dictionary it will be renamed to 'column_a'. An additional complication would be the fact that the columns can be in arbitrary order.
How would one solve this problem?
I recommend you formulate the conversion logic and write a function accordingly:
lst = ['columnA', 'Col_A', 'col_a', 'a']
def converter(x):
return 'column_'+x[-1].lower()
res = list(map(converter, lst))
['column_a', 'column_a', 'column_a', 'column_a']
You can then use this directly in pd.DataFrame.rename:
df = df.rename(columns=converter)
Example usage:
df = pd.DataFrame(columns=['columnA', 'col_B', 'c'])
df = df.rename(columns=converter)
print(df.columns)
Index(['column_a', 'column_b', 'column_c'], dtype='object')
Simply
for index, column_name in enumerate(df.columns):
if column_name in ['columnA', 'Col_A', 'col_a' ]:
df.columns[index] = 'column_a'
with dictionary
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
for index, column_name in enumerate(df.columns):
for name, ex_names in dico:
if column_name in ex_names:
df.columns[index] = name
This should solve it:
df=pd.DataFrame({'colA':[1,2], 'columnB':[3,4]})
def rename_df(col):
if col in ['columnA', 'Col_A', 'colA' ]:
return 'column_a'
if col in ['columnB', 'Col_B', 'colB' ]:
return 'column_b'
return col
df = df.rename(rename_df, axis=1)
if you have the list of other names like list_othername_A or list_othername_B, you can do:
for col_name in df.columns:
if col_name in list_othername_A:
df = df.rename(columns = {col_name : 'column_a'})
elif col_name in list_othername_B:
df = df.rename(columns = {col_name : 'column_b'})
elif ...
EDIT: using the dictionary of #djangoliv, you can do even shorter:
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
#create a dict to rename, kind of reverse dico:
dict_rename = {col:key for key in dico.keys() for col in dico[key]}
# then just rename:
df = df.rename(columns = dict_rename )
Note that this method does not work if in df you have two columns 'columnA' and 'Col_A' but otherwise, it should work as rename does not care if any key in dict_rename is not in df.columns.

unpack dictionary entries in pandas into dataframe

I have a dataframe where one of the columns has a dictionary in it
import pandas as pd
import numpy as np
def generate_dict():
return {'var1': np.random.rand(), 'var2': np.random.rand()}
data = {}
data[0] = {}
data[1] = {}
data[0]['A'] = generate_dict()
data[1]['A'] = generate_dict()
df = pd.DataFrame.from_dict(data, orient='index')
I would like to unpack the key/value pairs in the dictionary into a new dataframe, where each entry has it's own row. I can do that by iterating over the rows and appending to a new DataFrame:
def expand_row(row):
df_t = pd.DataFrame.from_dict({'value': row.A})
df_t.index.rename('row', inplace=True)
df_t.reset_index(inplace=True)
df_t['column'] = 'A'
return df_t
df_expanded = pd.DataFrame([])
for _, row in df.iterrows():
T = expand_row(row)
df_expanded = df_expanded.append(T, ignore_index=True)
This is rather slow, and my application is performance critical. I tihnk this is possible with df.apply. However as my function returns a DataFrame instead of a series, simply doing
df_expanded = df.apply(expand_row)
doesn't quite work. What would be the most performant way to do this?
Thanks in advance.
You can use nested list comprehension and then replace column 0 with constant A (column name):
d = df.A.to_dict()
df1 = pd.DataFrame([(key,key1,val1) for key,val in d.items() for key1,val1 in val.items()])
df1[0] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.013872
1 A var2 0.192230
2 A var1 0.176413
3 A var2 0.253600
Another solution:
df1 = pd.DataFrame.from_records(df.A.values.tolist()).stack().reset_index()
df1['level_0'] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.332594
1 A var2 0.118967
2 A var1 0.374482
3 A var2 0.263910

Changing multiple column names but not all of them - Pandas Python

I would like to know if there is a function to change specific column names but without selecting a specific name or without changing all of them.
I have the code:
df=df.rename(columns = {'nameofacolumn':'newname'})
But with it i have to manually change each one of them writing each name.
Also to change all of them I have
df = df.columns['name1','name2','etc']
I would like to have a function to change columns 1 and 3 without writing their names just stating their location.
say you have a dictionary of the new column names and the name of the column they should replace:
df.rename(columns={'old_col':'new_col', 'old_col_2':'new_col_2'}, inplace=True)
But, if you don't have that, and you only have the indices, you can do this:
column_indices = [1,4,5,6]
new_names = ['a','b','c','d']
old_names = df.columns[column_indices]
df.rename(columns=dict(zip(old_names, new_names)), inplace=True)
You can use a dict comprehension and pass this to rename:
In [246]:
df = pd.DataFrame(columns=list('abc'))
new_cols=['d','e']
df.rename(columns=dict(zip(df.columns[1:], new_cols)),inplace=True)
df
Out[246]:
Empty DataFrame
Columns: [a, d, e]
Index: []
It also works if you pass a list of ordinal positions:
df.rename(columns=dict(zip(df.columns[[1,2]], new_cols)),inplace=True)
You don't need to use rename method at all.
You simply replace the old column names with new ones using lists. To rename columns 1 and 3 (with index 0 and 2), you do something like this:
df.columns.values[[0, 2]] = ['newname0', 'newname2']
or possibly if you are using older version of pandas than 0.16.0, you do:
df.keys().values[[0, 2]] = ['newname0', 'newname2']
The advantage of this approach is, that you don't need to copy the whole dataframe with syntax df = df.rename, you just change the index values.
You should be able to reference the columns by index using ..df.columns[index]
>> temp = pd.DataFrame(np.random.randn(10, 5),columns=['a', 'b', 'c', 'd', 'e'])
>> print(temp.columns[0])
a
>> print(temp.columns[1])
b
So to change the value of specific columns, first assign the values to an array and change only the values you want
>> newcolumns=temp.columns.values
>> newcolumns[0] = 'New_a'
Assign the new array back to the columns and you'll have what you need
>> temp.columns = newcolumns
>> temp.columns
>> print(temp.columns[0])
New_a
if you have a dict of {position: new_name}, you can use items()
e.g.,
new_columns = {3: 'fourth_column'}
df.rename(columns={df.columns[i]: new_col for i, new_col in new_cols.items()})
full example:
$ ipython
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
...: import pandas as pd
...:
...: rng = np.random.default_rng(seed=0)
...: df = pd.DataFrame({key: rng.uniform(size=3) for key in list('abcde')})
...: df
Out[1]:
a b c d e
0 0.636962 0.016528 0.606636 0.935072 0.857404
1 0.269787 0.813270 0.729497 0.815854 0.033586
2 0.040974 0.912756 0.543625 0.002739 0.729655
In [2]: new_columns = {3: 'fourth_column'}
...: df.rename(columns={df.columns[i]: new_col for i, new_col in new_columns.items()})
Out[2]:
a b c fourth_column e
0 0.636962 0.016528 0.606636 0.935072 0.857404
1 0.269787 0.813270 0.729497 0.815854 0.033586
2 0.040974 0.912756 0.543625 0.002739 0.729655
In [3]:

python pandas: rename single column label in multi-index dataframe

I have a df that looks like this:
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([['1','2'],['A','B']])
print df
1 2
A B A B
0 0.030626 0.494912 0.364742 0.320088
1 0.178368 0.857469 0.628677 0.705226
2 0.886296 0.833130 0.495135 0.246427
3 0.391352 0.128498 0.162211 0.011254
How can I rename column '1' and '2' as 'One' and 'Two'?
I thought df.rename() would've helped but it doesn't. Have no idea how to do this?
That is indeed something missing in rename (ideally it should let you specify the level).
Another way is by setting the levels of the columns index, but then you need to know all values for that level:
In [41]: df.columns.levels[0]
Out[41]: Index([u'1', u'2'], dtype='object')
In [43]: df.columns = df.columns.set_levels(['one', 'two'], level=0)
In [44]: df
Out[44]:
one two
A B A B
0 0.899686 0.466577 0.867268 0.064329
1 0.162480 0.455039 0.736870 0.759595
2 0.620960 0.922119 0.060141 0.669997
3 0.871107 0.043799 0.080080 0.577421
In [45]: df.columns.levels[0]
Out[45]: Index([u'one', u'two'], dtype='object')
As of pandas 0.22.0 (and probably much earlier), you can specify the level:
df = df.rename(columns={'1': one, '2': two}, level=0)
or, alternatively (new notation since pandas 0.21.0):
df = df.rename({'1': one, '2': two}, axis='columns', level=0)
But actually, it works even when omitting the level:
df = df.rename(columns={'1': one, '2': two})
In that case, all column levels are checked for occurrences to be renamed.
Use set_levels:
>>> df.columns.set_levels(['one','two'], 0, inplace=True)
>>> print(df)
one two
A B A B
0 0.731851 0.489611 0.636441 0.774818
1 0.996034 0.298914 0.377097 0.404644
2 0.217106 0.808459 0.588594 0.009408
3 0.851270 0.799914 0.328863 0.009914
df.columns.set_levels(['one', 'two'], level=0, inplace=True)
df.rename_axis({'1':'one', '2':'two'}, axis='columns', inplace=True)
This is a good question. Combining the answer above, you can write a function:
def rename_col( df, columns, level = 0 ):
def rename_apply ( x, rename_dict ):
try:
return rename_dict[x]
except KeyError:
return x
if isinstance(df.columns, pd.core.index.MultiIndex):
df.columns = df.columns.set_levels([rename_apply(x, rename_dict = columns ) for x in df.columns.levels[level]], level= level)
else:
df.columns = [rename_apply(x, rename_dict = columns ) for x in df.columns ]
return df
It worked for me.
Ideally, a functionality like this should be integrated into the "official" "rename" function in the future, so you don't need to write a hack like this.

Categories

Resources