I've got a dataframe called data. How would I rename the only one column header? For example gdp to log(gdp)?
data =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
The rename show that it accepts a dict as a param for columns so you just pass a dict with a single entry.
Also see related
A much faster implementation would be to use list-comprehension if you need to rename a single column.
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
If the need arises to rename multiple columns, either use conditional expressions like:
df.columns = ['log(gdp)' if x=='gdp' else 'cap_mod' if x=='cap' else x for x in df.columns]
Or, construct a mapping using a dictionary and perform the list-comprehension with it's get operation by setting default value as the old name:
col_dict = {'gdp': 'log(gdp)', 'cap': 'cap_mod'} ## key→old name, value→new name
df.columns = [col_dict.get(x, x) for x in df.columns]
Timings:
%%timeit
df.rename(columns={'gdp':'log(gdp)'}, inplace=True)
10000 loops, best of 3: 168 µs per loop
%%timeit
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
10000 loops, best of 3: 58.5 µs per loop
How do I rename a specific column in pandas?
From v0.24+, to rename one (or more) columns at a time,
DataFrame.rename() with axis=1 or axis='columns' (the axis argument was introduced in v0.21.
Index.str.replace() for string/regex based replacement.
If you need to rename ALL columns at once,
DataFrame.set_axis() method with axis=1. Pass a list-like sequence. Options are available for in-place modification as well.
rename with axis=1
df = pd.DataFrame('x', columns=['y', 'gdp', 'cap'], index=range(5))
df
y gdp cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
With 0.21+, you can now specify an axis parameter with rename:
df.rename({'gdp':'log(gdp)'}, axis=1)
# df.rename({'gdp':'log(gdp)'}, axis='columns')
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
(Note that rename is not in-place by default, so you will need to assign the result back.)
This addition has been made to improve consistency with the rest of the API. The new axis argument is analogous to the columns parameter—they do the same thing.
df.rename(columns={'gdp': 'log(gdp)'})
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
rename also accepts a callback that is called once for each column.
df.rename(lambda x: x[0], axis=1)
# df.rename(lambda x: x[0], axis='columns')
y g c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For this specific scenario, you would want to use
df.rename(lambda x: 'log(gdp)' if x == 'gdp' else x, axis=1)
Index.str.replace
Similar to replace method of strings in python, pandas Index and Series (object dtype only) define a ("vectorized") str.replace method for string and regex-based replacement.
df.columns = df.columns.str.replace('gdp', 'log(gdp)')
df
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
The advantage of this over the other methods is that str.replace supports regex (enabled by default). See the docs for more information.
Passing a list to set_axis with axis=1
Call set_axis with a list of header(s). The list must be equal in length to the columns/index size. set_axis mutates the original DataFrame by default, but you can specify inplace=False to return a modified copy.
df.set_axis(['cap', 'log(gdp)', 'y'], axis=1, inplace=False)
# df.set_axis(['cap', 'log(gdp)', 'y'], axis='columns', inplace=False)
cap log(gdp) y
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
Note: In future releases, inplace will default to True.
Method Chaining
Why choose set_axis when we already have an efficient way of assigning columns with df.columns = ...? As shown by Ted Petrou in this answer set_axis is useful when trying to chain methods.
Compare
# new for pandas 0.21+
df.some_method1()
.some_method2()
.set_axis()
.some_method3()
Versus
# old way
df1 = df.some_method1()
.some_method2()
df1.columns = columns
df1.some_method3()
The former is more natural and free flowing syntax.
There are at least five different ways to rename specific columns in pandas, and I have listed them below along with links to the original answers. I also timed these methods and found them to perform about the same (though YMMV depending on your data set and scenario). The test case below is to rename columns A M N Z to A2 M2 N2 Z2 in a dataframe with columns A to Z containing a million rows.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Create sample data
df = pd.DataFrame(np.random.randint(0,9999,size=(1000000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
# Standard way - https://stackoverflow.com/a/19758398/452587
def method_1():
df_renamed = df.rename(columns={'A': 'A2', 'M': 'M2', 'N': 'N2', 'Z': 'Z2'})
# Lambda function - https://stackoverflow.com/a/16770353/452587
def method_2():
df_renamed = df.rename(columns=lambda x: x + '2' if x in ['A', 'M', 'N', 'Z'] else x)
# Mapping function - https://stackoverflow.com/a/19758398/452587
def rename_some(x):
if x=='A' or x=='M' or x=='N' or x=='Z':
return x + '2'
return x
def method_3():
df_renamed = df.rename(columns=rename_some)
# Dictionary comprehension - https://stackoverflow.com/a/58143182/452587
def method_4():
df_renamed = df.rename(columns={col: col + '2' for col in df.columns[
np.asarray([i for i, col in enumerate(df.columns) if 'A' in col or 'M' in col or 'N' in col or 'Z' in col])
]})
# Dictionary comprehension - https://stackoverflow.com/a/38101084/452587
def method_5():
df_renamed = df.rename(columns=dict(zip(df[['A', 'M', 'N', 'Z']], ['A2', 'M2', 'N2', 'Z2'])))
print('Method 1:', timeit.timeit(method_1, number=10))
print('Method 2:', timeit.timeit(method_2, number=10))
print('Method 3:', timeit.timeit(method_3, number=10))
print('Method 4:', timeit.timeit(method_4, number=10))
print('Method 5:', timeit.timeit(method_5, number=10))
Output:
Method 1: 3.650640267
Method 2: 3.163998427
Method 3: 2.998530871
Method 4: 2.9918436889999995
Method 5: 3.2436501520000007
Use the method that is most intuitive to you and easiest for you to implement in your application.
Use the pandas.DataFrame.rename funtion.
Check this link for description.
data.rename(columns = {'gdp': 'log(gdp)'}, inplace = True)
If you intend to rename multiple columns then
data.rename(columns = {'gdp': 'log(gdp)', 'cap': 'log(cap)', ..}, inplace = True)
df.rename(columns=lambda x: {"My_sample": "My_sample_new_name"}.get(x, x))
ewe can rename by re—doing the table
df = pd.DataFrame()
column_names = mydataframe.columns
for i in range(len(mydataframe)):
column = mydataframe.iloc[:,i]
df[column_names[i][:-8]+"desigred_texnt"] = column
print(df.columns)
Related
I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.
For example, let's say this is my DataFrame:
df = pd.DataFrame({
'num_1': [np.nan, 2., 2., 3., 1.],
'num_2': [9., 6., np.nan, 5., 7.],
'str_1': ['a', 'b', 'c', 'a', 'd'],
'str_2': ['C', 'B', 'B', 'D', 'A'],
})
And I have some manipulation I want to do on it:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})
My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?
I went through the documentation of .assign and .pipe, and many answers here, and have gotten as far as this:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
return df
df = (df
.pipe (foo_numbers)
.assign (str_2=df['str_2'].str.lower())
.replace ({'str_1':to_rep, 'str_2':to_rep})
)
which produces the same output. My problems with this are:
The pipe seems to just hide the handling of the numeric columns from the main chain, but the implementation inside hasn't improved at all.
The .replace requires me to manually name all the columns one by one. What if I have more than just two columns? (You can assume I want to apply the same replacement to all columns).
The .assign is OK, but I was hoping there is a way to pass str.lower as a callable to be applied to that one column, but I couldn't make it work.
So what's the correct way to approach these kind of changes to a DataFrame, using method chaining?
I would do it this way with the help of pandas.select_dtypes and pandas.concat :
import numpy as np
df = (
pd.concat(
[df.select_dtypes(np.number)
.fillna(0)
.astype(int)
.mul(2),
df.select_dtypes('object')
.apply(lambda s: s.str.lower())
.replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
)
Output :
print(df)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
You already have a good approach, except for the fact that you mutate the input. Either make a copy, or chain operations:
def foo_numbers(df):
df = df.copy()
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0, downcast='infer').mul(2)
return df
Or:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
return (df[numeric_cols].fillna(0, downcast='infer').mul(2)
.combine_first(df)
)[df.columns]
Here are some more examples.
Using assign:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
(df.assign(**{c: lambda d, c=c: d[c].fillna(0, downcast='infer').mul(2)
for c in numeric_cols})
.assign(**{c: lambda d, c=c: d[c].str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
for c in str_cols})
)
Using apply:
def foo(s):
if pd.api.types.is_numeric_dtype(s):
return s.fillna(0, downcast='infer').mul(2)
elif s.dtype == object:
return s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
return s
df.apply(foo)
Using pipe:
def foo(df):
df = df.copy()
df.update(df.select_dtypes('number')
.fillna(0, downcast='infer').mul(2))
df.update(df.select_dtypes(object)
.apply(lambda s: s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'}))
)
return df
df.pipe(foo)
One option, with the method chaining:
(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
.transform(lambda f: f.str.lower())
.replace({'a':'z', 'b':'y','c':'x'}))
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
Another option, using pyjanitor's transform_columns:
(df.transform_columns(numeric_cols,
lambda f: f.fillna(0,downcast='infer').mul(2),
elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
I am trying to run groupby with multiple columns and aggregate column and aggregate Operator.
I will get all of above as parameter to method. I have to do groupby:
result = df.groupby([groupByColumns])[aggColumn].agg(aggOperation)
Here
groupByColumns: clientId,state,branchId
aggColumn: amount
aggOperator: sum
But I am getting this error
KeyError: ''
I am not good in Panda. How can I correct my statement above?
If groupByColumns is already a list, remove [] in groupby:
groupByColumns = ['clientId', 'state', 'branchId']
aggColumn = 'amount'
aggOperation = sum
out = df.groupby(groupByColumns)[aggColumn].agg(aggOperation)
# OR
out = df.groupby(['clientId', 'state', 'branchId'])['amount'].sum()
print(out)
# Output
clientId state branchId
A M X 3
N Y 3
B M X 9
N Y 6
Name: amount, dtype: int64
Setup:
df = pd.DataFrame({'clientId': list('AAABBB'),
'state': list('MMNMMN'),
'branchId': list('XXYXXY'),
'amount': range(1, 7)})
print(df)
# Output
clientId state branchId amount
0 A M X 1
1 A M X 2
2 A N Y 3
3 B M X 4
4 B M X 5
5 B N Y 6
The groupby column requires a 1D list in input parameter. In your case, your groupByColumns is ['clientId', 'state', 'branchId'], and while using the groupby function, you are using the list operator, essentially making it a 2d list of length 1. This is what is happening in your case
df.groupby([['clientId', 'state', 'branchId']])['amount'].sum()
Solution
As answered by #Corralien, use the same command but without the list operator, this way in the groupby command you'll be passing a 1d list and it should work !
I want to slice metope columns that are located several columns away from each other. I'm trying to write code that easy without having to write the code repeatedly:
df (See below for example) where columns are from A to H, with many rows containing some data (x).
How do I slice multiple randomly spaced columns, the say A, D, E, G, all in minimum amount of code. I don't want to rewrite loc code (df.loc['A'], df.loc['C:E'], df.loc['G'])?
Can I generate a list and loop through it or is there a shorter/quicker way?
Ultimately my goal would be to drop the selected columns from the main DataFrame.
A B C D E F G H
0 x x x x x x x x
1 x x x x x x x x
2 x x x x x x x x
3 x x x x x x x x
4 x x x x x x x x
You might harness .iloc method to get columns by their position rather than name, for example:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9],'D':[10,11,12],'E':[13,14,15]})
df2 = df.iloc[:, [0,2,4]]
print(df2)
output:
A C E
0 1 7 13
1 2 8 14
2 3 9 15
If you need just x random columns from your df which has y columns, you might use random.sample for example if you want 3 column out of 5:
import random
cols = sorted(random.sample(range(0,5),k=3))
gives cols which is sorted list of three numbers (thanks to sorted order of columns will be preserved)
I have the following example and I cannot understand why it doesn't work.
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def balh(a, b):
z = a + b
if z.any() > 1:
return z + 1
else:
return z
df['col3'] = balh(df.col1, df.col2)
Output:
My expected output would be see 5 and 7 not 4 and 6 in col3, since 4 and 6 are grater than 1 and my intention is to add 1 if a + b are grater than 1
The any method will evaluate if any element of the pandas.Series or pandas.DataFrame is True. A non-null integer is evaluated as True. So essentially by if z.any() > 1 you are comparing the True returned by the method with the 1 integer.
You need to condition directly the pandas.Series which will return a boolean pandas.Series where you can safely apply the any method.
This will be the same for the all method.
def balh(a, b):
z = a + b
if (z > 1).any():
return z + 1
else:
return z
As #arhr clearly explained the issue was the incorrect call to z.any(), which returns True when there is at least one non-zero element in z. It resulted in a True > 1 which is a False expression.
A one line alternative to avoid the if statement and the custom function call would be the following:
df['col3'] = df.iloc[:, :2].sum(1).transform(lambda x: x + int(x > 1))
This gets the first two columns in the dataframe then sums the elements along each row and transforms the new column according to the lambda function.
The iloc can also be omitted because the dataframe is instantiated with only two columns col1 and col2, thus the line can be refactored to:
df['col3'] = df.sum(1).transform(lambda x: x + int(x > 1))
Example output:
col1 col2 col3
0 1 3 5
1 2 4 7
How do i assign columns in my dataframe to be equal to another column if/where condition is met?
Update
The problem
I need to assign many columns values (and sometimes a value from another column in that row) when the condition is met.
The condition is not the problem.
I need an efficient way to do this:
df.loc[some condition it doesn't matter,
['a','b','c','d','e','f','g','x','y']]=df['z'],1,3,4,5,6,7,8,df['p']
Simplified example data
d = {'var' : pd.Series([10,61]),
'c' : pd.Series([100,0]),
'z' : pd.Series(['x','x']),
'y' : pd.Series([None,None]),
'x' : pd.Series([None,None])}
df=pd.DataFrame(d)
Condition if var is not missing and first digit is less than 5
Result make df.x=df.z & df.y=1
Here is psuedo code that doesn't work, but it is what I would want.
df.loc[((df['var'].dropna().astype(str).str[0].astype(int) < 5)),
['x','y']]=df['z'],1
but i get
ValueError: cannot set using a list-like indexer with a different length than the value
ideal output
c var x z y
0 100 10 x x 1
1 0 61 None x None
The code below works, but is too inefficient because i need to assign values to multiple columns.
df.loc[((df['var'].dropna().astype(str).str[0].astype(int) < 5)),
['x']]=df['z']
df.loc[((df['var'].dropna().astype(str).str[0].astype(int) < 5)),
['y']]=1
You can work row wise:
def f(row):
if row['var'] is not None and int(str(row['var'])[0]) < 5:
row[['x', 'y']] = row['z'], 1
return row
>>> df.apply(f, axis=1)
c var x y z
0 100 10 x 1 x
1 0 61 None NaN x
To overwrite the original df:
df = df.apply(f, axis=1)
This is one way of doing it:
import pandas as pd
import numpy as np
d = {'var' : pd.Series([1,6]),
'c' : pd.Series([100,0]),
'z' : pd.Series(['x','x']),
'y' : pd.Series([None,None]),
'x' : pd.Series([None,None])}
df = pd.DataFrame(d)
# Condition 1: if var is not missing
cond1 = ~df['var'].apply(np.isnan)
# Condition 2: first number is less than 5
cond2 = df['var'].apply(lambda x: int(str(x)[0])) < 5
mask = cond1 & cond2
df.ix[mask, 'x'] = df.ix[mask, 'z']
df.ix[mask, 'y'] = 1
print df
Output:
c var x y z
0 100 1 x 1 x
1 0 6 None None x
As you can see, the Boolean mask has to be applied on both side of the assignment, and you need to broadcast the value 1 on the y column. It is probably cleaner to split the steps into multiple lines.
Question updated, edit: More generally, since some assignments depend on the other columns, and some assignments are just broadcasting along the column, you can do it in two steps:
df.loc[conds, ['a','y']] = df.loc[conds, ['z','p']]
df.loc[conds, ['b','c','d','e','f','g','x']] = [1,3,4,5,6,7,8]
You may profile and see if this is efficient enough for your use case.