Pandas - modifying single/multiple columns with method chaining - python

I discovered methods chaining in pandas only very recently. I love how it makes the code cleaner and more readable, but I still can't figure out how to use it when I want to modify only a single column, or a group of columns, as part of the pipeline.
For example, let's say this is my DataFrame:
df = pd.DataFrame({
'num_1': [np.nan, 2., 2., 3., 1.],
'num_2': [9., 6., np.nan, 5., 7.],
'str_1': ['a', 'b', 'c', 'a', 'd'],
'str_2': ['C', 'B', 'B', 'D', 'A'],
})
And I have some manipulation I want to do on it:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
df['str_2'] = df['str_2'].str.lower()
df[str_cols] = df[str_cols].replace({'a': 'z', 'b':'y', 'c': 'x'})
My question is - what is the most pandas-y way / best practice to achieve all of the above with method chaining?
I went through the documentation of .assign and .pipe, and many answers here, and have gotten as far as this:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0.).astype('int')
df[numeric_cols] = df[numeric_cols] * 2
return df
df = (df
.pipe (foo_numbers)
.assign (str_2=df['str_2'].str.lower())
.replace ({'str_1':to_rep, 'str_2':to_rep})
)
which produces the same output. My problems with this are:
The pipe seems to just hide the handling of the numeric columns from the main chain, but the implementation inside hasn't improved at all.
The .replace requires me to manually name all the columns one by one. What if I have more than just two columns? (You can assume I want to apply the same replacement to all columns).
The .assign is OK, but I was hoping there is a way to pass str.lower as a callable to be applied to that one column, but I couldn't make it work.
So what's the correct way to approach these kind of changes to a DataFrame, using method chaining?

I would do it this way with the help of pandas.select_dtypes and pandas.concat :
import numpy as np
df = (
pd.concat(
[df.select_dtypes(np.number)
.fillna(0)
.astype(int)
.mul(2),
df.select_dtypes('object')
.apply(lambda s: s.str.lower())
.replace({'a':'z', 'b':'y', 'c':'x'})], axis=1)
)
​
Output :
print(df)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z

You already have a good approach, except for the fact that you mutate the input. Either make a copy, or chain operations:
def foo_numbers(df):
df = df.copy()
numeric_cols = ['num_1', 'num_2']
df[numeric_cols] = df[numeric_cols].fillna(0, downcast='infer').mul(2)
return df
Or:
def foo_numbers(df):
numeric_cols = ['num_1', 'num_2']
return (df[numeric_cols].fillna(0, downcast='infer').mul(2)
.combine_first(df)
)[df.columns]
Here are some more examples.
Using assign:
numeric_cols = ['num_1', 'num_2']
str_cols = ['str_1', 'str_2']
(df.assign(**{c: lambda d, c=c: d[c].fillna(0, downcast='infer').mul(2)
for c in numeric_cols})
.assign(**{c: lambda d, c=c: d[c].str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
for c in str_cols})
)
Using apply:
def foo(s):
if pd.api.types.is_numeric_dtype(s):
return s.fillna(0, downcast='infer').mul(2)
elif s.dtype == object:
return s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'})
return s
df.apply(foo)
Using pipe:
def foo(df):
df = df.copy()
df.update(df.select_dtypes('number')
.fillna(0, downcast='infer').mul(2))
df.update(df.select_dtypes(object)
.apply(lambda s: s.str.lower().replace({'a': 'z', 'b':'y', 'c': 'x'}))
)
return df
df.pipe(foo)

One option, with the method chaining:
(df
.loc(axis=1)[numeric_cols]
.fillna(0,downcast='infer')
.mul(2)
.assign(**df.loc(axis=1)[str_cols]
.transform(lambda f: f.str.lower())
.replace({'a':'z', 'b':'y','c':'x'}))
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z
Another option, using pyjanitor's transform_columns:
(df.transform_columns(numeric_cols,
lambda f: f.fillna(0,downcast='infer').mul(2),
elementwise=False)
.transform_columns(str_cols, str.lower)
.replace({'a':'z', 'b':'y','c':'x'})
)
num_1 num_2 str_1 str_2
0 0 18 z x
1 4 12 y y
2 4 0 x y
3 6 10 z d
4 2 14 d z

Related

For and If together - dataframe Python

If the value in column 'y' is K, multiply the column 'x' values to 1e3. If column 'y' is M, multiply the column 'x' values to 1e6. Below code multiplies all the values to with 1e3
value_list = []
for i in list(result['x'].values):
if np.where(result['y'] == 'K'):
value_list.append(float(i)*1e3)
elif np.where(result['y'] == 'M'):
value_list.append(float(i)*1e6)
else:
value_list.append(np.nan)
df['Value_numeric'] = value_list
df.head().Value_numeric
Dataframe:
Output right now:
This case is simple enough that it's not necessary to use a loop or a custom function; one can use a simple assignment:
import pandas as pd
import numpy as np
d = {'x': [750, 5, 4, 240, 220], 'y': ['K', 'M', 'M', 'K', 'K']}
df = pd.DataFrame(data=d)
# here is the main operation:
df['value_numeric'] = np.where(df['y']=='K', df['x'] * 1e3, df['x'] * 1e6)
print(df)
output
x y value_numeric
0 750 K 750000.0
1 5 M 5000000.0
2 4 M 4000000.0
3 240 K 240000.0
4 220 K 220000.0
You can do something like this:
df = pd.DataFrame([[1,"a"],[2,'b'],[3,'c']], columns=['A', 'B'])
def calc(x):
if x['B'] == 'a':
return x['A'] * 10
if x['B'] == 'b':
return x['A'] * 20
if x['B'] == 'c':
return x['A'] * 30
df['calculate'] = df.apply(lambda x: calc(x),axis=1)
print(df)
# A B calculate
#0 1 a 10
#1 2 b 40
#2 3 c 90
You can adjust your calculations as needed based on the condition.

Calculate average of column x if column y meets criteria, for each y

How do I retrieve the value of column Z and its average
if any value are >1
data=[9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
l=[]
for x,y in df.iterrows():
for i,s in y.iteritems():
if s >1:
l.append(x)
print(df['Z'])
The expected output will most likely be a dictionary with the column name as key and the average of Z as its values.
Using a dictionary comprehension:
res = {col: df.loc[df[col] > 1, 'Z'].mean() for col in df.columns[:-1]}
# {'A': 9.0, 'B': 5.0, 'C': 8.0, 'D': 7.5, 'E': 6.666666666666667}
Setup used for above:
np.random.seed(0)
data = [9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd = pd.DataFrame(data, columns=['Z'])
df = pd.concat([df, fd], axis=1)
Do you mean this?
df[df['Z']>1].loc[:,'Z'].mean(axis=0)
or
df[df['Z']>1]['Z'].mean()
I don't know if I understood your question correctly but do you mean this:
import pandas as pd
import numpy as np
data=[9,2,3,4,5,6,7,8]
columns = ['A', 'B', 'C', 'D','E']
df = pd.DataFrame(np.random.randn(8, 5),columns=columns)
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
print('df = \n', str(df))
anyGreaterThanOne = (df[columns] > 1).any(axis=1)
print('anyGreaterThanOne = \n', str(anyGreaterThanOne))
filtered = df[anyGreaterThanOne]
print('filtered = \n', str(filtered))
Zmean = filtered['Z'].mean()
print('Zmean = ', str(Zmean))
Result:
df =
A B C D E Z
0 -2.170640 -2.626985 -0.817407 -0.389833 0.862373 9
1 -0.372144 -0.375271 -1.309273 -1.019846 -0.548244 2
2 0.267983 -0.680144 0.304727 0.302952 -0.597647 3
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
5 -1.135545 -1.738466 -1.148341 0.764914 -1.140543 6
6 -2.078396 0.057462 -0.737875 -0.817707 0.570017 7
7 0.187877 0.363962 0.637949 -0.875372 -1.105744 8
anyGreaterThanOne =
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
dtype: bool
filtered =
A B C D E Z
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
Zmean = 4.5

pandas data frame sort

I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object
Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]
You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)

Pandas Apply Function That returns two new columns

I have a pandas dataframe that I would like to use an apply function on to generate two new columns based on the existing data. I am getting this error:
ValueError: Wrong number of items passed 2, placement implies 1
import pandas as pd
import numpy as np
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return [C, D]
df = pd.DataFrame(np.random.randint(0,10,size=(2, 2)), columns=list('AB'))
df['C', 'D'] = df.apply(myfunc1 ,axis=1)
Starting DF:
A B
0 6 1
1 8 4
Desired DF:
A B C D
0 6 1 16 56
1 8 4 18 58
Based on your latest error, you can avoid the error by returning the new columns as a Series
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return pd.Series([C, D])
df[['C', 'D']] = df.apply(myfunc1 ,axis=1)
Please be aware of the huge memory consumption and low speed of the accepted answer: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ !
Using the suggestion presented there, the correct answer would be like this:
def run_loopy(df):
Cs, Ds = [], []
for _, row in df.iterrows():
c, d, = myfunc1(row['A'])
Cs.append(c)
Ds.append(d)
return pd.Series({'C': Cs,
'D': Ds})
def myfunc1(a):
c = a + 10
d = a + 50
return c, d
df[['C', 'D']] = run_loopy(df)
It works for me:
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return C, D
df = pd.DataFrame(np.random.randint(0,10,size=(2, 2)), columns=list('AB'))
df[['C', 'D']] = df.apply(myfunc1, axis=1, result_type='expand')
df
add: ==>> result_type='expand',
regards!
df['C','D'] is considered as 1 column rather than 2. So for 2 columns you need a sliced dataframe so use df[['C','D']]
df[['C', 'D']] = df.apply(myfunc1 ,axis=1)
A B C D
0 4 6 14 54
1 5 1 15 55
Or you can use chain assignment i.e
df['C'], df['D'] = df.apply(myfunc1 ,axis=1)
I believe can achieve similar results to #Federico Dorato answer without use of for loop. Return a list rather than a series and use lambda-apply + to_list() to expand results.
It's cleaner code and on a random df of 10,000,000 rows performs as well or faster.
Federico's code
run_time = []
for i in range(0,25):
df = pd.DataFrame(np.random.randint(0,10000000,size=(2, 2)), columns=list('AB'))
def run_loopy(df):
Cs, Ds = [], []
for _, row in df.iterrows():
c, d, = myfunc1(row['A'])
Cs.append(c)
Ds.append(d)
return pd.Series({'C': Cs,
'D': Ds})
def myfunc1(a):
c = a / 10
d = a + 50
return c, d
start = time.time()
df[['C', 'D']] = run_loopy(df)
end = time.time()
run_time.append(end-start)
print(np.average(run_time)) # 0.001240386962890625
Using lambda and to_list
run_time = []
for i in range(0,25):
df = pd.DataFrame(np.random.randint(0,10000000,size=(2, 2)), columns=list('AB'))
def myfunc1(a):
c = a / 10
d = a + 50
return [c, d]
start = time.time()
df[['C', 'D']] = df['A'].apply(lambda x: myfunc1(x)).to_list()
end = time.time()
run_time.append(end-start)
print(np.average(run_time)) #output 0.0009996891021728516
Add extra brackets when querying for multiple columns.
import pandas as pd
import numpy as np
def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return [C, D]
df = pd.DataFrame(np.random.randint(0,10,size=(2, 2)), columns=list('AB'))
df[['C', 'D']] = df.apply(myfunc1 ,axis=1)

Rename specific column(s) in pandas

I've got a dataframe called data. How would I rename the only one column header? For example gdp to log(gdp)?
data =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
The rename show that it accepts a dict as a param for columns so you just pass a dict with a single entry.
Also see related
A much faster implementation would be to use list-comprehension if you need to rename a single column.
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
If the need arises to rename multiple columns, either use conditional expressions like:
df.columns = ['log(gdp)' if x=='gdp' else 'cap_mod' if x=='cap' else x for x in df.columns]
Or, construct a mapping using a dictionary and perform the list-comprehension with it's get operation by setting default value as the old name:
col_dict = {'gdp': 'log(gdp)', 'cap': 'cap_mod'} ## key→old name, value→new name
df.columns = [col_dict.get(x, x) for x in df.columns]
Timings:
%%timeit
df.rename(columns={'gdp':'log(gdp)'}, inplace=True)
10000 loops, best of 3: 168 µs per loop
%%timeit
df.columns = ['log(gdp)' if x=='gdp' else x for x in df.columns]
10000 loops, best of 3: 58.5 µs per loop
How do I rename a specific column in pandas?
From v0.24+, to rename one (or more) columns at a time,
DataFrame.rename() with axis=1 or axis='columns' (the axis argument was introduced in v0.21.
Index.str.replace() for string/regex based replacement.
If you need to rename ALL columns at once,
DataFrame.set_axis() method with axis=1. Pass a list-like sequence. Options are available for in-place modification as well.
rename with axis=1
df = pd.DataFrame('x', columns=['y', 'gdp', 'cap'], index=range(5))
df
y gdp cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
With 0.21+, you can now specify an axis parameter with rename:
df.rename({'gdp':'log(gdp)'}, axis=1)
# df.rename({'gdp':'log(gdp)'}, axis='columns')
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
(Note that rename is not in-place by default, so you will need to assign the result back.)
This addition has been made to improve consistency with the rest of the API. The new axis argument is analogous to the columns parameter—they do the same thing.
df.rename(columns={'gdp': 'log(gdp)'})
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
rename also accepts a callback that is called once for each column.
df.rename(lambda x: x[0], axis=1)
# df.rename(lambda x: x[0], axis='columns')
y g c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For this specific scenario, you would want to use
df.rename(lambda x: 'log(gdp)' if x == 'gdp' else x, axis=1)
Index.str.replace
Similar to replace method of strings in python, pandas Index and Series (object dtype only) define a ("vectorized") str.replace method for string and regex-based replacement.
df.columns = df.columns.str.replace('gdp', 'log(gdp)')
df
y log(gdp) cap
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
The advantage of this over the other methods is that str.replace supports regex (enabled by default). See the docs for more information.
Passing a list to set_axis with axis=1
Call set_axis with a list of header(s). The list must be equal in length to the columns/index size. set_axis mutates the original DataFrame by default, but you can specify inplace=False to return a modified copy.
df.set_axis(['cap', 'log(gdp)', 'y'], axis=1, inplace=False)
# df.set_axis(['cap', 'log(gdp)', 'y'], axis='columns', inplace=False)
cap log(gdp) y
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
Note: In future releases, inplace will default to True.
Method Chaining
Why choose set_axis when we already have an efficient way of assigning columns with df.columns = ...? As shown by Ted Petrou in this answer set_axis is useful when trying to chain methods.
Compare
# new for pandas 0.21+
df.some_method1()
.some_method2()
.set_axis()
.some_method3()
Versus
# old way
df1 = df.some_method1()
.some_method2()
df1.columns = columns
df1.some_method3()
The former is more natural and free flowing syntax.
There are at least five different ways to rename specific columns in pandas, and I have listed them below along with links to the original answers. I also timed these methods and found them to perform about the same (though YMMV depending on your data set and scenario). The test case below is to rename columns A M N Z to A2 M2 N2 Z2 in a dataframe with columns A to Z containing a million rows.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Create sample data
df = pd.DataFrame(np.random.randint(0,9999,size=(1000000, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
# Standard way - https://stackoverflow.com/a/19758398/452587
def method_1():
df_renamed = df.rename(columns={'A': 'A2', 'M': 'M2', 'N': 'N2', 'Z': 'Z2'})
# Lambda function - https://stackoverflow.com/a/16770353/452587
def method_2():
df_renamed = df.rename(columns=lambda x: x + '2' if x in ['A', 'M', 'N', 'Z'] else x)
# Mapping function - https://stackoverflow.com/a/19758398/452587
def rename_some(x):
if x=='A' or x=='M' or x=='N' or x=='Z':
return x + '2'
return x
def method_3():
df_renamed = df.rename(columns=rename_some)
# Dictionary comprehension - https://stackoverflow.com/a/58143182/452587
def method_4():
df_renamed = df.rename(columns={col: col + '2' for col in df.columns[
np.asarray([i for i, col in enumerate(df.columns) if 'A' in col or 'M' in col or 'N' in col or 'Z' in col])
]})
# Dictionary comprehension - https://stackoverflow.com/a/38101084/452587
def method_5():
df_renamed = df.rename(columns=dict(zip(df[['A', 'M', 'N', 'Z']], ['A2', 'M2', 'N2', 'Z2'])))
print('Method 1:', timeit.timeit(method_1, number=10))
print('Method 2:', timeit.timeit(method_2, number=10))
print('Method 3:', timeit.timeit(method_3, number=10))
print('Method 4:', timeit.timeit(method_4, number=10))
print('Method 5:', timeit.timeit(method_5, number=10))
Output:
Method 1: 3.650640267
Method 2: 3.163998427
Method 3: 2.998530871
Method 4: 2.9918436889999995
Method 5: 3.2436501520000007
Use the method that is most intuitive to you and easiest for you to implement in your application.
Use the pandas.DataFrame.rename funtion.
Check this link for description.
data.rename(columns = {'gdp': 'log(gdp)'}, inplace = True)
If you intend to rename multiple columns then
data.rename(columns = {'gdp': 'log(gdp)', 'cap': 'log(cap)', ..}, inplace = True)
df.rename(columns=lambda x: {"My_sample": "My_sample_new_name"}.get(x, x))
ewe can rename by re—doing the table
df = pd.DataFrame()
column_names = mydataframe.columns
for i in range(len(mydataframe)):
column = mydataframe.iloc[:,i]
df[column_names[i][:-8]+"desigred_texnt"] = column
print(df.columns)

Categories

Resources