applying a function to a subset of columns in pandas groupby - python

I have a df with many columns. I would like to group by id and transform a subset of those columns leaving the rest untouched. What is the optimal way to do this? In particular, I have a df with a bunch of id's and I would like to z-score columns a and b within each id. Column c should remain untouched. In my actual problem I have many more columns.
The best I can think of is passing a dict of {col_name: function_name} to transform. For some reason this raises a TypeError.
MWE:
import pandas as pd
import numpy as np
np.random.seed(123) #reproducible ex
df = pd.DataFrame(data = {"a": np.arange(10), "b": np.arange(10)[::-1], "c": np.random.choice(a = np.arange(10), size = 10)}, index = pd.Index(data = np.random.choice(a = [1,2,3], size = 10), name = "id"))
#create a dict for all columns other than "c" and the function to do the transform
fmap = {k: lambda x: (x - x.mean()) / x.std() for k in df.columns if k != "c"}
df.groupby("id").transform(fmap) #yields error that "dict" is unhashable
Turns out this is a known bug: https://github.com/pandas-dev/pandas/issues/17309

One possible solution is filter columns names first by difference, because dict cannot working with transfrom yet:
cols = df.columns.difference(['c'])
print (cols)
Index(['a', 'b'], dtype='object')
fmap = lambda x: (x - x.mean()) / x.std()
df[cols] = df.groupby("id")[cols].transform(fmap)
print (df)
a b c
id
3 -1.000000 1.000000 2
2 -1.091089 1.091089 2
1 -1.134975 1.134975 6
3 0.000000 0.000000 1
1 -0.529655 0.529655 3
2 0.218218 -0.218218 9
3 1.000000 -1.000000 6
2 0.872872 -0.872872 1
1 0.680985 -0.680985 0
1 0.983645 -0.983645 1

Related

Pandas percent difference from mean

I have a pandas data frame with many columns and for each column I would like to generate a new columns where the result is the percent difference of the value in relation to the mean of that column, as seen in the example below:
d = {'var1': [1, 2], 'var2': [3,4]}
df = pd.DataFrame(data=d)
df
var1 var2
0 1 3
1 2 4
result:
var1 var2 var1_avg var2_avg
0 1 3 -0.33 -0.14
1 2 4 0.33 0.14
I am aware of how to find the mean of the column and then compute the percent difference, but only for a singe column, as seen below:
df['var1_avg'] = (df.var1 - df.var1.mean()) / df.var1.mean()
However, I have 100's of columns and would like a way where I can apply this to each column and append the "_avg" to each new column name.
You can use pandas.concat and pandas.DataFrame.add_suffix:
>>> pd.concat([df, ((df - df.mean())/df.mean()).add_suffix("_avg")], axis = 1)
var1 var2 var1_avg var2_avg
0 1 3 -0.333333 -0.142857
1 2 4 0.333333 0.142857
Given that you have hundreds of columns, I opted to avoid using a for loop when declaring the new columns (that is, I did not use for loops in the calculation process itself). This is the most efficient solution I could think of.
new_cols = [col + '_avg' for col in df.columns]
df[new_cols] = df.apply(lambda x: (x - x.mean()) / x.mean())
Out of curiosity I timed all the solutions given so far (other than #moys, which got edited to be essentially the same as mine [now deleted as it is a poor result and the code is duplicated in this answer]). The results were quite consistent for 10 iterations over a 100x100 dataframe, with #PabloC solution far and away the best; about 35x faster than mine, 120x faster than #Arturo and 2000x faster than #unltd_J:
nick : 3.6475648000000005
pablo : 0.09790619999999972
unltd_J : 189.9798728
arturo : 11.067582600000009
Testing code:
import timeit
import pandas as pd
import numpy as np
from pandas.testing import assert_frame_equal
def nick(df):
for col in df.columns:
df[f'{col}_avg'] = (df[col] - df[col].mean()) / df[col].mean()
return df
def pablo(df):
pd.concat([df, ((df - df.mean())/df.mean()).add_suffix("_avg")], axis = 1)
return df
def unltd_J(df):
for col in df.columns:
df[f'{col}_avg'] = [(x - np.mean(df[col])) / np.mean(df[col]) for x in df[col]]
return df
def arturo(df):
new_cols = [col + '_avg' for col in df.columns]
df[new_cols] = df.apply(lambda x: (x - x.mean()) / x.mean())
return df
size = 100
df = pd.DataFrame(np.random.randint(0,100,size=(size, size)), columns=['var' + str(i) for i in range(size)])
print(df)
assert_frame_equal(nick(df), pablo(df))
assert_frame_equal(nick(df), unltd_J(df))
assert_frame_equal(nick(df), arturo(df))
for fn in [nick, pablo, unltd_J, arturo]:
name = fn.__name__
t = timeit.timeit(f'{name}(df)', setup='''
import pandas as pd
import numpy as np
''', number=10, globals=globals())
print(f'{name:8} : {t}')
You are basically just making a new list of values for each column you have in the df and adding it to the columns.
for col in df.columns:
df[f'{col}_avg'] = [(x - np.mean(df[col])) / np.mean(df[col]) for x in df[col]]
output
var1 var2 var1_avg var2_avg
0 1 3 -0.3333 -0.1428
1 2 4 0.3333 0.1428
You can do something like this. Here i am just multiplying all the "Var" column values by 3, you can use your logic of calculating the value you want. If you can provide the logic that you are using, I can update the code appropriately.
for col in df.columns:
if 'var'in col:
df[col+'_avg'] = (df[col] - df[col].mean()) / df[col].mean()
Output
var1 var2 var1_avg var2_avg
1 3 -0.333333 -0.142857
2 4 0.333333 0.142857

How I can merge the columns into a single column in Python?

I want to merge 3 columns into a single column. I have tried changing the column types. However, I could not do it.
For example, I have 3 columns such as A: {1,2,4}, B:{3,4,4}, C:{1,1,1}
Output expected: ABC Column {131, 241, 441}
My inputs are like this:
df['ABC'] = df['A'].map(str) + df['B'].map(str) + df['C'].map(str)
df.head()
ABC {13.01.0 , 24.01.0, 44.01.0}
The type of ABC seems object and I could not change via str, int.
df['ABC'].apply(str)
Also, I realized that there are NaN values in A, B, C column. Is it possible to merge these even with NaN values?
# Example
import pandas as pd
import numpy as np
df = pd.DataFrame()
# Considering NaN's in the data-frame
df['colA'] = [1,2,4, np.NaN,5]
df['colB'] = [3,4,4,3,np.NaN]
df['colC'] = [1,1,1,4,1]
# Using pd.isna() to check for NaN values in the columns
df['colA'] = df['colA'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colB'] = df['colB'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colC'] = df['colC'].apply(lambda x: x if pd.isna(x) else str(int(x)))
# Filling the NaN values with a blank space
df = df.fillna('')
# Transform columns into string
df = df.astype(str)
# Concatenating all together
df['ABC'] = df.sum(axis=1)
A workaround your NaN problem could look like this but now NaN will be 0
import numpy as np
df = pd.DataFrame({'A': [1,2,4, np.nan], 'B':[3,4,4,4], 'C':[1,np.nan,1, 3]})
df = df.replace(np.nan, 0, regex=True).astype(int).applymap(str)
df['ABC'] = df['A'] + df['B'] + df['C']
output
A B C ABC
0 1 3 1 131
1 2 4 0 240
2 4 4 1 441
3 0 4 3 043

pandas - Apply mean to a specific row in grouped dataframe [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

How to get non-NaN elements' index and the value from a DataFrame

I have a big data frame with lots of NaN, I want to store it into a smaller data frame which stores all the indexes and the values of the non-NaN, non-zero values.
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
The data frame may look like this:
A B C
0 NaN -2.268882 0.337074
1 NaN 0.000000 1.340350
2 -1.526945 0.000000 NaN
3 -1.223816 0.000000 -2.185926
I want a data frame looks like this:
0 B -2.268882
0 C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
4 C -2.185926
How can I do it quickly, as i have a relatively big data frame, thousands by thousands...
Many Thanks!
Replace 0 with np.nan and .stack() the result (see docs).
If there's a chance that you have all np.nan values in rows after .replace(), you could do .dropna(how='all') before .stack() to reduce the number of rows to pivot. If that could apply to columns do `.dropna(how='all', axis=1).
df.replace(0, np.nan).stack()
0 B -2.268882
C 0.337074
1 C 1.340350
2 A -1.526945
3 A -1.223816
C -2.185926
Combine with .reset_index() as needed.
To select from a Series with MultiIndex use .loc[(level_0, level_1)]:
df.loc[(0, 'B')] = -2.268882
Details on slicing etc in the docs.
I've come up with a bit ugly way of achieving things, but hey, it works. But this solution has index going from 0 and does not preserve the original order of 'A', 'B', 'C' as in your question, if that matters.
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(4,3), columns=list('ABC'))
dff.iloc[0:2,0] = np.nan
dff.iloc[2,2] = np.nan
dff.iloc[1:4,1] = 0
dff.iloc[2,1] = np.nan
​
# mask to do logical and for two lists
mask = lambda y,z: list(map(lambda x: x[0] and x[1], zip(y,z)))
# create new frame
new_df = pd.DataFrame()
types = []
vals = []
# iterate over columns
for col in dff.columns:
# get the non empty and non zero values from current column
data = dff[col][mask(dff[col].notnull(), dff[col] != 0)]
# add corresponding original column name
types.extend([col for x in range(len(data))])
vals.extend(data)
# populate the dataframe
new_df['Types'] = pd.Series(types)
new_df['Vals'] = pd.Series(vals)
​
print(new_df)
# A B C
#0 NaN -1.167975 -1.362128
#1 NaN 0.000000 1.388611
#2 1.482621 NaN NaN
#3 -1.108279 0.000000 -1.454491
# Types Vals
#0 A 1.482621
#1 A -1.108279
#2 B -1.167975
#3 C -1.362128
#4 C 1.388611
#5 C -1.454491
I am looking forward for more pandas/python like answer myself!

How to apply a function to two columns of Pandas dataframe

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :
f = lambda x, y : my_function_expression.
Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
There is a clean, one-line way of doing this in Pandas:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.
Example with data (based on original question):
import pandas as pd
df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
Output of print(df):
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:
df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)
Here's an example using apply on the dataframe, which I am calling with axis = 1.
Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.
A simple solution is:
df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
The method you are looking for is Series.combine.
However, it seems some care has to be taken around datatypes.
In your example, you would (as I did when testing the answer) naively call
df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)
However, this throws the error:
ValueError: setting an array element with a sequence.
My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:
df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of
columns then a DataFrame is returned and each column gets the
corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Answering the problem without apply
Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
Timings
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to #Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here is a faster solution:
def func_1(a,b):
return a + b
df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())
This is 380 times faster than df.apply(f, axis=1) from #Aman and 310 times faster than df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) from #ajrwhite.
I add some benchmarks too:
Results:
FUNCTIONS TIMINGS GAIN
apply lambda 0.7 x 1
apply 0.56 x 1.25
map 0.3 x 2.3
np.vectorize 0.01 x 70
f3 on Series 0.0026 x 270
f3 on np arrays 0.0018 x 380
f3 numba 0.0018 x 380
In short:
Using apply is slow. We can speed up things very simply, just by using a function that will operate directly on Pandas Series (or better on numpy arrays). And because we will operate on Pandas Series or numpy arrays, we will be able to vectorize the operations. The function will return a Pandas Series or numpy array that we will assign as a new column.
And here is the benchmark code:
import timeit
timeit_setup = """
import pandas as pd
import numpy as np
import numba
np.random.seed(0)
# Create a DataFrame of 10000 rows with 2 columns "A" and "B"
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])
def f1(a,b):
# Here a and b are the values of column A and B for a specific row: integers
return a + b
def f2(x):
# Here, x is pandas Series, and corresponds to a specific row of the DataFrame
# 0 and 1 are the indexes of columns A and B
return x[0] + x[1]
def f3(a,b):
# Same as f1 but we will pass parameters that will allow vectorization
# Here, A and B will be Pandas Series or numpy arrays
# with df["C"] = f3(df["A"],df["B"]): Pandas Series
# with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
return a + b
#numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
# Here a and b are 2 numpy arrays with dtype int64
# This function must return a numpy array whith dtype int64
return a + b
"""
test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]
for test_function in test_functions:
print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))
Output:
0.7
0.56
0.3
0.01
0.0026
0.0018
0.0018
Final note: things could be optimzed with Cython and other numba tricks too.
The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.
You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.
You need this function signature because the syntax is .apply(f)
So f needs to take the single thing = dataframe and not two things which is what your current f expects.
Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply
Another option is df.itertuples() (generally faster and recommended over df.iterrows() by docs and user testing):
import pandas as pd
df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]
df
a b c d e
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
Since itertuples returns an Iterable of namedtuples, you can access tuple elements both as attributes by column name (aka dot notation) and by index:
b, d = row
b = row.b
d = row[1]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
It can be done in two simple ways:
Let's say, we want sum of col1 and col2 in output column named col_sum
Method 1
f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)
Method 2
def f(x):
x['col_sum'] = x.col_1 + col_2
return x
df = df.apply(f, axis=1)
Method 2 should be used when some complex function has to applied to the dataframe. Method 2 can also be used when output in multiple columns is required.
I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
def get_sublist_list(cols):
return [get_sublist(cols[0],cols[1])]
def unlist(list_of_lists):
return list_of_lists[0]
df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)
df
If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as #Ted Petrou had mentioned.
If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:
import pandas as pd
import swifter
def fnc(m,x,c):
return m*x+c
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

Categories

Resources