I have a pandas dataframe with two columns. I need to change the values of the first column without affecting the second one and get back the whole dataframe with just first column values changed. How can I do that using apply() in pandas?
Given a sample dataframe df as:
a b
0 1 2
1 2 3
2 3 4
3 4 5
what you want is:
df['a'] = df['a'].apply(lambda x: x + 1)
that returns:
a b
0 2 2
1 3 3
2 4 4
3 5 5
For a single column better to use map(), like this:
df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
a b c
0 15 15 5
1 20 10 7
2 25 30 9
df['a'] = df['a'].map(lambda a: a / 2.)
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
Given the following dataframe df and the function complex_function,
import pandas as pd
def complex_function(x, y=0):
if x > 5 and x > y:
return 1
else:
return 2
df = pd.DataFrame(data={'col1': [1, 4, 6, 2, 7], 'col2': [6, 7, 1, 2, 8]})
col1 col2
0 1 6
1 4 7
2 6 1
3 2 2
4 7 8
there are several solutions to use apply() on only one column. In the following I will explain them in detail.
I. Simple solution
The straightforward solution is the one from #Fabio Lamanna:
df['col1'] = df['col1'].apply(complex_function)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 1 8
Only the first column is modified, the second column is unchanged. The solution is beautiful. It is just one line of code and it reads almost like english: "Take 'col1' and apply the function complex_function to it."
However, if you need data from another column, e.g. 'col2', it won't work. If you want to pass the values of 'col2' to variable y of the complex_function, you need something else.
II. Solution using the whole dataframe
Alternatively, you could use the whole dataframe as described in this SO post or this one:
df['col1'] = df.apply(lambda x: complex_function(x['col1']), axis=1)
or if you prefer (like me) a solution without a lambda function:
def apply_complex_function(x):
return complex_function(x['col1'])
df['col1'] = df.apply(apply_complex_function, axis=1)
There is a lot going on in this solution that needs to be explained. The apply() function works on pd.Series and pd.DataFrame. But you cannot use df['col1'] = df.apply(complex_function).loc[:, 'col1'], because it would throw a ValueError.
Hence, you need to give the information which column to use. To complicate things, the apply() function does only accept callables. To solve this, you need to define a (lambda) function with the column x['col1'] as argument; i.e. we wrap the column information in another function.
Unfortunately, the default value of the axis parameter is zero (axis=0), which means it will try executing column-wise and not row-wise. This wasn't a problem in the first solution, because we gave apply() a pd.Series. But now the input is a dataframe and we must be explicit (axis=1). (I marvel how often I forget this.)
Whether you prefer the version with the lambda function or without is subjective. In my opinion the line of code is complicated enough to read even without a lambda function thrown in. You only need the (lambda) function as a wrapper. It is just boilerplate code. A reader should not be bothered with it.
Now, you can modify this solution easily to take the second column into account:
def apply_complex_function(x):
return complex_function(x['col1'], x['col2'])
df['col1'] = df.apply(apply_complex_function, axis=1)
Output:
col1 col2
0 2 6
1 2 7
2 1 1
3 2 2
4 2 8
At index 4 the value has changed from 1 to 2, because the first condition 7 > 5 is true but the second condition 7 > 8 is false.
Note that you only needed to change the first line of code (i.e. the function) and not the second line.
Side note
Never put the column information into your function.
def bad_idea(x):
return x['col1'] ** 2
By doing this, you make a general function dependent on a column name! This is a bad idea, because the next time you want to use this function, you cannot. Worse: Maybe you rename a column in a different dataframe just to make it work with your existing function. (Been there, done that. It is a slippery slope!)
III. Alternative solutions without using apply()
Although the OP specifically asked for a solution with apply(), alternative solutions were suggested. For example, the answer of #George Petrov suggested to use map(); the answer of #Thibaut Dubernet proposed assign().
I fully agree that apply() is seldom the best solution, because apply() is not vectorized. It is an element-wise operation with expensive function calling and overhead from pd.Series.
One reason to use apply() is that you want to use an existing function and performance is not an issue. Or your function is so complex that no vectorized version exists.
Another reason to use apply() is in combination with groupby(). Please note that DataFrame.apply() and GroupBy.apply() are different functions.
So it does make sense to consider some alternatives:
map() only works on pd.Series, but accepts dict and pd.Series as input. Using map() with a function is almost interchangeable with using apply(). It can be faster than apply(). See this SO post for more details.
df['col1'] = df['col1'].map(complex_function)
applymap() is almost identical for dataframes. It does not support pd.Series and it will always return a dataframe. However, it can be faster. The documentation states: "In the current implementation applymap calls func twice on the first column/row to decide whether it can take a fast or slow code path.". But if performance really counts you should seek an alternative route.
df['col1'] = df.applymap(complex_function).loc[:, 'col1']
assign() is not a feasible replacement for apply(). It has a similar behaviour in only the most basic use cases. It does not work with the complex_function. You still need apply() as you can see in the example below. The main use case for assign() is method chaining, because it gives back the dataframe without changing the original dataframe.
df['col1'] = df.assign(col1=df.col1.apply(complex_function))
Annex: How to speed up apply()?
I only mention it here because it was suggested by other answers, e.g. #durjoy. The list is not exhaustive:
Do not use apply(). This is no joke. For most numeric operations, a vectorized method exists in pandas. If/else blocks can often be refactored with a combination of boolean indexing and .loc. My example complex_function could be refactored in this way.
Refactor to Cython. If you have a complex equation and the parameters of the equation are in your dataframe, this might be a good idea. Check out the official pandas user guide for more information.
Use raw=True parameter. Theoretically, this should improve the performance of apply() if you are just applying a NumPy reduction function, because the overhead of pd.Series is removed. Of course, your function has to accept an ndarray. You have to refactor your function to NumPy. By doing this, you will have a huge performance boost.
Use 3rd party packages. The first thing you should try is Numba. I do not know swifter mentioned by #durjoy; and probably many other packages are worth mentioning here.
Try/Fail/Repeat. As mentioned above, map() and applymap() can be faster - depending on the use case. Just time the different versions and choose the fastest. This approach is the most tedious one with the least performance increase.
You don't need a function at all. You can work on a whole column directly.
Example data:
>>> df = pd.DataFrame({'a': [100, 1000], 'b': [200, 2000], 'c': [300, 3000]})
>>> df
a b c
0 100 200 300
1 1000 2000 3000
Half all the values in column a:
>>> df.a = df.a / 2
>>> df
a b c
0 50 200 300
1 500 2000 3000
Although the given responses are correct, they modify the initial data frame, which is not always desirable (and, given the OP asked for examples "using apply", it might be they wanted a version that returns a new data frame, as apply does).
This is possible using assign: it is valid to assign to existing columns, as the documentation states (emphasis is mine):
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
In short:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{'a': 15, 'b': 15, 'c': 5}, {'a': 20, 'b': 10, 'c': 7}, {'a': 25, 'b': 30, 'c': 9}])
In [3]: df.assign(a=lambda df: df.a / 2)
Out[3]:
a b c
0 7.5 15 5
1 10.0 10 7
2 12.5 30 9
In [4]: df
Out[4]:
a b c
0 15 15 5
1 20 10 7
2 25 30 9
Note that the function will be passed the whole dataframe, not only the column you want to modify, so you will need to make sure you select the right column in your lambda.
If you are really concerned about the execution speed of your apply function and you have a huge dataset to work on, you could use swifter to make faster execution, here is an example for swifter on pandas dataframe:
import pandas as pd
import swifter
def fnc(m):
return m*3+4
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
# apply a self created function to a single column in pandas
df["y"] = df.m.swifter.apply(fnc)
This will enable your all CPU cores to compute the result hence it will be much faster than normal apply functions. Try and let me know if it become useful for you.
Let me try a complex computation using datetime and considering nulls or empty spaces. I am reducing 30 years on a datetime column and using apply method as well as lambda and converting datetime format. Line if x != '' else x will take care of all empty spaces or nulls accordingly.
df['Date'] = df['Date'].fillna('')
df['Date'] = df['Date'].apply(lambda x : ((datetime.datetime.strptime(str(x), '%m/%d/%Y') - datetime.timedelta(days=30*365)).strftime('%Y%m%d')) if x != '' else x)
Make a copy of your dataframe first if you need to modify a column
Many answers here suggest modifying some column and assign the new values to the old column. It is common to get the SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. warning. This happens when your dataframe was created from another dataframe but is not a proper copy.
To silence this warning, make a copy and assign back.
df = df.copy()
df['a'] = df['a'].apply('add', other=1)
apply() only needs the name of the function
You can invoke a function by simply passing its name to apply() (no need for lambda). If your function needs additional arguments, you can pass them either as keyword arguments or pass the positional arguments as args=. For example, suppose you have file paths in your dataframe and you need to read files in these paths.
def read_data(path, sep=',', usecols=[0]):
return pd.read_csv(path, sep=sep, usecols=usecols)
df = pd.DataFrame({'paths': ['../x/yz.txt', '../u/vw.txt']})
df['paths'].apply(read_data) # you don't need lambda
df['paths'].apply(read_data, args=(',', [0, 1])) # pass the positional arguments to `args=`
df['paths'].apply(read_data, sep=',', usecols=[0, 1]) # pass as keyword arguments
Don't apply a function, call the appropriate method directly
It's almost never ideal to apply a custom function on a column via apply(). Because apply() is a syntactic sugar for a Python loop with a pandas overhead, it's often slower than calling the same function in a list comprehension, never mind, calling optimized pandas methods. Almost all numeric operators can be directly applied on the column and there are corresponding methods for all of them.
# add 1 to every element in column `a`
df['a'] += 1
# for every row, subtract column `a` value from column `b` value
df['c'] = df['b'] - df['a']
If you want to apply a function that has if-else blocks, then you should probably be using numpy.where() or numpy.select() instead. It is much, much faster. If you have anything larger than 10k rows of data, you'll notice the difference right away.
For example, if you have a custom function similar to func() below, then instead of applying it on the column, you could operate directly on the columns and return values using numpy.select().
def func(row):
if row == 'a':
return 1
elif row == 'b':
return 2
else:
return -999
# instead of applying a `func` to each row of a column, use `numpy.select` as below
import numpy as np
conditions = [df['col'] == 'a', df['col'] == 'b']
choices = [1, 2]
df['new'] = np.select(conditions, choices, default=-999)
As you can see, numpy.select() has very minimal syntax difference from an if-else ladder; only need to separate conditions and choices into separate lists. For other options, check out this answer.
Related
I have a large pandas dataframe with many different types of observations that need different models applied to them. One column is which model to apply, and that can be mapped to a python function which accepts a dataframe and returns a dataframe. One approach would be just doing 3 steps:
split dataframe into n dataframes for n different models
run each dataframe through each function
concatenate output dataframes at the end
This just ends up not being super flexible particularly as models are added and removed. Looking at groupby it seems like I should be able to leverage that to make this look much cleaner code-wise, but I haven't been able to find a pattern that does what I'd like.
Also because of the size of this data, using apply isn't particularly useful as it would drastically slow down the runtime.
Quick example:
df = pd.DataFrame({"model":["a","b","a"],"a":[1,5,8],"b":[1,4,6]})
def model_a(df):
return df["a"] + df["b"]
def model_b(df):
return df["a"] - df["b"]
model_map = {"a":model_a,"b":model_b}
results = df.groupby("model")...
The expected result would look like [2,1,14]. Is there an easy way code-wise to do this? Note that the actual models are much more complicated and involve potentially hundreds of variables with lots of transformations, this is just a toy example.
Thanks!
You can use groupby/apply:
x.name contains the name of the group, here a and b
x contains the sub dataframe
df['r'] = df.groupby('model') \
.apply(lambda x: model_map[x.name](x)) \
.droplevel(level='model')
>>> df
model a b r
0 a 1 1 2
1 b 5 4 1
2 a 8 6 14
Or you can use np.select:
>>> np.select([df['model'] == 'a', df['model'] == 'b'],
[model_a(df), model_b(df)])
array([ 2, 1, 14])
I have a very particular use case where pipeline users are allowed to pass in string expressions that get evaluated by a pipeline via DataFrame.query(). There are obviously far better ways to determine column existence in pandas, however using .query() is my current constraint.
Ideally I'd like to have a query that accepts a single column name and return a dataframe with either 1 column if it exists and no columns if it does not.
Input DataFrame:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
index
a
b
0
1
4
1
2
5
2
3
6
Desired return value when looking for a column that exists:
looking_for = "a"
df.query("#looking_for in columns")
index
a
0
1
1
2
2
3
Desired return value when looking for a column does not exist:
looking_for = "c"
df.query("#looking_for in columns")
index
0
1
2
What I've tried:
This is easy when using the dataframe directly, here is one way. However, after reading pandas query docs and fiddling around I have yet to find a way to do this from the .query() method.
df.loc[:, df.columns.isin(["c"])]
index
0
1
2
query only works with filtering operations. If you're constrained to do this by building string expressions only, you can use df.eval (close sister to df.query):
if df.eval("#looking_for in #df.columns.tolist()"):
print (df.eval("#df[#looking_for]"))
You could also use the top level pd.eval function directly ( pd.eval("df[looking_for]")). More on eval in this post by me.
Without the if check, eval could result in KeyError, so you could alternatively wrap the code inside try-except, this is a bit shorter.
try:
print (df.eval("#df[#looking_for]"))
except KeyError:
# column not present
As commented, I'm not sure why you are insisting on using query, which is not the best in this case. There are several options:
Option 1: `filter:
looking_for = 'c'
df.filter(regex = rf'^{looking_for}$')
Option 2: reindex:
df.reindex([looking_for], axis=1)
Lets say I have a dataframe A with attribute called 'score'.
I can modify the 'score' attribute of the second row by doing:
tmp = A.loc[2]
tmp.score = some_new_value
A.loc[2] = tmp
But I cant do it like this:
A.loc[2].score = some_new_value
Why ?
It will be hard to reproduce your case, because Pandas does not guarantee, when using chained indexing, whether the operation will return a view or a copy of the dataframe.
When you access a "cell" of the dataframe by
A.loc[2].score
you are actually performing two steps: first .loc and then .score (which is essentially chained indexing). The Pandas documentation has a nice post about it here.
The simplest way to prevent this is by consistently using .loc or .iloc to access the rows/columns you need and reassigning the value. Therefore, I would recommend always using either
A.loc[2, "score"] = some_new_value
or
A.at[2, "score"] = some_new_value
This kind of indexing + setting will be translated "under the hood" to:
A.loc.__setitem__((2, 'score'), some_new_value) # modifies A directly
instead of an unreliable chain of __getitem__ and __setitem__.
Let's show an example:
import pandas as pd
dict_ = {'score': [1,2,3,4,5,6], 'other':'a'}
A = pd.DataFrame(dict_)
A
Dataframe:
score other
0 1 a
1 2 a
2 3 a
3 4 a
4 5 a
5 6 a
Now you can do the following, and the values are actually saved:
A.loc[2,'score'] = 'Heyyyy'
A
Dataframe:
score other
0 1 a
1 2 a
2 Heyyyy a
3 4 a
4 5 a
5 6 a
For some reason, the following 2 calls to iloc / loc produce different behavior:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(A=range(3), B=range(3)))
>>> df.iloc[:1]
A B
0 0 0
>>> df.loc[:1]
A B
0 0 0
1 1 1
I understand that loc considers the row labels, while iloc considers the integer-based indices of the rows. But why is the upper bound for the loc call considered inclusive, while the iloc bound is considered exclusive?
Quick answer:
It often makes more sense to do end-inclusive slicing when using labels, because it requires less knowledge about other rows in the DataFrame.
Whenever you care about labels instead of positions, end-exclusive label slicing introduces position-dependence in a way that can be inconvenient.
Longer answer:
Any function's behavior is a trade-off: you favor some use cases over others. Ultimately the operation of .iloc is a subjective design decision by the Pandas developers (as the comment by #ALlollz indicates, this behavior is intentional). But to understand why they might have designed it that way, think about what makes label slicing different from positional slicing.
Imagine we have two DataFrames df1 and df2:
df1 = pd.DataFrame(dict(X=range(4)), index=['a','b','c','d'])
df2 = pd.DataFrame(dict(X=range(3)), index=['b','c','z'])
df1 contains:
X
a 0
b 1
c 2
d 3
df2 contains:
X
b 0
c 1
z 2
Let's say we have a label-based task to perform: we want to get rows between b and c from both df1 and df2, and we want to do it using the same code for both DataFrames. Because b and c don't have the same positions in both DataFrames, simple positional slicing won't do the trick. So we turn to label-based slicing.
If .loc were end-exclusive, to get rows between b and c we would need to know not only the label of our desired end row, but also the label of the next row after that. As constructed, this next label would be different in each DataFrame.
In this case, we would have two options:
Use separate code for each DataFrame: df1.loc['b':'d'] and df2.loc['b':'z']. This is inconvenient because it means we need to know extra information beyond just the rows that we want.
For either dataframe, get the positional index first, add 1, and then use positional slicing: df.iloc[df.index.get_loc('b'):df.index.get_loc('c')+1]. This is just wordy.
But since .loc is end-inclusive, we can just say .loc['b':'c']. Much simpler!
Whenever you care about labels instead of positions, and you're trying to write position-independent code, end-inclusive label slicing re-introduces position-dependence in a way that can be inconvenient.
That said, maybe there are use cases where you really do want end-exclusive label-based slicing. If so, you can use #Willz's answer in this question:
df.loc[start:end].iloc[:-1]
I have a dataframe containing a column type of categorical data, and I have a table (dictionary) of parameter values for each possible type, each entry of which looks like
type1: [x1,x2,x3]
I have working code looking like this:
def foo(df):
[x1,x2,x3] = parameters[df.type]
return (* formula depending on x1,x2,x3,df.A,df.B *)
df['new_variable'] = df.apply(lambda x: foo(x), axis = 1)
Iterating through the rows like this (.apply(..., axis=1)) is of course very slow, and I'd like an efficient solution, but I don't know how to do the table-lookup in a neat manner. For instance, I can't just do
df['new_variable'] = (* formula depending on parameters[df.type][0:3],df.A,df.B *)
as that throws a TypeError: 'Series' objects are mutable, thus they cannot be hashed (I'm naively trying to use a Series as a key, which doesn't work).
I suppose I could make new columns for the parameter values, but that seems inelegant somehow, and I'm sure there is a better way. What's the best way to do this?
EDIT: I just realised I can get a column with the lists of parameters via
df.type.map(parameters)
but I can't access the entries of those lists, as the usual index-conventions don't seem to work. E.g. df.type.map(parameters).loc[:,2] gives an IndexingError: Too many indexers; basically pandas gets confused when having too many dimensions without sticking it all in a MultiIndex. Is there a way to get around this?
EDIT2: a minimal example:
df = pd.DataFrame([['dog',4],['dog',6],['cat',1],['cat',4]],columns = ['type','A'])
parameters = {'dog': [1,2], 'cat': [3,-1]}
def foo(x):
[a,b]=parameters[x.type]
return a * x.A + b
df['new'] = df.apply(foo,axis=1)
produces the desired output
type A new
0 dog 4 6
1 dog 6 8
2 cat 1 2
3 cat 4 11
For a vectorised solution you should split your series of lists, which is what df['type'].map(parameters) gives, into separate columns. You can then leverage efficient NumPy operations:
params = pd.DataFrame(df['type'].map(parameters).values.tolist(),
columns=['a', 'b'])
df['new'] = params['a'] * df['A'] + params['b']
As you note, pd.DataFrame.apply is a thinly veiled, and generally inefficient, loop. It should be avoided wherever possible.