applying a function to a pair of pandas series - python

Suppose I have two series:
s = pd.Series([20, 21, 12]
t = pd.Series([17,19 , 11]
I want to apply a two argument function to the two series to get a series of results (as a series). Now, one way to do it is as follows:
df = pd.concat([s, t], axis=1)
result = df.apply(lambda x: foo(x[s], x[t]), axis=1)
But this seems clunky. Is there any more elegant way?

There are many ways to do what you want.
Depending on the function in question, you may be able to apply it directly to the series. For example, calling s + t returns
0 37
1 40
2 23
dtype: int64
However, if your function is more complicated than simple arithmetic, you may need to get creative. One option is to use the built-in Python map function. For example, calling
list(map(np.add, s, t))
returns
[37, 40, 23]

If the two series have the same index, you can create a series with list comprehension:
result = pd.Series([foo(xs, xt) for xs,xt in zip(s,t)], index=s.index)
If you can't guarantee that the two series have the same index, concat is the way to go as it helps align the index.

If I understand you can use this to apply a function using 2 colums and copy the results in another column:
df['result'] = df.loc[:, ['s', 't']].apply(foo, axis=1)

It might be possible to use numpy.vectorize:
from numpy import vectorize
vect_foo = vectorize(foo)
result = vect_foo(s, t)

Related

Is there a way to use a method/function as an expression for .loc() in pandas?

I've been going crazy trying to figure this out. I'm trying to avoid using df.iterrows() to iterate through the rows of a dataframe, as it's quite time consuming and .loc() is better from what I've seen.
I know this works:
df = df.loc[df.number == 3, :]
And that'll basically set df to be each row where the "number" column is equal to 3.
But, I get an error when I try something like this:
df = df.loc[someFunction(df.number), :]
What I want is to get every row where someFunction() returns True whenever the "number" value of said row is set as the parameter.
For some reason, it's passing the entire column (the dataframe's entire "number" column, in this example), instead of the value of a row as it iterates through the row, like the previous example.
Again, I know I can just use a for loop and .iterrows(), but I'm working with around 280,000 rows and it just takes longer than I'd like. Also have tried using a lambda function among other things.
Apply is slow - if you can, try to just put the complex vectorization logic in the function by taking in series as arguments:
import pandas as pd
df = pd.DataFrame()
df['a'] = [7, 6, 5, 4, 3, 2]
df['b'] = [1, 2, 3, 4, 5, 6]
def my_func(series1, series2):
return (series2 > 3) | (series1 == series2)
df.loc[my_func(df.b, df.a), 'new_column_name'] = True
I think this is what you need:
import pandas as pd
df = pd.DataFrame({"number": [x for x in range(10)]})
def someFunction(row):
if row > 5:
return True
else:
return False
df = df.loc[df.number.apply(someFunction)]
print(df)
Output:
number
6 6
7 7
8 8
9 9
You can use an anonymous function with .loc
x refers to the dataframe you are indexing
df.loc[lambda x: x.number > 5, :]
Two options I can think of:
Create a new column using the pandas apply() method and a lambda function that returns either true or false depending on someFunction(). Then, use loc to filter on the new column you just created.
Use a for loop and df.itertuples() as it is way faster than iterrows. Make sure to look up the documentation as the syntax is slightly different for itertuples
Just use something like this will work
df = pd.DataFrame()
df['number'] = np.arange(10)
display(df[df['number']>5])
display(df[df['number']>2][df['number']<7])

Apply Pandas series string function to the whole dataframe

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.
There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34
import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe
I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

How to avoid excessive lambda functions in pandas DataFrame assign and apply method chains

I am trying to translate a pipeline of manipulations on a dataframe in R over to its Python equivalent. A basic example of the pipeline is as follows, incorporating a few mutate and filter calls:
library(tidyverse)
calc_circle_area <- function(diam) pi / 4 * diam^2
calc_cylinder_vol <- function(area, length) area * length
raw_data <- tibble(cylinder_name=c('a', 'b', 'c'), length=c(3, 5, 9), diam=c(1, 2, 4))
new_table <- raw_data %>%
mutate(area = calc_circle_area(diam)) %>%
mutate(vol = calc_cylinder_vol(area, length)) %>%
mutate(is_small_vol = vol < 100) %>%
filter(is_small_vol)
I can replicate this in pandas without too much trouble but find that it involves some nested lambda calls when using assign to do an apply (first where the dataframe caller is an argument, and subsequently with dataframe rows as the argument). This tends to obscure the meaning of the assign call, where I would like to specify something more to the point (like the R version) if at all possible.
import pandas as pd
import math
calc_circle_area = lambda diam: math.pi / 4 * diam**2
calc_cylinder_vol = lambda area, length: area * length
raw_data = pd.DataFrame({'cylinder_name': ['a', 'b', 'c'], 'length': [3, 5, 9], 'diam': [1, 2, 4]})
new_table = (
raw_data
.assign(area=lambda df: df.diam.apply(lambda r: calc_circle_area(r.diam), axis=1))
.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
.assign(is_small_vol=lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
I am aware that the .assign(area=lambda df: df.diam.apply(calc_circle_area)) could be written as .assign(area=raw_data.diam.apply(calc_circle_area)) but only because the diam column already exists in the original dataframe, which may not always be the case.
I also realize that the calc_... functions here are vectorizable, meaning I could also do things like
.assign(area=lambda df: calc_circle_area(df.diam))
.assign(vol=lambda df: calc_cylinder_vol(df.area, df.length))
but again, since most functions aren't vectorizable, this wouldn't work in most cases.
TL;DR I am wondering if there is a cleaner way to "mutate" columns on a dataframe that doesn't involve double-nesting lambda statements, like in something like:
.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
Are there best practices for this type of application or is this the best one can do within the context of method chaining?
The best practice is to vectorize operations.
The reason for this is performance, because apply is very slow. You are already taking advantage of vectorization in the R code, and you should continue to do so in Python. You will find that, because of this performance consideration, most of the functions you need actually are vectorizable.
That will get rid of your inner lambdas. For the outer lambdas over the df, I think what you have is the cleanest pattern. The alternative is to repeatedly reassign to the raw_data variable, or some other intermediate variables(s), but this doesn't fit the method chaining style for which you are asking.
There are also Python packages like dfply that aim to mimic the dplyr feel in Python. These do not receive the same level of support as core pandas will, so keep that in mind if you want to go this route.
Or, if you want to just save a bit of typing, and all the functions will be only over columns, you can create a glue function that unpacks the columns for you and passes them along.
def df_apply(col_fn, *col_names):
def inner_fn(df):
cols = [df[col] for col in col_names]
return col_fn(*cols)
return inner_fn
Then usage ends up looking something like this:
new_table = (
raw_data
.assign(area=df_apply(calc_circle_area, 'diam'))
.assign(vol=df_apply(calc_cylinder_vol, 'area', 'length'))
.assign(is_small_vol=lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
It is also possible to write this without taking advantage of vectorization, in case that does come up.
def df_apply_unvec(fn, *col_names):
def inner_fn(df):
def row_fn(row):
vals = [row[col] for col in col_names]
return fn(*vals)
return df.apply(row_fn, axis=1)
return inner_fn
I used named functions for extra clarity. But it can be condensed with lambdas into something that looks much like your original format, just generic.
as #mcskinner has pointed out, vectorized operations are way better and faster. if however, your operation cannot be vectorized and you still want to apply a function, you could use the pipe method, which should allow for a cleaner method chaining:
import math
def area(df):
df['area'] = math.pi/4*df['diam']**2
return df
def vol(df):
df['vol'] = df['area'] * df['length']
return df
new_table = (raw_data
.pipe(area)
.pipe(vol)
.assign(is_small_vol = lambda df: df.vol < 100)
.loc[lambda df: df.is_small_vol]
)
new_table
cylinder_name length diam area vol is_small_vol
0 a 3 1 0.785398 2.356194 True
1 b 5 2 3.141593 15.707963 True

Savgol filter over dataframe columns

I'm trying to apply a savgol filter from SciPy to smooth my data. I've successfully applied the filter by selecting each column separately, defining a new y value and plotting it. However I wanted to apply the function in a more efficient way across a dataframe.
y0 = alldata_raw.iloc[:,0]
w0 = savgol_filter(y0, 41, 1)
My first thought was to create an empty array, write a for loop apply the function to each column, append it to the array and finally concatenate the array. However I got an error 'TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid'
smoothed_array = []
for key,values in alldata_raw.iteritems():
y = savgol_filter(values, 41, 1)
smoothed_array.append(y)
alldata_smoothed = pd.concat(smoothed_array, axis=1)
Instead I tried using the pd.apply() function however I'm having issues with that. I have an error message: 'TypeError: expected x and y to have same length'
alldata_smoothed = alldata_raw.apply(savgol_filter(alldata_raw, 41, 1), axis=1)
print(alldata_smoothed)
I'm quite new to python so any advice on how to make each method work and which is preferable would be appreciated!
In order to use the filter first create a function that takes a single argument - the column data. Then you can apply it to dataframe columns like this:
from scipy.signal import savgol_filter
def my_filter(x):
return savgol_filter(x, 41, 1)
alldata_smoothed = alldata_raw.apply(my_filter)
You could also go with a lambda function:
alldata_smoothed = alldata_raw.apply(lambda x: savgol_filter(x,41,1))
axis=1 in apply is specified to apply the function to dataframe rows. What you need is the default option axis=0 which means apply it to the columns.
That was pretty general but the docs for savgol_filter tell me that it accepts an axis argument too. So in this specific case you could apply the filter to the whole dataframe at once. This will probably be more performant but I haven't checked =).
alldata_smoothed = pd.DataFrame(savgol_filter(alldata_raw, 41, 1, axis=0),
columns=alldata_raw.columns,
index=alldata_raw.index)

Pandas - apply transformation on index of DataFrame

This is my code
import pandas as pd
x = pd.DataFrame.from_dict({'A':[1,2,3,4,5,6], 'B':[10, 20, 30, 44, 48, 81]})
a = x['A'].apply(lambda t: t%2==0) # works
c = x.index.apply(lambda t: t%2==0) # error
How can I make that code work in the easiest way? I know how to reset_index() and then treat it as a column, but I was curious if it's possible to operate on the index as if it's a regular column.
You have to convert the Index to a Series using to_series:
c = x.index.to_series().apply(lambda t: t%2==0)
if you want to call apply as Index objects have no apply method
There are a limited number of methods and operations for an Index: http://pandas.pydata.org/pandas-docs/stable/api.html#modifying-and-computations
Pandas hasn't implemented pd.Index.apply. You can, for simple calculations, use the underlying NumPy array:
c = x.index.values % 2 == 0
As opposed to lambda-based solutions, this takes advantage of vectorised operations.
Pandas index have a map method:
c = x.index.map(lambda t: t%2==0) # Index([True, False, True, False, True, False], dtype='object')
Note that this returns a Index, not a pandas Series.

Categories

Resources