Alternatives to looping through a function taking inputs from several Pandas series - python

I have been using Pandas for a while but have not came across a need to do this until now. Here's the setup. I have several Pandas series (with their indices exactly identical), say A, B and C, and a complicated function func(). What I am trying to do (in a non-Pandas-efficient way) is iterate through the index of the series applying func().
D = pandas.Series(index=A.index) # First create an empty Series
for i in range(len(A)):
D[i] = func(A[i], B[i], C[i])
Is there a Pandas-efficient way of doing the above that takes into account that this is essentially an array-based operation? I looked at pandas.DataFrame.apply but the examples show application of simple functions such as numpy.sqrt() that take only one series argument.

If you have only a pd.Series your function should return a series as well.
Therefore,
D = func(A, B, C)
should yield D as a pd.Series which is a vectorized result over the A, B and C values.
If you want a new column on a DataFrame you could solve it this way:
df.loc[:,'new column'] = \
df.loc[:,'data column'].\
apply(lambda x: func(x, A[x.name], B[x.name], C[x.name]), axis=1)

Related

applying a function to a pair of pandas series

Suppose I have two series:
s = pd.Series([20, 21, 12]
t = pd.Series([17,19 , 11]
I want to apply a two argument function to the two series to get a series of results (as a series). Now, one way to do it is as follows:
df = pd.concat([s, t], axis=1)
result = df.apply(lambda x: foo(x[s], x[t]), axis=1)
But this seems clunky. Is there any more elegant way?
There are many ways to do what you want.
Depending on the function in question, you may be able to apply it directly to the series. For example, calling s + t returns
0 37
1 40
2 23
dtype: int64
However, if your function is more complicated than simple arithmetic, you may need to get creative. One option is to use the built-in Python map function. For example, calling
list(map(np.add, s, t))
returns
[37, 40, 23]
If the two series have the same index, you can create a series with list comprehension:
result = pd.Series([foo(xs, xt) for xs,xt in zip(s,t)], index=s.index)
If you can't guarantee that the two series have the same index, concat is the way to go as it helps align the index.
If I understand you can use this to apply a function using 2 colums and copy the results in another column:
df['result'] = df.loc[:, ['s', 't']].apply(foo, axis=1)
It might be possible to use numpy.vectorize:
from numpy import vectorize
vect_foo = vectorize(foo)
result = vect_foo(s, t)

How to do string operations when aggregating a pandas dataframe?

I need to perform some aggregations on a pandas dataframe. I'm using pandas version 1.3.3.
It seems I am only able to use builtin python functions, such as the max function, to aggregate columns that contain strings. Trying to do the same thing using any custom function (even one that only calls the builtin max) causes an error, as shown in the example below.
Can anyone tell me what I'm doing wrong in this example, and what is the correct way to use a custom function for string aggregation?
import pandas as pd
# Define a dataframe with two columns - one with strings (a-e), one with numbers (1-5)
foo = pd.DataFrame(
data={
'string_col': ['a', 'b', 'c', 'd', 'e'],
'num_col': [1,2,3,4,5]
}
)
# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
return ", ".join(vals)
# This works - gives a pandas Series with string_col = e, and num_col = 5
a = foo.agg(func={'string_col': max, 'num_col': max})
# This crashes with 'ValueError: cannot perform both aggregation and transformation operations simultaneously'
b = foo.agg(func={'string_col': lambda x: max(x), 'num_col': max})
# Crashes with same error
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})
If you try to run:
foo['string_col'].agg(','.join)
you will see that you get back a Series:
0 a
1 b
2 c
3 d
4 e
Name: string_col, dtype: object
Indeed, your custom function is applied per element, not on the whole Series. Thus the "cannot perform both aggregation and transformation operations simultaneously".
You can change your function to:
# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
return ", ".join(vals.to_list())
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})
output:
string_col a, b, c, d, e
num_col 5
dtype: object

How to use list comprehensions for dataframe with two or more variables in python?

I've dataframe df from excel
Is this possible in any way:
df["A"] = [foo(b, c) for (b, c) in (df["B"], df["C"])]
need to pass variables in function from different columns of dataframe
thx
You can use df.apply() on axis=1 (for column index) to get the corresponding values of df["B"] and df["C"]) for each row for passing to foo, as folllow:
df['A'] = df.apply(lambda x: foo(x['B'], x['C']), axis=1)
This is the more idiomatic and Pandas way of achieving the task. We commonly prefer to use Pandas functions than using list comprehension since Pandas functions can handle NaN values better while list comprehension often gives you error when handling NaN values.

Create list in Pandas DataFrame and each list will concatenate column values from prior rows

I would like to create a pandas data_frame as below. Is there anyway to do it? Thank you
df = pd.DataFrame({'Column 1':['A','B','C'],
'Column 2':[['A'],['A','B'],['A','B','C']]})
Column 1 Column 2
0 A [A]
1 B [A, B]
2 C [A, B, C]
I assume that you have only Column 1 and want to generate Column 2.
Initially I thought about expanding, with application of
a function joining the argument (Series) but it turned out
that expanding requires a numerical argument.
But your task can be performed using "ordinary" apply, with a function
which performs such "cumulation" on its own.
Start from defining a function adding its argument to the "internal" list
(held as its attribute) and returning each time a copy of the list
gathered so far:
def addTbl(x):
addTbl.tbl.extend(x)
return addTbl.tbl.copy()
Then initiate its tbl attribute to [], apply it to each element of
Column 1 and save the result in Column 2:
addTbl.tbl = []
df['Column 2'] = df['Column 1'].apply(addTbl)

How do I expand a range between two excel cells in python 3 and add results to a new column?

I would like to use python 3.4 to compare columns.
I have two columns a and b
If A=B print A in column C.
If B > A, print all numbers between A and B including A and B in column C.
The subsequent compared rows would print in column C after the results of the previous test.
Any help is appreciated. My question wording must be off as I'm sure this has been done before, but I just can't find it here or elsewhere.
as brittenb noticed, try apply function in pandas.
import pandas as pd
df = pd.read_excel("somefile.xlsx")
df['c'] = df.apply(lambda r: list(range(r['a'], r['b']+1)), axis=1)
Update
If you want to add rows, writing in pandas may get complicated. If you don't care much about speed and memory, classic python style seems easier to understand.
ary = []
for i,r in df.iterrows():
for j in range(r['a'], r['b']+1):
ary.append( (r['a'], r['b'], j) )
df = pd.DataFrame(ary, columns = ['a','b','c'])

Categories

Resources