How to do string operations when aggregating a pandas dataframe? - python

I need to perform some aggregations on a pandas dataframe. I'm using pandas version 1.3.3.
It seems I am only able to use builtin python functions, such as the max function, to aggregate columns that contain strings. Trying to do the same thing using any custom function (even one that only calls the builtin max) causes an error, as shown in the example below.
Can anyone tell me what I'm doing wrong in this example, and what is the correct way to use a custom function for string aggregation?
import pandas as pd
# Define a dataframe with two columns - one with strings (a-e), one with numbers (1-5)
foo = pd.DataFrame(
data={
'string_col': ['a', 'b', 'c', 'd', 'e'],
'num_col': [1,2,3,4,5]
}
)
# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
return ", ".join(vals)
# This works - gives a pandas Series with string_col = e, and num_col = 5
a = foo.agg(func={'string_col': max, 'num_col': max})
# This crashes with 'ValueError: cannot perform both aggregation and transformation operations simultaneously'
b = foo.agg(func={'string_col': lambda x: max(x), 'num_col': max})
# Crashes with same error
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})

If you try to run:
foo['string_col'].agg(','.join)
you will see that you get back a Series:
0 a
1 b
2 c
3 d
4 e
Name: string_col, dtype: object
Indeed, your custom function is applied per element, not on the whole Series. Thus the "cannot perform both aggregation and transformation operations simultaneously".
You can change your function to:
# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
return ", ".join(vals.to_list())
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})
output:
string_col a, b, c, d, e
num_col 5
dtype: object

Related

PySpark: Sum up columns from array [duplicate]

I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.
The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns will be list of columns from df.
Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.

how to express dataframe operations using symbols?

suppose i have an sympy expression, it seems to me i can only substitute symbols with numbers. the question is can i substitute it with something else like a pandas series? For example,
from sympy import Symbol, Function
a_sym = Symbol('a')
b_sym = Symbol('b')
sum_func_sym = Function('sum_func')
expression = sum_func_sym(a_sym+b_sym)
is there a way for me to substitute a_sym and b_sym with pandas series and replace the sum_func_sym with series sum and then calculate the result?
import pandas as pd
df = pd.DataFrame({'a': [1,2], 'b': [3,4]})
a = df.a
b = df.b
def sum_func(series):
return series.sum()
When i do the substitution and replacement i get an error:
expression.subs(a_sym, a).subs(b_sym, b).replace(sum_func_sym, sum_func)
AttributeError: 'Add' object has no attribute 'sum'
Building upon this answer, I came up with the following implementation that seems to work for at least fairly simple use cases:
df = pd.DataFrame({'a': range(5), 'b': range(5)})
my_vars = symbols('a b') # have to have same names as DataFrame columns
expr = parse_expr('a+Sqrt(b)+1')
# Create callable version of the expression
callable_obj = lambdify(my_vars, expr)
# Call the object, passing in the DataFrame columns as parameters
# Write the result in a new column of the dataframe
df['result'] = callable_obj(**{
str(a): df[str(a)] # pass column as variable with the same name
for a in expr.atoms() # for all atomic expressions...
if isinstance(a, Symbol) # that are Symbols (not constants)
})
The output is (as expected):
0 1.000000
1 3.000000
2 4.414214
3 5.732051
4 7.000000
dtype: float64
I assume that you have a dataframe with many columns and you want to add two of them. However, the names of columns to be added are variables, unknown beforeahead. Here is the solution for this case. f-strings work for Python 3.6+, for other versions, please modify appropriately.
def sum(a, b):
global df
df[f'sum_of_{a}_and_{b}'] = df[a] + df[b]
# for more general function than sum
# df['f'sum_of_{a}_and_{b}']] = df.apply(lambda x: f(x[a],x[b]), axis=1)
# where f is the function instead of the sum

Is it possible to apply an agg function without listing out all the columns if I only need to apply a different function to one column

Given
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df
a b c d
0 0.569586 0.730646 0.070111 0.226699
1 0.092704 0.828220 0.190215 0.644188
2 0.815397 0.281504 0.690391 0.115763
3 0.614022 0.303781 0.738919 0.551983
I understand we can use df.agg({'a':'sum','b':'mean','c':'max','d':'min'}) to apply multiple functions across multiple columns.
Is it possible to do it without listing out all the columns if I only need to apply one different function. Something like df.agg({'a':'sum', //df.columns[1:]// : 'mean'})
AFAIK, no, you need explicit column names as keys. However, you can build the dictionary like this:
agg_dict = {'a':'sum'}
for c in df.columns[1:]: agg_dict[c] = 'mean'
df.groupby('some_columns').agg(agg_dict)
pandas agg function also get list as parameter.
so you can use something like
params = ['sum'] + len(df.columns[1:]) * ['mean']
df.agg([params])

Alternatives to looping through a function taking inputs from several Pandas series

I have been using Pandas for a while but have not came across a need to do this until now. Here's the setup. I have several Pandas series (with their indices exactly identical), say A, B and C, and a complicated function func(). What I am trying to do (in a non-Pandas-efficient way) is iterate through the index of the series applying func().
D = pandas.Series(index=A.index) # First create an empty Series
for i in range(len(A)):
D[i] = func(A[i], B[i], C[i])
Is there a Pandas-efficient way of doing the above that takes into account that this is essentially an array-based operation? I looked at pandas.DataFrame.apply but the examples show application of simple functions such as numpy.sqrt() that take only one series argument.
If you have only a pd.Series your function should return a series as well.
Therefore,
D = func(A, B, C)
should yield D as a pd.Series which is a vectorized result over the A, B and C values.
If you want a new column on a DataFrame you could solve it this way:
df.loc[:,'new column'] = \
df.loc[:,'data column'].\
apply(lambda x: func(x, A[x.name], B[x.name], C[x.name]), axis=1)

How to row-wise concatenate several columns containing strings?

I have a specific series of datasets which come in the following general form:
import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})
The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30.
My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):
df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3
So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:
Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:
df['result'] = ' '.join(df.ix[:,1:])
Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings:
Concatenate row-wise across specific columns of dataframe
The key to operate in columns (Series) of strings en mass is the Series.str accessor.
I can think of two .str methods to do what you want.
str.cat()
The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:
column_names = df.columns[1:] # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')
Or, in one line:
df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')
str.join()
The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:
tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')
Or, in one line:
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.
Other than using apply to concatenate the strings, you can also use agg to do so.
df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]:
0 a d g i
1 b e h j
2 c f i k
dtype: object
Here is a slightly alternative solution:
In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()
In [58]: df
Out[58]:
n t0 t1 t2 t3 result
0 92 a d g i a d g i
1 916 b e h j b e h j
2 363 c f i k c f i k

Categories

Resources