A dumb question. How to manipulate columns in Polars?
Explicitly, I have a table with 3 columns : N , Survivors, Deaths
I want to replace Deaths by Deaths * N and Survivors by Survivors * N
the following code is not working
table["SURVIVORS"] = table["SURVIVORS"]*table["N"]
I have this error:
TypeError: 'DataFrame' object does not support 'Series' assignment by index. Use 'DataFrame.with_columns'
thank you
Polars isn't pandas.
You can't assign a part of a df. To put that another way, the left side of the equals has to be a full df so forget about this syntax table["SURVIVORS"]=
You'll mainly use the with_column, with_columns, select methods. The first two will add columns to your existing df based on the expression you feed them whereas select will only return what you ask for.
In your case, since you want to overwrite SURVIVORS and DEATHS while keeping N you'd do:
table=table.with_columns([
pl.col('SURVIVORS')*pl.col('N'),
pl.col('DEATHS')*pl.col('N')
])
If you wanted to rename the columns then you might think to do this:
table=table.with_columns([
(pl.col('SURVIVORS')*pl.col('N')).alias('SURIVORS_N'),
(pl.col('DEATHS')*pl.col('N')).alias('DEATHS_N')
])
in this case, since with_columns just adds columns, you'll still have the original SURVIVORS and DEATHS column.
This brings it back to select, if you want to have explicit control of what is returned, including the order, then do select:
table=table.select([ 'N',
(pl.col('SURVIVORS')*pl.col('N')).alias('SURIVORS_N'),
(pl.col('DEATHS')*pl.col('N')).alias('DEATHS_N')
])
One note, you can refer to a column by just giving its name, like 'N' in the previous example as long as you don't want to do anything to it. If you want to do something with it (math, rename, anything) then you have to wrap it in pl.col('column_name') so that it becomes an Expression.
You can use polars.DataFrame.with_column to overwrite/replace the values of a column.
Return a new DataFrame with the column added or replaced.
Here is an example :
import polars as pl
table = pl.DataFrame({"N": [5, 2, 6],
"SURVIVORS": [1, 10, 3],
"Deaths": [0, 3, 2]})
table= table.with_column(pl.Series(name="SURVIVORS",
values=table["SURVIVORS"]*table["N"]))
# Output :
print(table)
shape: (3, 3)
┌─────┬───────────┬────────┐
│ N ┆ SURVIVORS ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪════════╡
│ 5 ┆ 5 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6 ┆ 18 ┆ 2 │
└─────┴───────────┴────────┘
Related
If I have a polars Dataframe and want to perform masked operations, I currently see two options:
# create data
df = pl.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], schema = ['a', 'b']).lazy()
# create a second dataframe for added fun
df2 = pl.DataFrame([[8, 6, 7, 5], [15, 16, 17, 18]], schema=["b", "d"]).lazy()
# define mask
mask = pl.col('a').is_between(2, 3)
Option 1: create filtered dataframe, perform operations and join back to the original dataframe
masked_df = df.filter(mask)
masked_df = masked_df.with_columns( # calculate some columns
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # throw a join into the mix
df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())
Option 2: mask each operation individually
res = df.with_columns( # calculate some columns - we have to add `pl.when(mask).then()` to each column now
[
pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # we have to construct a convoluted back-and-forth join to apply the mask to the join
df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)
print(res.collect())
Output:
shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a ┆ b ┆ new_1 ┆ new_2 ┆ new_3 ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1 ┆ 5 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 6 ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16 │
│ 3 ┆ 7 ┆ 0.14112 ┆ -0.989992 ┆ 0.428571 ┆ 17 │
│ 4 ┆ 8 ┆ null ┆ null ┆ null ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘
Most of the time, option 2 will be faster, but it gets pretty verbose and is generally harder to read than option 1 when any sort of complexity is involved.
Is there a way to apply a mask more generically to cover multiple subsequent operations?
You can avoid the boiler plate by applying your mask to your operations in a helper function.
def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
return [
pl.when(mask).then(operation)
for operation in operations
]
res = df.with_columns(
with_mask(
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
pl.col("a") / pl.col("b").alias("new_3"),
],
mask,
)
)
You can use a struct with an unnest
Your dfs weren't consistent between being lazy and eager so I'm going to make them both lazy
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
]).alias('allcols'))).unnest('allcols') \
.with_columns([pl.when(mask).then(pl.col(x)).otherwise(None)
for x in df2.columns if x not in df]) \
.collect()
I think that's the heart of your question is how to write when then with multiple column outputs which is covered by the first with_columns and then the second with_columns covers the quasi-semi join value replacement behavior.
Another way you can write it is to first create a list of the columns in df2 that you want to be subject to the mask and then put those in the struct. The unsightly thing is that you have to then exclude those columns before you do the unnest
df2_mask_cols=[x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
] + df2_mask_cols).alias('allcols'))) \
.select(pl.exclude(df2_mask_cols)) \
.unnest('allcols') \
.collect()
Surprisingly, this approach was fastest:
df.join(df2, on='b') \
.with_columns([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") /pl.col("b").cast(pl.Float64())).alias("new_3")
]) \
.with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()
I have a Polars dataframe and I want to calculate a weighted sum of a particular column and the weights is just the positive integer sequence, e.g., 1, 2, 3, ...
For example, assume I have the following dataframe.
import polars as pl
df = pl.DataFrame({"a": [2, 4, 2, 1, 2, 1, 3, 6, 7, 5]})
The result I want is
218 (= 2*1 + 4*2 + 2*3 + 1*4 + ... + 7*9 + 5*10)
How can I achieve this by using only general polars expressions? (The reason I want to use just polars expressions to solve the problem is for speed considerations)
Note: The example is just a simple example where there are just 10 numbers there, but in general, the dataframe height can be any positive number.
Thanks for your help..
Such weighted sum can be calculated using dot product (.dot() method). To generate range (weights) from 1 to n, you can use pl.arange(1, n+1).
If you just need to calculate result of weighted sum:
df.select(
pl.col("a").dot(pl.arange(1, pl.count()+1))
) #.item() - to get value (218)
Keep dataframe
df.with_columns(
pl.col("a").dot(pl.arange(1, pl.count()+1)).alias("weighted_sum")
)
┌─────┬──────────────┐
│ a ┆ weighted_sum │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════════════╡
│ 2 ┆ 218 │
│ 4 ┆ 218 │
│ ... ┆ ... │
│ 3 ┆ 218 │
│ 5 ┆ 218 │
└─────┴──────────────┘
In groupby context
df.groupby("some_cat_col", maintain_order=True).agg([
pl.col("a").dot(pl.arange(1, pl.count()+1))
])
You should be able to dot the series a with the index + 1
import polars as pl
df = pl.DataFrame({"a": [2, 4, 2, 1, 2, 1, 3, 6, 7, 5]})
print(df["a"].dot(df.index+1))
Alternatively, you can use the __matmul__ operator #
print(df["a"] # (df.index + 1))
I have a python function which takes a polars dataframe, a column name and a default value. The function will return a polars series (with length the same as the number of rows of the dataframe) based on the column name and default value.
When the column name is None, just return a series of default values.
When the column name is not None, return that column from dataframe as a series.
And, I want to achieve this with just oneline polars expression.
Below is an example for better illustration.
The function I want has the following signature.
import polars as pl
def f(df, colname=None, value=0):
pass
And below are the behaviors I want to have.
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
>>> f(df)
shape: (3,)
Series: '' [i64]
[
0
0
0
]
>>> f(df, "a")
shape: (3,)
Series: '' [i64]
[
1
2
3
]
This is what I tried, basically use polars.when.
def f(df, colname=None, value=0):
return df.select(pl.when(colname is None).then(pl.lit(value)).otherwise(pl.col(colname))).to_series()
But the code errors out when colname is None, with the error message: TypeError: argument 'name': 'NoneType' object cannot be converted to 'PyString'.
Another problem is that the code below runs successfully, but it returns a dataframe with shape (1, 1),
>>> colname = None
>>> value = 0
>>> df.select(pl.when(colname is None).then(pl.lit(value)).otherwise(100))
shape: (1, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 0 │
└─────────┘
the result I want is a dataframe with shape (3, 1), e.g.,
shape: (3, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 0 │
│ 0 │
│ 0 │
└─────────┘
What am I supposed to do?
Is there a reason you can't implement the if/else logic in Python?
def f(df, colname=None, value=0):
if colname is None:
series = pl.Series().extend_constant(value, df.height)
else:
series = df.get_column(colname)
return series
Since rank does not handle null values, I want to write a rank function that can handle null values.
import numpy as np
import polars as pl
df = pl.DataFrame({
'group': ['a'] * 3 + ['b'] * 3,
'value': [2, 1, None, 4, 5, 6],
})
df
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪═══════╡
│ a ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 6 │
└───────┴───────┘
It works well if I didn't use groupby since I can use when-then-otherwise to set values.
def valid_rank(expr: pl.Expr, reverse=False):
"""handle null values when rank"""
FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
mask = expr.is_null()
expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX)
return pl.when(~mask).then(expr.rank(reverse=reverse)).otherwise(None)
df.with_column(valid_rank(pl.col('value')))
shape: (6, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ str ┆ f32 │
╞═══════╪═══════╡
│ a ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 3.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 4.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ 5.0 │
└───────┴───────┘
However, in groupby context, the predicate col("value").is_not_null() in when->then->otherwise is not an aggregation so I will get
ComputeError: the predicate 'not(col("value").is_null())' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the groupby operation would
Usually I have to make some calculations within each group after rank and I am worried about performance if I use partition_by to split the DataFrame. So I hope that Polars can have expressions like np.putmask or similar functions that can set values within each group.
def valid_rank(expr: pl.Expr, reverse=False):
"""handle null values when rank"""
FLOAT_MAX, FLOAT_MIN = np.finfo(float).max, np.finfo(float).min
mask = expr.is_null()
expr = expr.fill_null(FLOAT_MIN) if reverse else expr.fill_null(FLOAT_MAX)
# return pl.putmask(expr.rank(reverse=reverse), mask, None) # hope
# return expr.rank(reverse=reverse).set(mask, None) # hope
I propose a solution that is minimally invasive to existing code, requires no changes to the Polars API, and allows masking for a wide variety of Expressions.
Decorator: Maskable
The decorator below is one easy way to add masking capabilities to any suitable Expression. The decorator adds two keyword-only parameters to any Expression: mask and mask_fill.
If mask=None (the default), the decorator passes all parameters to the decorated Expression unaltered. There are no changes needed to existing code for this.
If a mask is provided, then the decorator handles the tasking of masking, filtering, recombining, and sorting.
Here's the documentation and code for the decorator. The documentation is simply from my docstring of the function. (It helps me track what I'm doing if I keep the docstring with the function as I write code.)
(I suggest skipping directly to the Examples section first, then coming back to look at the code and documentation.)
Overview
from functools import wraps
import polars.internals as pli
import polars.internals.lazy_functions as plz
def maskable(expr: pli.Expr) -> pli.Expr:
"""
Allow masking of values in an Expression
This function is intended to be used as a decorator for Polars Expressions.
For example:
pl.Expr.rolling_mean = maskable(pl.Expr.rolling_mean)
The intended purpose of this decorator is to change the way that an Expression
handles exceptional values (e.g., None, NaN, Inf, -Inf, zero, negative values, etc.)
Usage Notes:
This decorator should only be applied to Expressions whose return value is the
same length as its input (e.g., rank, rolling_mean, ewm_mean, pct_change).
It is not intended for aggregations (e.g., sum, var, count). (For aggregations,
use "filter" before the aggregration Expression.)
Performance Notes:
This decorator adds significant overhead to a function call when a mask is supplied.
As such, this decorator should not be used in places where other methods would
suffice (e.g., filter, when/then/otherwise, fill_null, etc.)
In cases where no mask is supplied, the overhead of this decorator is insignicant.
Operation
---------
A mask is (conceptually) a column/expession/list of boolean values that control
which values will not be passed to the wrapped expression:
True, Null -> corresponding value will not be passed to the wrapped
expression, and will instead be filled by the mask_fill value after
the wrapped expression has been evaluated.
False -> corresponding value will be passed to the wrapped expression.
"""
Parameters
"""
Parameters
----------
The decorator will add two keyword-only parameters to any wrapped Expression:
mask
In-Stream Masks
---------------
In-stream masks select a mask based on the current state of a chained expression
at the point where the decorated expression is called. (See examples below)
str -> One of {"Null", "NaN", "-Inf", "+Inf"}
list[str] -> two or more of the above, all of which will be filled with the same
mask_fill value
Static Masks
------------
Static masks select a mask at the time the context is created, and do not reflect
changes in values as a chained set of expressions is evaluated (see examples below)
list[bool] -> external list of boolean values to use as mask
pli.Series -> external Series to use as mask
pli.Expr -> ad-hoc expression that evaluates to boolean
Note: for static masks, it is the responsibility of the caller to ensure that the
mask is the same length as the number of values to which it applies.
No Mask
-------
None -> no masking applied. The decorator passses all parameters and values to the
wrapped expression unaltered. There is no significant performance penalty.
mask_fill
Fill value to be used for all values that are masked.
"""
The Decorator Code
Here is the code for the decorator itself.
from functools import wraps
import polars.internals as pli
import polars.internals.lazy_functions as plz
def maskable(expr: pli.Expr) -> pli.Expr:
#wraps(expr)
def maskable_expr(
self: pli.Expr,
*args,
mask: str | list[str] | list[bool] | pli.Series | pli.Expr | None = None,
mask_fill: float | int | str | bool | None = None,
**kwargs,
):
if mask is None:
return expr(self, *args, **kwargs)
if isinstance(mask, str):
mask = [mask]
if isinstance(mask, list):
if len(mask) == 0:
return expr(self, *args, **kwargs)
if isinstance(mask[0], bool):
mask = pli.Series(mask)
elif isinstance(mask[0], str):
mask_dict = {
"Null": (self.is_null()),
"NaN": (self.is_not_null() & self.is_nan()),
"+Inf": (self.is_not_null() & self.is_infinite() & (self > 0)),
"-Inf": (self.is_not_null() & self.is_infinite() & (self < 0)),
}
mask_str, *mask_list = mask
mask = mask_dict[mask_str]
while mask_list:
mask_str, *mask_list = mask_list
mask = mask | mask_dict[mask_str]
if isinstance(mask, pli.Series):
mask = pli.lit(mask)
mask = mask.fill_null(True)
return (
expr(self.filter(mask.is_not()), *args, **kwargs)
.append(plz.repeat(mask_fill, mask.sum()))
.sort_by(mask.arg_sort())
)
return maskable_expr
Examples
The following are examples of usage from the docstring that resides in my library for this decorator function. (It helps me track which use cases that I've tested.)
Simple in-stream mask
Here's an example of a simple "in-stream" mask, based on your Stack Overflow question. The mask prevents null values from disturbing the ranking. The mask is calculated at the time that the wrapped Expression (rank) receives the data.
Note that the changes to the code are not terribly invasive. There's no new expression, no new evaluation context required, and no changes to the Polars API. All work is done by the decorator.
Also, note that there's no when/then/otherwise needed to achieve this; thus, the over grouping expression does not complain.
import polars as pl
pl.Expr.rank = maskable(pl.Expr.rank)
df = pl.DataFrame(
{
"group": ["a"] * 4 + ["b"] * 4,
"a": [1, 2, None, 3, None, 1, None, 2],
}
)
(
df.with_columns(
[
pl.col("a")
.rank()
.over("group")
.alias("rank_a"),
pl.col("a")
.rank(mask='Null', mask_fill=float("NaN"))
.over("group")
.alias("rank_a_masked"),
]
)
)
shape: (8, 4)
┌───────┬──────┬────────┬───────────────┐
│ group ┆ a ┆ rank_a ┆ rank_a_masked │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f32 ┆ f64 │
╞═══════╪══════╪════════╪═══════════════╡
│ a ┆ 1 ┆ 2.0 ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 ┆ 3.0 ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ null ┆ 1.0 ┆ NaN │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 3 ┆ 4.0 ┆ 3.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ null ┆ 1.5 ┆ NaN │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 3.0 ┆ 1.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ null ┆ 1.5 ┆ NaN │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 4.0 ┆ 2.0 │
└───────┴──────┴────────┴───────────────┘
Multiple Masked values
This is an example of a convenience built-in: multiple exceptional values can be provided in a list. Note that masked values all receive the same fill_mask value.
This example also shows the mask working in Lazy mode, one side-benefit of using a decorator approach.
import polars as pl
pl.Expr.rolling_mean = maskable(pl.Expr.rolling_mean)
df = pl.DataFrame(
{
"a": [1.0, 2, 3, float("NaN"), 4, None, float("NaN"), 5],
}
).lazy()
(
df.with_columns(
[
pl.col("a")
.rolling_mean(window_size=2).alias("roll_mean"),
pl.col("a")
.rolling_mean(window_size=2, mask=['NaN', 'Null'], mask_fill=None)
.alias("roll_mean_masked"),
]
).collect()
)
shape: (8, 3)
┌──────┬───────────┬──────────────────┐
│ a ┆ roll_mean ┆ roll_mean_masked │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════╪═══════════╪══════════════════╡
│ 1.0 ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ 1.5 ┆ 1.5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 2.5 ┆ 2.5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NaN ┆ NaN ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ NaN ┆ 3.5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NaN ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ NaN ┆ 4.5 │
└──────┴───────────┴──────────────────┘
In-stream versus Static masks
The code below provides an example of the difference between an "in-stream" mask and a "static" mask.
An in-stream mask makes its masking choices at the time the wrapped expression is executed. This includes the evaluated results of all chained expressions that came before it.
By contrast, a static mask makes its masking choices when the context is created, and it never changes.
For most use cases, in-stream masks and static masks will produce the same result. The example below is one example where they will not.
The sqrt function creates new NaN values during the evaluation of the chained expression. The in-stream mask sees these; the static mask sees column a only as it exists at the time the with_columns context is initiated.
import polars as pl
pl.Expr.ewm_mean = maskable(pl.Expr.ewm_mean)
df = pl.DataFrame(
{
"a": [1.0, 2, -2, 3, -4, 5, 6],
}
)
(
df.with_columns(
[
pl.col("a").sqrt().alias('sqrt'),
pl.col('a').sqrt()
.ewm_mean(half_life=4, mask="NaN", mask_fill=None)
.alias("ewm_instream"),
pl.col("a").sqrt()
.ewm_mean(half_life=4, mask=pl.col('a').is_nan(), mask_fill=None)
.alias("ewm_static"),
pl.col("a").sqrt()
.ewm_mean(half_life=4).alias('ewm_no_mask'),
]
)
)
shape: (7, 5)
┌──────┬──────────┬──────────────┬────────────┬─────────────┐
│ a ┆ sqrt ┆ ewm_instream ┆ ewm_static ┆ ewm_no_mask │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════╪══════════╪══════════════╪════════════╪═════════════╡
│ 1.0 ┆ 1.0 ┆ 1.0 ┆ 1.0 ┆ 1.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ 1.414214 ┆ 1.225006 ┆ 1.225006 ┆ 1.225006 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -2.0 ┆ NaN ┆ null ┆ NaN ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 1.732051 ┆ 1.424003 ┆ NaN ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ -4.0 ┆ NaN ┆ null ┆ NaN ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ 2.236068 ┆ 1.682408 ┆ NaN ┆ NaN │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.0 ┆ 2.4494 ┆ 1.892994 ┆ NaN ┆ NaN │
└──────┴──────────┴──────────────┴────────────┴─────────────┘
Incorporating external masks
Sometimes we want to mask values based on the results of external inputs, for example in code testing, sensitivity testing, or incorporating results from external libraries/functions. External lists are, by definition, static masks. And it is up to the user to make sure that they are the correct length to match the column that they are masking.
The example below also demonstrates that the scope of a mask (in-stream or static) is limited to one expression evaluation. The mask does not stay in effect for other expressions in a chained expression. (However, you can certainly declare masks for other expressions in a single chain.) In the example below, diff does not see the mask that was used for the prior rank step.
import polars as pl
pl.Expr.rank = maskable(pl.Expr.rank)
pl.Expr.diff = maskable(pl.Expr.diff)
df = pl.DataFrame(
{
"trial_nbr": [1, 2, 3, 4, 5, 6],
"response": [1.0, -5, 9, 3, 2, 10],
}
)
pending = [False, True, False, False, False, False]
(
df.with_columns(
[
pl.col("response").rank().alias('rank'),
pl.col("response")
.rank(mask=pending, mask_fill=float("NaN"))
.alias('rank_masked'),
pl.col("response")
.rank(mask=pending, mask_fill=float("NaN"))
.diff()
.alias('diff_rank'),
]
)
)
shape: (6, 5)
┌───────────┬──────────┬──────┬─────────────┬───────────┐
│ trial_nbr ┆ response ┆ rank ┆ rank_masked ┆ diff_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f32 ┆ f64 ┆ f64 │
╞═══════════╪══════════╪══════╪═════════════╪═══════════╡
│ 1 ┆ 1.0 ┆ 2.0 ┆ 1.0 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ -5.0 ┆ 1.0 ┆ NaN ┆ NaN │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 9.0 ┆ 5.0 ┆ 4.0 ┆ NaN │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 3.0 ┆ 4.0 ┆ 3.0 ┆ -1.0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 2.0 ┆ 3.0 ┆ 2.0 ┆ -1.0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 10.0 ┆ 6.0 ┆ 5.0 ┆ 3.0 │
└───────────┴──────────┴──────┴─────────────┴───────────┘
Apply
This approach also works with apply (but currently only when apply is used with only one column input, not when a struct is used to pass multiple values to apply).
For example, the simple function below will throw an exception if a value greater than 1.0 is passed to my_func. Normally, this would halt execution, and some kind of workaround would be needed, such as setting the value to something else, and remembering to set it's value back after apply is run. Using a mask, you can side-step the problem conveniently, without such a workaround.
import polars as pl
import math
pl.Expr.apply = maskable(pl.Expr.apply)
def my_func(value: float) -> float:
return math.acos(value)
df = pl.DataFrame(
{
"val": [0.0, 0.5, 0.7, 0.9, 1.0, 1.1],
}
)
(
df.with_columns(
[
pl.col('val')
.apply(f=my_func,
mask=pl.col('val') > 1.0,
mask_fill=float('NaN')
)
.alias('result')
]
)
)
shape: (6, 2)
┌─────┬──────────┐
│ val ┆ result │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪══════════╡
│ 0.0 ┆ 1.570796 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0.5 ┆ 1.047198 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0.7 ┆ 0.795399 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0.9 ┆ 0.451027 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1.1 ┆ NaN │
└─────┴──────────┘
"""
The Algorithm
The heart of the algorithm is these few lines:
expr(self.filter(mask.is_not()), *args, **kwargs)
.append(plz.repeat(mask_fill, mask.sum()))
.sort_by(mask.arg_sort())
In steps,
The algorithm filters the results of the current state of the chained expression based on the mask, and passes the filtered results to the wrapped expression for evaluation.
The column of returned values from the evaluated expression is then extended to its former length by filling with the mask_fill values.
An argsort on the mask is then used to restore the filled values at the bottom to their place among the returned values.
This last step assumes that the filter step maintains the relative ordering of rows (which it does), and that the mask_fill values are indistinguishable/identical (which they are).
Benefits and Limitations
Using this approach has some notable benefits:
The impact to code is minimal. No complex workarounds are needed (e.g., partitioning DataFrames, changing values)
There is zero impact to the Polars API. No new expressions. No new context. No new keywords.
Decorated Expressions continue to run in parallel. The Python code in the decorator merely writes expressions and passes them along; the Python code itself does not run calculations on data.
Decorated Expressions retain their familiar names and signatures, with the exception of two additional keyword-only parameters, which default to no-masking.
Decorated Expressions work in both Lazy and Eager mode.
Decorated Expressions can be used just like any other Expression, including chaining Expressions and using over for grouping.
The performance impact when a decorated Expression is used without masking is insignificant. The decorator merely passes the parameters to the wrapped Expression unaltered.
Some limitations do apply:
The coding hints (as they are stated above) may raise errors with linters and IDE's when using decorated Expressions. Some linters will complain that mask and mask_fill are not valid parameters.
Not all Expressions are suitable for masking. Masking will not work for aggregation expressions, in particular. (Nor should they; simple filtering before an aggregating expression will be far faster than masking.)
Performance Impact
Using a mask with an Expression will impact performance. The additional runtime is associated with filtering based on the mask and then sorting to place the mask_fill values back to their proper place in the results. This last step requires sorting, which is O(n log n), in general.
The performance overhead is more or less independent of the expression that is wrapped by the decorator. Instead, the performance impact is a function of the number of records involved, due to the filtering and the sorting steps.
Whether the performance impact outweighs the convenience of this approach is probably better discussed on GitHub (depending on whether this approach is acceptable).
And there may be ways to reduce the O(n log n) complexity at the heart of the algorithm, if the performance impact is deemed too severe. I tried an approach that interleaves the results returned from the wrapped function with the fill values, based on the mask, but it performed no better than the simple sort that is shown above. Perhaps there is a way to interleave the two in a more performant manner.
I would point out one thing, though. Masking will come with a performance cost (no matter what approach is used). Thus, comparing 'no-masking' to 'masking' may not be terribly informative. Instead, 'masking' accomplished with one algorithm versus another is probably the better comparison.
I have the following Python Code with pandas
df['EVENT_DATE'] = df.apply(
lambda row: datetime.date(year=row.iyear, month=row.imonth, day=row.iday).strftime("%Y-%m-%d"), axis=1)
and want to transform it into a valid Polars Code. Does anyone have any idea to solve this?
I will also answer your generic question and not only you specific use case.
For your specific case, as of polars version >= 0.10.18, the recommended method to create what you want is with the pl.date or pl.datetime expression.
Given this dataframe, pl.date is used to format the date as requested.
import polars as pl
df = pl.DataFrame({
"iyear": [2001, 2001],
"imonth": [1, 2],
"iday": [1, 1]
})
df.with_columns([
pl.date("iyear", "imonth", "iday").dt.strftime("%Y-%m-%d").alias("fmt")
])
This outputs:
shape: (2, 4)
┌───────┬────────┬──────┬────────────┐
│ iyear ┆ imonth ┆ iday ┆ fmt │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═══════╪════════╪══════╪════════════╡
│ 2001 ┆ 1 ┆ 1 ┆ 2001-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001 ┆ 2 ┆ 1 ┆ 2001-02-01 │
└───────┴────────┴──────┴────────────┘
Other ways to collect other columns in a single expression
Below is a more generic answer on the main question. We can use a map to get multiple columns as Series, or if we know we want to format a string column, we can use pl.format. The map offers most utility.
df.with_columns([
# string fmt over multiple expressions
pl.format("{}-{}-{}", "iyear", "imonth", "iday").alias("date"),
# columnar lambda over multiple expressions
pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),
])
This outputs
shape: (2, 5)
┌───────┬────────┬──────┬──────────┬──────────┐
│ iyear ┆ imonth ┆ iday ┆ date ┆ date2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞═══════╪════════╪══════╪══════════╪══════════╡
│ 2001 ┆ 1 ┆ 1 ┆ 2001-1-1 ┆ 2001-1-1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2001 ┆ 2 ┆ 1 ┆ 2001-2-1 ┆ 2001-2-1 │
└───────┴────────┴──────┴──────────┴──────────┘
Avoid row-wise operations
Though, the accepted answer is correct in the result. It's not the recommended way to apply operations over multiple columns in polars. Accessing rows is tremendously slow. Incurring a lot of cache misses, needing to run slow python bytecode and killing all parallelization/ query optimization.
Note
In this specific case, the map creating string data is not recommended:
pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),. Because the way memory is layed out and because we create a new column per string operation, this is actually quite expensive (Only with string data). Therefore there is the pl.format and pl.concat_str.
Polars apply will return the row data as a tuple, so you would need to use numerical indices instead. Example:
import datetime
import polars as pl
df = pl.DataFrame({"iyear": [2020, 2021],
"imonth": [1, 2],
"iday": [3, 4]})
df['EVENT_DATE'] = df.apply(
lambda row: datetime.date(year=row[0], month=row[1], day=row[2]).strftime("%Y-%m-%d"))
In case dfcontains more columns, or in a different order, you could use apply on df[["iyear", "imonth", "iday"]] rather than df to ensure the indices refer to the right elements.
There may be better ways to achieve this, but this comes closest to the Pandas code.
On a separate note, my guess is you don't want to store the dates as a string, but rather as a proper pl.Date. You could modify the code in this way:
def days_since_epoch(dt):
return (dt - datetime.date(1970, 1, 1)).days
df['EVENT_DATE_dt'] = df.apply(
lambda row: days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2])), return_dtype=pl.Date)
where we first convert the Python date to days since Jan 1, 1970, and then convert to a pl.Date using apply's return_dtype argument. The cast to pl.Date needs an int rather than a Python datetime, as it stores the data as an int ultimately. This is most easily seen by simply accessing the dates:
print(type(df["EVENT_DATE_dt"][0])) # >>> <class 'int'>
print(type(df["EVENT_DATE_dt"].dt[0])) # >>> <class 'datetime.date'>
Would be great if the cast does operate on the Python datetime directly.
edit: on the conversation on performance vs Pandas.
For both Pandas and Polars, you could speed this up further if you have many duplicate rows (for year/month/day), by using a cache to speedup the apply. I.e.
from functools import lru_cache
#lru_cache
def row_to_date(row):
return days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2]))
df['EVENT_DATE_dt'] = df.apply(row_to_date, return_dtype=pl.Date)
This will improve runtime when there are many duplicate entries, at the expense of some memory. If there are no duplicates, it will probably slow you down.