Weighted sum of a column in Polars dataframe - python

I have a Polars dataframe and I want to calculate a weighted sum of a particular column and the weights is just the positive integer sequence, e.g., 1, 2, 3, ...
For example, assume I have the following dataframe.
import polars as pl
df = pl.DataFrame({"a": [2, 4, 2, 1, 2, 1, 3, 6, 7, 5]})
The result I want is
218 (= 2*1 + 4*2 + 2*3 + 1*4 + ... + 7*9 + 5*10)
How can I achieve this by using only general polars expressions? (The reason I want to use just polars expressions to solve the problem is for speed considerations)
Note: The example is just a simple example where there are just 10 numbers there, but in general, the dataframe height can be any positive number.
Thanks for your help..

Such weighted sum can be calculated using dot product (.dot() method). To generate range (weights) from 1 to n, you can use pl.arange(1, n+1).
If you just need to calculate result of weighted sum:
df.select(
pl.col("a").dot(pl.arange(1, pl.count()+1))
) #.item() - to get value (218)
Keep dataframe
df.with_columns(
pl.col("a").dot(pl.arange(1, pl.count()+1)).alias("weighted_sum")
)
┌─────┬──────────────┐
│ a ┆ weighted_sum │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════════════╡
│ 2 ┆ 218 │
│ 4 ┆ 218 │
│ ... ┆ ... │
│ 3 ┆ 218 │
│ 5 ┆ 218 │
└─────┴──────────────┘
In groupby context
df.groupby("some_cat_col", maintain_order=True).agg([
pl.col("a").dot(pl.arange(1, pl.count()+1))
])

You should be able to dot the series a with the index + 1
import polars as pl
df = pl.DataFrame({"a": [2, 4, 2, 1, 2, 1, 3, 6, 7, 5]})
print(df["a"].dot(df.index+1))
Alternatively, you can use the __matmul__ operator #
print(df["a"] # (df.index + 1))

Related

Masking a polars dataframe for complex operations

If I have a polars Dataframe and want to perform masked operations, I currently see two options:
# create data
df = pl.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], schema = ['a', 'b']).lazy()
# create a second dataframe for added fun
df2 = pl.DataFrame([[8, 6, 7, 5], [15, 16, 17, 18]], schema=["b", "d"]).lazy()
# define mask
mask = pl.col('a').is_between(2, 3)
Option 1: create filtered dataframe, perform operations and join back to the original dataframe
masked_df = df.filter(mask)
masked_df = masked_df.with_columns( # calculate some columns
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # throw a join into the mix
df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())
Option 2: mask each operation individually
res = df.with_columns( # calculate some columns - we have to add `pl.when(mask).then()` to each column now
[
pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # we have to construct a convoluted back-and-forth join to apply the mask to the join
df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)
print(res.collect())
Output:
shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a ┆ b ┆ new_1 ┆ new_2 ┆ new_3 ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1 ┆ 5 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 6 ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16 │
│ 3 ┆ 7 ┆ 0.14112 ┆ -0.989992 ┆ 0.428571 ┆ 17 │
│ 4 ┆ 8 ┆ null ┆ null ┆ null ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘
Most of the time, option 2 will be faster, but it gets pretty verbose and is generally harder to read than option 1 when any sort of complexity is involved.
Is there a way to apply a mask more generically to cover multiple subsequent operations?
You can avoid the boiler plate by applying your mask to your operations in a helper function.
def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
return [
pl.when(mask).then(operation)
for operation in operations
]
res = df.with_columns(
with_mask(
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
pl.col("a") / pl.col("b").alias("new_3"),
],
mask,
)
)
You can use a struct with an unnest
Your dfs weren't consistent between being lazy and eager so I'm going to make them both lazy
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
]).alias('allcols'))).unnest('allcols') \
.with_columns([pl.when(mask).then(pl.col(x)).otherwise(None)
for x in df2.columns if x not in df]) \
.collect()
I think that's the heart of your question is how to write when then with multiple column outputs which is covered by the first with_columns and then the second with_columns covers the quasi-semi join value replacement behavior.
Another way you can write it is to first create a list of the columns in df2 that you want to be subject to the mask and then put those in the struct. The unsightly thing is that you have to then exclude those columns before you do the unnest
df2_mask_cols=[x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
] + df2_mask_cols).alias('allcols'))) \
.select(pl.exclude(df2_mask_cols)) \
.unnest('allcols') \
.collect()
Surprisingly, this approach was fastest:
df.join(df2, on='b') \
.with_columns([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") /pl.col("b").cast(pl.Float64())).alias("new_3")
]) \
.with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()

How can I use polars.when based on whether a column name is None?

I have a python function which takes a polars dataframe, a column name and a default value. The function will return a polars series (with length the same as the number of rows of the dataframe) based on the column name and default value.
When the column name is None, just return a series of default values.
When the column name is not None, return that column from dataframe as a series.
And, I want to achieve this with just oneline polars expression.
Below is an example for better illustration.
The function I want has the following signature.
import polars as pl
def f(df, colname=None, value=0):
pass
And below are the behaviors I want to have.
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
>>> f(df)
shape: (3,)
Series: '' [i64]
[
0
0
0
]
>>> f(df, "a")
shape: (3,)
Series: '' [i64]
[
1
2
3
]
This is what I tried, basically use polars.when.
def f(df, colname=None, value=0):
return df.select(pl.when(colname is None).then(pl.lit(value)).otherwise(pl.col(colname))).to_series()
But the code errors out when colname is None, with the error message: TypeError: argument 'name': 'NoneType' object cannot be converted to 'PyString'.
Another problem is that the code below runs successfully, but it returns a dataframe with shape (1, 1),
>>> colname = None
>>> value = 0
>>> df.select(pl.when(colname is None).then(pl.lit(value)).otherwise(100))
shape: (1, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 0 │
└─────────┘
the result I want is a dataframe with shape (3, 1), e.g.,
shape: (3, 1)
┌─────────┐
│ literal │
│ --- │
│ i32 │
╞═════════╡
│ 0 │
│ 0 │
│ 0 │
└─────────┘
What am I supposed to do?
Is there a reason you can't implement the if/else logic in Python?
def f(df, colname=None, value=0):
if colname is None:
series = pl.Series().extend_constant(value, df.height)
else:
series = df.get_column(colname)
return series

Manipulating data in Polars

A dumb question. How to manipulate columns in Polars?
Explicitly, I have a table with 3 columns : N , Survivors, Deaths
I want to replace Deaths by Deaths * N and Survivors by Survivors * N
the following code is not working
table["SURVIVORS"] = table["SURVIVORS"]*table["N"]
I have this error:
TypeError: 'DataFrame' object does not support 'Series' assignment by index. Use 'DataFrame.with_columns'
thank you
Polars isn't pandas.
You can't assign a part of a df. To put that another way, the left side of the equals has to be a full df so forget about this syntax table["SURVIVORS"]=
You'll mainly use the with_column, with_columns, select methods. The first two will add columns to your existing df based on the expression you feed them whereas select will only return what you ask for.
In your case, since you want to overwrite SURVIVORS and DEATHS while keeping N you'd do:
table=table.with_columns([
pl.col('SURVIVORS')*pl.col('N'),
pl.col('DEATHS')*pl.col('N')
])
If you wanted to rename the columns then you might think to do this:
table=table.with_columns([
(pl.col('SURVIVORS')*pl.col('N')).alias('SURIVORS_N'),
(pl.col('DEATHS')*pl.col('N')).alias('DEATHS_N')
])
in this case, since with_columns just adds columns, you'll still have the original SURVIVORS and DEATHS column.
This brings it back to select, if you want to have explicit control of what is returned, including the order, then do select:
table=table.select([ 'N',
(pl.col('SURVIVORS')*pl.col('N')).alias('SURIVORS_N'),
(pl.col('DEATHS')*pl.col('N')).alias('DEATHS_N')
])
One note, you can refer to a column by just giving its name, like 'N' in the previous example as long as you don't want to do anything to it. If you want to do something with it (math, rename, anything) then you have to wrap it in pl.col('column_name') so that it becomes an Expression.
You can use polars.DataFrame.with_column to overwrite/replace the values of a column.
Return a new DataFrame with the column added or replaced.
Here is an example :
import polars as pl
table = pl.DataFrame({"N": [5, 2, 6],
"SURVIVORS": [1, 10, 3],
"Deaths": [0, 3, 2]})
table= table.with_column(pl.Series(name="SURVIVORS",
values=table["SURVIVORS"]*table["N"]))
# Output :
print(table)
shape: (3, 3)
┌─────┬───────────┬────────┐
│ N ┆ SURVIVORS ┆ Deaths │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪════════╡
│ 5 ┆ 5 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 6 ┆ 18 ┆ 2 │
└─────┴───────────┴────────┘

Pandas REPLACE equivalent in Python Polars

Is there an elegant way how to recode values in polars dataframe.
For example
1->0,
2->0,
3->1...
in Pandas it is simple like that:
df.replace([1,2,3,4,97,98,99],[0,0,1,1,2,2,2])
Edit 2022-02-12
As of polars >=0.16.4 there is a map_dict expression.
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5]
})
mapper = {
1: 0,
2: 0,
3: 10,
4: 10
}
df.select(
pl.all().map_dict(mapper, default=pl.col("a"))
)
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 0 │
│ 10 │
│ 10 │
│ 5 │
└─────┘
Before Edit
In polars you can build columnar if else statetements called if -> then -> otherwise expressions.
So let's say we have this DataFrame.
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5]
})
And we'd like to replace these with the following values:
from_ = [1, 2]
to_ = [99, 12]
We could write:
df.with_column(
pl.when(pl.col("a") == from_[0])
.then(to_[0])
.when(pl.col("a") == from_[1])
.then(to_[1])
.otherwise(pl.col("a")).alias("a")
)
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 99 │
├╌╌╌╌╌┤
│ 12 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 5 │
└─────┘
Don't repeat yourself
Now, this becomes very tedious to write really fast, so we could write a function that generates these expressions for use, we are programmers aren't we!
So to replace with the values you have suggested, you could do:
from_ = [1,2,3,4,97,98,99]
to_ = [0,0,1,1,2,2,2]
def replace(column, from_, to_):
# initiate the expression with `pl.when`
branch = pl.when(pl.col(column) == from_[0]).then(to_[0])
# for every value add a `when.then`
for (from_value, to_value) in zip(from_, to_):
branch = branch.when(pl.col(column) == from_value).then(to_value)
# finish with an `otherwise`
return branch.otherwise(pl.col(column)).alias(column)
df.with_column(replace("a", from_, to_))
Which outputs:
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 5 │
└─────┘
Just in case you like the pandas docstrings as well and want to place it as a utils function somewhere in your repo
def replace(column: str, mapping: dict) -> pl.internals.expr.Expr:
"""
Create a polars expression that replaces a columns values.
Parameters
----------
column : str
Column name on which values should be replaced.
mapping : dict
Can be used to specify different replacement values for different existing values. For example,
``{'a': 'b', 'y': 'z'}`` replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. Values not mentioned in ``mapping``
will stay the same.
Returns
-------
pl.internals.expr.Expr
Expression that contains instructions to replace values in ``column`` according to ``mapping``.
Raises
------
Exception
* If ``mapping`` is empty.
TypeError
* If ``column`` is not ``str``.
* If ``mapping`` is not ``dict``.
polars.exceptions.PanicException
* When ``mapping`` has keys or values that are not mappable to arrows format. Only catchable via BaseException.
See also https://pola-rs.github.io/polars-book/user-guide/datatypes.html.
Examples
--------
>>> import polars as pl
>>> df = pl.DataFrame({'fruit':['banana', 'apple', 'pie']})
>>> df
shape: (3, 1)
┌────────┐
│ fruit │
│ --- │
│ str │
╞════════╡
│ banana │
├╌╌╌╌╌╌╌╌┤
│ apple │
├╌╌╌╌╌╌╌╌┤
│ apple │
└────────┘
>>> df.with_column(replace(column='fruit', mapping={'apple': 'pomegranate'}))
shape: (3, 1)
┌─────────────┐
│ fruit │
│ --- │
│ str │
╞═════════════╡
│ banana │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pomegranate │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pomegranate │
└─────────────┘
"""
if not mapping:
raise Exception("Mapping can't be empty")
elif not isinstance(mapping, dict):
TypeError(f"mapping must be of type dict, but is type: {type(mapping)}")
if not isinstance(column, str):
raise TypeError(f"column must be of type str, but is type: {type(column)}")
branch = pl.when(pl.col(column) == list(mapping.keys())[0]).then(
list(mapping.values())[0]
)
for from_value, to_value in mapping.items():
branch = branch.when(pl.col(column) == from_value).then(to_value)
return branch.otherwise(pl.col(column)).alias(column)
You can also use apply with a dict, as long as you specify an exhaustive mapping for each from_ option:
df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
from_ = [1, 2, 3, 4, 5]
to_ = [99, 12, 4, 18, 64]
my_map = dict(zip(from_, to_))
df.select(pl.col("a").apply(lambda x: my_map[x]))
which outputs:
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 99 │
├╌╌╌╌╌┤
│ 12 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 18 │
├╌╌╌╌╌┤
│ 64 │
└─────┘
It'll be slower than ritchie46's answer but it's quite a bit simpler.
Can't use code snippet in comments, so I'll post this slight generalization as an answer.
In case the value to be mapped is missing from the mapping, this accepts a default value if provided, else it will act as if the mapping is the identity mapping.
import polars as pl
def apply_map(
column: str, mapping: dict, default = None
) -> pl.Expr:
branch = pl
for key, value in mapping.items():
branch = branch.when(pl.col(column) == key).then(value)
default = pl.lit(default) if default is not None else pl.col(column)
return branch.otherwise(default).alias(column)

Combine different values of multiple columns into one column in Polars

I have the following Python Code with pandas
df['EVENT_DATE'] = df.apply(
lambda row: datetime.date(year=row.iyear, month=row.imonth, day=row.iday).strftime("%Y-%m-%d"), axis=1)
and want to transform it into a valid Polars Code. Does anyone have any idea to solve this?
I will also answer your generic question and not only you specific use case.
For your specific case, as of polars version >= 0.10.18, the recommended method to create what you want is with the pl.date or pl.datetime expression.
Given this dataframe, pl.date is used to format the date as requested.
import polars as pl
df = pl.DataFrame({
"iyear": [2001, 2001],
"imonth": [1, 2],
"iday": [1, 1]
})
df.with_columns([
pl.date("iyear", "imonth", "iday").dt.strftime("%Y-%m-%d").alias("fmt")
])
This outputs:
shape: (2, 4)
┌───────┬────────┬──────┬────────────┐
│ iyear ┆ imonth ┆ iday ┆ fmt │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═══════╪════════╪══════╪════════════╡
│ 2001 ┆ 1 ┆ 1 ┆ 2001-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001 ┆ 2 ┆ 1 ┆ 2001-02-01 │
└───────┴────────┴──────┴────────────┘
Other ways to collect other columns in a single expression
Below is a more generic answer on the main question. We can use a map to get multiple columns as Series, or if we know we want to format a string column, we can use pl.format. The map offers most utility.
df.with_columns([
# string fmt over multiple expressions
pl.format("{}-{}-{}", "iyear", "imonth", "iday").alias("date"),
# columnar lambda over multiple expressions
pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),
])
This outputs
shape: (2, 5)
┌───────┬────────┬──────┬──────────┬──────────┐
│ iyear ┆ imonth ┆ iday ┆ date ┆ date2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞═══════╪════════╪══════╪══════════╪══════════╡
│ 2001 ┆ 1 ┆ 1 ┆ 2001-1-1 ┆ 2001-1-1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2001 ┆ 2 ┆ 1 ┆ 2001-2-1 ┆ 2001-2-1 │
└───────┴────────┴──────┴──────────┴──────────┘
Avoid row-wise operations
Though, the accepted answer is correct in the result. It's not the recommended way to apply operations over multiple columns in polars. Accessing rows is tremendously slow. Incurring a lot of cache misses, needing to run slow python bytecode and killing all parallelization/ query optimization.
Note
In this specific case, the map creating string data is not recommended:
pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),. Because the way memory is layed out and because we create a new column per string operation, this is actually quite expensive (Only with string data). Therefore there is the pl.format and pl.concat_str.
Polars apply will return the row data as a tuple, so you would need to use numerical indices instead. Example:
import datetime
import polars as pl
df = pl.DataFrame({"iyear": [2020, 2021],
"imonth": [1, 2],
"iday": [3, 4]})
df['EVENT_DATE'] = df.apply(
lambda row: datetime.date(year=row[0], month=row[1], day=row[2]).strftime("%Y-%m-%d"))
In case dfcontains more columns, or in a different order, you could use apply on df[["iyear", "imonth", "iday"]] rather than df to ensure the indices refer to the right elements.
There may be better ways to achieve this, but this comes closest to the Pandas code.
On a separate note, my guess is you don't want to store the dates as a string, but rather as a proper pl.Date. You could modify the code in this way:
def days_since_epoch(dt):
return (dt - datetime.date(1970, 1, 1)).days
df['EVENT_DATE_dt'] = df.apply(
lambda row: days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2])), return_dtype=pl.Date)
where we first convert the Python date to days since Jan 1, 1970, and then convert to a pl.Date using apply's return_dtype argument. The cast to pl.Date needs an int rather than a Python datetime, as it stores the data as an int ultimately. This is most easily seen by simply accessing the dates:
print(type(df["EVENT_DATE_dt"][0])) # >>> <class 'int'>
print(type(df["EVENT_DATE_dt"].dt[0])) # >>> <class 'datetime.date'>
Would be great if the cast does operate on the Python datetime directly.
edit: on the conversation on performance vs Pandas.
For both Pandas and Polars, you could speed this up further if you have many duplicate rows (for year/month/day), by using a cache to speedup the apply. I.e.
from functools import lru_cache
#lru_cache
def row_to_date(row):
return days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2]))
df['EVENT_DATE_dt'] = df.apply(row_to_date, return_dtype=pl.Date)
This will improve runtime when there are many duplicate entries, at the expense of some memory. If there are no duplicates, it will probably slow you down.

Categories

Resources