I want to calculate mean of sum of 3 columns based on the aggregate value. It is simple in pandas
df.groupby("id").apply(
lambda s: pd.Series(
{
"m_1 sum": s["m_1"].sum(),
"m_7_8 mean": (s["m_7"] + s["m_8"]).mean(),
}
)
)
cant find a good way in polars as pl.mean needs a string column name ...
I expect ability to calculate mean of sum of 2 columns in one line ...
You can do:
(
df.groupby("id").agg(
[
pl.col("m_1").sum().alias("m_1 sum"),
(pl.col("m_7") + pl.col("m_8")).mean().alias("m_7_8 mean"),
]
)
)
E.g.:
In [33]: df = pl.DataFrame({'id': [1, 1, 2], 'm_1': [1, 3.5, 2], 'm_7': [5, 1., 2], 'm_8': [6., 5., 4.]})
In [34]: (
...: df.groupby("id").agg(
...: [
...: pl.col("m_1").sum().alias("m_1 sum"),
...: (pl.col("m_7") + pl.col("m_8")).mean().alias("m_7_8 mean"),
...: ]
...: )
...: )
Out[34]:
shape: (2, 3)
┌─────┬─────────┬────────────┐
│ id ┆ m_1 sum ┆ m_7_8 mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪═════════╪════════════╡
│ 2 ┆ 2.0 ┆ 6.0 │
│ 1 ┆ 4.5 ┆ 8.5 │
└─────┴─────────┴────────────┘
Related
I want to strip a dataframe based on its data type per column. If it is a string column, a strip should be executed. If it is not a string column, it should not be striped. In pandas there is the following approach for this task:
df_clean = df_select.copy()
for col in df_select.columns:
if df_select[col].dtype == 'object':
df_clean[col] = df_select[col].str.strip()
How can this be executed in polars?
import polars as pl
df = pl.DataFrame(
{
"ID": [1, 1, 1, 1,],
"A": ["foo ", "ham", "spam ", "egg",],
"L": ["A54", " A12", "B84", " C12"],
}
)
You don't need a copy, you can directly use with_columns on df_select:
import polars as pl
df_select = pl.DataFrame(
{
"ID": [1, 1, 1, 1,],
"A": ["foo ", "ham", "spam ", "egg",],
"L": ["A54", " A12", "B84", " C12"],
}
)
df_clean = df_select.with_columns(pl.col(pl.Utf8).str.strip())
Output:
shape: (4, 3)
┌─────┬──────┬─────┐
│ ID ┆ A ┆ L │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪══════╪═════╡
│ 1 ┆ foo ┆ A54 │
│ 1 ┆ ham ┆ A12 │
│ 1 ┆ spam ┆ B84 │
│ 1 ┆ egg ┆ C12 │
└─────┴──────┴─────┘
Starting with
import polars as pl
df = pl.DataFrame({
'a': [1,2,3],
'b': [4.,2.,6.],
'c': ['w', 'a', 'r'],
'd': [4, 1, 1]
})
how can I get the correlation between a and all other numeric columns?
Equivalent in pandas:
In [30]: (
...: pd.DataFrame({
...: 'a': [1,2,3],
...: 'b': [4.,2.,6.],
...: 'c': ['w', 'a', 'r'],
...: 'd': [4, 1, 1]
...: })
...: .corr()
...: .loc['a']
...: )
Out[30]:
a 1.000000
b 0.500000
d -0.866025
Name: a, dtype: float64
I've tried
(
df.select([pl.col(pl.Int64).cast(pl.Float64), pl.col(pl.Float64)])
.select(pl.pearson_corr('a', pl.exclude('a')))
)
but got
DuplicateError: Column with name: 'a' has more than one occurrences
There is a DataFrame.pearson_corr which you could then filter.
>>> df.select([
... pl.col(pl.Int64).cast(pl.Float64),
... pl.col(pl.Float64)]
... ).pearson_corr()
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ a ┆ d ┆ b │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═════╡
│ 1.0 ┆ -0.866025 ┆ 0.5 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ -0.866025 ┆ 1.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 0.5 ┆ 0.0 ┆ 1.0 │
└───────────┴───────────┴─────┘
As for your current approach - you could pl.concat_list()
>>> (
... df
... .select([
... pl.col(pl.Int64).cast(pl.Float64),
... pl.col(pl.Float64)])
... .select(
... pl.concat_list(
... pl.pearson_corr("a", pl.exclude("a"))
... )
... )
... )
shape: (1, 1)
┌──────────────────┐
│ a │
│ --- │
│ list[f64] │
╞══════════════════╡
│ [-0.866025, 0.5] │
└──────────────────┘
You can convert it to a struct and .unnest() to split it into columns:
>>> (
... df
... .select([
... pl.col(pl.Int64).cast(pl.Float64),
... pl.col(pl.Float64)])
... .select(
... pl.concat_list(
... pl.pearson_corr("a", pl.exclude("a"))
... ).arr.to_struct())
... .unnest("a")
... )
shape: (1, 2)
┌───────────┬─────────┐
│ field_0 ┆ field_1 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪═════════╡
│ -0.866025 ┆ 0.5 │
└───────────┴─────────┘
Here's one solution I came up with:
In [52]: numeric = df.select([pl.col(pl.Int64).cast(pl.Float64), pl.col(pl.Float64)])
In [53]: numeric.select([pl.pearson_corr('a', col).alias(f'corr_a_{col}') for col in numeric.columns if col != 'a'])
Out[53]:
shape: (1, 2)
┌───────────┬──────────┐
│ corr_a_d ┆ corr_a_b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪══════════╡
│ -0.866025 ┆ 0.5 │
└───────────┴──────────┘
Is there a way to do this without having to assign to a temporary variable?
Suppose we have a polars dataframe like:
df = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]}).lazy()
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 │
└─────┴─────┘
I would like to X^TX the matrix while preserving the sparse matrix format for arrow* - in pandas I would do something like:
pdf = df.collect().to_pandas()
numbers = pdf[["a", "b"]]
(numbers.T # numbers).melt(ignore_index=False)
variable value
a a 14
b a 26
a b 26
b b 50
I did something like this in polars:
df.select(
[
(pl.col("a") * pl.col("a")).sum().alias("aa"),
(pl.col("a") * pl.col("b")).sum().alias("ab"),
(pl.col("b") * pl.col("a")).sum().alias("ba"),
(pl.col("b") * pl.col("b")).sum().alias("bb"),
]
).melt().collect()
shape: (4, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════╪═══════╡
│ aa ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ab ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ba ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bb ┆ 50 │
└──────────┴───────┘
Which is almost there but not quite. This is a hack to get around the fact that I can't store lists as the column names (and then I could unnest them to become two different columns representing the x and y axis of the matrix). Is there a way to get the same format as shown in the pandas example?
*arrow is a columnar data format which means it's performant when scaled across rows but not across columns, which is why I think the sparse matrix representation is better if I want to use the results of the gram matrix chained with pl.LazyFrames later down the graph. I could be wrong though!
Polars doesn't have matrix multiplication, but we can tweak your algorithm slightly to accomplish what we need:
use the built-in dot expression
calculate each inner product only once, since <a, b> = <b, a>. We'll use Python's combinations_with_replacement iterator from itertools to accomplish this.
automatically generate the list of expressions that will run in parallel
Let's expand your data a bit:
from itertools import combinations_with_replacement
import polars as pl
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5], "b": [3, 4, 5, 6, 7], "c": [5, 6, 7, 8, 9]}
).lazy()
df.collect()
shape: (5, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 6 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
The algorithm would be as follows:
expr_list = [
pl.col(col1).dot(pl.col(col2)).alias(col1 + "|" + col2)
for col1, col2 in combinations_with_replacement(df.columns, 2)
]
dot_prods = (
df
.select(expr_list)
.melt()
.with_column(
pl.col('variable').str.split_exact('|', 1)
)
.unnest('variable')
.cache()
)
result = (
pl.concat([
dot_prods,
dot_prods
.filter(pl.col('field_0') != pl.col('field_1'))
.select(['field_1', 'field_0', 'value'])
.rename({'field_0':'field_1', 'field_1': 'field_0'})
],
)
.sort(['field_0', 'field_1'])
)
result.collect()
shape: (9, 3)
┌─────────┬─────────┬───────┐
│ field_0 ┆ field_1 ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪═════════╪═══════╡
│ a ┆ a ┆ 55 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ b ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ c ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ a ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ b ┆ 135 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ c ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ a ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ b ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ c ┆ 255 │
└─────────┴─────────┴───────┘
Couple of notes:
I'm assuming that a pipe would be an appropriate delimiter for your column names.
The use of Python bytecode and iterator will not significantly impair performance. It is only used to generate the list of expressions, not run any calculations.
Given the following dataframe, is there some way to select only columns starting with a given prefix? I know I could do e.g. pl.col(column) for column in df.columns if column.startswith("prefix_"), but I'm wondering if I can do it as part of a single expression.
df = pl.DataFrame(
{"prefix_a": [1, 2, 3], "prefix_b": [1, 2, 3], "some_column": [3, 2, 1]}
)
df.select(pl.all().<column_name_starts_with>("prefix_"))
Would this be possible to do lazily?
From the documentation for polars.col, the expression can take one of the following arguments:
a single column by name
all columns by using a wildcard “*”
column by regular expression if the regex starts with ^ and ends with $
So in this case, we can use a regex expression to select for the prefix. And this does work in lazy mode.
(
df
.lazy()
.select(pl.col('^prefix_.*$'))
.collect()
)
>>> (
... df
... .lazy()
... .select(pl.col('^prefix_.*$'))
... .collect()
...
... )
shape: (3, 2)
┌──────────┬──────────┐
│ prefix_a ┆ prefix_b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪══════════╡
│ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 │
└──────────┴──────────┘
Note: we can also use polars.exclude with regex expressions:
(
df
.lazy()
.select(pl.exclude('^prefix_.*$'))
.collect()
)
shape: (3, 1)
┌─────────────┐
│ some_column │
│ --- │
│ i64 │
╞═════════════╡
│ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
└─────────────┘
I have a largish dataframe (5.5M rows, four columns). The first column (let's call it column A) has 235 distinct entries. The second column (B) has 100 distinct entries, integers from 0 to 99, all present in various proportions for each entry in A.
I groupby on A an aggregate by randomly selecting a value from B. Something like
df.groupby("A").agg(
pl.col("B").unique().apply(np.random.choice)
)
In doing so, every value in A is attributed a random integer. My goal is to select, from the dataframe, the columns C et D corresponding to the pairs A, B so generated.
My approach so far is
choice = df.groupby("A").agg(
pl.col("B").unique().apply(np.random.choice)
).to_numpy()
lchoice = ((df["A"] == arr[0]) & (df["B"] == arr[1]) for arr in choice)
mask = functools.reduce(operator.or_, lchoice)
sdf = df[mask].select(["C", "D"])
It does the job, but does not feel very idiomatic.
My first attempt was
sdf = df.filter(functools
.reduce(operator.or_,
[(pl.col("A") == arr[0]) & (pl.col("B") == arr[1])
for arr in choice]))
but it hangs until I kill it (I waited for ~30 minutes, while the other approach takes 1.6 seconds).
df.filter(
(pl.col("period") == choice[0, 0]) & (pl.col("exp_id") == choice[0, 1])
)
works fine, as expected, and I have used the functools.reduce construct successfully as argument to filter in the past. Obviously, I do not want to write them all by hand; I could loop over the rows of choice, filter df one at a time and then concatenate the dataframes, but it sounds much more expensive than it should be.
Any tip on getting to my sdf "the polars way", without having to create temporary objects, arrays, etc.? As I said, I have a working solution, but it is kind of shaky, and I am interested in learning better polars.
EDIT: some mock data
df = pl.DataFrame({"A": [1.3, 8.9, 6.7]*3 + [3.6, 4.1]*2,
"B": [1]*3 + [2]*3 + [3]*3 + [1]*2 + [2]*2,
"C": [21.5, 24.3, 21.8, 20.8, 23.6, 15.6, 23.5,
16.1, 15.6, 14.8, 14.7, 23.8, 20.],
"D": [6.9, 7.6, 6.4, 6.2, 7.6, 6.2,
6.3, 7.1, 7.8,7.7, 6.5, 6.6, 7.1]})
Slight twist on the accepted answer:
df.sort(by=['A', 'B'], in_place=True)
sdf = (df
.join(df
.groupby('A', maintain_order=True)
.agg(pl.col('B')
.unique()
.sort()
.shuffle(seed)
.first()
.alias('B')),
on=['A', 'B'])
.select(['C','D']))
I need to perform this operation multiple time, and I'd like to ensure reproducibility of the random generation, hence the sorts and maintain_order=True.
It looks like you can accomplish what you need with a join.
Let's see if I understand your question. Let's start with this data:
import polars as pl
df = pl.DataFrame(
{
"A": ["a", "b", "c", "c", "b", "a"] * 2,
"B": [1, 2, 3, 2] * 3,
"C": [1, 2, 3, 4, 5, 6] * 2,
"D": ["x", "y", "y", "x"] * 3,
}
).sort(["A", "B"])
print(df)
shape: (12, 4)
┌─────┬─────┬─────┬─────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 1 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 ┆ 6 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 ┆ 6 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 3 ┆ 1 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 1 ┆ 5 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 2 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 2 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 3 ┆ 5 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 1 ┆ 3 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 2 ┆ 4 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 2 ┆ 4 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 3 ┆ y │
└─────┴─────┴─────┴─────┘
Next, we randomly select a value for B for every value of A.
I'm going to change your code slightly to eliminate the use of numpy and instead use Polars' shuffle expression. This way, we get a Polars DataFrame back, which we'll use in the upcoming join. (The documentation for the shuffle expression is here. It uses numpy if no seed is provided.)
choice_df = df.groupby("A").agg(pl.col("B").unique().shuffle().first())
print(choice_df)
shape: (3, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 1 │
└─────┴─────┘
If I understand your question correctly, we now want to get the values for columns C and D in the original dataset that correspond to the three combinations of A and B that we selected in the previous step. We can accomplish this most simply with a join, followed by a select. (The select is merely to eliminate columns A and B from the result).
df.join(choice_df, on=['A','B']).select(['C','D'])
shape: (4, 2)
┌─────┬─────┐
│ C ┆ D │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 6 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ x │
└─────┴─────┘
Does this accomplish what you need? The resulting code is clean, concise, and typical of the use of the Polars API.