Corr of one column with all other numeric ones - python

Starting with
import polars as pl
df = pl.DataFrame({
'a': [1,2,3],
'b': [4.,2.,6.],
'c': ['w', 'a', 'r'],
'd': [4, 1, 1]
})
how can I get the correlation between a and all other numeric columns?
Equivalent in pandas:
In [30]: (
...: pd.DataFrame({
...: 'a': [1,2,3],
...: 'b': [4.,2.,6.],
...: 'c': ['w', 'a', 'r'],
...: 'd': [4, 1, 1]
...: })
...: .corr()
...: .loc['a']
...: )
Out[30]:
a 1.000000
b 0.500000
d -0.866025
Name: a, dtype: float64
I've tried
(
df.select([pl.col(pl.Int64).cast(pl.Float64), pl.col(pl.Float64)])
.select(pl.pearson_corr('a', pl.exclude('a')))
)
but got
DuplicateError: Column with name: 'a' has more than one occurrences

There is a DataFrame.pearson_corr which you could then filter.
>>> df.select([
... pl.col(pl.Int64).cast(pl.Float64),
... pl.col(pl.Float64)]
... ).pearson_corr()
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ a ┆ d ┆ b │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞═══════════╪═══════════╪═════╡
│ 1.0 ┆ -0.866025 ┆ 0.5 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ -0.866025 ┆ 1.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 0.5 ┆ 0.0 ┆ 1.0 │
└───────────┴───────────┴─────┘
As for your current approach - you could pl.concat_list()
>>> (
... df
... .select([
... pl.col(pl.Int64).cast(pl.Float64),
... pl.col(pl.Float64)])
... .select(
... pl.concat_list(
... pl.pearson_corr("a", pl.exclude("a"))
... )
... )
... )
shape: (1, 1)
┌──────────────────┐
│ a │
│ --- │
│ list[f64] │
╞══════════════════╡
│ [-0.866025, 0.5] │
└──────────────────┘
You can convert it to a struct and .unnest() to split it into columns:
>>> (
... df
... .select([
... pl.col(pl.Int64).cast(pl.Float64),
... pl.col(pl.Float64)])
... .select(
... pl.concat_list(
... pl.pearson_corr("a", pl.exclude("a"))
... ).arr.to_struct())
... .unnest("a")
... )
shape: (1, 2)
┌───────────┬─────────┐
│ field_0 ┆ field_1 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪═════════╡
│ -0.866025 ┆ 0.5 │
└───────────┴─────────┘

Here's one solution I came up with:
In [52]: numeric = df.select([pl.col(pl.Int64).cast(pl.Float64), pl.col(pl.Float64)])
In [53]: numeric.select([pl.pearson_corr('a', col).alias(f'corr_a_{col}') for col in numeric.columns if col != 'a'])
Out[53]:
shape: (1, 2)
┌───────────┬──────────┐
│ corr_a_d ┆ corr_a_b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═══════════╪══════════╡
│ -0.866025 ┆ 0.5 │
└───────────┴──────────┘
Is there a way to do this without having to assign to a temporary variable?

Related

Polars calculating mean of sum of multiple columns

I want to calculate mean of sum of 3 columns based on the aggregate value. It is simple in pandas
df.groupby("id").apply(
lambda s: pd.Series(
{
"m_1 sum": s["m_1"].sum(),
"m_7_8 mean": (s["m_7"] + s["m_8"]).mean(),
}
)
)
cant find a good way in polars as pl.mean needs a string column name ...
I expect ability to calculate mean of sum of 2 columns in one line ...
You can do:
(
df.groupby("id").agg(
[
pl.col("m_1").sum().alias("m_1 sum"),
(pl.col("m_7") + pl.col("m_8")).mean().alias("m_7_8 mean"),
]
)
)
E.g.:
In [33]: df = pl.DataFrame({'id': [1, 1, 2], 'm_1': [1, 3.5, 2], 'm_7': [5, 1., 2], 'm_8': [6., 5., 4.]})
In [34]: (
...: df.groupby("id").agg(
...: [
...: pl.col("m_1").sum().alias("m_1 sum"),
...: (pl.col("m_7") + pl.col("m_8")).mean().alias("m_7_8 mean"),
...: ]
...: )
...: )
Out[34]:
shape: (2, 3)
┌─────┬─────────┬────────────┐
│ id ┆ m_1 sum ┆ m_7_8 mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪═════════╪════════════╡
│ 2 ┆ 2.0 ┆ 6.0 │
│ 1 ┆ 4.5 ┆ 8.5 │
└─────┴─────────┴────────────┘

how to imitate Pandas' index-based querying in Polars?

Any idea what I can do to imitate the below pandas code using polars? Polars doesn't have indexes like pandas so I couldn't figure out what I can do .
df = pd.DataFrame(data = ([21,123], [132,412], [23, 43]), columns = ['c1', 'c2']).set_index("c1")
print(df.loc[[23, 132]])
and it prints
c1
c2
23
43
132
412
the only polars conversion I could figure out to do is
df = pl.DataFrame(data = ([21,123], [132,412], [23, 43]), schema = ['c1', 'c2'])
print(df.filter(pl.col("c1").is_in([23, 132])))
but it prints
c1
c2
132
412
23
43
which is okay but the rows are not in the order I gave. I gave [23, 132] and want the output rows to be in the same order, like how pandas' output has.
I can use a sort() later yes, but the original data I use this on has like 30Million rows so I'm looking for something that's as fast as possible.
I suggest using a left join to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)
For example, let's start with this shuffled DataFrame.
nbr_rows = 30_000_000
df = pl.DataFrame({
'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
})
df
shape: (30000000, 2)
┌──────────┬──────────┐
│ c1 ┆ c2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪══════════╡
│ 4052015 ┆ 20642741 │
│ 7787054 ┆ 17007051 │
│ 20246150 ┆ 19445431 │
│ 1309992 ┆ 6495751 │
│ ... ┆ ... │
│ 10371090 ┆ 4791782 │
│ 26281644 ┆ 12350777 │
│ 6740626 ┆ 24888572 │
│ 22573405 ┆ 14885989 │
└──────────┴──────────┘
And these index values:
nbr_index_values = 10_000
s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
s1
shape: (10000,)
Series: 'c1' [i64]
[
1754
6716
3485
7058
7216
1040
1832
3921
1639
6734
5560
7596
...
4243
4455
894
7806
9291
1883
9947
3309
2030
7731
4706
8528
8426
]
We now perform a left join to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)
start = time.perf_counter()
df2 = (
s1.to_frame()
.join(
df,
on='c1',
how='left'
)
)
print(time.perf_counter() - start)
df2
>>> print(time.perf_counter() - start)
0.8427023889998964
shape: (10000, 2)
┌──────┬──────────┐
│ c1 ┆ c2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════════╡
│ 1754 ┆ 15734441 │
│ 6716 ┆ 20631535 │
│ 3485 ┆ 20199121 │
│ 7058 ┆ 15881128 │
│ ... ┆ ... │
│ 7731 ┆ 19420197 │
│ 4706 ┆ 16918008 │
│ 8528 ┆ 5278904 │
│ 8426 ┆ 18927935 │
└──────┴──────────┘
Notice how the rows are in the same order as the index values. We can verify this:
s1.series_equal(df2.get_column('c1'), strict=True)
>>> s1.series_equal(df2.get_column('c1'), strict=True)
True
And the performance is quite good. On my 32-core system, this takes less than a second.

Possible to calculate counts and percentage in one chain using polars?

From seeing some of the other polars answers it seems most things can be complete in a single chain. Is that possible with the below example? Any simplifications possible?
import polars as pl
scores = pl.DataFrame({
'zone': ['North', 'North', 'North', 'South', 'East', 'East', 'East', 'East'],
'score': [78, 39, 76, 56, 67, 89, 100, 55]
})
cnt = scores.groupby("zone").count()
cnt.with_column(
(100 * pl.col("count") / pl.col("count").sum())
.round(2)
.cast(str)
.str.replace("$", "%")
.alias("perc")
)
Noticed the post below just after asking this question. Feel free to close as needed.
Best way to get percentage counts in Polars
(
scores.groupby("zone")
.agg([pl.count().alias("count")])
.with_column(
(pl.col("count") * 100 / pl.sum("count"))
.round(2)
.cast(str)
.str.replace("$", "%")
.alias("perc")
)
)
shape: (3, 3)
┌───────┬───────┬───────┐
│ zone ┆ count ┆ perc │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ str │
╞═══════╪═══════╪═══════╡
│ South ┆ 1 ┆ 12.5% │
│ North ┆ 3 ┆ 37.5% │
│ East ┆ 4 ┆ 50.0% │
└───────┴───────┴───────┘
EDIT: Slightly simpler/cleaner version
(
scores.groupby("zone")
.agg(pl.count().alias("count"))
.with_column(
pl.format(
"{}%",
(pl.col("count") * 100 / pl.sum("count")).round(2)
).alias("perc")
)
)
shape: (3, 3)
┌───────┬───────┬───────┐
│ zone ┆ count ┆ perc │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ str │
╞═══════╪═══════╪═══════╡
│ North ┆ 3 ┆ 37.5% │
│ South ┆ 1 ┆ 12.5% │
│ East ┆ 4 ┆ 50.0% │
└───────┴───────┴───────┘

How to swap column values on conditions in python polars?

I have a data frame as below:
df_n = pl.from_pandas(pd.DataFrame({'last_name':[np.nan,'mallesh','bhavik'],
'first_name':['a','b','c'],
'middle_name_or_initial':['aa','bb','cc']}))
Here I would like to find an observation which has First and Middle Name not NULL and Last Name is Null, in this case first_name should be swapped to last_name and middle_name should be swapped to first_name, and middle_name to be EMPTY.
expected output will be:
I'm trying with this command:
df_n.with_columns([
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())
).then(pl.col('first_name').alias('last_name')).otherwise(pl.col('last_name').alias('first_name')),
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())
).then(pl.col('middle_name_or_initial').alias('first_name')).otherwise('').alias('middle_name_or_initial')
]
)
Here it is throwing a wrong output and any help ?
You can actually change the values of multiple columns within a single when/then/otherwise statement.
The Algorithm
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
.drop(name_cols)
.unnest('name_struct')
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
How it works
To change the values of multiple columns within a single when/then/otherwise statement, we can use structs. But you must observe some rules with structs. In all your then and otherwise statements, your structs must have:
the same field names
in the same order
with the same data type in corresponding fields.
So, in both the then and otherwise statements, I'm going to create a struct with field names in this order:
last_name: string
first_name: string
middle_name_or_initial: string
In our then statement, I'm swapping values and using alias to ensure that my fields names are as stated above. (This is important.)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
And in the otherwise statement, we'll simply name the existing columns that we want, in the order that we want - using the list name_cols that I created in a previous step.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
...
.otherwise(pl.struct(name_cols))
Here's the result after the when/then/otherwise statement.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
)
shape: (3, 4)
┌───────────┬────────────┬────────────────────────┬──────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial ┆ name_struct │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ struct[3] │
╞═══════════╪════════════╪════════════════════════╪══════════════════════╡
│ null ┆ a ┆ aa ┆ {"a","aa",null} │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb ┆ {"mallesh","b","bb"} │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc ┆ {"bhavik","c","cc"} │
└───────────┴────────────┴────────────────────────┴──────────────────────┘
Notice that our new struct name_struct has the values that we want - in the correct order.
Next, we will use unnest to break the struct into separate columns. (But first, we must drop the existing columns so that we don't get 2 sets of columns with the same names.)
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
.drop(name_cols)
.unnest('name_struct')
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
With pl.when().then().otherwise() you create values for only one column (so only one alias at the end is allowed).
In [67]: df_n.with_columns(
...: [
...: # Create temp column with filter, so it does not have to be recalculated 3 times.
...: ((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())).alias("swap_names")
...: ]
...: ).with_columns(
...: [
...: # Create new columns with the correct value based on the swap_names column.
...: pl.when(pl.col("swap_names")).then(pl.col("first_name")).otherwise(pl.col("last_name")).alias("last_name_new"),
...: pl.when(pl.col("swap_names")).then(pl.col("middle_name_or_initial")).otherwise(pl.col("first_name")).alias("first_name_new"),
...: pl.when(pl.col("swap_names")).then(None).otherwise(pl.col("middle_name_or_initial")).alias("middle_name_or_initial_new"),
...: ]
...: )
Out[67]:
shape: (3, 7)
┌───────────┬────────────┬────────────────────────┬────────────┬───────────────┬────────────────┬────────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial ┆ swap_names ┆ last_name_new ┆ first_name_new ┆ middle_name_or_initial_new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool ┆ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╪════════════╪═══════════════╪════════════════╪════════════════════════════╡
│ null ┆ a ┆ aa ┆ true ┆ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb ┆ false ┆ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc ┆ false ┆ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┴────────────┴───────────────┴────────────────┴────────────────────────────┘

Polars - Perform matrix inner product on lazy frames to produce sparse representation of gram matrix

Suppose we have a polars dataframe like:
df = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]}).lazy()
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 │
└─────┴─────┘
I would like to X^TX the matrix while preserving the sparse matrix format for arrow* - in pandas I would do something like:
pdf = df.collect().to_pandas()
numbers = pdf[["a", "b"]]
(numbers.T # numbers).melt(ignore_index=False)
variable value
a a 14
b a 26
a b 26
b b 50
I did something like this in polars:
df.select(
[
(pl.col("a") * pl.col("a")).sum().alias("aa"),
(pl.col("a") * pl.col("b")).sum().alias("ab"),
(pl.col("b") * pl.col("a")).sum().alias("ba"),
(pl.col("b") * pl.col("b")).sum().alias("bb"),
]
).melt().collect()
shape: (4, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════╪═══════╡
│ aa ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ab ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ba ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bb ┆ 50 │
└──────────┴───────┘
Which is almost there but not quite. This is a hack to get around the fact that I can't store lists as the column names (and then I could unnest them to become two different columns representing the x and y axis of the matrix). Is there a way to get the same format as shown in the pandas example?
*arrow is a columnar data format which means it's performant when scaled across rows but not across columns, which is why I think the sparse matrix representation is better if I want to use the results of the gram matrix chained with pl.LazyFrames later down the graph. I could be wrong though!
Polars doesn't have matrix multiplication, but we can tweak your algorithm slightly to accomplish what we need:
use the built-in dot expression
calculate each inner product only once, since <a, b> = <b, a>. We'll use Python's combinations_with_replacement iterator from itertools to accomplish this.
automatically generate the list of expressions that will run in parallel
Let's expand your data a bit:
from itertools import combinations_with_replacement
import polars as pl
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5], "b": [3, 4, 5, 6, 7], "c": [5, 6, 7, 8, 9]}
).lazy()
df.collect()
shape: (5, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 6 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
The algorithm would be as follows:
expr_list = [
pl.col(col1).dot(pl.col(col2)).alias(col1 + "|" + col2)
for col1, col2 in combinations_with_replacement(df.columns, 2)
]
dot_prods = (
df
.select(expr_list)
.melt()
.with_column(
pl.col('variable').str.split_exact('|', 1)
)
.unnest('variable')
.cache()
)
result = (
pl.concat([
dot_prods,
dot_prods
.filter(pl.col('field_0') != pl.col('field_1'))
.select(['field_1', 'field_0', 'value'])
.rename({'field_0':'field_1', 'field_1': 'field_0'})
],
)
.sort(['field_0', 'field_1'])
)
result.collect()
shape: (9, 3)
┌─────────┬─────────┬───────┐
│ field_0 ┆ field_1 ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪═════════╪═══════╡
│ a ┆ a ┆ 55 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ b ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ c ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ a ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ b ┆ 135 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ c ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ a ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ b ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ c ┆ 255 │
└─────────┴─────────┴───────┘
Couple of notes:
I'm assuming that a pipe would be an appropriate delimiter for your column names.
The use of Python bytecode and iterator will not significantly impair performance. It is only used to generate the list of expressions, not run any calculations.

Categories

Resources