Polars - compute on all other values in a window except this one - python

For each row, I'm trying to compute the standard deviation for the other values in a group excluding the row's value. A way to think about it is "what would the standard deviation for the group be if this row's value was removed". An example may be easier to parse:
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
I would want to add a column that would have the first row be std([10,7,8]) = 1.527525
I tried to hack something together and ended up with code that is horrible to read and also has a bug that I don't know how to work around:
df.with_column(
(
(pl.col("value").sum().over(pl.col("answer")) - pl.col("value"))
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).alias("average_other")
).with_column(
(
(
(
(pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
- (pl.col("value") - pl.col("average_other")).pow(2)
)
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).sqrt()
).alias("std_dev_other")
)
I'm not sure I would recommend parsing that, but I'll point out at least one thing that is wrong:
pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
I want to be comparing "value" in each row to "average_other" from this row then squaring and summing over the window but instead I am comparing "value" in each row to "average_other" in each row.
My main question is the "what is the best way to get the standard deviation while leaving out this value?" part. But I would also be interested if there is a way to do the comparison that I'm doing wrong above. Third would be tips on how to write this in way that is easy to understand what is going on.

The way I'd come at this (at least at first thought) is create three helper columns. The first being a row index, the second being a window list of the values in the group, and the last is a windowed list of the row index. Next I'd explode by the two aforementioned lists. With that you can filter out the rows where the actual row index is equal to the list row index. That allows you to run std against the values by the row index where we've filtered out the own value on each row. You can join that result back to the original.
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
df.with_row_count('i').join(
df.with_row_count('i') \
.with_columns([
pl.col('value').list().over('answer').alias('l'),
pl.col('i').list().over('answer').alias('il')]) \
.explode(['l','il']).filter(pl.col('i')!=pl.col('il')) \
.groupby('i').agg(pl.col('l').std().alias('std')),
on='i').drop('i')
shape: (7, 3)
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ std │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╡
│ yes ┆ 5 ┆ 1.527525 │
│ yes ┆ 10 ┆ 1.527525 │
│ yes ┆ 7 ┆ 2.516611 │
│ yes ┆ 8 ┆ 2.516611 │
│ maybe ┆ 6 ┆ 0.707107 │
│ maybe ┆ 9 ┆ 2.828427 │
│ maybe ┆ 10 ┆ 2.12132 │
└────────┴───────┴──────────┘

I came up with something similiar to #DeanMacGregor's answer:
df = (
df.with_row_count()
.join(df.with_row_count(), on="answer")
.filter(pl.col("row_nr") != pl.col("row_nr_right"))
.groupby(["answer", "row_nr_right"], maintain_order=True).agg([
pl.col("value_right").first().alias("value"),
pl.col("value").std().alias("stdev"),
])
.drop("row_nr_right")
)
.join df with row count on itself and remove the rows where the two row counts are identical. Then group by answer and row_nr_right and (1) pick the first group item out of value_right and (2) calculate the standard deviation over the value group.
Result for
df = pl.DataFrame({
"answer": ["yes", "yes", "yes", "yes", "yes", "maybe", "maybe", "maybe", "maybe"],
"value": [5, 10, 7, 8, 4, 6, 9, 10, 4],
})
is
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ stdev │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╡
│ yes ┆ 5 ┆ 2.5 │
│ yes ┆ 10 ┆ 1.825742 │
│ yes ┆ 7 ┆ 2.753785 │
│ yes ┆ 8 ┆ 2.645751 │
│ ... ┆ ... ┆ ... │
│ maybe ┆ 6 ┆ 3.21455 │
│ maybe ┆ 9 ┆ 3.05505 │
│ maybe ┆ 10 ┆ 2.516611 │
│ maybe ┆ 4 ┆ 2.081666 │
└────────┴───────┴──────────┘

I am not sure if my solution retains the (performance) advantages of using polars (instead of regular pandas), but I find it easier to maintain and more readable than the other answers.
With all the iterative conversion of data from Series to a list I expect this won't scale well, but perhaps your usecase does not require that.
Starting with the data (thanks to this answer for a minimal working dataset):
import polars as pl
import statistics as stats
df = pl.DataFrame(
{
"answer": ["yes", "yes", "yes", "yes", "maybe", "maybe", "maybe"],
"value": [5, 10, 7, 8, 6, 9, 10],
}
)
df
shape: (7, 2)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
Now define a custom function to do the actual calculation (thanks to this answer for an example of a custom polars aggregation function).
I.e. take a list of values, move one-by-one through the list and calculate the standard deviation of the values in the list, except for the current value:
def custom_std(args: list[pl.Series]) -> pl.Series:
output = []
# Iterate over the values within the group
for idx in range(0, len(args[0])):
# Convert the Series to a list, as Polars does not have a method that can delete individual elements
temp = args[0].to_list()
# Delete value fo the current row
del temp[idx]
# Now calculate the std. dev. on the remaining values
# I arbitrarly chose the sample standard deviation, adjust accordingly to your situation
result = stats.stdev(temp)
# Store the result in a list and go to the next iteration
output.append(result)
# The dtype is not correct, but I can't find how to specify that this series contains a list of floats
return pl.Series(output, dtype=pl.Float64)
Use this function in a groupby:
gdf = df.groupby(by=["answer"], maintain_order=True).agg(pl.apply(f=custom_std, exprs=["value"]))
gdf
┌────────┬─────────────────────────────────────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ list [f64] │
╞════════╪═════════════════════════════════════╡
│ yes ┆ [1.247219, 1.247219, ... 2.05480... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ [0.5, 2.0, 1.5] │
└────────┴─────────────────────────────────────┘
To get it in the desired format explode the resulting DataFrame:
gdf.explode("value")
┌────────┬──────────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ yes ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 0.707107 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 2.828427 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 2.1213 │
└────────┴──────────┘

Related

how to imitate Pandas' index-based querying in Polars?

Any idea what I can do to imitate the below pandas code using polars? Polars doesn't have indexes like pandas so I couldn't figure out what I can do .
df = pd.DataFrame(data = ([21,123], [132,412], [23, 43]), columns = ['c1', 'c2']).set_index("c1")
print(df.loc[[23, 132]])
and it prints
c1
c2
23
43
132
412
the only polars conversion I could figure out to do is
df = pl.DataFrame(data = ([21,123], [132,412], [23, 43]), schema = ['c1', 'c2'])
print(df.filter(pl.col("c1").is_in([23, 132])))
but it prints
c1
c2
132
412
23
43
which is okay but the rows are not in the order I gave. I gave [23, 132] and want the output rows to be in the same order, like how pandas' output has.
I can use a sort() later yes, but the original data I use this on has like 30Million rows so I'm looking for something that's as fast as possible.
I suggest using a left join to accomplish this. This will maintain the order corresponding to your list of index values. (And it is quite performant.)
For example, let's start with this shuffled DataFrame.
nbr_rows = 30_000_000
df = pl.DataFrame({
'c1': pl.arange(0, nbr_rows, eager=True).shuffle(2),
'c2': pl.arange(0, nbr_rows, eager=True).shuffle(3),
})
df
shape: (30000000, 2)
┌──────────┬──────────┐
│ c1 ┆ c2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪══════════╡
│ 4052015 ┆ 20642741 │
│ 7787054 ┆ 17007051 │
│ 20246150 ┆ 19445431 │
│ 1309992 ┆ 6495751 │
│ ... ┆ ... │
│ 10371090 ┆ 4791782 │
│ 26281644 ┆ 12350777 │
│ 6740626 ┆ 24888572 │
│ 22573405 ┆ 14885989 │
└──────────┴──────────┘
And these index values:
nbr_index_values = 10_000
s1 = pl.Series(name='c1', values=pl.arange(0, nbr_index_values, eager=True).shuffle())
s1
shape: (10000,)
Series: 'c1' [i64]
[
1754
6716
3485
7058
7216
1040
1832
3921
1639
6734
5560
7596
...
4243
4455
894
7806
9291
1883
9947
3309
2030
7731
4706
8528
8426
]
We now perform a left join to obtain the rows corresponding to the index values. (Note that the list of index values is the left DataFrame in this join.)
start = time.perf_counter()
df2 = (
s1.to_frame()
.join(
df,
on='c1',
how='left'
)
)
print(time.perf_counter() - start)
df2
>>> print(time.perf_counter() - start)
0.8427023889998964
shape: (10000, 2)
┌──────┬──────────┐
│ c1 ┆ c2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════════╡
│ 1754 ┆ 15734441 │
│ 6716 ┆ 20631535 │
│ 3485 ┆ 20199121 │
│ 7058 ┆ 15881128 │
│ ... ┆ ... │
│ 7731 ┆ 19420197 │
│ 4706 ┆ 16918008 │
│ 8528 ┆ 5278904 │
│ 8426 ┆ 18927935 │
└──────┴──────────┘
Notice how the rows are in the same order as the index values. We can verify this:
s1.series_equal(df2.get_column('c1'), strict=True)
>>> s1.series_equal(df2.get_column('c1'), strict=True)
True
And the performance is quite good. On my 32-core system, this takes less than a second.

Polars Selecting all columns without NaNs

I have a dataframe where a number of the columns only consists of NaNs. I am trying to select only the columns in the dataframe where all the values are not equal to NaNs using Polars.
I have tried seeing if I could use a similar syntax to how I would proceed in Pandas e.g.
df[df.columns[~df.is_null().all()]]
However the syntax doesn't translate.
I also know that you can use pl.filter but this only filters rows and not columns based on the criteria's applied within the filter expression.
So this is basically subsetting columns with a boolean mask.
So first let's create some sample data:
import polars as pl
import numpy as np
df = pl.DataFrame(
{"a": [np.nan, np.nan, np.nan, np.nan],
"b": [3,4, np.nan, 5],
"c": [np.nan, np.nan, np.nan, np.nan]
})
Next we have to get if a column consists completely of NaN Values
df.select(pl.all().is_nan().all().is_not())
shape: (1, 3)
┌───────┬──────┬───────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪══════╪═══════╡
│ false ┆ true ┆ false │
└───────┴──────┴───────┘
To get this DataFrame as iterable we use the row function
df.select(pl.all().is_nan().all().is_not()).row(0)
(False, True, False)
This we can now use in the bracket notation
df[:, df.select(pl.all().is_nan().all().is_not()).row(0)]
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ f64 │
╞═════╡
│ 3.0 │
├╌╌╌╌╌┤
│ 4.0 │
├╌╌╌╌╌┤
│ NaN │
├╌╌╌╌╌┤
│ 5.0 │
└─────┘
Since in general bracket notation is not recommended we can do this also with select: (for looking more concise we use the compress function from itertools)
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().is_nan().all().is_not()).row(0)))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ f64 │
╞═════╡
│ 3.0 │
├╌╌╌╌╌┤
│ 4.0 │
├╌╌╌╌╌┤
│ NaN │
├╌╌╌╌╌┤
│ 5.0 │
└─────┘

How to get an index of maximum count of a required string in a list column of polars data frame?

I have a polars dataframe as
pl.DataFrame({'doc_id':[
['83;45;32;65;13','7;8;9'],
['9;4;5','4;2;7;3;5;8;10;11'],
['1000;2000','76;34;100001;7474;2924'],
['100;200','200;100'],
['3;4;6;7;10;11','1;2;3;4;5']
]})
each list consist of document id's separated with semicolon, if any of list element has got higher semicolon its index needs to be found and create a new column as len_idx_at and fill in with the index number.
For example:
['83;45;32;65;13','7;8;9']
This list is having two elements, in a first element there are about 4 semicolon hence its has 5 documents, similarly in a second element there are about 2 semicolons and it means it has 3 documents.
Here we should consider an index of a highest document counts element in the above case - it will be 0 index because it has 4 semicolons'.
the expected output as:
shape: (5, 2)
┌─────────────────────────────────────┬────────────┐
│ doc_id ┆ len_idx_at │
│ --- ┆ --- │
│ list[str] ┆ i64 │
╞═════════════════════════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 0 │
└─────────────────────────────────────┴────────────┘
In case of all elements in a list has equal semicolon counts, zero index would be preferred as showed in above output
df.with_columns(
[
# Get first and second list of documents as string element.
pl.col("doc_id").arr.get(0).alias("doc_list1"),
pl.col("doc_id").arr.get(1).alias("doc_list2"),
]
)
.with_columns(
[
# Split each doc list element on ";" and count number of splits.
pl.col("doc_list1").str.split(";").arr.lengths().alias("doc_list1_count"),
pl.col("doc_list2").str.split(";").arr.lengths().alias("doc_list2_count")
]
)
.with_column(
# Get the wanted index based on which list is longer.
pl.when(
pl.col("doc_list1_count") >= pl.col("doc_list2_count")
)
.then(0)
.otherwise(1)
.alias("len_idx_at")
)
Out[11]:
shape: (5, 6)
┌─────────────────────────────────────┬────────────────┬────────────────────────┬─────────────────┬─────────────────┬────────────┐
│ doc_id ┆ doc_list1 ┆ doc_list2 ┆ doc_list1_count ┆ doc_list2_count ┆ len_idx_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ str ┆ u32 ┆ u32 ┆ i64 │
╞═════════════════════════════════════╪════════════════╪════════════════════════╪═════════════════╪═════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 83;45;32;65;13 ┆ 7;8;9 ┆ 5 ┆ 3 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 9;4;5 ┆ 4;2;7;3;5;8;10;11 ┆ 3 ┆ 8 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1000;2000 ┆ 76;34;100001;7474;2924 ┆ 2 ┆ 5 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 100;200 ┆ 200;100 ┆ 2 ┆ 2 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 3;4;6;7;10;11 ┆ 1;2;3;4;5 ┆ 6 ┆ 5 ┆ 0 │
└─────────────────────────────────────┴────────────────┴────────────────────────┴─────────────────┴─────────────────┴────────────┘

Polars - Perform matrix inner product on lazy frames to produce sparse representation of gram matrix

Suppose we have a polars dataframe like:
df = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]}).lazy()
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 │
└─────┴─────┘
I would like to X^TX the matrix while preserving the sparse matrix format for arrow* - in pandas I would do something like:
pdf = df.collect().to_pandas()
numbers = pdf[["a", "b"]]
(numbers.T # numbers).melt(ignore_index=False)
variable value
a a 14
b a 26
a b 26
b b 50
I did something like this in polars:
df.select(
[
(pl.col("a") * pl.col("a")).sum().alias("aa"),
(pl.col("a") * pl.col("b")).sum().alias("ab"),
(pl.col("b") * pl.col("a")).sum().alias("ba"),
(pl.col("b") * pl.col("b")).sum().alias("bb"),
]
).melt().collect()
shape: (4, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════╪═══════╡
│ aa ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ab ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ba ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bb ┆ 50 │
└──────────┴───────┘
Which is almost there but not quite. This is a hack to get around the fact that I can't store lists as the column names (and then I could unnest them to become two different columns representing the x and y axis of the matrix). Is there a way to get the same format as shown in the pandas example?
*arrow is a columnar data format which means it's performant when scaled across rows but not across columns, which is why I think the sparse matrix representation is better if I want to use the results of the gram matrix chained with pl.LazyFrames later down the graph. I could be wrong though!
Polars doesn't have matrix multiplication, but we can tweak your algorithm slightly to accomplish what we need:
use the built-in dot expression
calculate each inner product only once, since <a, b> = <b, a>. We'll use Python's combinations_with_replacement iterator from itertools to accomplish this.
automatically generate the list of expressions that will run in parallel
Let's expand your data a bit:
from itertools import combinations_with_replacement
import polars as pl
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5], "b": [3, 4, 5, 6, 7], "c": [5, 6, 7, 8, 9]}
).lazy()
df.collect()
shape: (5, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 6 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
The algorithm would be as follows:
expr_list = [
pl.col(col1).dot(pl.col(col2)).alias(col1 + "|" + col2)
for col1, col2 in combinations_with_replacement(df.columns, 2)
]
dot_prods = (
df
.select(expr_list)
.melt()
.with_column(
pl.col('variable').str.split_exact('|', 1)
)
.unnest('variable')
.cache()
)
result = (
pl.concat([
dot_prods,
dot_prods
.filter(pl.col('field_0') != pl.col('field_1'))
.select(['field_1', 'field_0', 'value'])
.rename({'field_0':'field_1', 'field_1': 'field_0'})
],
)
.sort(['field_0', 'field_1'])
)
result.collect()
shape: (9, 3)
┌─────────┬─────────┬───────┐
│ field_0 ┆ field_1 ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪═════════╪═══════╡
│ a ┆ a ┆ 55 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ b ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ c ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ a ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ b ┆ 135 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ c ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ a ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ b ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ c ┆ 255 │
└─────────┴─────────┴───────┘
Couple of notes:
I'm assuming that a pipe would be an appropriate delimiter for your column names.
The use of Python bytecode and iterator will not significantly impair performance. It is only used to generate the list of expressions, not run any calculations.

Polars: switching between dtypes within a DataFrame

I was trying to search whether there would be a way to change the dtypes for the strings with numbers easily. For example, the problem I face is as follows:
df = pl.Dataframe({"foo": ["100CT pen", "pencils 250CT", "what "125CT soever", "this is a thing"]})
I could extract and create a new column named {"bar": ["100", "250", "125", ""]}. But then I couldn't find a handy function that converts this column to Int64 or float dtypes so that the result is [100, 250, 125, null].
Also, vice versa. Sometimes it would be useful to have a handy function that converts the column of [100, 250, 125, 0] to ["100", "250", "125", "0"]. Is it something that already exists?
Thanks!
The easiest way to accomplish this is with the cast expression.
String to Int/Float
To cast from a string to an integer (or float):
import polars as pl
df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_column(pl.col('bar').cast(pl.Int64, strict=False).alias('bar_int'))
shape: (4, 2)
┌─────┬─────────┐
│ bar ┆ bar_int │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════════╡
│ 100 ┆ 100 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 250 ┆ 250 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 125 ┆ 125 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ┆ null │
└─────┴─────────┘
A handy list of available datatypes is here. These are all aliased under polars, so you can refer to them easily (e.g., pl.UInt64).
For the data you describe, I recommend using strict=False to avoid having one mangled number among millions of records result in an exception that halts everything.
Int/Float to String
The same process can be used to convert numbers to strings - in this case, the utf8 datatype.
Let me modify your dataset slightly:
df = pl.DataFrame({"bar": [100.5, 250.25, 1250000, None]})
df.with_column(pl.col("bar").cast(pl.Utf8, strict=False).alias("bar_string"))
shape: (4, 2)
┌────────┬────────────┐
│ bar ┆ bar_string │
│ --- ┆ --- │
│ f64 ┆ str │
╞════════╪════════════╡
│ 100.5 ┆ 100.5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250.25 ┆ 250.25 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.25e6 ┆ 1250000.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└────────┴────────────┘
If you need more control over the formatting, you can use the apply method and Python's new f-string formatting.
df.with_column(
pl.col("bar").apply(lambda x: f"This is ${x:,.2f}!").alias("bar_fstring")
)
shape: (4, 2)
┌────────┬────────────────────────┐
│ bar ┆ bar_fstring │
│ --- ┆ --- │
│ f64 ┆ str │
╞════════╪════════════════════════╡
│ 100.5 ┆ This is $100.50! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250.25 ┆ This is $250.25! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.25e6 ┆ This is $1,250,000.00! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└────────┴────────────────────────┘
I found this web page to be a handy reference for those unfamiliar with f-string formatting.
As an addition to #cbilot 's answer.
You don't need to use slow python lambda functions to use special string formatting of expressions. Polars has a format function for this purpose:
df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_columns([
pl.format("This is {}!", pl.col("bar"))
])
shape: (4, 2)
┌─────┬──────────────┐
│ bar ┆ literal │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════════╡
│ 100 ┆ This is 100! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250 ┆ This is 250! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 125 ┆ This is 125! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ┆ This is ! │
└─────┴──────────────┘
For other data manipulation in polars, like string to datetime, use strptime().
import polars as pl
df = pl.DataFrame(df_pandas)
df
shape: (100, 2)
┌────────────┬────────┐
│ dates_col ┆ ticker │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪════════╡
│ 2022-02-25 ┆ RDW │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2008-05-28 ┆ ARTX │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2015-05-21 ┆ CBAT │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2009-02-09 ┆ ANNB │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
Use it like this, converting the column to string:
df.with_column(pl.col("dates_col").str.strptime(pl.Datetime, fmt="%Y-%m-%d").cast(pl.Datetime))
shape: (100, 2)
┌─────────────────────┬────────┐
│ dates_col ┆ ticker │
│ --- ┆ --- │
│ datetime[μs] ┆ str │
╞═════════════════════╪════════╡
│ 2022-02-25 00:00:00 ┆ RDW │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2008-05-28 00:00:00 ┆ ARTX │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2015-05-21 00:00:00 ┆ CBAT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2009-02-09 00:00:00 ┆ ANNB │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤

Categories

Resources