String manipulation in polars - python

I have a record in polars which has no header so far. This header should refer to the first row of the record. Before I instantiate this row as header, I want to manipulate the entries.
import polars as pl
# Creating a dictionary with the data
data = {
"Column_1": ["ID", 4, 4, 4, 4],
"Column_2": ["LocalValue", "B", "C", "D", "E"],
"Column_3": ["Data\nField", "Q", "R", "S", "T"],
"Column_4": [None, None, None, None, None],
"Column_5": ["Global Value", "G", "H", "I", "J"],
}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌──────────┬────────────┬──────────┬──────────┬──────────────┐
│ Column_1 ┆ Column_2 ┆ Column_3 ┆ Column_4 ┆ Column_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞══════════╪════════════╪══════════╪══════════╪══════════════╡
│ ID ┆ LocalValue ┆ Data ┆ null ┆ Global Value │
│ ┆ ┆ Field ┆ ┆ │
│ null ┆ B ┆ Q ┆ null ┆ G │
│ null ┆ C ┆ R ┆ null ┆ H │
│ null ┆ D ┆ S ┆ null ┆ I │
│ null ┆ E ┆ T ┆ null ┆ J │
└──────────┴────────────┴──────────┴──────────┴──────────────┘
First, I want to replace line breaks and spaces between words with an underscore. Furthermore I want to fill Camel Cases with an underscore (e.g. TestTest -> Test_Test). Finally, all entries should be lowercase. For this I wrote the following function:
def clean_dataframe_columns(df):
header = list(df.head(1).transpose().to_series())
cleaned_headers = []
for entry in header:
if entry:
entry = (
entry.replace("\n", "_")
.replace("(?<=[a-z])(?=[A-Z])", "_")
.replace("\s", "_")
.to_lowercase()
)
else:
entry = "no_column"
cleaned_headers.append(entry)
df.columns = cleaned_headers
return df
Unfortunately I have the following error. What am I doing wrong?
AttributeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df1 = clean_dataframe_columns(df)
Cell In[12], line 7, in clean_dataframe_columns(df)
4 for entry in header:
5 if entry:
6 entry = (
----> 7 entry.str.replace("\n", "_")
8 .replace("(?<=[a-z])(?=[A-Z])", "_")
9 .replace("\s", "_")
10 .to_lowercase()
11 )
12 else:
13 entry = "no_column"
AttributeError: 'str' object has no attribute 'str'
The goal should be this dataframe:
shape: (4, 5)
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ str │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4 ┆ B ┆ Q ┆ null ┆ G │
│ 4 ┆ C ┆ R ┆ null ┆ H │
│ 4 ┆ D ┆ S ┆ null ┆ I │
│ 4 ┆ E ┆ T ┆ null ┆ J │
└─────┴─────────────┴────────────┴───────────┴──────────────┘

Here for entry in header: you iterate over python strings, so you should use corresponding methods (like .lower() instead of .to_lowercase()).
Rewritten sol-n:
import re
def get_cols(raw_col):
if raw_col is None: return "no_column"
raw_col = re.sub("(?<=[a-z])(?=[A-Z])", "_", raw_col)
return raw_col.replace("\n", "_").replace(" ", "_").lower()
def clean_dataframe_columns(df):
raw_cols = table.head(1).transpose().to_series().to_list()
return df.rename({
col: get_cols(raw_col) for col, raw_col in zip(df.columns, raw_cols)
}).slice(1).with_column(pl.col("id").fill_null(4).cast(pl.Int32))
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4 ┆ B ┆ Q ┆ null ┆ G │
│ 4 ┆ C ┆ R ┆ null ┆ H │
│ 4 ┆ D ┆ S ┆ null ┆ I │
│ 4 ┆ E ┆ T ┆ null ┆ J │
└─────┴─────────────┴────────────┴───────────┴──────────────┘

I solved it on my own with this approach:
def clean_select_columns(self, df: pl.DataFrame) -> pl.DataFrame:
"""
Clean columns from a dataframe.
:param df: input Dataframe
:return: Dataframe with cleaned columns
The function takes a loaded Dataframe and performs the following operations:
Transposes the first row of the dataframe to get the header
Selects the required columns defined in the list required_columns
Cleans the header names by:
1. Replacing special characters with underscores
2. Converting CamelCase strings to snake_case strings
3. Converting all columns to lowercase
4. Naming columns with no names as "no_column_X", where X is a unique integer
5. Returns the cleaned dataframe.
"""
header = list(df.head(1).transpose().to_series())
cleaned_headers = []
i = 0
for entry in header:
if entry:
entry = (
re.sub(r"(?i)([\n ?])", "",
re.sub(r"(?<!^)(?=[A-Z][a-z])", "_", entry))
.lower()
)
else:
entry = f"no_column_{i}"
cleaned_headers.append(entry)
i += 1
df.columns = cleaned_headers
return df

Related

Polars - compute on all other values in a window except this one

For each row, I'm trying to compute the standard deviation for the other values in a group excluding the row's value. A way to think about it is "what would the standard deviation for the group be if this row's value was removed". An example may be easier to parse:
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
I would want to add a column that would have the first row be std([10,7,8]) = 1.527525
I tried to hack something together and ended up with code that is horrible to read and also has a bug that I don't know how to work around:
df.with_column(
(
(pl.col("value").sum().over(pl.col("answer")) - pl.col("value"))
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).alias("average_other")
).with_column(
(
(
(
(pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
- (pl.col("value") - pl.col("average_other")).pow(2)
)
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).sqrt()
).alias("std_dev_other")
)
I'm not sure I would recommend parsing that, but I'll point out at least one thing that is wrong:
pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
I want to be comparing "value" in each row to "average_other" from this row then squaring and summing over the window but instead I am comparing "value" in each row to "average_other" in each row.
My main question is the "what is the best way to get the standard deviation while leaving out this value?" part. But I would also be interested if there is a way to do the comparison that I'm doing wrong above. Third would be tips on how to write this in way that is easy to understand what is going on.
The way I'd come at this (at least at first thought) is create three helper columns. The first being a row index, the second being a window list of the values in the group, and the last is a windowed list of the row index. Next I'd explode by the two aforementioned lists. With that you can filter out the rows where the actual row index is equal to the list row index. That allows you to run std against the values by the row index where we've filtered out the own value on each row. You can join that result back to the original.
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
df.with_row_count('i').join(
df.with_row_count('i') \
.with_columns([
pl.col('value').list().over('answer').alias('l'),
pl.col('i').list().over('answer').alias('il')]) \
.explode(['l','il']).filter(pl.col('i')!=pl.col('il')) \
.groupby('i').agg(pl.col('l').std().alias('std')),
on='i').drop('i')
shape: (7, 3)
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ std │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╡
│ yes ┆ 5 ┆ 1.527525 │
│ yes ┆ 10 ┆ 1.527525 │
│ yes ┆ 7 ┆ 2.516611 │
│ yes ┆ 8 ┆ 2.516611 │
│ maybe ┆ 6 ┆ 0.707107 │
│ maybe ┆ 9 ┆ 2.828427 │
│ maybe ┆ 10 ┆ 2.12132 │
└────────┴───────┴──────────┘
I came up with something similiar to #DeanMacGregor's answer:
df = (
df.with_row_count()
.join(df.with_row_count(), on="answer")
.filter(pl.col("row_nr") != pl.col("row_nr_right"))
.groupby(["answer", "row_nr_right"], maintain_order=True).agg([
pl.col("value_right").first().alias("value"),
pl.col("value").std().alias("stdev"),
])
.drop("row_nr_right")
)
.join df with row count on itself and remove the rows where the two row counts are identical. Then group by answer and row_nr_right and (1) pick the first group item out of value_right and (2) calculate the standard deviation over the value group.
Result for
df = pl.DataFrame({
"answer": ["yes", "yes", "yes", "yes", "yes", "maybe", "maybe", "maybe", "maybe"],
"value": [5, 10, 7, 8, 4, 6, 9, 10, 4],
})
is
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ stdev │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╡
│ yes ┆ 5 ┆ 2.5 │
│ yes ┆ 10 ┆ 1.825742 │
│ yes ┆ 7 ┆ 2.753785 │
│ yes ┆ 8 ┆ 2.645751 │
│ ... ┆ ... ┆ ... │
│ maybe ┆ 6 ┆ 3.21455 │
│ maybe ┆ 9 ┆ 3.05505 │
│ maybe ┆ 10 ┆ 2.516611 │
│ maybe ┆ 4 ┆ 2.081666 │
└────────┴───────┴──────────┘
I am not sure if my solution retains the (performance) advantages of using polars (instead of regular pandas), but I find it easier to maintain and more readable than the other answers.
With all the iterative conversion of data from Series to a list I expect this won't scale well, but perhaps your usecase does not require that.
Starting with the data (thanks to this answer for a minimal working dataset):
import polars as pl
import statistics as stats
df = pl.DataFrame(
{
"answer": ["yes", "yes", "yes", "yes", "maybe", "maybe", "maybe"],
"value": [5, 10, 7, 8, 6, 9, 10],
}
)
df
shape: (7, 2)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
Now define a custom function to do the actual calculation (thanks to this answer for an example of a custom polars aggregation function).
I.e. take a list of values, move one-by-one through the list and calculate the standard deviation of the values in the list, except for the current value:
def custom_std(args: list[pl.Series]) -> pl.Series:
output = []
# Iterate over the values within the group
for idx in range(0, len(args[0])):
# Convert the Series to a list, as Polars does not have a method that can delete individual elements
temp = args[0].to_list()
# Delete value fo the current row
del temp[idx]
# Now calculate the std. dev. on the remaining values
# I arbitrarly chose the sample standard deviation, adjust accordingly to your situation
result = stats.stdev(temp)
# Store the result in a list and go to the next iteration
output.append(result)
# The dtype is not correct, but I can't find how to specify that this series contains a list of floats
return pl.Series(output, dtype=pl.Float64)
Use this function in a groupby:
gdf = df.groupby(by=["answer"], maintain_order=True).agg(pl.apply(f=custom_std, exprs=["value"]))
gdf
┌────────┬─────────────────────────────────────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ list [f64] │
╞════════╪═════════════════════════════════════╡
│ yes ┆ [1.247219, 1.247219, ... 2.05480... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ [0.5, 2.0, 1.5] │
└────────┴─────────────────────────────────────┘
To get it in the desired format explode the resulting DataFrame:
gdf.explode("value")
┌────────┬──────────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ yes ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 0.707107 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 2.828427 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 2.1213 │
└────────┴──────────┘

How to get an index of maximum count of a required string in a list column of polars data frame?

I have a polars dataframe as
pl.DataFrame({'doc_id':[
['83;45;32;65;13','7;8;9'],
['9;4;5','4;2;7;3;5;8;10;11'],
['1000;2000','76;34;100001;7474;2924'],
['100;200','200;100'],
['3;4;6;7;10;11','1;2;3;4;5']
]})
each list consist of document id's separated with semicolon, if any of list element has got higher semicolon its index needs to be found and create a new column as len_idx_at and fill in with the index number.
For example:
['83;45;32;65;13','7;8;9']
This list is having two elements, in a first element there are about 4 semicolon hence its has 5 documents, similarly in a second element there are about 2 semicolons and it means it has 3 documents.
Here we should consider an index of a highest document counts element in the above case - it will be 0 index because it has 4 semicolons'.
the expected output as:
shape: (5, 2)
┌─────────────────────────────────────┬────────────┐
│ doc_id ┆ len_idx_at │
│ --- ┆ --- │
│ list[str] ┆ i64 │
╞═════════════════════════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 0 │
└─────────────────────────────────────┴────────────┘
In case of all elements in a list has equal semicolon counts, zero index would be preferred as showed in above output
df.with_columns(
[
# Get first and second list of documents as string element.
pl.col("doc_id").arr.get(0).alias("doc_list1"),
pl.col("doc_id").arr.get(1).alias("doc_list2"),
]
)
.with_columns(
[
# Split each doc list element on ";" and count number of splits.
pl.col("doc_list1").str.split(";").arr.lengths().alias("doc_list1_count"),
pl.col("doc_list2").str.split(";").arr.lengths().alias("doc_list2_count")
]
)
.with_column(
# Get the wanted index based on which list is longer.
pl.when(
pl.col("doc_list1_count") >= pl.col("doc_list2_count")
)
.then(0)
.otherwise(1)
.alias("len_idx_at")
)
Out[11]:
shape: (5, 6)
┌─────────────────────────────────────┬────────────────┬────────────────────────┬─────────────────┬─────────────────┬────────────┐
│ doc_id ┆ doc_list1 ┆ doc_list2 ┆ doc_list1_count ┆ doc_list2_count ┆ len_idx_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ str ┆ u32 ┆ u32 ┆ i64 │
╞═════════════════════════════════════╪════════════════╪════════════════════════╪═════════════════╪═════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 83;45;32;65;13 ┆ 7;8;9 ┆ 5 ┆ 3 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 9;4;5 ┆ 4;2;7;3;5;8;10;11 ┆ 3 ┆ 8 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1000;2000 ┆ 76;34;100001;7474;2924 ┆ 2 ┆ 5 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 100;200 ┆ 200;100 ┆ 2 ┆ 2 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 3;4;6;7;10;11 ┆ 1;2;3;4;5 ┆ 6 ┆ 5 ┆ 0 │
└─────────────────────────────────────┴────────────────┴────────────────────────┴─────────────────┴─────────────────┴────────────┘

How to swap column values on conditions in python polars?

I have a data frame as below:
df_n = pl.from_pandas(pd.DataFrame({'last_name':[np.nan,'mallesh','bhavik'],
'first_name':['a','b','c'],
'middle_name_or_initial':['aa','bb','cc']}))
Here I would like to find an observation which has First and Middle Name not NULL and Last Name is Null, in this case first_name should be swapped to last_name and middle_name should be swapped to first_name, and middle_name to be EMPTY.
expected output will be:
I'm trying with this command:
df_n.with_columns([
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())
).then(pl.col('first_name').alias('last_name')).otherwise(pl.col('last_name').alias('first_name')),
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())
).then(pl.col('middle_name_or_initial').alias('first_name')).otherwise('').alias('middle_name_or_initial')
]
)
Here it is throwing a wrong output and any help ?
You can actually change the values of multiple columns within a single when/then/otherwise statement.
The Algorithm
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
.drop(name_cols)
.unnest('name_struct')
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
How it works
To change the values of multiple columns within a single when/then/otherwise statement, we can use structs. But you must observe some rules with structs. In all your then and otherwise statements, your structs must have:
the same field names
in the same order
with the same data type in corresponding fields.
So, in both the then and otherwise statements, I'm going to create a struct with field names in this order:
last_name: string
first_name: string
middle_name_or_initial: string
In our then statement, I'm swapping values and using alias to ensure that my fields names are as stated above. (This is important.)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
And in the otherwise statement, we'll simply name the existing columns that we want, in the order that we want - using the list name_cols that I created in a previous step.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
...
.otherwise(pl.struct(name_cols))
Here's the result after the when/then/otherwise statement.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
)
shape: (3, 4)
┌───────────┬────────────┬────────────────────────┬──────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial ┆ name_struct │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ struct[3] │
╞═══════════╪════════════╪════════════════════════╪══════════════════════╡
│ null ┆ a ┆ aa ┆ {"a","aa",null} │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb ┆ {"mallesh","b","bb"} │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc ┆ {"bhavik","c","cc"} │
└───────────┴────────────┴────────────────────────┴──────────────────────┘
Notice that our new struct name_struct has the values that we want - in the correct order.
Next, we will use unnest to break the struct into separate columns. (But first, we must drop the existing columns so that we don't get 2 sets of columns with the same names.)
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
.drop(name_cols)
.unnest('name_struct')
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
With pl.when().then().otherwise() you create values for only one column (so only one alias at the end is allowed).
In [67]: df_n.with_columns(
...: [
...: # Create temp column with filter, so it does not have to be recalculated 3 times.
...: ((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())).alias("swap_names")
...: ]
...: ).with_columns(
...: [
...: # Create new columns with the correct value based on the swap_names column.
...: pl.when(pl.col("swap_names")).then(pl.col("first_name")).otherwise(pl.col("last_name")).alias("last_name_new"),
...: pl.when(pl.col("swap_names")).then(pl.col("middle_name_or_initial")).otherwise(pl.col("first_name")).alias("first_name_new"),
...: pl.when(pl.col("swap_names")).then(None).otherwise(pl.col("middle_name_or_initial")).alias("middle_name_or_initial_new"),
...: ]
...: )
Out[67]:
shape: (3, 7)
┌───────────┬────────────┬────────────────────────┬────────────┬───────────────┬────────────────┬────────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial ┆ swap_names ┆ last_name_new ┆ first_name_new ┆ middle_name_or_initial_new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool ┆ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╪════════════╪═══════════════╪════════════════╪════════════════════════════╡
│ null ┆ a ┆ aa ┆ true ┆ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb ┆ false ┆ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc ┆ false ┆ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┴────────────┴───────────────┴────────────────┴────────────────────────────┘

Polars - Perform matrix inner product on lazy frames to produce sparse representation of gram matrix

Suppose we have a polars dataframe like:
df = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]}).lazy()
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 │
└─────┴─────┘
I would like to X^TX the matrix while preserving the sparse matrix format for arrow* - in pandas I would do something like:
pdf = df.collect().to_pandas()
numbers = pdf[["a", "b"]]
(numbers.T # numbers).melt(ignore_index=False)
variable value
a a 14
b a 26
a b 26
b b 50
I did something like this in polars:
df.select(
[
(pl.col("a") * pl.col("a")).sum().alias("aa"),
(pl.col("a") * pl.col("b")).sum().alias("ab"),
(pl.col("b") * pl.col("a")).sum().alias("ba"),
(pl.col("b") * pl.col("b")).sum().alias("bb"),
]
).melt().collect()
shape: (4, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════╪═══════╡
│ aa ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ab ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ba ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bb ┆ 50 │
└──────────┴───────┘
Which is almost there but not quite. This is a hack to get around the fact that I can't store lists as the column names (and then I could unnest them to become two different columns representing the x and y axis of the matrix). Is there a way to get the same format as shown in the pandas example?
*arrow is a columnar data format which means it's performant when scaled across rows but not across columns, which is why I think the sparse matrix representation is better if I want to use the results of the gram matrix chained with pl.LazyFrames later down the graph. I could be wrong though!
Polars doesn't have matrix multiplication, but we can tweak your algorithm slightly to accomplish what we need:
use the built-in dot expression
calculate each inner product only once, since <a, b> = <b, a>. We'll use Python's combinations_with_replacement iterator from itertools to accomplish this.
automatically generate the list of expressions that will run in parallel
Let's expand your data a bit:
from itertools import combinations_with_replacement
import polars as pl
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5], "b": [3, 4, 5, 6, 7], "c": [5, 6, 7, 8, 9]}
).lazy()
df.collect()
shape: (5, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 6 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
The algorithm would be as follows:
expr_list = [
pl.col(col1).dot(pl.col(col2)).alias(col1 + "|" + col2)
for col1, col2 in combinations_with_replacement(df.columns, 2)
]
dot_prods = (
df
.select(expr_list)
.melt()
.with_column(
pl.col('variable').str.split_exact('|', 1)
)
.unnest('variable')
.cache()
)
result = (
pl.concat([
dot_prods,
dot_prods
.filter(pl.col('field_0') != pl.col('field_1'))
.select(['field_1', 'field_0', 'value'])
.rename({'field_0':'field_1', 'field_1': 'field_0'})
],
)
.sort(['field_0', 'field_1'])
)
result.collect()
shape: (9, 3)
┌─────────┬─────────┬───────┐
│ field_0 ┆ field_1 ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪═════════╪═══════╡
│ a ┆ a ┆ 55 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ b ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ c ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ a ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ b ┆ 135 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ c ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ a ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ b ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ c ┆ 255 │
└─────────┴─────────┴───────┘
Couple of notes:
I'm assuming that a pipe would be an appropriate delimiter for your column names.
The use of Python bytecode and iterator will not significantly impair performance. It is only used to generate the list of expressions, not run any calculations.

Filtering on a large number (hundreds) of conditions

I have a largish dataframe (5.5M rows, four columns). The first column (let's call it column A) has 235 distinct entries. The second column (B) has 100 distinct entries, integers from 0 to 99, all present in various proportions for each entry in A.
I groupby on A an aggregate by randomly selecting a value from B. Something like
df.groupby("A").agg(
pl.col("B").unique().apply(np.random.choice)
)
In doing so, every value in A is attributed a random integer. My goal is to select, from the dataframe, the columns C et D corresponding to the pairs A, B so generated.
My approach so far is
choice = df.groupby("A").agg(
pl.col("B").unique().apply(np.random.choice)
).to_numpy()
lchoice = ((df["A"] == arr[0]) & (df["B"] == arr[1]) for arr in choice)
mask = functools.reduce(operator.or_, lchoice)
sdf = df[mask].select(["C", "D"])
It does the job, but does not feel very idiomatic.
My first attempt was
sdf = df.filter(functools
.reduce(operator.or_,
[(pl.col("A") == arr[0]) & (pl.col("B") == arr[1])
for arr in choice]))
but it hangs until I kill it (I waited for ~30 minutes, while the other approach takes 1.6 seconds).
df.filter(
(pl.col("period") == choice[0, 0]) & (pl.col("exp_id") == choice[0, 1])
)
works fine, as expected, and I have used the functools.reduce construct successfully as argument to filter in the past. Obviously, I do not want to write them all by hand; I could loop over the rows of choice, filter df one at a time and then concatenate the dataframes, but it sounds much more expensive than it should be.
Any tip on getting to my sdf "the polars way", without having to create temporary objects, arrays, etc.? As I said, I have a working solution, but it is kind of shaky, and I am interested in learning better polars.
EDIT: some mock data
df = pl.DataFrame({"A": [1.3, 8.9, 6.7]*3 + [3.6, 4.1]*2,
"B": [1]*3 + [2]*3 + [3]*3 + [1]*2 + [2]*2,
"C": [21.5, 24.3, 21.8, 20.8, 23.6, 15.6, 23.5,
16.1, 15.6, 14.8, 14.7, 23.8, 20.],
"D": [6.9, 7.6, 6.4, 6.2, 7.6, 6.2,
6.3, 7.1, 7.8,7.7, 6.5, 6.6, 7.1]})
Slight twist on the accepted answer:
df.sort(by=['A', 'B'], in_place=True)
sdf = (df
.join(df
.groupby('A', maintain_order=True)
.agg(pl.col('B')
.unique()
.sort()
.shuffle(seed)
.first()
.alias('B')),
on=['A', 'B'])
.select(['C','D']))
I need to perform this operation multiple time, and I'd like to ensure reproducibility of the random generation, hence the sorts and maintain_order=True.
It looks like you can accomplish what you need with a join.
Let's see if I understand your question. Let's start with this data:
import polars as pl
df = pl.DataFrame(
{
"A": ["a", "b", "c", "c", "b", "a"] * 2,
"B": [1, 2, 3, 2] * 3,
"C": [1, 2, 3, 4, 5, 6] * 2,
"D": ["x", "y", "y", "x"] * 3,
}
).sort(["A", "B"])
print(df)
shape: (12, 4)
┌─────┬─────┬─────┬─────┐
│ A ┆ B ┆ C ┆ D │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 1 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 ┆ 6 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 ┆ 6 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 3 ┆ 1 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 1 ┆ 5 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 2 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 2 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 3 ┆ 5 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 1 ┆ 3 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 2 ┆ 4 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 2 ┆ 4 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 3 ┆ y │
└─────┴─────┴─────┴─────┘
Next, we randomly select a value for B for every value of A.
I'm going to change your code slightly to eliminate the use of numpy and instead use Polars' shuffle expression. This way, we get a Polars DataFrame back, which we'll use in the upcoming join. (The documentation for the shuffle expression is here. It uses numpy if no seed is provided.)
choice_df = df.groupby("A").agg(pl.col("B").unique().shuffle().first())
print(choice_df)
shape: (3, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 1 │
└─────┴─────┘
If I understand your question correctly, we now want to get the values for columns C and D in the original dataset that correspond to the three combinations of A and B that we selected in the previous step. We can accomplish this most simply with a join, followed by a select. (The select is merely to eliminate columns A and B from the result).
df.join(choice_df, on=['A','B']).select(['C','D'])
shape: (4, 2)
┌─────┬─────┐
│ C ┆ D │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 6 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ x │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ y │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ x │
└─────┴─────┘
Does this accomplish what you need? The resulting code is clean, concise, and typical of the use of the Polars API.

Categories

Resources