How to swap column values on conditions in python polars? - python

I have a data frame as below:
df_n = pl.from_pandas(pd.DataFrame({'last_name':[np.nan,'mallesh','bhavik'],
'first_name':['a','b','c'],
'middle_name_or_initial':['aa','bb','cc']}))
Here I would like to find an observation which has First and Middle Name not NULL and Last Name is Null, in this case first_name should be swapped to last_name and middle_name should be swapped to first_name, and middle_name to be EMPTY.
expected output will be:
I'm trying with this command:
df_n.with_columns([
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())
).then(pl.col('first_name').alias('last_name')).otherwise(pl.col('last_name').alias('first_name')),
pl.when((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())
).then(pl.col('middle_name_or_initial').alias('first_name')).otherwise('').alias('middle_name_or_initial')
]
)
Here it is throwing a wrong output and any help ?

You can actually change the values of multiple columns within a single when/then/otherwise statement.
The Algorithm
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
.drop(name_cols)
.unnest('name_struct')
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘
How it works
To change the values of multiple columns within a single when/then/otherwise statement, we can use structs. But you must observe some rules with structs. In all your then and otherwise statements, your structs must have:
the same field names
in the same order
with the same data type in corresponding fields.
So, in both the then and otherwise statements, I'm going to create a struct with field names in this order:
last_name: string
first_name: string
middle_name_or_initial: string
In our then statement, I'm swapping values and using alias to ensure that my fields names are as stated above. (This is important.)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
And in the otherwise statement, we'll simply name the existing columns that we want, in the order that we want - using the list name_cols that I created in a previous step.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
...
.otherwise(pl.struct(name_cols))
Here's the result after the when/then/otherwise statement.
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
)
shape: (3, 4)
┌───────────┬────────────┬────────────────────────┬──────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial ┆ name_struct │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ struct[3] │
╞═══════════╪════════════╪════════════════════════╪══════════════════════╡
│ null ┆ a ┆ aa ┆ {"a","aa",null} │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb ┆ {"mallesh","b","bb"} │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc ┆ {"bhavik","c","cc"} │
└───────────┴────────────┴────────────────────────┴──────────────────────┘
Notice that our new struct name_struct has the values that we want - in the correct order.
Next, we will use unnest to break the struct into separate columns. (But first, we must drop the existing columns so that we don't get 2 sets of columns with the same names.)
name_cols = ["last_name", "first_name", "middle_name_or_initial"]
(
df_n.with_column(
pl.when(
(pl.col("first_name").is_not_null())
& (pl.col("middle_name_or_initial").is_not_null())
& (pl.col("last_name").is_null())
)
.then(pl.struct([
pl.col('first_name').alias('last_name'),
pl.col('middle_name_or_initial').alias('first_name'),
pl.col('last_name').alias('middle_name_or_initial'),
]))
.otherwise(pl.struct(name_cols))
.alias('name_struct')
)
.drop(name_cols)
.unnest('name_struct')
)
shape: (3, 3)
┌───────────┬────────────┬────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╡
│ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┘

With pl.when().then().otherwise() you create values for only one column (so only one alias at the end is allowed).
In [67]: df_n.with_columns(
...: [
...: # Create temp column with filter, so it does not have to be recalculated 3 times.
...: ((pl.col('first_name').is_not_null()) & (pl.col('middle_name_or_initial').is_not_null()) & (pl.col('last_name').is_null())).alias("swap_names")
...: ]
...: ).with_columns(
...: [
...: # Create new columns with the correct value based on the swap_names column.
...: pl.when(pl.col("swap_names")).then(pl.col("first_name")).otherwise(pl.col("last_name")).alias("last_name_new"),
...: pl.when(pl.col("swap_names")).then(pl.col("middle_name_or_initial")).otherwise(pl.col("first_name")).alias("first_name_new"),
...: pl.when(pl.col("swap_names")).then(None).otherwise(pl.col("middle_name_or_initial")).alias("middle_name_or_initial_new"),
...: ]
...: )
Out[67]:
shape: (3, 7)
┌───────────┬────────────┬────────────────────────┬────────────┬───────────────┬────────────────┬────────────────────────────┐
│ last_name ┆ first_name ┆ middle_name_or_initial ┆ swap_names ┆ last_name_new ┆ first_name_new ┆ middle_name_or_initial_new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool ┆ str ┆ str ┆ str │
╞═══════════╪════════════╪════════════════════════╪════════════╪═══════════════╪════════════════╪════════════════════════════╡
│ null ┆ a ┆ aa ┆ true ┆ a ┆ aa ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mallesh ┆ b ┆ bb ┆ false ┆ mallesh ┆ b ┆ bb │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bhavik ┆ c ┆ cc ┆ false ┆ bhavik ┆ c ┆ cc │
└───────────┴────────────┴────────────────────────┴────────────┴───────────────┴────────────────┴────────────────────────────┘

Related

String manipulation in polars

I have a record in polars which has no header so far. This header should refer to the first row of the record. Before I instantiate this row as header, I want to manipulate the entries.
import polars as pl
# Creating a dictionary with the data
data = {
"Column_1": ["ID", 4, 4, 4, 4],
"Column_2": ["LocalValue", "B", "C", "D", "E"],
"Column_3": ["Data\nField", "Q", "R", "S", "T"],
"Column_4": [None, None, None, None, None],
"Column_5": ["Global Value", "G", "H", "I", "J"],
}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌──────────┬────────────┬──────────┬──────────┬──────────────┐
│ Column_1 ┆ Column_2 ┆ Column_3 ┆ Column_4 ┆ Column_5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞══════════╪════════════╪══════════╪══════════╪══════════════╡
│ ID ┆ LocalValue ┆ Data ┆ null ┆ Global Value │
│ ┆ ┆ Field ┆ ┆ │
│ null ┆ B ┆ Q ┆ null ┆ G │
│ null ┆ C ┆ R ┆ null ┆ H │
│ null ┆ D ┆ S ┆ null ┆ I │
│ null ┆ E ┆ T ┆ null ┆ J │
└──────────┴────────────┴──────────┴──────────┴──────────────┘
First, I want to replace line breaks and spaces between words with an underscore. Furthermore I want to fill Camel Cases with an underscore (e.g. TestTest -> Test_Test). Finally, all entries should be lowercase. For this I wrote the following function:
def clean_dataframe_columns(df):
header = list(df.head(1).transpose().to_series())
cleaned_headers = []
for entry in header:
if entry:
entry = (
entry.replace("\n", "_")
.replace("(?<=[a-z])(?=[A-Z])", "_")
.replace("\s", "_")
.to_lowercase()
)
else:
entry = "no_column"
cleaned_headers.append(entry)
df.columns = cleaned_headers
return df
Unfortunately I have the following error. What am I doing wrong?
AttributeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 df1 = clean_dataframe_columns(df)
Cell In[12], line 7, in clean_dataframe_columns(df)
4 for entry in header:
5 if entry:
6 entry = (
----> 7 entry.str.replace("\n", "_")
8 .replace("(?<=[a-z])(?=[A-Z])", "_")
9 .replace("\s", "_")
10 .to_lowercase()
11 )
12 else:
13 entry = "no_column"
AttributeError: 'str' object has no attribute 'str'
The goal should be this dataframe:
shape: (4, 5)
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 ┆ str │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4 ┆ B ┆ Q ┆ null ┆ G │
│ 4 ┆ C ┆ R ┆ null ┆ H │
│ 4 ┆ D ┆ S ┆ null ┆ I │
│ 4 ┆ E ┆ T ┆ null ┆ J │
└─────┴─────────────┴────────────┴───────────┴──────────────┘
Here for entry in header: you iterate over python strings, so you should use corresponding methods (like .lower() instead of .to_lowercase()).
Rewritten sol-n:
import re
def get_cols(raw_col):
if raw_col is None: return "no_column"
raw_col = re.sub("(?<=[a-z])(?=[A-Z])", "_", raw_col)
return raw_col.replace("\n", "_").replace(" ", "_").lower()
def clean_dataframe_columns(df):
raw_cols = table.head(1).transpose().to_series().to_list()
return df.rename({
col: get_cols(raw_col) for col, raw_col in zip(df.columns, raw_cols)
}).slice(1).with_column(pl.col("id").fill_null(4).cast(pl.Int32))
┌─────┬─────────────┬────────────┬───────────┬──────────────┐
│ id ┆ local_value ┆ data_field ┆ no_column ┆ global_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ str │
╞═════╪═════════════╪════════════╪═══════════╪══════════════╡
│ 4 ┆ B ┆ Q ┆ null ┆ G │
│ 4 ┆ C ┆ R ┆ null ┆ H │
│ 4 ┆ D ┆ S ┆ null ┆ I │
│ 4 ┆ E ┆ T ┆ null ┆ J │
└─────┴─────────────┴────────────┴───────────┴──────────────┘
I solved it on my own with this approach:
def clean_select_columns(self, df: pl.DataFrame) -> pl.DataFrame:
"""
Clean columns from a dataframe.
:param df: input Dataframe
:return: Dataframe with cleaned columns
The function takes a loaded Dataframe and performs the following operations:
Transposes the first row of the dataframe to get the header
Selects the required columns defined in the list required_columns
Cleans the header names by:
1. Replacing special characters with underscores
2. Converting CamelCase strings to snake_case strings
3. Converting all columns to lowercase
4. Naming columns with no names as "no_column_X", where X is a unique integer
5. Returns the cleaned dataframe.
"""
header = list(df.head(1).transpose().to_series())
cleaned_headers = []
i = 0
for entry in header:
if entry:
entry = (
re.sub(r"(?i)([\n ?])", "",
re.sub(r"(?<!^)(?=[A-Z][a-z])", "_", entry))
.lower()
)
else:
entry = f"no_column_{i}"
cleaned_headers.append(entry)
i += 1
df.columns = cleaned_headers
return df

Polars - compute on all other values in a window except this one

For each row, I'm trying to compute the standard deviation for the other values in a group excluding the row's value. A way to think about it is "what would the standard deviation for the group be if this row's value was removed". An example may be easier to parse:
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
I would want to add a column that would have the first row be std([10,7,8]) = 1.527525
I tried to hack something together and ended up with code that is horrible to read and also has a bug that I don't know how to work around:
df.with_column(
(
(pl.col("value").sum().over(pl.col("answer")) - pl.col("value"))
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).alias("average_other")
).with_column(
(
(
(
(pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
- (pl.col("value") - pl.col("average_other")).pow(2)
)
/ (pl.col("value").count().over(pl.col("answer")) - 1)
).sqrt()
).alias("std_dev_other")
)
I'm not sure I would recommend parsing that, but I'll point out at least one thing that is wrong:
pl.col("value") - pl.col("average_other")).pow(2).sum().over(pl.col("answer"))
I want to be comparing "value" in each row to "average_other" from this row then squaring and summing over the window but instead I am comparing "value" in each row to "average_other" in each row.
My main question is the "what is the best way to get the standard deviation while leaving out this value?" part. But I would also be interested if there is a way to do the comparison that I'm doing wrong above. Third would be tips on how to write this in way that is easy to understand what is going on.
The way I'd come at this (at least at first thought) is create three helper columns. The first being a row index, the second being a window list of the values in the group, and the last is a windowed list of the row index. Next I'd explode by the two aforementioned lists. With that you can filter out the rows where the actual row index is equal to the list row index. That allows you to run std against the values by the row index where we've filtered out the own value on each row. You can join that result back to the original.
df = pl.DataFrame(
{
"answer": ["yes","yes","yes","yes","maybe","maybe","maybe"],
"value": [5,10,7,8,6,9,10],
}
)
df.with_row_count('i').join(
df.with_row_count('i') \
.with_columns([
pl.col('value').list().over('answer').alias('l'),
pl.col('i').list().over('answer').alias('il')]) \
.explode(['l','il']).filter(pl.col('i')!=pl.col('il')) \
.groupby('i').agg(pl.col('l').std().alias('std')),
on='i').drop('i')
shape: (7, 3)
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ std │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╡
│ yes ┆ 5 ┆ 1.527525 │
│ yes ┆ 10 ┆ 1.527525 │
│ yes ┆ 7 ┆ 2.516611 │
│ yes ┆ 8 ┆ 2.516611 │
│ maybe ┆ 6 ┆ 0.707107 │
│ maybe ┆ 9 ┆ 2.828427 │
│ maybe ┆ 10 ┆ 2.12132 │
└────────┴───────┴──────────┘
I came up with something similiar to #DeanMacGregor's answer:
df = (
df.with_row_count()
.join(df.with_row_count(), on="answer")
.filter(pl.col("row_nr") != pl.col("row_nr_right"))
.groupby(["answer", "row_nr_right"], maintain_order=True).agg([
pl.col("value_right").first().alias("value"),
pl.col("value").std().alias("stdev"),
])
.drop("row_nr_right")
)
.join df with row count on itself and remove the rows where the two row counts are identical. Then group by answer and row_nr_right and (1) pick the first group item out of value_right and (2) calculate the standard deviation over the value group.
Result for
df = pl.DataFrame({
"answer": ["yes", "yes", "yes", "yes", "yes", "maybe", "maybe", "maybe", "maybe"],
"value": [5, 10, 7, 8, 4, 6, 9, 10, 4],
})
is
┌────────┬───────┬──────────┐
│ answer ┆ value ┆ stdev │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╡
│ yes ┆ 5 ┆ 2.5 │
│ yes ┆ 10 ┆ 1.825742 │
│ yes ┆ 7 ┆ 2.753785 │
│ yes ┆ 8 ┆ 2.645751 │
│ ... ┆ ... ┆ ... │
│ maybe ┆ 6 ┆ 3.21455 │
│ maybe ┆ 9 ┆ 3.05505 │
│ maybe ┆ 10 ┆ 2.516611 │
│ maybe ┆ 4 ┆ 2.081666 │
└────────┴───────┴──────────┘
I am not sure if my solution retains the (performance) advantages of using polars (instead of regular pandas), but I find it easier to maintain and more readable than the other answers.
With all the iterative conversion of data from Series to a list I expect this won't scale well, but perhaps your usecase does not require that.
Starting with the data (thanks to this answer for a minimal working dataset):
import polars as pl
import statistics as stats
df = pl.DataFrame(
{
"answer": ["yes", "yes", "yes", "yes", "maybe", "maybe", "maybe"],
"value": [5, 10, 7, 8, 6, 9, 10],
}
)
df
shape: (7, 2)
┌────────┬───────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═══════╡
│ yes ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ yes ┆ 8 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 6 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ maybe ┆ 10 │
└────────┴───────┘
Now define a custom function to do the actual calculation (thanks to this answer for an example of a custom polars aggregation function).
I.e. take a list of values, move one-by-one through the list and calculate the standard deviation of the values in the list, except for the current value:
def custom_std(args: list[pl.Series]) -> pl.Series:
output = []
# Iterate over the values within the group
for idx in range(0, len(args[0])):
# Convert the Series to a list, as Polars does not have a method that can delete individual elements
temp = args[0].to_list()
# Delete value fo the current row
del temp[idx]
# Now calculate the std. dev. on the remaining values
# I arbitrarly chose the sample standard deviation, adjust accordingly to your situation
result = stats.stdev(temp)
# Store the result in a list and go to the next iteration
output.append(result)
# The dtype is not correct, but I can't find how to specify that this series contains a list of floats
return pl.Series(output, dtype=pl.Float64)
Use this function in a groupby:
gdf = df.groupby(by=["answer"], maintain_order=True).agg(pl.apply(f=custom_std, exprs=["value"]))
gdf
┌────────┬─────────────────────────────────────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ list [f64] │
╞════════╪═════════════════════════════════════╡
│ yes ┆ [1.247219, 1.247219, ... 2.05480... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ [0.5, 2.0, 1.5] │
└────────┴─────────────────────────────────────┘
To get it in the desired format explode the resulting DataFrame:
gdf.explode("value")
┌────────┬──────────┐
│ answer ┆ value │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪══════════╡
│ yes ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 1.527525 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ yes ┆ 2.516611 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 0.707107 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 2.828427 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ maybe ┆ 2.1213 │
└────────┴──────────┘

How to get an index of maximum count of a required string in a list column of polars data frame?

I have a polars dataframe as
pl.DataFrame({'doc_id':[
['83;45;32;65;13','7;8;9'],
['9;4;5','4;2;7;3;5;8;10;11'],
['1000;2000','76;34;100001;7474;2924'],
['100;200','200;100'],
['3;4;6;7;10;11','1;2;3;4;5']
]})
each list consist of document id's separated with semicolon, if any of list element has got higher semicolon its index needs to be found and create a new column as len_idx_at and fill in with the index number.
For example:
['83;45;32;65;13','7;8;9']
This list is having two elements, in a first element there are about 4 semicolon hence its has 5 documents, similarly in a second element there are about 2 semicolons and it means it has 3 documents.
Here we should consider an index of a highest document counts element in the above case - it will be 0 index because it has 4 semicolons'.
the expected output as:
shape: (5, 2)
┌─────────────────────────────────────┬────────────┐
│ doc_id ┆ len_idx_at │
│ --- ┆ --- │
│ list[str] ┆ i64 │
╞═════════════════════════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 0 │
└─────────────────────────────────────┴────────────┘
In case of all elements in a list has equal semicolon counts, zero index would be preferred as showed in above output
df.with_columns(
[
# Get first and second list of documents as string element.
pl.col("doc_id").arr.get(0).alias("doc_list1"),
pl.col("doc_id").arr.get(1).alias("doc_list2"),
]
)
.with_columns(
[
# Split each doc list element on ";" and count number of splits.
pl.col("doc_list1").str.split(";").arr.lengths().alias("doc_list1_count"),
pl.col("doc_list2").str.split(";").arr.lengths().alias("doc_list2_count")
]
)
.with_column(
# Get the wanted index based on which list is longer.
pl.when(
pl.col("doc_list1_count") >= pl.col("doc_list2_count")
)
.then(0)
.otherwise(1)
.alias("len_idx_at")
)
Out[11]:
shape: (5, 6)
┌─────────────────────────────────────┬────────────────┬────────────────────────┬─────────────────┬─────────────────┬────────────┐
│ doc_id ┆ doc_list1 ┆ doc_list2 ┆ doc_list1_count ┆ doc_list2_count ┆ len_idx_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ str ┆ u32 ┆ u32 ┆ i64 │
╞═════════════════════════════════════╪════════════════╪════════════════════════╪═════════════════╪═════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 83;45;32;65;13 ┆ 7;8;9 ┆ 5 ┆ 3 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 9;4;5 ┆ 4;2;7;3;5;8;10;11 ┆ 3 ┆ 8 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1000;2000 ┆ 76;34;100001;7474;2924 ┆ 2 ┆ 5 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 100;200 ┆ 200;100 ┆ 2 ┆ 2 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 3;4;6;7;10;11 ┆ 1;2;3;4;5 ┆ 6 ┆ 5 ┆ 0 │
└─────────────────────────────────────┴────────────────┴────────────────────────┴─────────────────┴─────────────────┴────────────┘

Polars - Perform matrix inner product on lazy frames to produce sparse representation of gram matrix

Suppose we have a polars dataframe like:
df = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]}).lazy()
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 │
└─────┴─────┘
I would like to X^TX the matrix while preserving the sparse matrix format for arrow* - in pandas I would do something like:
pdf = df.collect().to_pandas()
numbers = pdf[["a", "b"]]
(numbers.T # numbers).melt(ignore_index=False)
variable value
a a 14
b a 26
a b 26
b b 50
I did something like this in polars:
df.select(
[
(pl.col("a") * pl.col("a")).sum().alias("aa"),
(pl.col("a") * pl.col("b")).sum().alias("ab"),
(pl.col("b") * pl.col("a")).sum().alias("ba"),
(pl.col("b") * pl.col("b")).sum().alias("bb"),
]
).melt().collect()
shape: (4, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════════╪═══════╡
│ aa ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ab ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ba ┆ 26 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bb ┆ 50 │
└──────────┴───────┘
Which is almost there but not quite. This is a hack to get around the fact that I can't store lists as the column names (and then I could unnest them to become two different columns representing the x and y axis of the matrix). Is there a way to get the same format as shown in the pandas example?
*arrow is a columnar data format which means it's performant when scaled across rows but not across columns, which is why I think the sparse matrix representation is better if I want to use the results of the gram matrix chained with pl.LazyFrames later down the graph. I could be wrong though!
Polars doesn't have matrix multiplication, but we can tweak your algorithm slightly to accomplish what we need:
use the built-in dot expression
calculate each inner product only once, since <a, b> = <b, a>. We'll use Python's combinations_with_replacement iterator from itertools to accomplish this.
automatically generate the list of expressions that will run in parallel
Let's expand your data a bit:
from itertools import combinations_with_replacement
import polars as pl
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5], "b": [3, 4, 5, 6, 7], "c": [5, 6, 7, 8, 9]}
).lazy()
df.collect()
shape: (5, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 5 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 6 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
The algorithm would be as follows:
expr_list = [
pl.col(col1).dot(pl.col(col2)).alias(col1 + "|" + col2)
for col1, col2 in combinations_with_replacement(df.columns, 2)
]
dot_prods = (
df
.select(expr_list)
.melt()
.with_column(
pl.col('variable').str.split_exact('|', 1)
)
.unnest('variable')
.cache()
)
result = (
pl.concat([
dot_prods,
dot_prods
.filter(pl.col('field_0') != pl.col('field_1'))
.select(['field_1', 'field_0', 'value'])
.rename({'field_0':'field_1', 'field_1': 'field_0'})
],
)
.sort(['field_0', 'field_1'])
)
result.collect()
shape: (9, 3)
┌─────────┬─────────┬───────┐
│ field_0 ┆ field_1 ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════╪═════════╪═══════╡
│ a ┆ a ┆ 55 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ b ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ c ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ a ┆ 85 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ b ┆ 135 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ c ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ a ┆ 115 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ b ┆ 185 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ c ┆ 255 │
└─────────┴─────────┴───────┘
Couple of notes:
I'm assuming that a pipe would be an appropriate delimiter for your column names.
The use of Python bytecode and iterator will not significantly impair performance. It is only used to generate the list of expressions, not run any calculations.

Polars: switching between dtypes within a DataFrame

I was trying to search whether there would be a way to change the dtypes for the strings with numbers easily. For example, the problem I face is as follows:
df = pl.Dataframe({"foo": ["100CT pen", "pencils 250CT", "what "125CT soever", "this is a thing"]})
I could extract and create a new column named {"bar": ["100", "250", "125", ""]}. But then I couldn't find a handy function that converts this column to Int64 or float dtypes so that the result is [100, 250, 125, null].
Also, vice versa. Sometimes it would be useful to have a handy function that converts the column of [100, 250, 125, 0] to ["100", "250", "125", "0"]. Is it something that already exists?
Thanks!
The easiest way to accomplish this is with the cast expression.
String to Int/Float
To cast from a string to an integer (or float):
import polars as pl
df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_column(pl.col('bar').cast(pl.Int64, strict=False).alias('bar_int'))
shape: (4, 2)
┌─────┬─────────┐
│ bar ┆ bar_int │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════════╡
│ 100 ┆ 100 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 250 ┆ 250 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 125 ┆ 125 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ┆ null │
└─────┴─────────┘
A handy list of available datatypes is here. These are all aliased under polars, so you can refer to them easily (e.g., pl.UInt64).
For the data you describe, I recommend using strict=False to avoid having one mangled number among millions of records result in an exception that halts everything.
Int/Float to String
The same process can be used to convert numbers to strings - in this case, the utf8 datatype.
Let me modify your dataset slightly:
df = pl.DataFrame({"bar": [100.5, 250.25, 1250000, None]})
df.with_column(pl.col("bar").cast(pl.Utf8, strict=False).alias("bar_string"))
shape: (4, 2)
┌────────┬────────────┐
│ bar ┆ bar_string │
│ --- ┆ --- │
│ f64 ┆ str │
╞════════╪════════════╡
│ 100.5 ┆ 100.5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250.25 ┆ 250.25 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.25e6 ┆ 1250000.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└────────┴────────────┘
If you need more control over the formatting, you can use the apply method and Python's new f-string formatting.
df.with_column(
pl.col("bar").apply(lambda x: f"This is ${x:,.2f}!").alias("bar_fstring")
)
shape: (4, 2)
┌────────┬────────────────────────┐
│ bar ┆ bar_fstring │
│ --- ┆ --- │
│ f64 ┆ str │
╞════════╪════════════════════════╡
│ 100.5 ┆ This is $100.50! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250.25 ┆ This is $250.25! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.25e6 ┆ This is $1,250,000.00! │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└────────┴────────────────────────┘
I found this web page to be a handy reference for those unfamiliar with f-string formatting.
As an addition to #cbilot 's answer.
You don't need to use slow python lambda functions to use special string formatting of expressions. Polars has a format function for this purpose:
df = pl.DataFrame({"bar": ["100", "250", "125", ""]})
df.with_columns([
pl.format("This is {}!", pl.col("bar"))
])
shape: (4, 2)
┌─────┬──────────────┐
│ bar ┆ literal │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════════╡
│ 100 ┆ This is 100! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 250 ┆ This is 250! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 125 ┆ This is 125! │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ┆ This is ! │
└─────┴──────────────┘
For other data manipulation in polars, like string to datetime, use strptime().
import polars as pl
df = pl.DataFrame(df_pandas)
df
shape: (100, 2)
┌────────────┬────────┐
│ dates_col ┆ ticker │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪════════╡
│ 2022-02-25 ┆ RDW │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2008-05-28 ┆ ARTX │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2015-05-21 ┆ CBAT │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2009-02-09 ┆ ANNB │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
Use it like this, converting the column to string:
df.with_column(pl.col("dates_col").str.strptime(pl.Datetime, fmt="%Y-%m-%d").cast(pl.Datetime))
shape: (100, 2)
┌─────────────────────┬────────┐
│ dates_col ┆ ticker │
│ --- ┆ --- │
│ datetime[μs] ┆ str │
╞═════════════════════╪════════╡
│ 2022-02-25 00:00:00 ┆ RDW │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2008-05-28 00:00:00 ┆ ARTX │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2015-05-21 00:00:00 ┆ CBAT │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2009-02-09 00:00:00 ┆ ANNB │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤

Categories

Resources