I have a df as follows:
with n knows at runtime.
I need to count 1 and -1 values over the rows.
Namely, I need a new df (or new columns in the old one):
Any advice?
As of polars 0.13.60, you can use polars.sum with an Expression to sum horizontally. For example, starting with this data
import polars as pl
data_frame = (
pl.DataFrame({
'col0': [1, -1, 1, -1, 1],
'col1': [1, 1, 1, 1, 1],
'col2': [-1, -1, -1, -1, -1],
'col3': [1, -1, -1, 1, 1],
})
)
data_frame
shape: (5, 4)
┌──────┬──────┬──────┬──────┐
│ col0 ┆ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 1 ┆ 1 ┆ -1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ -1 ┆ 1 ┆ -1 ┆ -1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ -1 ┆ -1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ -1 ┆ 1 ┆ -1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ -1 ┆ 1 │
└──────┴──────┴──────┴──────┘
We can sum all columns horizontally, using polars.all.
(
data_frame
.with_columns([
pl.sum(pl.all() > 0).alias('pos'),
pl.sum(pl.all() < 0).alias('neg'),
])
)
shape: (5, 6)
┌──────┬──────┬──────┬──────┬─────┬─────┐
│ col0 ┆ col1 ┆ col2 ┆ col3 ┆ pos ┆ neg │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪═════╪═════╡
│ 1 ┆ 1 ┆ -1 ┆ 1 ┆ 3 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ -1 ┆ 1 ┆ -1 ┆ -1 ┆ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ -1 ┆ -1 ┆ 2 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ -1 ┆ 1 ┆ -1 ┆ 1 ┆ 2 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ -1 ┆ 1 ┆ 3 ┆ 1 │
└──────┴──────┴──────┴──────┴─────┴─────┘
How it works
The above algorithm works because Polars will upcast boolean values to unsigned integers when summing. For example, the expression pl.all() > 0 produces Expressions of type boolean.
(
data_frame
.with_columns([
(pl.all() > 0).keep_name()
])
)
shape: (5, 4)
┌───────┬──────┬───────┬───────┐
│ col0 ┆ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool ┆ bool │
╞═══════╪══════╪═══════╪═══════╡
│ true ┆ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false ┆ true │
└───────┴──────┴───────┴───────┘
polars.sum will then convert these to unsigned integers as it sums them horizontally.
For examples of how to select only certain columns (by name, by type, by regex expression, etc...), see this StackOverflow response.
Related
I am trying to remove null values across a list of selected columns. But it seems that I might have got the wtih_columns operation not right. What's the right approach if you want to operate the removing only on selected columns?
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2000": [1, None, 3],
"eat2001": [-2, None, 4],
"eat2002": [None, None, None],
"eat2003": [-9, None, 8],
"eat2004": [None, None, 8]
}
); df
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
col_list = [word for word in df.columns if word.startswith(("eat"))]
(
df
.with_columns([
pl.col(col_list).filter(~pl.fold(True, lambda acc, s: acc & s.is_null(), pl.all()))
])
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The polars.all and polars.any expressions will perform their operations horizontally (i.e., row-wise) if we supply them with a list of Expressions or a polars.col with a regex wildcard expression. Let's use the latter to simplify our work:
(
df
.filter(
~pl.all(pl.col('^eat.*$').is_null())
)
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
The ~ in front of the pl.all stands for negation. Notice that we didn't need the col_list.
One caution: the regex expression in the pl.col must start with ^ and end with $. These cannot be omitted, even if the resulting regex expression is otherwise valid.
Alternately, if you don't like the ~ operator:
(
df
.filter(
pl.any(pl.col('^eat.*$').is_not_null())
)
)
Other Notes
As an aside, polars.sum, polars.min, and polars.max will also operate row-wise when supplied with a list of Expression or a wildcard expression in col.
Edit - using fold
FYI, here's how to use the fold method, if that is what you'd prefer. Note the use of pl.col with a regex expression.
(
df
.filter(
~pl.fold(True, lambda acc, s: acc & s.is_null(), exprs=pl.col('^eat.*$'))
)
)
shape: (2, 6)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2000 ┆ eat2001 ┆ eat2002 ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ 1 ┆ -2 ┆ null ┆ -9 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 3 ┆ 4 ┆ null ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
(
df
.filter(pl.col("eat2000").is_not_null())
)
I am trying to create a list of new columns based on the latest column. I can achieve this by using with_columns() and simple multiplication. Given I want a long list of new columns, I am thinking to use a loop with an f-string to do it. However, I am not so sure how to apply f-string into polars column names.
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2003": [-9, 3, 8],
"eat2004": [10, 11, 8]
}
); df
┌─────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┘
(
df
.with_columns((pl.col('eat2004') * 2).alias('eat2005'))
.with_columns((pl.col('eat2005') * 2).alias('eat2006'))
.with_columns((pl.col('eat2006') * 2).alias('eat2007'))
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
If you can base each of the newest columns from eat2004, I would suggest the following approach:
expr_list = [
(pl.col('eat2004') * (2**i)).alias(f"eat{2004 + i}")
for i in range(1, 8)
]
(
df
.with_columns(expr_list)
)
shape: (3, 10)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 ┆ eat2008 ┆ eat2009 ┆ eat2010 ┆ eat2011 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 ┆ 160 ┆ 320 ┆ 640 ┆ 1280 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 ┆ 176 ┆ 352 ┆ 704 ┆ 1408 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 ┆ 128 ┆ 256 ┆ 512 ┆ 1024 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
As long as all the Expressions are independent of each other, we can run them in parallel in a single with_columns context (for a nice performance gain). However, if the Expressions are not independent, then they must be run each in successive with_column contexts.
I've purposely created the list of Expressions outside of any query context to demonstrate that Expressions can be generated independent of any query. Later, the list can be supplied to with_columns. This approach helps with debugging and keeping code clean, as you build and test your Expressions.
In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?
d = {'col1': ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split('/',expand=True)
Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.
Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.
import polars as pl
df = pl.DataFrame(
{
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
}
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str │
│ --- │
│ str │
╞═════════════════════════════╡
│ cat │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘
The Algorithm
The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
.with_column(
pl.col('^string_.*$').fill_null("")
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
How it works
We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬────────────────────────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ list[str] │
╞═════╪═════════════════════════════╪════════════════════════════╡
│ 0 ┆ cat ┆ ["cat"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ ["cat", "dog"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ [""] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", ... "frog"] │
└─────┴─────────────────────────────┴────────────────────────────┘
Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╡
│ 0 ┆ cat ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog │
└─────┴─────────────────────────────┴───────────┘
In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.
I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)
You can substitute your own naming in this step as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id ┆ my_str ┆ split_str ┆ col_nm │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark ┆ string_02 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse ┆ string_03 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘
In the next step, we'll use the pivot function to place each string in its own column.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)
You can use apply() method
import polars as pl
from polars import col
df = pl.DataFrame({
'col1': ["a/b/c/d", "e/f/j/k"]
})
print(df)
df:
shape: (2, 1)
┌─────────┐
│ col1 │
│ --- │
│ str │
╞═════════╡
│ a/b/c/d │
├╌╌╌╌╌╌╌╌╌┤
│ e/f/j/k │
└─────────┘
With apply()
df = df.with_columns([
col('col1'),
*[col('col1').apply(lambda s, i=i: s.split('/')[i]).alias(col_name)
for i, col_name in enumerate(['a', 'b', 'c', 'd'])]
# or without 'for'
# col('col1').apply(lambda s: s.split('/')[0]).alias('a'),
# col('col1').apply(lambda s: s.split('/')[1]).alias('b'),
# col('col1').apply(lambda s: s.split('/')[2]).alias('c'),
# col('col1').apply(lambda s: s.split('/')[3]).alias('d')
])
print(df)
df:
shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1 ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a ┆ b ┆ c ┆ d │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ e/f/j/k ┆ e ┆ f ┆ j ┆ k │
└─────────┴─────┴─────┴─────┴─────┘
It works, but probably there is more accurate way)
With this way you do the string split to turn col1 into a list of strings. Then you loop over the lists and use .arr.get to extract each element into a separate column
(df
.with_column(pl.col("col1").str.split("/"))
.with_columns(
[pl.col("col1").arr.get(i).alias(str(i)) for i in range(len(df[0,"col1"].split('/')))
]
)
)
One challenge is whether you will have the same number of elements in the list in each row. In this solution I've assumed you have and have taken the length of the list in the first row to do the loop.
You can use struct datatype, as described in this post: https://stackoverflow.com/a/74219166:
import pandas as pl
df = pl.DataFrame({
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
})
df.select(pl.col('my_str').str.split('/')
.arr.to_struct(n_field_strategy="max_width")).unnest('my_str')
Notice you must use n_field_strategy="max_width", otherwise, unnest() will create only 1 column.
import polars as pl
#Create new column list(can be created dynamically as well)
new_cols=['new_col1','new_col2','new_col3',.....,new_coln]
#Define expression
expr = [pl.col('col1').str.split('/').arr.get(i).alias(col)
for i,col in enumerate(new_cols)
]
#Apply Expression
df.with_columns(expr)
Now I have a dataframe like this:
df = pd.DataFrame({"asset":["a","b","c","a","b","c","b","c"],"v":[1,2,3,4,5,6,7,8],"date":["2017","2011","2012","2013","2014","2015","2016","2010"]})
I can calculate the pct_change by groupby and my function like this:
def fun(df):
df = df.sort_values(by="date")
df["pct_change"] = df["v"].pct_change()
return df
df = df.groupby("asset",as_index=False).apply(fun)
Now I want to know how can I get the same result by polars?
Here are two options. One using window functions, and one using groupby + explode.
You should benchmark and see which is faster on your use case.
preparing data
df = pl.DataFrame({
"asset":["a","b","c","a","b","c","b","c"],
"v":[1,2,3,4,5,6,7,8],
"date":["2017","2011","2012","2013","2014","2015","2016","2010"]
})
using window functions
(
df.sort(["asset", "date"])
.with_columns([
pl.col("v").pct_change().over("asset").alias("pct_change")
])
)
using groupby + explode
(df.groupby("asset")
.agg([
pl.all().first(),
pl.col("v").sort_by("date").pct_change().alias("pct_change")
]).explode("pct_change")
)
Result
Both output:
shape: (8, 4)
┌───────┬─────┬──────┬────────────┐
│ asset ┆ v ┆ date ┆ pct_change │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 │
╞═══════╪═════╪══════╪════════════╡
│ a ┆ 4 ┆ 2013 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 1 ┆ 2017 ┆ -0.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 2011 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 5 ┆ 2014 ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 7 ┆ 2016 ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 8 ┆ 2010 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 2012 ┆ -0.625 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 6 ┆ 2015 ┆ 1.0 │
└───────┴─────┴──────┴────────────┘
In polars pandas want to inter change/ assign , & inter change row value in two rows.
for i in range(len(k2)):
k2['column1'][i] == k2['column2'][i]
k2['column3'][i] == k2['column4'][i]
You can use alias to copy & rename columns:
import polars as pl
k2 = pl.DataFrame({"column1": [1,2,3],
"column2": [4,5,6],
"column3": [7,8,9],
"column4": [10,11,12]})
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 1 ┆ 4 ┆ 7 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 ┆ 8 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 6 ┆ 9 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘
k2.with_columns([pl.col("column2").alias("column1"), pl.col("column4").alias("column3")])
which prints out
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 4 ┆ 4 ┆ 10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 5 ┆ 11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 6 ┆ 12 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘