In polars pandas want to inter change/ assign , & inter change row value in two rows.
for i in range(len(k2)):
k2['column1'][i] == k2['column2'][i]
k2['column3'][i] == k2['column4'][i]
You can use alias to copy & rename columns:
import polars as pl
k2 = pl.DataFrame({"column1": [1,2,3],
"column2": [4,5,6],
"column3": [7,8,9],
"column4": [10,11,12]})
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 1 ┆ 4 ┆ 7 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 5 ┆ 8 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 6 ┆ 9 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘
k2.with_columns([pl.col("column2").alias("column1"), pl.col("column4").alias("column3")])
which prints out
┌─────────┬─────────┬─────────┬─────────┐
│ column1 ┆ column2 ┆ column3 ┆ column4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╡
│ 4 ┆ 4 ┆ 10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 5 ┆ 11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 6 ┆ 12 ┆ 12 │
└─────────┴─────────┴─────────┴─────────┘
Related
I have a small use case and here is a polars dataframe.
df_names = pl.DataFrame({'LN'['Mallesham','Bhavik','Mallesham','Bhavik','Mahesh','Naresh','Sharath','Rakesh','Mallesham'],
'FN':['Yamulla','Yamulla','Yamulla','Yamulla','Dayala','Burre','Velmala','Uppu','Yamulla'],
'SSN':['123','456','123','456','893','111','222','333','123'],
'Address':['A','B','C','D','E','F','G','H','S']})
Here I would like to group on LN,FN,SSN and create a new column in which how many number of observations for this group combination and below is the expected output.
'Mallesham','Yamulla','123' is appeared 3 times, hence LN_FN_SSN_count field is filled up with 3.
You can use an expression using over (which is like grouping, aggregating and self-joining in other libs, but without the need for the join):
df_names.with_column(pl.count().over(['LN', 'FN', 'SSN']).alias('LN_FN_SSN_count'))
┌───────────┬─────────┬─────┬─────────┬─────────────────┐
│ LN ┆ FN ┆ SSN ┆ Address ┆ LN_FN_SSN_count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════╪═════════╪═════════════════╡
│ Mallesham ┆ Yamulla ┆ 123 ┆ A ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ B ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ C ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ Yamulla ┆ 456 ┆ D ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Naresh ┆ Burre ┆ 111 ┆ F ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Sharath ┆ Velmala ┆ 222 ┆ G ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rakesh ┆ Uppu ┆ 333 ┆ H ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ Yamulla ┆ 123 ┆ S ┆ 3 │
└───────────┴─────────┴─────┴─────────┴─────────────────┘
I am trying to create a list of new columns based on the latest column. I can achieve this by using with_columns() and simple multiplication. Given I want a long list of new columns, I am thinking to use a loop with an f-string to do it. However, I am not so sure how to apply f-string into polars column names.
df = pl.DataFrame(
{
"id": ["NY", "TK", "FD"],
"eat2003": [-9, 3, 8],
"eat2004": [10, 11, 8]
}
); df
┌─────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 │
└─────┴─────────┴─────────┘
(
df
.with_columns((pl.col('eat2004') * 2).alias('eat2005'))
.with_columns((pl.col('eat2005') * 2).alias('eat2006'))
.with_columns((pl.col('eat2006') * 2).alias('eat2007'))
)
Expected output:
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┘
If you can base each of the newest columns from eat2004, I would suggest the following approach:
expr_list = [
(pl.col('eat2004') * (2**i)).alias(f"eat{2004 + i}")
for i in range(1, 8)
]
(
df
.with_columns(expr_list)
)
shape: (3, 10)
┌─────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ id ┆ eat2003 ┆ eat2004 ┆ eat2005 ┆ eat2006 ┆ eat2007 ┆ eat2008 ┆ eat2009 ┆ eat2010 ┆ eat2011 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ NY ┆ -9 ┆ 10 ┆ 20 ┆ 40 ┆ 80 ┆ 160 ┆ 320 ┆ 640 ┆ 1280 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ TK ┆ 3 ┆ 11 ┆ 22 ┆ 44 ┆ 88 ┆ 176 ┆ 352 ┆ 704 ┆ 1408 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ FD ┆ 8 ┆ 8 ┆ 16 ┆ 32 ┆ 64 ┆ 128 ┆ 256 ┆ 512 ┆ 1024 │
└─────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
As long as all the Expressions are independent of each other, we can run them in parallel in a single with_columns context (for a nice performance gain). However, if the Expressions are not independent, then they must be run each in successive with_column contexts.
I've purposely created the list of Expressions outside of any query context to demonstrate that Expressions can be generated independent of any query. Later, the list can be supplied to with_columns. This approach helps with debugging and keeping code clean, as you build and test your Expressions.
In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?
d = {'col1': ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split('/',expand=True)
Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.
Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.
import polars as pl
df = pl.DataFrame(
{
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
}
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str │
│ --- │
│ str │
╞═════════════════════════════╡
│ cat │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘
The Algorithm
The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
.with_column(
pl.col('^string_.*$').fill_null("")
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
How it works
We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬────────────────────────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ list[str] │
╞═════╪═════════════════════════════╪════════════════════════════╡
│ 0 ┆ cat ┆ ["cat"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ ["cat", "dog"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ [""] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", ... "frog"] │
└─────┴─────────────────────────────┴────────────────────────────┘
Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╡
│ 0 ┆ cat ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog │
└─────┴─────────────────────────────┴───────────┘
In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.
I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)
You can substitute your own naming in this step as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id ┆ my_str ┆ split_str ┆ col_nm │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark ┆ string_02 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse ┆ string_03 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘
In the next step, we'll use the pivot function to place each string in its own column.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)
You can use apply() method
import polars as pl
from polars import col
df = pl.DataFrame({
'col1': ["a/b/c/d", "e/f/j/k"]
})
print(df)
df:
shape: (2, 1)
┌─────────┐
│ col1 │
│ --- │
│ str │
╞═════════╡
│ a/b/c/d │
├╌╌╌╌╌╌╌╌╌┤
│ e/f/j/k │
└─────────┘
With apply()
df = df.with_columns([
col('col1'),
*[col('col1').apply(lambda s, i=i: s.split('/')[i]).alias(col_name)
for i, col_name in enumerate(['a', 'b', 'c', 'd'])]
# or without 'for'
# col('col1').apply(lambda s: s.split('/')[0]).alias('a'),
# col('col1').apply(lambda s: s.split('/')[1]).alias('b'),
# col('col1').apply(lambda s: s.split('/')[2]).alias('c'),
# col('col1').apply(lambda s: s.split('/')[3]).alias('d')
])
print(df)
df:
shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1 ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a ┆ b ┆ c ┆ d │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ e/f/j/k ┆ e ┆ f ┆ j ┆ k │
└─────────┴─────┴─────┴─────┴─────┘
It works, but probably there is more accurate way)
With this way you do the string split to turn col1 into a list of strings. Then you loop over the lists and use .arr.get to extract each element into a separate column
(df
.with_column(pl.col("col1").str.split("/"))
.with_columns(
[pl.col("col1").arr.get(i).alias(str(i)) for i in range(len(df[0,"col1"].split('/')))
]
)
)
One challenge is whether you will have the same number of elements in the list in each row. In this solution I've assumed you have and have taken the length of the list in the first row to do the loop.
You can use struct datatype, as described in this post: https://stackoverflow.com/a/74219166:
import pandas as pl
df = pl.DataFrame({
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
})
df.select(pl.col('my_str').str.split('/')
.arr.to_struct(n_field_strategy="max_width")).unnest('my_str')
Notice you must use n_field_strategy="max_width", otherwise, unnest() will create only 1 column.
import polars as pl
#Create new column list(can be created dynamically as well)
new_cols=['new_col1','new_col2','new_col3',.....,new_coln]
#Define expression
expr = [pl.col('col1').str.split('/').arr.get(i).alias(col)
for i,col in enumerate(new_cols)
]
#Apply Expression
df.with_columns(expr)
Now I have a dataframe like this:
df = pd.DataFrame({"asset":["a","b","c","a","b","c","b","c"],"v":[1,2,3,4,5,6,7,8],"date":["2017","2011","2012","2013","2014","2015","2016","2010"]})
I can calculate the pct_change by groupby and my function like this:
def fun(df):
df = df.sort_values(by="date")
df["pct_change"] = df["v"].pct_change()
return df
df = df.groupby("asset",as_index=False).apply(fun)
Now I want to know how can I get the same result by polars?
Here are two options. One using window functions, and one using groupby + explode.
You should benchmark and see which is faster on your use case.
preparing data
df = pl.DataFrame({
"asset":["a","b","c","a","b","c","b","c"],
"v":[1,2,3,4,5,6,7,8],
"date":["2017","2011","2012","2013","2014","2015","2016","2010"]
})
using window functions
(
df.sort(["asset", "date"])
.with_columns([
pl.col("v").pct_change().over("asset").alias("pct_change")
])
)
using groupby + explode
(df.groupby("asset")
.agg([
pl.all().first(),
pl.col("v").sort_by("date").pct_change().alias("pct_change")
]).explode("pct_change")
)
Result
Both output:
shape: (8, 4)
┌───────┬─────┬──────┬────────────┐
│ asset ┆ v ┆ date ┆ pct_change │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 │
╞═══════╪═════╪══════╪════════════╡
│ a ┆ 4 ┆ 2013 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 1 ┆ 2017 ┆ -0.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 2011 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 5 ┆ 2014 ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 7 ┆ 2016 ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 8 ┆ 2010 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 2012 ┆ -0.625 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 6 ┆ 2015 ┆ 1.0 │
└───────┴─────┴──────┴────────────┘
For these two dfs, I want to check for each i in df1["TS"] if df["TS"] == df1["TS}, then assign the value in "Dr" that corresponds to i to the "mmsi" column:
df = pl.DataFrame({"TS": [1, 2, 3, 4, 5, 6, 7], "mmsi":[11,12,13,14,15,16,17]})
df1 = pl.DataFrame({
"TS": [4, 6, 7],
"Dr": [21,22,23]})
I want the output of df["mmsi"] to be: [11,12,13,21,15,22,23]
I suggest using a "left" join, followed by a fill_null, to fill in values of Dr that are not found.
df.join(
df1,
on="TS",
how="left"
).with_column(pl.col('Dr').fill_null(pl.col('mmsi')))
shape: (7, 3)
┌─────┬──────┬─────┐
│ TS ┆ mmsi ┆ Dr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪═════╡
│ 1 ┆ 11 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 12 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 13 ┆ 13 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 14 ┆ 21 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 15 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ 16 ┆ 22 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 7 ┆ 17 ┆ 23 │
└─────┴──────┴─────┘
Your result is in the Dr column. If needed, you can drop/rename columns so that mmsi is the final column.
df = (
df.join(df1, on="TS", how="left")
.with_column(pl.col("Dr").fill_null(pl.col("mmsi")))
.drop("mmsi")
.rename({"Dr": "mmsi"})
)
print(df)
shape: (7, 2)
┌─────┬──────┐
│ TS ┆ mmsi │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ 13 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 21 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 6 ┆ 22 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 7 ┆ 23 │
└─────┴──────┘
Taken in steps, the "left" join will yield the following.
df.join(
df1,
on="TS",
how="left"
)
shape: (7, 3)
┌─────┬──────┬──────┐
│ TS ┆ mmsi ┆ Dr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ 1 ┆ 11 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 12 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ 13 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 14 ┆ 21 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5 ┆ 15 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 6 ┆ 16 ┆ 22 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 7 ┆ 17 ┆ 23 │
└─────┴──────┴──────┘
The fill_null step will then fill in any missing values in the Dr column using the corresponding values in the mmsi column.
The performance of this will be much better than iterating over values using a for loop.