I want to add a duration in seconds to a date/time. My data looks like
import polars as pl
df = pl.DataFrame(
{
"dt": [
"2022-12-14T00:00:00", "2022-12-14T00:00:00", "2022-12-14T00:00:00",
],
"seconds": [
1.0, 2.2, 2.4,
],
}
)
df = df.with_column(pl.col("dt").str.strptime(pl.Datetime).cast(pl.Datetime))
Now my naive attempt was to to convert the float column to duration type to be able to add it to the datetime column (as I would do in pandas).
df = df.with_column(pl.col("seconds").cast(pl.Duration).alias("duration0"))
print(df.head())
┌─────────────────────┬─────────┬──────────────┐
│ dt ┆ seconds ┆ duration0 │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ duration[μs] │
╞═════════════════════╪═════════╪══════════════╡
│ 2022-12-14 00:00:00 ┆ 1.0 ┆ 0µs │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 00:00:00 ┆ 2.2 ┆ 0µs │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 00:00:00 ┆ 2.4 ┆ 0µs │
└─────────────────────┴─────────┴──────────────┘
...gives the correct data type, however the values are all zero.
I also tried
df = df.with_column(
pl.col("seconds")
.apply(lambda x: pl.duration(nanoseconds=x * 1e9))
.alias("duration1")
)
print(df.head())
shape: (3, 4)
┌─────────────────────┬─────────┬──────────────┬─────────────────────────────────────┐
│ dt ┆ seconds ┆ duration0 ┆ duration1 │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ duration[μs] ┆ object │
╞═════════════════════╪═════════╪══════════════╪═════════════════════════════════════╡
│ 2022-12-14 00:00:00 ┆ 1.0 ┆ 0µs ┆ 0i64.duration([0i64, 1000000000f... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 00:00:00 ┆ 2.2 ┆ 0µs ┆ 0i64.duration([0i64, 2200000000f... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-12-14 00:00:00 ┆ 2.4 ┆ 0µs ┆ 0i64.duration([0i64, 2400000000f... │
└─────────────────────┴─────────┴──────────────┴─────────────────────────────────────┘
which gives an object type column which isn't helpful either. The documentation is kind of sparse on the topic, any better options?
Update: The values being zero is a repr formatting issue that has been fixed with this commit.
pl.duration() can be used in this way:
>>> df.with_column(
... pl.col("dt").str.strptime(pl.Datetime)
... + pl.duration(nanoseconds=pl.col("seconds") * 1e9)
... )
shape: (3, 2)
┌─────────────────────────┬─────────┐
│ dt | seconds │
│ --- | --- │
│ datetime[μs] | f64 │
╞═════════════════════════╪═════════╡
│ 2022-12-14 00:00:01 | 1.0 │
├─────────────────────────┼─────────┤
│ 2022-12-14 00:00:02.200 | 2.2 │
├─────────────────────────┼─────────┤
│ 2022-12-14 00:00:02.400 | 2.4 │
└─//──────────────────────┴─//──────┘
Related
I'd like to calculate aggregated metrics with an expanding window. Basically, given the following dataframe:
from datetime import date
import polars as pl
df = pl.DataFrame({"Day":[date(2022, 1, i) for i in range(1,10)], "value":[1,2,3,4,5,6,7,8,9]})
shape: (9, 2)
┌────────────┬───────┐
│ Day ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-02 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-03 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-04 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-06 ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-07 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-08 ┆ 8 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-09 ┆ 9 │
└────────────┴───────┘
What I'm after is basically this:
|--|
|-----|
|--------|
I tried to use groupby_rolling and groupby_dynamic, but I couldn't get it to fix the initial time of each group to the first timestamp. My current workaround is something like this:
date_range = pl.date_range(df.select("Day").min().row(0)[0], df.select("Day").max().row(0)[0], '1w',)
for timestamp in date_range:
print(df.filter(pl.col('Day').is_between(date_range[0], timestamp, include_bounds=True)))
shape: (1, 2)
┌────────────┬───────┐
│ Day ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 1 │
└────────────┴───────┘
shape: (8, 2)
┌────────────┬───────┐
│ Day ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-02 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-03 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-04 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-05 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-06 ┆ 6 │
...
│ 2022-01-07 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-08 ┆ 8 │
└────────────┴───────┘
This gives me the exact aggregation I'm after, but I feel like there's a much more efficient way of doing this - and I'd especially like to do my aggregations within a groupby context.
from datetime import date
import polars as pl
df = pl.DataFrame({"Day":[date(2022, 1, i) for i in range(1,10)], "value":[1,2,3,4,5,6,7,8,9]})
shape: (9, 2)
┌────────────┬───────┐
│ Day ┆ value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-02 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-03 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-04 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-06 ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-07 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-08 ┆ 8 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-09 ┆ 9 │
└────────────┴───────┘
Not sure if it's possible with dynamic/rolling but you could create a dataframe from your date range and do a cross join.
>>> start = df.get_column("Day").min()
... end = df.get_column("Day").max()
... date_range = (
... pl.date_range(start, end, interval="1w").to_frame("end")
... .with_row_count(name="group")
... )
>>> date_range
shape: (2, 2)
┌───────┬────────────┐
│ group ┆ end │
│ --- ┆ --- │
│ u32 ┆ date │
╞═══════╪════════════╡
│ 0 ┆ 2022-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2022-01-08 │
└───────┴────────────┘
You can then run your filter and be left with a group identifier:
>>> (
... df
... .join(date_range, left_on="Day", right_on="end", how="cross")
... .with_column(pl.lit(start).alias("start"))
... .filter(
... pl.col("Day").is_between(
... pl.col("start"),
... pl.col("end"),
... include_bounds=True))
... .drop(["start", "end"])
... )
shape: (9, 3)
┌────────────┬───────┬───────┐
│ Day ┆ value ┆ group │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ u32 │
╞════════════╪═══════╪═══════╡
│ 2022-01-01 ┆ 1 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-01 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-02 ┆ 2 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-03 ┆ 3 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-05 ┆ 5 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-06 ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-07 ┆ 7 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2022-01-08 ┆ 8 ┆ 1 │
└────────────┴───────┴───────┘
I have a dataframe as-
pl.DataFrame({'last_name':['Unknown','Mallesham',np.nan,'Bhavik','Unknown'],
'first_name_or_initial':['U',np.nan,'TRUE','yamulla',np.nan],
'number':['003123490','012457847','100030303','','0023004648'],
'date_of_birth':[np.nan,'12/09/1900','12/09/1900','12/09/1900',np.nan]})
Here I would like to add a new column which contains the field names that do hold on any information except NULL/EMPTY/NAN.
For example:
first row: it has last,first and number field information, and dob is NULL, hence a new column conso_field is filled in with these field names such as last_name,first_name_or_initial and number. like wise I need to get this done for all the rows.
Here is an expected output:
First, let's expand the example to show a row with all null/empty fields (to show how the algorithm handles this case).
import polars as pl
import numpy as np
df = pl.DataFrame(
{
"last_name": ["Unknown", "Mallesham", np.nan, "Bhavik", "Unknown", None],
"first_name_or_initial": ["U", np.nan, "TRUE", "yamulla", np.nan, None],
"number": ["003123490", "012457847", "100030303", "", "0023004648", None],
"date_of_birth": [np.nan, "12/09/1900", "12/09/1900", "12/09/1900", np.nan, None],
}
)
df
shape: (6, 4)
┌───────────┬───────────────────────┬────────────┬───────────────┐
│ last_name ┆ first_name_or_initial ┆ number ┆ date_of_birth │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪═══════════════════════╪════════════╪═══════════════╡
│ Unknown ┆ U ┆ 003123490 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mallesham ┆ null ┆ 012457847 ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ TRUE ┆ 100030303 ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bhavik ┆ yamulla ┆ ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Unknown ┆ null ┆ 0023004648 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ null ┆ null │
└───────────┴───────────────────────┴────────────┴───────────────┘
The Algorithm
df = df.with_row_count()
(
df
.join(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
.groupby('row_nr')
.agg(pl.col('variable').alias('conso_field'))
.with_column(pl.col('conso_field').arr.join(','))
,
on='row_nr',
how='left'
)
)
shape: (6, 6)
┌────────┬───────────┬───────────────────────┬────────────┬───────────────┬─────────────────────────────────────┐
│ row_nr ┆ last_name ┆ first_name_or_initial ┆ number ┆ date_of_birth ┆ conso_field │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str │
╞════════╪═══════════╪═══════════════════════╪════════════╪═══════════════╪═════════════════════════════════════╡
│ 0 ┆ Unknown ┆ U ┆ 003123490 ┆ null ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ Mallesham ┆ null ┆ 012457847 ┆ 12/09/1900 ┆ last_name,number,date_of_birth │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ TRUE ┆ 100030303 ┆ 12/09/1900 ┆ first_name_or_initial,number,dat... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ Bhavik ┆ yamulla ┆ ┆ 12/09/1900 ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ Unknown ┆ null ┆ 0023004648 ┆ null ┆ last_name,number │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ null ┆ null ┆ null ┆ null │
└────────┴───────────┴───────────────────────┴────────────┴───────────────┴─────────────────────────────────────┘
Note that the algorithm keeps the last row with all null/empty values.
How it works
To see how it works, let's take it in steps.
First, we'll need to attach a row number to each row. (This is needed in case any row has all null/empty values.)
Then we'll use melt to place each value in each column on a separate row, next to it's column name.
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
)
shape: (24, 3)
┌────────┬───────────────┬────────────┐
│ row_nr ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞════════╪═══════════════╪════════════╡
│ 0 ┆ last_name ┆ Unknown │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ last_name ┆ Mallesham │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ last_name ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ last_name ┆ Bhavik │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ date_of_birth ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ date_of_birth ┆ null │
└────────┴───────────────┴────────────┘
Note that columns values will be converted to string values in this step.
Next, we'll filter out any rows with null or "" values.
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
)
shape: (14, 3)
┌────────┬───────────────┬────────────┐
│ row_nr ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞════════╪═══════════════╪════════════╡
│ 0 ┆ last_name ┆ Unknown │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ last_name ┆ Mallesham │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ last_name ┆ Bhavik │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ last_name ┆ Unknown │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ number ┆ 0023004648 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ date_of_birth ┆ 12/09/1900 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ date_of_birth ┆ 12/09/1900 │
└────────┴───────────────┴────────────┘
In the next step, we'll aggregate up all the remaining rows by row number, keeping only the column names. These represent columns with non-null, non-empty values.
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
.groupby('row_nr')
.agg(pl.col('variable').alias('conso_field'))
)
shape: (5, 2)
┌────────┬─────────────────────────────────────┐
│ row_nr ┆ conso_field │
│ --- ┆ --- │
│ u32 ┆ list[str] │
╞════════╪═════════════════════════════════════╡
│ 2 ┆ ["first_name_or_initial", "numbe... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ["last_name", "first_name_or_ini... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ ["last_name", "number"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ ["last_name", "first_name_or_ini... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["last_name", "number", "date_of... │
└────────┴─────────────────────────────────────┘
Note that we get a list of column names for each row. (Note: we don't need to worry about the order of the rows at this point. We'll use the row number and a left-join in the last step recombine the values to the original DataFrame.)
Then, it's simply a matter of joining the columns names into one string:
df = df.with_row_count()
(
df
.melt(id_vars="row_nr")
.filter(pl.col('value').is_not_null() & (pl.col('value') != ""))
.groupby('row_nr')
.agg(pl.col('variable').alias('conso_field'))
.with_column(pl.col('conso_field').arr.join(','))
)
shape: (5, 2)
┌────────┬─────────────────────────────────────┐
│ row_nr ┆ conso_field │
│ --- ┆ --- │
│ u32 ┆ str │
╞════════╪═════════════════════════════════════╡
│ 3 ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ first_name_or_initial,number,dat... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ last_name,number │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ last_name,first_name_or_initial,... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ last_name,number,date_of_birth │
└────────┴─────────────────────────────────────┘
From here, we simply use a "left join" to merge the data back to the original dataset (as shown at the beginning.)
In pandas, the following code will split the string from col1 into many columns. is there a way to do this in polars?
d = {'col1': ["a/b/c/d", "a/b/c/d"]}
df= pd.DataFrame(data=d)
df[["a","b","c","d"]]=df["col1"].str.split('/',expand=True)
Here's an algorithm that will automatically adjust for the required number of columns -- and should be quite performant.
Let's start with this data. Notice that I've purposely added the empty string "" and a null value - to show how the algorithm handles these values. Also, the number of split strings varies widely.
import polars as pl
df = pl.DataFrame(
{
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
}
)
df
shape: (5, 1)
┌─────────────────────────────┐
│ my_str │
│ --- │
│ str │
╞═════════════════════════════╡
│ cat │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat/dog/aardvark/mouse/frog │
└─────────────────────────────┘
The Algorithm
The algorithm below may be a bit more than you need, but you can edit/delete/add as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
.with_column(
pl.col('^string_.*$').fill_null("")
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
How it works
We first assign a row number id (which we'll need later), and use split to separate the strings. Note that the split strings form a list.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
)
shape: (5, 3)
┌─────┬─────────────────────────────┬────────────────────────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ list[str] │
╞═════╪═════════════════════════════╪════════════════════════════╡
│ 0 ┆ cat ┆ ["cat"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ ["cat", "dog"] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ [""] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ ["cat", "dog", ... "frog"] │
└─────┴─────────────────────────────┴────────────────────────────┘
Next, we'll use explode to put each string on its own row. (Notice how the id column tracks the original row that each string came from.)
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
)
shape: (10, 3)
┌─────┬─────────────────────────────┬───────────┐
│ id ┆ my_str ┆ split_str │
│ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╡
│ 0 ┆ cat ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog │
└─────┴─────────────────────────────┴───────────┘
In the next step, we're going to generate our column names. I chose to call each column string_XX where XX is the offset with regards to the original string.
I've used the handy zfill expression so that 1 becomes 01. (This makes sure that string_02 comes before string_10 if you decide to sort your columns later.)
You can substitute your own naming in this step as you need.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
)
shape: (10, 4)
┌─────┬─────────────────────────────┬───────────┬───────────┐
│ id ┆ my_str ┆ split_str ┆ col_nm │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ string_00 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ dog ┆ string_01 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ aardvark ┆ string_02 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ mouse ┆ string_03 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ frog ┆ string_04 │
└─────┴─────────────────────────────┴───────────┴───────────┘
In the next step, we'll use the pivot function to place each string in its own column.
(
df
.with_row_count('id')
.with_column(pl.col("my_str").str.split("/").alias("split_str"))
.explode("split_str")
.with_column(
("string_" + pl.arange(0, pl.count()).cast(pl.Utf8).str.zfill(2))
.over("id")
.alias("col_nm")
)
.pivot(
index=['id', 'my_str'],
values='split_str',
columns='col_nm',
)
)
shape: (5, 7)
┌─────┬─────────────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ id ┆ my_str ┆ string_00 ┆ string_01 ┆ string_02 ┆ string_03 ┆ string_04 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞═════╪═════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 0 ┆ cat ┆ cat ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ cat/dog ┆ cat ┆ dog ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ null ┆ null ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ ┆ ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ cat/dog/aardvark/mouse/frog ┆ cat ┆ dog ┆ aardvark ┆ mouse ┆ frog │
└─────┴─────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
All that remains is to use fill_null to replace the null values with an empty string "". Notice that I've used a regex expression in the col expression to target only those columns whose names start with "string_". (Depending on your other data, you may not want to replace null with "" everywhere in your data.)
You can use apply() method
import polars as pl
from polars import col
df = pl.DataFrame({
'col1': ["a/b/c/d", "e/f/j/k"]
})
print(df)
df:
shape: (2, 1)
┌─────────┐
│ col1 │
│ --- │
│ str │
╞═════════╡
│ a/b/c/d │
├╌╌╌╌╌╌╌╌╌┤
│ e/f/j/k │
└─────────┘
With apply()
df = df.with_columns([
col('col1'),
*[col('col1').apply(lambda s, i=i: s.split('/')[i]).alias(col_name)
for i, col_name in enumerate(['a', 'b', 'c', 'd'])]
# or without 'for'
# col('col1').apply(lambda s: s.split('/')[0]).alias('a'),
# col('col1').apply(lambda s: s.split('/')[1]).alias('b'),
# col('col1').apply(lambda s: s.split('/')[2]).alias('c'),
# col('col1').apply(lambda s: s.split('/')[3]).alias('d')
])
print(df)
df:
shape: (2, 5)
┌─────────┬─────┬─────┬─────┬─────┐
│ col1 ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str │
╞═════════╪═════╪═════╪═════╪═════╡
│ a/b/c/d ┆ a ┆ b ┆ c ┆ d │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ e/f/j/k ┆ e ┆ f ┆ j ┆ k │
└─────────┴─────┴─────┴─────┴─────┘
It works, but probably there is more accurate way)
With this way you do the string split to turn col1 into a list of strings. Then you loop over the lists and use .arr.get to extract each element into a separate column
(df
.with_column(pl.col("col1").str.split("/"))
.with_columns(
[pl.col("col1").arr.get(i).alias(str(i)) for i in range(len(df[0,"col1"].split('/')))
]
)
)
One challenge is whether you will have the same number of elements in the list in each row. In this solution I've assumed you have and have taken the length of the list in the first row to do the loop.
You can use struct datatype, as described in this post: https://stackoverflow.com/a/74219166:
import pandas as pl
df = pl.DataFrame({
"my_str": ["cat", "cat/dog", None, "", "cat/dog/aardvark/mouse/frog"],
})
df.select(pl.col('my_str').str.split('/')
.arr.to_struct(n_field_strategy="max_width")).unnest('my_str')
Notice you must use n_field_strategy="max_width", otherwise, unnest() will create only 1 column.
import polars as pl
#Create new column list(can be created dynamically as well)
new_cols=['new_col1','new_col2','new_col3',.....,new_coln]
#Define expression
expr = [pl.col('col1').str.split('/').arr.get(i).alias(col)
for i,col in enumerate(new_cols)
]
#Apply Expression
df.with_columns(expr)
Now I have a dataframe like this:
df = pd.DataFrame({"asset":["a","b","c","a","b","c","b","c"],"v":[1,2,3,4,5,6,7,8],"date":["2017","2011","2012","2013","2014","2015","2016","2010"]})
I can calculate the pct_change by groupby and my function like this:
def fun(df):
df = df.sort_values(by="date")
df["pct_change"] = df["v"].pct_change()
return df
df = df.groupby("asset",as_index=False).apply(fun)
Now I want to know how can I get the same result by polars?
Here are two options. One using window functions, and one using groupby + explode.
You should benchmark and see which is faster on your use case.
preparing data
df = pl.DataFrame({
"asset":["a","b","c","a","b","c","b","c"],
"v":[1,2,3,4,5,6,7,8],
"date":["2017","2011","2012","2013","2014","2015","2016","2010"]
})
using window functions
(
df.sort(["asset", "date"])
.with_columns([
pl.col("v").pct_change().over("asset").alias("pct_change")
])
)
using groupby + explode
(df.groupby("asset")
.agg([
pl.all().first(),
pl.col("v").sort_by("date").pct_change().alias("pct_change")
]).explode("pct_change")
)
Result
Both output:
shape: (8, 4)
┌───────┬─────┬──────┬────────────┐
│ asset ┆ v ┆ date ┆ pct_change │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ f64 │
╞═══════╪═════╪══════╪════════════╡
│ a ┆ 4 ┆ 2013 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 1 ┆ 2017 ┆ -0.75 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 2011 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 5 ┆ 2014 ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 7 ┆ 2016 ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 8 ┆ 2010 ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 2012 ┆ -0.625 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 6 ┆ 2015 ┆ 1.0 │
└───────┴─────┴──────┴────────────┘
I have a problem to merge columns into one. Say I have a dataframe (df) like below:
>> print(df)
shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ a ┆ b_a_1 ┆ b_a_2 ┆ b_a_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪═══════╡
│ 1 ┆ a-- ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ ┆ b-- ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ ┆ ┆ c-- │
└─────┴───────┴───────┴───────┘
And I want to be able to merge the last three (3) columsn into one using python-polars. I have tried and successfully got what I want. However,
>> out = df.select(pl.concat_str(['b_a_1', 'b_a_2', 'b_a_3']).alias('b_a'))
>> print(out)
shape: (3, 1)
┌─────┐
│ b_a │
│ --- │
│ str │
╞═════╡
│ a-- │
├╌╌╌╌╌┤
│ b-- │
├╌╌╌╌╌┤
│ c-- │
└─────┘
when I use regex in selecting the columns, I don't get the above result
>> out = df.select(pl.concat_str('^b_a_\d$'))
>> print(out)
shape: (3, 3)
┌───────┬───────┬───────┐
│ b_a_1 ┆ b_a_2 ┆ b_a_3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═══════╪═══════╪═══════╡
│ a-- ┆ ┆ │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ┆ b-- ┆ │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ┆ ┆ c-- │
└───────┴───────┴───────┘
and nothing when run
>> out = df.select(pl.concat_str('^b_a_*$'))
>> print(out)
shape: (0, 0)
┌┐
╞╡
└┘
How am I to select the columns with regex and combine them into one?
Thank you very much for your time and suggestion.
Sincerely,
Thi An