how can we resample time series in polars - python

I'd like to use the bucket expression with groupby, to downsample on monthly basis, as the downsampling function will be deprecated. Is there a easy way to do this, datetime.timedelta only works on days and lower.

With the landing groupby_dynamic we can now downsample and use the whole expression API for our aggregations. Meaning we can resample by either.
upsampling
downsampling
first upsample and then downsample
Let's go through an example:
df = pl.DataFrame(
{"time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
"values": [1., 2., 3., 4., 5., 6., 7.]
})
print(df)
shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime ┆ str ┆ f64 │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ a ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ b ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a ┆ 7 │
└─────────────────────┴────────┴────────┘
Upsampling
Upsampling can be done by defining an interval. This will yield a DataFrame with nulls, which can then be filled with a fill strategy or interpolation.
df.upsample("time", "15m").fill_null("forward")
shape: (13, 3)
┌─────────────────────┬────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime ┆ str ┆ f64 │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:15:00 ┆ a ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:45:00 ┆ a ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:15:00 ┆ b ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:45:00 ┆ a ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a ┆ 7 │
└─────────────────────┴────────┴────────┘
(df.upsample("time", "15m")
.interpolate()
.fill_null("forward") # string columns cannot be interpolated
)
shape: (13, 3)
┌─────────────────────┬────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime ┆ str ┆ f64 │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:15:00 ┆ a ┆ 1.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:45:00 ┆ a ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:15:00 ┆ b ┆ 3.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:45:00 ┆ a ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a ┆ 7 │
└─────────────────────┴────────┴────────┘
Downsampling
This is the powerful one, because we can also combine this with normal groupby keys. Having a virtual moving window over the time series (grouped by one or multiple keys) that can be aggregated with the expression API.
(df.groupby_dynamic(
time_column="time",
every="1h",
closed="both",
by="groups",
include_boundaries=True
)
.agg([
pl.col('time').count(),
pl.col("time").max(),
pl.sum("values"),
]))
shape: (4, 7)
┌────────┬────────────┬────────────┬────────────┬────────────┬─────────────────────┬────────────┐
│ groups ┆ _lower_bou ┆ _upper_bou ┆ time ┆ time_count ┆ time_max ┆ values_sum │
│ --- ┆ ndary ┆ ndary ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ --- ┆ --- ┆ datetime ┆ u32 ┆ datetime ┆ f64 │
│ ┆ datetime ┆ datetime ┆ ┆ ┆ ┆ │
╞════════╪════════════╪════════════╪════════════╪════════════╪═════════════════════╪════════════╡
│ a ┆ 2021-12-16 ┆ 2021-12-16 ┆ 2021-12-16 ┆ 3 ┆ 2021-12-16 01:00:00 ┆ 6 │
│ ┆ 00:00:00 ┆ 01:00:00 ┆ 00:00:00 ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2021-12-16 ┆ 2021-12-16 ┆ 2021-12-16 ┆ 1 ┆ 2021-12-16 00:00:00 ┆ 1 │
│ ┆ 01:00:00 ┆ 02:00:00 ┆ 00:00:00 ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2021-12-16 ┆ 2021-12-16 ┆ 2021-12-16 ┆ 2 ┆ 2021-12-16 03:00:00 ┆ 13 │
│ ┆ 02:00:00 ┆ 03:00:00 ┆ 00:00:00 ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2021-12-16 ┆ 2021-12-16 ┆ 2021-12-16 ┆ 2 ┆ 2021-12-16 02:00:00 ┆ 9 │
│ ┆ 01:00:00 ┆ 02:00:00 ┆ 01:00:00 ┆ ┆ ┆ │
└────────┴────────────┴────────────┴────────────┴────────────┴─────────────────────┴────────────┘

I found a solutuion for my problem using the round expression and a groupby operation on the date column after it.
Here is some Code exmaple:
df = pl.DataFrame(
{
"A": [
"2020-01-01",
"2020-01-02",
"2020-02-03",
"2020-02-04",
"2020-03-05",
"2020-03-06",
"2020-06-06",
],
"B": [1.0, 8.0, 6.0, 2.0, 16.0, 10.0,2],
"C": [3.0, 6.0, 9.0, 2.0, 13.0, 8.0,2],
"D": [12.0, 5.0, 9.0, 2.0, 11.0, 2.0,2],
}
)
q = (
df.lazy().with_column(pl.col('A').str.strptime(pl.Date, "%Y-%m-%d").dt.round(rule='month',n=1))
.groupby('A').agg(
[pl.col("B").max(),
pl.col("C").min(),
pl.col("D").last()]
)
.sort('A')
)
df = q.collect()
print(df)
prints
┌────────────┬───────┬───────┬────────┐
│ A ┆ B_max ┆ C_min ┆ D_last │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪═══════╪═══════╪════════╡
│ 2020-01-01 ┆ 8 ┆ 3 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-02-01 ┆ 6 ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-03-01 ┆ 16 ┆ 8 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-06-01 ┆ 2 ┆ 2 ┆ 2 │
└────────────┴───────┴───────┴────────┘
Some explanation, first I cast the String column to pl.Date type, then I use .dt to create the Namespace of the Date Types. After that I use the round function of the DateTime Namespace to round all Dates on monthly basis to the first day in the month. As result every date in the same month got the same date, so I can group them with groupby and use some aggregation functions on the groups. The downside of this method(and downsampling too) is that u missing the months where no dates exist. For this u can us this code, it's a little bit messy and I had to use pandas but I think it works.
from dateutil.relativedelta import relativedelta
date_min = df['A'].dt.min()
date_max = df['A'].dt.max()+relativedelta(months=+1)
t_index=pd.date_range(date_min, date_max, freq='M',closed='right').values
t_index = [datetime.datetime.fromisoformat(str(np.datetime_as_string(x, unit='D'))) for x in t_index]
df_ref = pl.DataFrame(t_index,columns='A')
q=(
df_ref.lazy().with_column(pl.col('A').cast(pl.Date).dt.round(rule='month',n=1))
.join(df.lazy(),on='A',how='left')
)
df = q.collect()
print(df)
results in
┌────────────┬───────┬───────┬────────┐
│ A ┆ B_max ┆ C_min ┆ D_last │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪═══════╪═══════╪════════╡
│ 2020-01-01 ┆ 8 ┆ 3 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-02-01 ┆ 6 ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-03-01 ┆ 16 ┆ 8 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-04-01 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-05-01 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2020-06-01 ┆ 2 ┆ 2 ┆ 2 │
└────────────┴───────┴───────┴────────┘

Related

Adding secondary group by clause using groupby_dynamic() operation in polar

I would like to groupby the data in interval of a hourly/daily/weekly and further group by certain other clauses. I was able to acheive groupby hourly/daily/weekly basis by using groupby_dynamic option provided by polars.
How do we add a secondary non datetime groupby clause to the polars dataframe after using groupby_dynamic operation in polar?
The sample dataframe read from csv is
┌─────────-─────────-─┬─────────────┬────────┬─────────┬──-───────────┐
| Date ┆ Item ┆ Issue ┆ Channel ┆ ID │
|═════════════════════|═════════════|════════|═════════|══════════════|
| 2023-01-02 01:00:00 ┆ Item ABC ┆ EAAGCD ┆ Twitter ┆ 32513995 │
| 2023-01-02 01:40:00 ┆ Item ABC ┆ ASDFFF ┆ Web ┆ 32513995 │
| 2023-01-02 02:15:00 ┆ Item ABC ┆ WERWET ┆ Web ┆ 32513995 │
| 2023-01-02 03:00:00 ┆ Item ABC ┆ BVRTNB ┆ Twitter ┆ 32513995 │
| 2023-01-03 04:11:00 ┆ Item ABC ┆ VDFGVS ┆ Fax ┆ 32513995 │
| 2023-01-03 04:30:00 ┆ Item ABC ┆ QWEDWE ┆ Twitter ┆ 32513995 │
| 2023-01-03 04:45:00 ┆ Item ABC ┆ BRHMNU ┆ Fax ┆ 32513995 │
└─────────────────────┴─────────────┴────────┴─────────┴──────────────┘
I am grouping this data in houlry interval using polars groupby_dynamic operation using the below code snippet.
import polars as pl
q = (
pl.scan_csv("Test.csv", parse_dates=True)
.filter(pl.col("Item") == "Item ABC")
.groupby_dynamic("Date", every="1h", closed="right")
.agg([pl.col("ID").count().alias("total")])
.sort(["Date"])
)
df = q.collect()
This code gives me result as
┌─────────────────────┬───────┐
│ Date ┆ total │
╞═════════════════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ 2 │
│ 2023-01-02 02:00:00 ┆ 1 │
│ 2023-01-02 03:00:00 ┆ 1 │
│ 2023-01-05 04:00:00 ┆ 3 │
└─────────────────────┴───────┘
But i would want to further group by this data by "Channel" and expecting the result as
┌────────────-──────-─┬─────────┬───────┐
│ Date ┆ Channel ┆ total │
╞═════════════════════╪═════════╪═══════╡
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1 │
│ 2023-01-02 01:00:00 ┆ Web ┆ 1 │
│ 2023-01-02 01:00:00 ┆ Web ┆ 1 │
│ 2023-01-02 01:00:00 ┆ Twitter ┆ 1 │
│ 2023-01-03 01:00:00 ┆ Fax ┆ 2 │
│ 2023-01-11 01:00:00 ┆ Twitter ┆ 1 │
└─────────────────────┴─────────┴───────┘
You can specify by
q = (
pl.scan_csv("Test.csv", parse_dates=True)
.filter(pl.col("Item") == "Item ABC")
.groupby_dynamic("Date", every="1h", closed="right", by="Item")
.agg([pl.col("ID").count().alias("total")])
.sort(["Date"])
)

How to truncate a datetime to just the day

I am trying to move from pandas to polars but I am running into the following issue.
import polars as pl
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
Yields the following dataframe:
>>> df
shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET │
│ 2 ┆ 2010-02-01 01:00:00 CET │
│ 3 ┆ 2010-02-01 02:00:00 CET │
└─────────┴────────────────────────────────┘
As you can see, I transformed the datetime string from UTC to CET succesfully. However, when I try to extract the date (using the accepted answer by the polars author in this thread: https://stackoverflow.com/a/73212748/16332690), it seems to extract the date from the UTC string even though it has been transformed, e.g.:
df = df.with_columns(
[
pl.col("date").cast(pl.Date).alias("valueDay")
]
)
>>> df
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date ┆ valueDay │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ date │
╞═════════╪════════════════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 2010-01-31 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘
The valueDay should be 2010-02-01 for all 3 values.
Can anyone help me fix this? By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df or is there a way to chain all of this?
Edit:
I managed to find a quick way around this but it would be nice if the issue above could be addressed. A pandas dt.date like way to approach this would be nice, I noticed that it is missing over here: https://pola-rs.github.io/polars/py-polars/html/reference/series/timeseries.html
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
df = df.with_columns(
[
pl.col("date").dt.day().alias("day"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.year().alias("year"),
]
)
df = df.with_columns(
pl.datetime(year=pl.col("year"), month=pl.col("month"), day=pl.col("day"))
)
df = df.with_columns(
[
pl.col("datetime").cast(pl.Date).alias("valueDay")
]
)
Yields the following:
>>> df
shape: (3, 7)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┬─────────────────────┬────────────┐
│ integer ┆ date ┆ day ┆ month ┆ year ┆ datetime ┆ valueDay │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32 ┆ i32 ┆ datetime[μs] ┆ date │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╪═════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┴─────────────────────┴────────────┘
Would this temporary workaround help? Starting with this data:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"date": pl.date_range(
datetime(2010, 1, 30, 22, 0, 0),
datetime(2010, 2, 1, 2, 0, 0),
"1h",
).dt.with_time_zone("Europe/Amsterdam"),
}
)
df
shape: (29, 1)
┌────────────────────────────────┐
│ date │
│ --- │
│ datetime[μs, Europe/Amsterdam] │
╞════════════════════════════════╡
│ 2010-01-30 23:00:00 CET │
│ 2010-01-31 00:00:00 CET │
│ 2010-01-31 01:00:00 CET │
│ 2010-01-31 02:00:00 CET │
│ 2010-01-31 03:00:00 CET │
│ 2010-01-31 04:00:00 CET │
│ 2010-01-31 05:00:00 CET │
│ 2010-01-31 06:00:00 CET │
│ 2010-01-31 07:00:00 CET │
│ 2010-01-31 08:00:00 CET │
│ 2010-01-31 09:00:00 CET │
│ 2010-01-31 10:00:00 CET │
│ 2010-01-31 11:00:00 CET │
│ 2010-01-31 12:00:00 CET │
│ 2010-01-31 13:00:00 CET │
│ 2010-01-31 14:00:00 CET │
│ 2010-01-31 15:00:00 CET │
│ 2010-01-31 16:00:00 CET │
│ 2010-01-31 17:00:00 CET │
│ 2010-01-31 18:00:00 CET │
│ 2010-01-31 19:00:00 CET │
│ 2010-01-31 20:00:00 CET │
│ 2010-01-31 21:00:00 CET │
│ 2010-01-31 22:00:00 CET │
│ 2010-01-31 23:00:00 CET │
│ 2010-02-01 00:00:00 CET │
│ 2010-02-01 01:00:00 CET │
│ 2010-02-01 02:00:00 CET │
│ 2010-02-01 03:00:00 CET │
└────────────────────────────────┘
You can extract the date using
(
df.with_columns(
pl.col("date")
.dt.cast_time_zone("UTC")
.cast(pl.Date)
.alias("trunc_date")
)
)
shape: (29, 2)
┌────────────────────────────────┬────────────┐
│ date ┆ trunc_date │
│ --- ┆ --- │
│ datetime[μs, Europe/Amsterdam] ┆ date │
╞════════════════════════════════╪════════════╡
│ 2010-01-30 23:00:00 CET ┆ 2010-01-30 │
│ 2010-01-31 00:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 01:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 02:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 03:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 04:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 05:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 06:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 07:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 08:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 09:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 10:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 11:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 12:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 13:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 14:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 15:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 16:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 17:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 18:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 19:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 20:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 21:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 22:00:00 CET ┆ 2010-01-31 │
│ 2010-01-31 23:00:00 CET ┆ 2010-01-31 │
│ 2010-02-01 00:00:00 CET ┆ 2010-02-01 │
│ 2010-02-01 01:00:00 CET ┆ 2010-02-01 │
│ 2010-02-01 02:00:00 CET ┆ 2010-02-01 │
│ 2010-02-01 03:00:00 CET ┆ 2010-02-01 │
└────────────────────────────────┴────────────┘
There's nothing wrong with the cast function per se. It's just not intended to be used for truncating the time out of a datetime.
As it happens the function you're looking for is truncate so this is what you want to do (in the last with_columns chunk)
pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
).with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
).with_columns(
[
pl.col("date").dt.truncate('1d').alias("valueDay")
]
)

Pandas groupby apply is very slow

grouped = data_v1.sort_values(by = "Strike_Price").groupby(['dateTime','close','Index','Expiry','group'])
def calc_summary(group):
name = group.name
if name[3] == "above":
call_oi = group['Call_OI'].sum()
call_vol = group['Call_Volume'].sum()
put_oi = group['Put_OI'].sum()
put_vol = group['Put_Volume'].sum()
call_oi_1 = group.head(1)['Call_OI'].sum()
call_vol_1 = group.head(1)['Call_Volume'].sum()
put_oi_1 = group.head(1)['Put_OI'].sum()
put_vol_1 = group.head(1)['Put_Volume'].sum()
else:
call_oi = group['Call_OI'].sum()
call_vol = group['Call_Volume'].sum()
put_oi = group['Put_OI'].sum()
put_vol = group['Put_Volume'].sum()
call_oi_1 = group.tail(1)['Call_OI'].sum()
call_vol_1 = group.tail(1)['Call_Volume'].sum()
put_oi_1 = group.tail(1)['Put_OI'].sum()
put_vol_1 = group.tail(1)['Put_Volume'].sum()
summary = pd.DataFrame([{'call_oi':call_oi,
'call_vol':call_vol,
'put_oi':put_oi,
'put_vol':put_vol,
'call_oi_1':call_oi_1,
'call_vol_1':call_vol_1,
'put_oi_1':put_oi_1,
'put_vol_1':put_vol_1,
return summary
result = grouped.apply(calc_summary)
This above code takes too much time to run given the dataset is not even that big. Currently, it takes about 23 seconds in my system.
I tried swifter but that doesn't work with groupby objects.
What should I do to make my code faster?
Edit:
The data looks like this
{'dateTime': {0: Timestamp('2023-02-06 09:21:00'),
1: Timestamp('2023-02-06 09:21:00'),
2: Timestamp('2023-02-06 09:21:00'),
3: Timestamp('2023-02-06 09:21:00'),
4: Timestamp('2023-02-06 09:21:00')},
'close': {0: 17780.55, 1: 17780.55, 2: 17780.55, 3: 17780.55, 4: 17780.55},
'Index': {0: 'NIFTY', 1: 'NIFTY', 2: 'NIFTY', 3: 'NIFTY', 4: 'NIFTY'},
'Expiry': {0: '16FEB2023',
1: '23FEB2023',
2: '9FEB2023',
3: '16FEB2023',
4: '23FEB2023'},
'Expiry_order': {0: 'week_2',
1: 'week_3',
2: 'week_1',
3: 'week_2',
4: 'week_3'},
'group': {0: 'below', 1: 'below', 2: 'below', 3: 'below', 4: 'below'},
'Call_OI': {0: nan, 1: 60.0, 2: 4.0, 3: nan, 4: nan},
'Put_OI': {0: 1364.0, 1: 11255.0, 2: 91059.0, 3: 343.0, 4: 153.0},
'Call_Volume': {0: nan, 1: 3.0, 2: 2.0, 3: nan, 4: nan},
'Put_Volume': {0: 84.0, 1: 1246.0, 2: 5197.0, 3: 24.0, 4: 1.0},
'Strike_Price': {0: 16100.0, 1: 16100.0, 2: 16100.0, 3: 16150.0, 4: 16150.0}}
Using your sample data:
import io
import pandas as pd
csv = """
dateTime,close,Index,Expiry,Expiry_order,group,Call_OI,Put_OI,Call_Volume,Put_Volume,Strike_Price
2023-02-06 09:21:00,17780.55,NIFTY,16FEB2023,week_2,below,,1364.0,,84.0,16100.0
2023-02-06 09:21:00,17780.55,NIFTY,23FEB2023,week_3,below,60.0,11255.0,3.0,1246.0,16100.0
2023-02-06 09:21:00,17780.55,NIFTY,9FEB2023,week_1,below,4.0,91059.0,2.0,5197.0,16100.0
2023-02-06 09:21:00,17780.55,NIFTY,16FEB2023,week_2,below,,343.0,,24.0,16150.0
2023-02-06 09:21:00,17780.55,NIFTY,23FEB2023,week_3,below,,153.0,,1.0,16150.0
"""
df = pd.read_csv(io.StringIO(csv))
The output of your calc_summary function:
>>> df.sort_values(by='Strike_Price').groupby(['dateTime', 'close', 'Index', 'Expiry', 'group']).apply(calc_summary)
call_oi call_vol put_oi put_vol call_oi_1 call_vol_1 put_oi_1 put_vol_1
dateTime close Index Expiry group
2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0 0.0 0.0 1707.0 108.0 0.0 0.0 343.0 24.0
23FEB2023 below 0 60.0 3.0 11408.0 1247.0 0.0 0.0 153.0 1.0
9FEB2023 below 0 4.0 2.0 91059.0 5197.0 4.0 2.0 91059.0 5197.0
.agg()
You're performing an aggregation where you conditionally want the head/tail depending on the value of the group column.
You could aggregate both values instead and then do the filtering afterwards.
This allows you to use .agg() directly.
We can use first and last aggregations for head/tail but must first fillna(0) as they handle NaN values differently.
summary = (
df.fillna(0) # needed for first/last as they ignore NaN
.sort_values(by='Strike_Price')
.groupby(['dateTime', 'close', 'Index', 'Expiry', 'group'])
[['Call_OI', 'Call_Volume', 'Put_OI', 'Put_Volume']]
.agg(['first', 'last', 'sum'])
.reset_index()
)
Which produces a multi-indexed column structure like:
dateTime close Index Expiry group Call_OI Call_Volume Put_OI Put_Volume
first last sum first last sum first last sum first last sum
0 2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0.0 0.0 0.0 0.0 0.0 0.0 1364.0 343.0 1707.0 84.0 24.0 108.0
1 2023-02-06 09:21:00 17780.55 NIFTY 23FEB2023 below 60.0 0.0 60.0 3.0 0.0 3.0 11255.0 153.0 11408.0 1246.0 1.0 1247.0
2 2023-02-06 09:21:00 17780.55 NIFTY 9FEB2023 below 4.0 4.0 4.0 2.0 2.0 2.0 91059.0 91059.0 91059.0 5197.0 5197.0 5197.0
To say you want the last values when group != "above" you can:
>>> below = summary.loc[summary['group'] != 'above', summary.columns.get_level_values(1) != 'first']
>>> below
dateTime close Index Expiry group Call_OI Call_Volume Put_OI Put_Volume
last sum last sum last sum last sum
0 2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0.0 0.0 0.0 0.0 343.0 1707.0 24.0 108.0
1 2023-02-06 09:21:00 17780.55 NIFTY 23FEB2023 below 0.0 60.0 0.0 3.0 153.0 11408.0 1.0 1247.0
2 2023-02-06 09:21:00 17780.55 NIFTY 9FEB2023 below 4.0 4.0 2.0 2.0 91059.0 91059.0 5197.0 5197.0
To flatten the column structure similar to your functions output you can:
>>> below.columns = [left.lower() + ('' if right in {'', 'sum'} else '_1') for left, right in below.columns]
>>> below
datetime close index expiry group call_oi_1 call_oi call_volume_1 call_volume put_oi_1 put_oi put_volume_1 put_volume
0 2023-02-06 09:21:00 17780.55 NIFTY 16FEB2023 below 0.0 0.0 0.0 0.0 343.0 1707.0 24.0 108.0
1 2023-02-06 09:21:00 17780.55 NIFTY 23FEB2023 below 0.0 60.0 0.0 3.0 153.0 11408.0 1.0 1247.0
2 2023-02-06 09:21:00 17780.55 NIFTY 9FEB2023 below 4.0 4.0 2.0 2.0 91059.0 91059.0 5197.0 5197.0
There are no examples of above in your data - but you could do the same for those rows using == 'above' and != 'last' and concat both sets of rows into a single dataframe.
Polars
You may also wish to compare how the dataset performs with polars.
One possible approach which generates the same output:
import io
import polars as pl
df = pl.read_csv(io.StringIO(csv))
columns = ["Call_OI", "Call_Volume", "Put_OI", "Put_Volume"]
(
df
.sort("Strike_Price")
.groupby(["dateTime", "close", "Index", "Expiry", "group"], maintain_order=True)
.agg([
pl.col(columns).sum(),
pl.when(pl.col("group").first() == "above")
.then(pl.col(columns).first())
.otherwise(pl.col(columns).last())
.suffix("_1")
])
.fill_null(0)
)
shape: (3, 13)
┌─────────────────────┬──────────┬───────┬───────────┬───────┬─────────┬─────────────┬─────────┬────────────┬───────────┬───────────────┬──────────┬──────────────┐
│ dateTime | close | Index | Expiry | group | Call_OI | Call_Volume | Put_OI | Put_Volume | Call_OI_1 | Call_Volume_1 | Put_OI_1 | Put_Volume_1 │
│ --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- │
│ str | f64 | str | str | str | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 │
╞═════════════════════╪══════════╪═══════╪═══════════╪═══════╪═════════╪═════════════╪═════════╪════════════╪═══════════╪═══════════════╪══════════╪══════════════╡
│ 2023-02-06 09:21:00 | 17780.55 | NIFTY | 16FEB2023 | below | 0.0 | 0.0 | 1707.0 | 108.0 | 0.0 | 0.0 | 343.0 | 24.0 │
│ 2023-02-06 09:21:00 | 17780.55 | NIFTY | 23FEB2023 | below | 60.0 | 3.0 | 11408.0 | 1247.0 | 0.0 | 0.0 | 153.0 | 1.0 │
│ 2023-02-06 09:21:00 | 17780.55 | NIFTY | 9FEB2023 | below | 4.0 | 2.0 | 91059.0 | 5197.0 | 4.0 | 2.0 | 91059.0 | 5197.0 │
└─────────────────────┴──────────┴───────┴───────────┴───────┴─────────┴─────────────┴─────────┴────────────┴───────────┴───────────────┴──────────┴──────────────┘

Python Polars: How to apply a aggregate function for all columns and pass one additional column as argument?

I have a lazy dataframe (using scan_parquet) like below,
region time sen1 sen2 sen3
us 1 10.0 11.0 12.0
us 2 11.0 14.0 13.0
us 3 10.1 10.0 12.3
us 4 13.0 11.1 14.0
us 5 12.0 11.0 19.0
uk 1 10.0 11.0 12.1
uk 2 11.0 14.0 13.0
uk 3 10.1 10.0 12.0
uk 4 13.0 11.1 14.0
uk 5 12.0 11.0 19.0
uk 6 13.7 11.1 14.0
uk 7 12.0 11.0 21.9
I want to find max and min for all the sensors for each region and while doing so, I also wanted the time at which max and min happened.
So, I wrote the below aggregate function,
def my_custom_agg(t,v):
smax = v.max()
smin = v.min()
smax_t = t[v.arg_max()]
smin_t = t[v.arg_max()]
return [smax, smin, smax_t, smin_t]
Then I did the groupby as below,
df.groupby('region').agg(
[
pl.col('*').apply(lambda s: my_custom_agg(pl.col('time'),s))
]
)
When I do this, I get the below error,
TypeError: 'Expr' object is not subscribable
Expected result,
region sen1 sen2 sen3
us [13.0,10.0,4,1] [14.0,10.0,2,3] [19.0,12.0,5,1]
uk [13.7,10.0,6,1] [14.0,10.0,2,3] [21.9,12.0,7,3]
# which I will melt and transform to below,
region sname smax smin smax_t smin_t
us sen1 13.0 10.0 4 1
us sen2 14.0 10.0 2 3
us sen3 19.0 12.0 5 1
uk sen1 13.7 10.0 6 1
uk sen2 14.0 10.0 2 3
uk sen3 21.9 12.0 7 3
Could you please tell me how to pass one additional column as an argument? If there is an alternative way to do this, I am happy to hear it since I am flexible with the output format.
Note: In my real dataset I have 8k sensors, so it is better to do with *.
Thanks for your support.
You could .melt() and .sort() first.
Then when you .groupby() you can use .first() and .last() to get the min/max for time and value.
pl.all() can be used instead of pl.col("*")
>>> (
... df
... .melt(["region", "time"], variable_name="sname")
... .sort(pl.all().exclude("time"))
... .groupby(["region", "sname"])
... .agg([
... pl.all().first().suffix("_min"),
... pl.all().last() .suffix("_max"),
... ])
... )
shape: (6, 6)
┌────────┬───────┬──────────┬───────────┬──────────┬───────────┐
│ region ┆ sname ┆ time_min ┆ value_min ┆ time_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 ┆ i64 ┆ f64 │
╞════════╪═══════╪══════════╪═══════════╪══════════╪═══════════╡
│ uk ┆ sen1 ┆ 1 ┆ 10.0 ┆ 6 ┆ 13.7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ uk ┆ sen3 ┆ 3 ┆ 12.0 ┆ 7 ┆ 21.9 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ us ┆ sen1 ┆ 1 ┆ 10.0 ┆ 4 ┆ 13.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ us ┆ sen2 ┆ 3 ┆ 10.0 ┆ 2 ┆ 14.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ uk ┆ sen2 ┆ 3 ┆ 10.0 ┆ 2 ┆ 14.0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ us ┆ sen3 ┆ 1 ┆ 12.0 ┆ 5 ┆ 19.0 │
└────────┴───────┴──────────┴───────────┴──────────┴───────────┘

Break the time series dataframe into one has multiple variables each one has the name of the year

I have the following time series dateframe:
date_time system_load
0 2013-01-01 00:00:00.000000 599.2
1 2013-01-01 00:59:59.999999 759.2
2 2013-01-01 02:00:00.000001 954.5
3 2013-01-01 03:00:00.000000 190.9
4 2013-01-01 03:59:59.999999 465.2
... ... ...
70123 2020-12-31 18:59:59.999999 355.9
70124 2020-12-31 20:00:00.000001 752.1
70125 2020-12-31 21:00:00.000000 928.5
70126 2020-12-31 21:59:59.999999 299.2
70127 2020-12-31 23:00:00.000001 478.5
What I want is a new dataframe as below :
Year2013 Year 2014 Year2015 Year2016 Year2017 Year2018 Year2019 Year 2020
0 599.2 ... ... ... ... ... ... 355.9
1 759.2 ... ... ... ... ... ... 752.1
2 954.5 ... ... ... ... ... ... 928.5
3 190.9 ... ... ... ... ... ... 299.2
4 465.2 ... ... ... ... ... ... 478.5
... ... ... ... ... ... ... ... ...
8760 ... .... ... ... ... ... ... ...
8761 NaN NaN NaN ... NaN NaN NaN ...
... NaN NaN NaN ... NaN NaN NaN ...
8784 NaN NaN NaN ... NaN NaN NaN ...
and the leap Years taken into considerations.
Any help to get what I want
Thanks in advance.
I'm supposing you have this dataframe:
date_time system_load
0 2013-01-01 00:00:00.000000 599.2
1 2013-01-01 00:59:59.999999 759.2
2 2013-01-01 02:00:00.000001 954.5
3 2013-01-01 03:00:00.000000 190.9
4 2013-01-01 03:59:59.999999 465.2
5 2020-12-31 18:59:59.999999 355.9
6 2020-12-31 20:00:00.000001 752.1
7 2020-12-31 21:00:00.000000 928.5
8 2020-12-31 21:59:59.999999 299.2
9 2020-12-31 23:00:00.000001 478.5
10 2020-12-31 23:00:01.000001 400.0
Then:
df["date_time"] = pd.to_datetime(df["date_time"])
df["year"] = df["date_time"].dt.year
df["index"] = df.groupby("year").transform("cumcount")
print(
df.pivot(columns="year", index="index", values="system_load").add_prefix(
"Year"
)
)
Prints:
year Year2013 Year2020
index
0 599.2 355.9
1 759.2 752.1
2 954.5 928.5
3 190.9 299.2
4 465.2 478.5
5 NaN 400.0

Categories

Resources