Currently when I try to retrieve date from a polars datetime column, I have to write sth. similar to:
df = pl.DataFrame({
'time': [dt.datetime.now()]
})
df = df.select([
pl.col("*"),
pl.col("time").apply(lambda x: x.date()).alias("date")
])
Is there a different way, something closer to:
pl.col("time").dt.date().alias("date")
You can cast a Datetime column to a Date column:
import datetime
import polars as pl
df = pl.DataFrame({
'time': [datetime.datetime.now()]
})
df.with_column(
pl.col("time").cast(pl.Date)
)
shape: (1, 1)
┌────────────┐
│ time │
│ --- │
│ date │
╞════════════╡
│ 2022-08-02 │
└────────────┘
Related
I am trying to loop through a Polars recordset using the following code:
import polars as pl
mydf = pl.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
"Name": ["John", "Joe", "James"]})
print(mydf)
│start_date ┆ Name │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════╡
│ 2020-01-02 ┆ John │
│ 2020-01-03 ┆ Joe │
│ 2020-01-04 ┆ James │
for row in mydf.rows():
print(row)
('2020-01-02', 'John')
('2020-01-03', 'Joe')
('2020-01-04', 'James')
Is there a way to specifically reference 'Name' using the named column as opposed to the index? In Pandas this would look something like:
import pandas as pd
mydf = pd.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
"Name": ["John", "Joe", "James"]})
for index, row in mydf.iterrows():
mydf['Name'][index]
'John'
'Joe'
'James'
You can specify that you want the rows to be named
for row in mydf.rows(named=True):
print(row)
It will give you a dict:
{'start_date': '2020-01-02', 'Name': 'John'}
{'start_date': '2020-01-03', 'Name': 'Joe'}
{'start_date': '2020-01-04', 'Name': 'James'}
You can then call row['Name']
Note that:
previous versions returned namedtuple instead of dict.
it's less memory intensive to use iter_rows
overall it's not recommended to iterate through the data this way
Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
You would use select for that
names = mydf.select(['Name'])
for row in names:
print(row)
Is there a way to filter data in a period of time (i.e., start time and end time) using polars?
import pandas as pd
import polars as pl
dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min")
df = pd.DataFrame({"timestamp": dr})
pf = pl.from_pandas(df)
The best try I've got was:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30))
It only gave me everything after 9:30; and if I append another filter after that:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16))
This however does not give me the slice that falls right on 16:00.
polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?
Good question!
Firstly, we can create this kind of DataFrame in Polars:
from datetime import datetime, time
import polars as pl
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")})
To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype.
To filter on a range of times we then pass the upper and lower boundaries of time to in_between.
In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition.
(
df
.select(
[
pl.col("timestamp"),
pl.col("timestamp").cast(pl.Time).alias('time_component'),
(pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
]
)
)
What you are after is:
(
df
.filter(
pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
)
See the API docs for the syntax on controlling behaviour at the boundaries:
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range
It would look something like this:
start_date = "2022-03-22 00:00:00"
end_date = "2022-03-27 00:00:00"
df = pl.DataFrame(
{
"dates": [
"2022-03-22 00:00:00",
"2022-03-23 00:00:00",
"2022-03-24 00:00:00",
"2022-03-25 00:00:00",
"2022-03-26 00:00:00",
"2022-03-27 00:00:00",
"2022-03-28 00:00:00",
]
}
)
df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True)
shape: (4, 2)
┌─────────────────────┬────────────┐
│ dates ┆ is_between │
│ --- ┆ --- │
│ str ┆ bool │
╞═════════════════════╪════════════╡
│ 2022-03-23 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-24 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-25 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-26 00:00:00 ┆ true │
└─────────────────────┴────────────┘
Suppose you have
df=pl.DataFrame(
{
"date":["2022-01-01", "2022-01-02"],
"hroff":[5,2],
"minoff":[1,2]
}).with_column(pl.col('date').str.strptime(pl.Date,"%Y-%m-%d"))
and you want to make a new column that adds the hour and min offsets to the date column. The only thing I saw was the dt.offset_by method. I made an extra column
df=df.with_column((pl.col('hroff')+"h"+pl.col('minoff')+"m").alias('offset'))
and then tried
df.with_column(pl.col('date') \
.cast(pl.Datetime).dt.with_time_zone('UTC') \
.dt.offset_by(pl.col('offset')).alias('newdate'))
but that doesn't work because dt.offset_by only takes a fixed string, not another column.
What's the best way to do that?
Use pl.duration:
import polars as pl
df = pl.DataFrame({
"date": pl.Series(["2022-01-01", "2022-01-02"]).str.strptime(pl.Datetime(time_zone="UTC"), "%Y-%m-%d"),
"hroff": [5, 2],
"minoff": [1, 2]
})
print(df.select(
pl.col("date") + pl.duration(hours=pl.col("hroff"), minutes=pl.col("minoff"))
))
shape: (2, 1)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2022-01-01 05:01:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-02 02:02:00 │
└─────────────────────┘
Is there a way to filter data in a period of time (i.e., start time and end time) using polars?
import pandas as pd
import polars as pl
dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min")
df = pd.DataFrame({"timestamp": dr})
pf = pl.from_pandas(df)
The best try I've got was:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30))
It only gave me everything after 9:30; and if I append another filter after that:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16))
This however does not give me the slice that falls right on 16:00.
polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?
Good question!
Firstly, we can create this kind of DataFrame in Polars:
from datetime import datetime, time
import polars as pl
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")})
To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype.
To filter on a range of times we then pass the upper and lower boundaries of time to in_between.
In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition.
(
df
.select(
[
pl.col("timestamp"),
pl.col("timestamp").cast(pl.Time).alias('time_component'),
(pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
]
)
)
What you are after is:
(
df
.filter(
pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
)
See the API docs for the syntax on controlling behaviour at the boundaries:
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range
It would look something like this:
start_date = "2022-03-22 00:00:00"
end_date = "2022-03-27 00:00:00"
df = pl.DataFrame(
{
"dates": [
"2022-03-22 00:00:00",
"2022-03-23 00:00:00",
"2022-03-24 00:00:00",
"2022-03-25 00:00:00",
"2022-03-26 00:00:00",
"2022-03-27 00:00:00",
"2022-03-28 00:00:00",
]
}
)
df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True)
shape: (4, 2)
┌─────────────────────┬────────────┐
│ dates ┆ is_between │
│ --- ┆ --- │
│ str ┆ bool │
╞═════════════════════╪════════════╡
│ 2022-03-23 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-24 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-25 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-26 00:00:00 ┆ true │
└─────────────────────┴────────────┘
How does one convert a column of i64 epoch strings into dates in polars?
I've got a column of i64 representing seconds since epoch and I'd like to parse them into polars native datetimes.
Polars' Datetime is represented as unix epoch in either, nanoseconds, microseconds or milliseconds. So with that knowledge we can convert the seconds to milliseconds and cast to Datetime.
Finally we ensure polars uses the proper unit.
df = pl.DataFrame({
"epoch_seconds": [1648457740, 1648457740 + 10]
})
MILLISECONDS_IN_SECOND = 1000;
df.select(
(pl.col("epoch_seconds") * MILLISECONDS_IN_SECOND).cast(pl.Datetime).dt.with_time_unit("ms").alias("datetime")
)
shape: (2, 1)
┌─────────────────────┐
│ datetime │
│ --- │
│ datetime[ms] │
╞═════════════════════╡
│ 2022-03-28 08:55:40 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-28 08:55:50 │
└─────────────────────┘