I've been recently using Polars for a project im starting to develop, ive come across several problems but with the info here and in the docs i have solved those issues.
My issue:
When I save the dataframe it stores datetime data like this:
1900-01-01T18:00:00.000000000
Load dataframe from saved dataframe
When it shows like this in my console ( I have checked type is object) :
1900-01-01 18:00:00
Dataframe show pre-saved
Code:
'''
My column is a string like this: 1234 , this means 12:34, so i do the following transformation:
'''
df = df2.with_columns([
pl.col('initial_band_time').apply(lambda x: datetime.datetime.strptime(x, '%H:%M')),
pl.col('final_band_time').apply(lambda x: datetime.datetime.strptime(x, '%H:%M')),
])
df = df.drop('version').rename({'day_type': 'day'})
print(df)
print(df.dtypes)
#output: <class 'polars.datatypes.Datetime'>, <class 'polars.datatypes.Datetime'>
'''
I save it with write_csv
'''
df.write_csv('data/trp_occupation_level_emt_cleaned.csv', sep=",")
dfnew = pl.read_csv('data/trp_occupation_level_emt_cleaned.csv')
# print new df
print(dfnew.head())
print(dfnew.dtypes)
# output: <class 'polars.datatypes.Utf8'>, <class 'polars.datatypes.Utf8'>
I know i can read the csv with parsed_dates= True, but i consume this dataframe in a database so i need it to export it with dates parsed.
Polars does not default to parsing string data as dates automatically.
But you can easily turn it on by setting the parse_dates keyword argument.
pl.read_csv("myfile.csv", parse_dates=True)
It sounds like you want to specify the formatting of Date and Datetime fields in an output csv file - to conform with the formatting requirements of an external application (e.g., database loader).
We can do that easily using the strftime format function. Basically, we will convert the Date/Datetime fields to strings, formatted as we need them, just before we write the csv file. This way, the csv output writer will not alter them.
For example, let's start with this data:
from io import StringIO
import polars as pl
my_csv = """sample_id,initial_band_time,final_band_time
1,2022-01-01T18:00:00,2022-01-01T18:35:00
2,2022-01-02T19:35:00,2022-01-02T20:05:00
"""
df = pl.read_csv(StringIO(my_csv), parse_dates=True)
print(df)
shape: (2, 3)
┌───────────┬─────────────────────┬─────────────────────┐
│ sample_id ┆ initial_band_time ┆ final_band_time │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ datetime[μs] │
╞═══════════╪═════════════════════╪═════════════════════╡
│ 1 ┆ 2022-01-01 18:00:00 ┆ 2022-01-01 18:35:00 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02 19:35:00 ┆ 2022-01-02 20:05:00 │
└───────────┴─────────────────────┴─────────────────────┘
Now, we'll apply the strftime function and the following format specifier %F %T.
df = df.with_column(pl.col(pl.Datetime).dt.strftime(fmt="%F %T"))
print(df)
shape: (2, 3)
┌───────────┬─────────────────────┬─────────────────────┐
│ sample_id ┆ initial_band_time ┆ final_band_time │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═══════════╪═════════════════════╪═════════════════════╡
│ 1 ┆ 2022-01-01 18:00:00 ┆ 2022-01-01 18:35:00 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02 19:35:00 ┆ 2022-01-02 20:05:00 │
└───────────┴─────────────────────┴─────────────────────┘
Notice that our Datetime fields have been converted to strings (the 'str' in the column header).
And here's a pro tip: notice that I'm using a datatype wildcard expression in the col expression: pl.col(pl.Datetime). This way, you don't need to specify each Datetime field; Polars will automatically convert them all.
Now, when we write the csv file, we get the following output.
df.write_csv('/tmp/tmp.csv')
Output csv:
sample_id,initial_band_time,final_band_time
1,2022-01-01 18:00:00,2022-01-01 18:35:00
2,2022-01-02 19:35:00,2022-01-02 20:05:00
You may need to play around with the format specifier until you find one that your external application will accept. Here's a handy reference for format specifiers.
Here's another trick: you can do this step just before writing the csv file:
df.with_column(pl.col(pl.Datetime).dt.strftime(fmt="%F %T")).write_csv('/tmp/tmp.csv')
This way, your original dataset is not changed ... only the copy that you intend to write to a csv file.
BTW, I use this trick all the time when writing csv files that I intend to use in spreadsheets. I often just want the "%F" (date) part of the datetime, not the "%T" part (time). It just makes parsing easier in the spreadsheet.
Related
I have a Timeseries dataset that needs to be interpolated such that any gaps more than 3 minutes are left as null values.
The problem i'm facing is that Polars upsample leads to a lot of nulls even when there is data close to the time period. Here's a snippet of the dataframe.
utc gnd_p gnd_t app_sza azimuth xh2o xair xco2 xch4 xco xch4_s5p
0 2022-06-04 04:49:31 955.081699 293.84 77.009159 -109.292040 4118.807354 0.996515 421.510185 1.878339 0.0 0.0
1 2022-06-04 04:49:46 955.081655 293.84 76.971435 -109.250593 4119.081639 0.996508 421.543444 1.878761 0.0 0.0
Here's a Pandas code for the same operation
output = sensor_dataframe.sort_values(by=['utc']) # sort according to time
output['utc'] = pd.to_datetime(output['utc'])
# Apply smoothing function for all data columns.
for column in output.columns[1::]:
output[column] = scipy.signal.savgol_filter(pd.to_numeric(output[column]), 31, 3)
print(output)
output = output.set_index('utc')
output.index = pd.to_datetime(output.index)
output = output.resample(sampling_rate).mean()
sampling_delta = pd.to_timedelta(sampling_rate)
# The interpolating limit is dependant on the sampling rate.
interpolating_limit = int(MAX_DELTA_FOR_INTERPOLATION / sampling_delta)
if interpolating_limit != 0:
output.interpolate(
limit=interpolating_limit,
inplace=True,
limit_direction='both',
limit_area='inside',
)
Here's the output in a 10 second sampling rate.
gnd_p gnd_t app_sza azimuth xh2o xair xco2 xch4 xco xch4_s5p
utc
2022-06-04 04:49:30 955.081699 293.84 77.009159 -109.292040 4118.807354 0.996515 421.510185 1.878339 0.0 0.0
2022-06-04 04:49:40 955.081655 293.84 76.971435 -109.250593 4119.081639 0.996508 421.543444 1.878761 0.0 0.0
Here's the same attempt at a Polars version.
df = pl.from_pandas(sensor_dataframe)
q = df.lazy().with_column(pl.col('utc').str.strptime(pl.Datetime, fmt='%F %T').cast(pl.Datetime)).select([pl.col('utc'),
pl.exclude('utc').map(lambda x: savgol_filter(x.to_numpy(), 31, 3)).explode()])
df = q.collect()
df = df.upsample(time_column="utc", every="10s")
Here's the output of the above snipper
│ 2022-06-04 04:49:31 ┆ 955.081699 ┆ 293.84 ┆ 77.009159 ┆ ... ┆ 421.510185 ┆ 1.878339 ┆ 0.0 ┆ 0.0 │
│ 2022-06-04 04:49:41 ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
│ 2022-06-04 04:49:51 ┆ null ┆ null ┆ null ┆ ... ┆ null ┆ null ┆ null ┆ null │
Polars just spits out a df with a lot of nulls. I would have to interpolate to fill the values but that would mean I interpolate the entire dataset. Polars unfortunately provides no arguments or parameters on interpolate() which leads to the all the series getting interpolated which is not the desired action.
I think the solution should have something to do with masks. Anyone has experience working with Polars and interpoaltion?
Reproducable CODE: https://pastebin.com/gQ1WU4zp
sample csv data: https://0bin.net/paste/3fX2AOM2#uQmEv2KvBK5Xk-2vuWxx2z0QgXlttdnaa78eFt8ra62
I'm not going to download your whole dataset so let's use this as an example:
np.random.seed(0)
df = pl.DataFrame(
{
"time": pl.date_range(
low=datetime(2023, 2, 1),
high=datetime(2023, 2, 2),
interval="1m"),
'data':list(np.random.choice([None, 1,2,3,4], size=1441))
}).filter(~pl.col('data').is_null())
upsample, by definition, doesn't interpolate, it (as you've discovered) just inserts a bunch of nulls to match the periods you want.
If you only want to interpolate when the preupsampled gap is 3m or less then make a helper column before the upsample.
Use when then looking at the helper column to interpolate or not interpolate.
df \
.with_columns(
(pl.col('time')-pl.col('time').shift()<pl.duration(minutes=3)).alias('small_gap')) \
.upsample(time_column="time", every="10s") \
.with_columns(pl.col('small_gap').backward_fill()) \
.with_columns(
pl.when(pl.col('small_gap')) \
.then(pl.exclude(['small_gap']).interpolate()) \
.otherwise(pl.exclude(['small_gap']))) \
.select(pl.exclude('small_gap'))
Is there a way to filter data in a period of time (i.e., start time and end time) using polars?
import pandas as pd
import polars as pl
dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min")
df = pd.DataFrame({"timestamp": dr})
pf = pl.from_pandas(df)
The best try I've got was:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30))
It only gave me everything after 9:30; and if I append another filter after that:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16))
This however does not give me the slice that falls right on 16:00.
polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?
Good question!
Firstly, we can create this kind of DataFrame in Polars:
from datetime import datetime, time
import polars as pl
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")})
To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype.
To filter on a range of times we then pass the upper and lower boundaries of time to in_between.
In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition.
(
df
.select(
[
pl.col("timestamp"),
pl.col("timestamp").cast(pl.Time).alias('time_component'),
(pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
]
)
)
What you are after is:
(
df
.filter(
pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
)
See the API docs for the syntax on controlling behaviour at the boundaries:
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range
It would look something like this:
start_date = "2022-03-22 00:00:00"
end_date = "2022-03-27 00:00:00"
df = pl.DataFrame(
{
"dates": [
"2022-03-22 00:00:00",
"2022-03-23 00:00:00",
"2022-03-24 00:00:00",
"2022-03-25 00:00:00",
"2022-03-26 00:00:00",
"2022-03-27 00:00:00",
"2022-03-28 00:00:00",
]
}
)
df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True)
shape: (4, 2)
┌─────────────────────┬────────────┐
│ dates ┆ is_between │
│ --- ┆ --- │
│ str ┆ bool │
╞═════════════════════╪════════════╡
│ 2022-03-23 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-24 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-25 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-26 00:00:00 ┆ true │
└─────────────────────┴────────────┘
Suppose you have
df=pl.DataFrame(
{
"date":["2022-01-01", "2022-01-02"],
"hroff":[5,2],
"minoff":[1,2]
}).with_column(pl.col('date').str.strptime(pl.Date,"%Y-%m-%d"))
and you want to make a new column that adds the hour and min offsets to the date column. The only thing I saw was the dt.offset_by method. I made an extra column
df=df.with_column((pl.col('hroff')+"h"+pl.col('minoff')+"m").alias('offset'))
and then tried
df.with_column(pl.col('date') \
.cast(pl.Datetime).dt.with_time_zone('UTC') \
.dt.offset_by(pl.col('offset')).alias('newdate'))
but that doesn't work because dt.offset_by only takes a fixed string, not another column.
What's the best way to do that?
Use pl.duration:
import polars as pl
df = pl.DataFrame({
"date": pl.Series(["2022-01-01", "2022-01-02"]).str.strptime(pl.Datetime(time_zone="UTC"), "%Y-%m-%d"),
"hroff": [5, 2],
"minoff": [1, 2]
})
print(df.select(
pl.col("date") + pl.duration(hours=pl.col("hroff"), minutes=pl.col("minoff"))
))
shape: (2, 1)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2022-01-01 05:01:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-01-02 02:02:00 │
└─────────────────────┘
Is there a way to filter data in a period of time (i.e., start time and end time) using polars?
import pandas as pd
import polars as pl
dr = pd.date_range(start='2020-01-01', end='2021-01-01', freq="30min")
df = pd.DataFrame({"timestamp": dr})
pf = pl.from_pandas(df)
The best try I've got was:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30))
It only gave me everything after 9:30; and if I append another filter after that:
pf.filter((pl.col("timestamp").dt.hour()>=9) & (pl.col("timestamp").dt.minute()>=30)).filter((pl.col("timestamp").dt.hour()<16))
This however does not give me the slice that falls right on 16:00.
polars API do not seem to specifically deal with the time part of time series (only date part); Is there a better workaround here using polars?
Good question!
Firstly, we can create this kind of DataFrame in Polars:
from datetime import datetime, time
import polars as pl
start = datetime(2020,1,1)
stop = datetime(2021,1,1)
df = pl.DataFrame({'timestamp':pl.date_range(low=start, high=stop, interval="30m")})
To work on the time components of a datetime we cast the timestamp column to the pl.Time dtype.
To filter on a range of times we then pass the upper and lower boundaries of time to in_between.
In this example I've printed the original timestamp column, the timestamp column cast to pl.Time and the filter condition.
(
df
.select(
[
pl.col("timestamp"),
pl.col("timestamp").cast(pl.Time).alias('time_component'),
(pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
]
)
)
What you are after is:
(
df
.filter(
pl.col("timestamp").cast(pl.Time).is_between(
time(9,30),time(16),include_bounds=True
)
)
)
See the API docs for the syntax on controlling behaviour at the boundaries:
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.Expr.is_between.html#polars.Expr.is_between
It is described in the polars book here: https://pola-rs.github.io/polars-book/user-guide/howcani/timeseries/selecting_dates.html#filtering-by-a-date-range
It would look something like this:
start_date = "2022-03-22 00:00:00"
end_date = "2022-03-27 00:00:00"
df = pl.DataFrame(
{
"dates": [
"2022-03-22 00:00:00",
"2022-03-23 00:00:00",
"2022-03-24 00:00:00",
"2022-03-25 00:00:00",
"2022-03-26 00:00:00",
"2022-03-27 00:00:00",
"2022-03-28 00:00:00",
]
}
)
df.with_column(pl.col("dates").is_between(start_date,end_date)).filter(pl.col("is_between") == True)
shape: (4, 2)
┌─────────────────────┬────────────┐
│ dates ┆ is_between │
│ --- ┆ --- │
│ str ┆ bool │
╞═════════════════════╪════════════╡
│ 2022-03-23 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-24 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-25 00:00:00 ┆ true │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-26 00:00:00 ┆ true │
└─────────────────────┴────────────┘
How does one convert a column of i64 epoch strings into dates in polars?
I've got a column of i64 representing seconds since epoch and I'd like to parse them into polars native datetimes.
Polars' Datetime is represented as unix epoch in either, nanoseconds, microseconds or milliseconds. So with that knowledge we can convert the seconds to milliseconds and cast to Datetime.
Finally we ensure polars uses the proper unit.
df = pl.DataFrame({
"epoch_seconds": [1648457740, 1648457740 + 10]
})
MILLISECONDS_IN_SECOND = 1000;
df.select(
(pl.col("epoch_seconds") * MILLISECONDS_IN_SECOND).cast(pl.Datetime).dt.with_time_unit("ms").alias("datetime")
)
shape: (2, 1)
┌─────────────────────┐
│ datetime │
│ --- │
│ datetime[ms] │
╞═════════════════════╡
│ 2022-03-28 08:55:40 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-03-28 08:55:50 │
└─────────────────────┘