Python: Finding the input of pandas DatetimeIndex.asof() - python

I am trying to use pandas.DatetimeIndex.asof() to find the closest value to a certain date. However, what is the input for this function exactly?
The documentation states that the input is a label but of what format?
To be more specific, I have a DataFrame that looks like this, where the datetime column is set as an index. I want the code to return the index of the row whose datetime is closest to 2018-07-28 13:00:00.
datetime | price
2018-07-28 12:57:13 8.50
2018-07-28 12:59:45 8.60
2018-07-28 13:01:19 8.70
2018-07-28 13:03:27 8.65

Agreed, the use of the word label in the documentation is unclear. The format should be the same as your datetime format. For example:
# If datetime column is already in datetime format:
df.set_index(df.datetime).asof('2018-07-28 13:00:00')
# If datetime is not already in proper datetime format
df.set_index(pd.to_datetime(df.datetime)).asof('2018-07-28 13:00:00')
returns a series of the closest datetime found:
datetime 2018-07-28 12:59:45
price 8.6
Name: 2018-07-28 13:00:00, dtype: object
Alternative solution (better IMO)
I think a better way to do this though is just to subtract your target datetime from the datetime column, find the minumum, and extract that using loc. In this way you can get the true closest value, including from rows that come after it (asof is limited to the most recent label up to and including the passed label, as noted in the docs you linked)
>>> df.loc[abs(df.datetime - pd.to_datetime('2018-07-28 13:00:00')).idxmin()]
datetime 2018-07-28 12:59:45
price 8.6
Name: 1, dtype: object

Related

pandas dataframe datetime - convert string to datetime offset

I have a dataframe like:
This time
Time difference
2000-01-01 00:00:00
-3:00
2000-03-01 05:00:00
-5:00
...
...
2000-01-24 16:10:00
-7:00
I'd like to convert the 2nd column (-3:00 means minus 3 hours) from string into something like a time offest that I can directly use to operate with the 1st column (which is already in datetime64[ns]).
I thought there was supposed to be something in pd that does it but couldn't find anything straightforward. Does anyone have any clue?
You can use pd.to_timedelta:
df['Time difference'] = pd.to_timedelta(df['Time difference']+':00')
Obs: I used + ':00' because the default format for string conversion in pd.to_timedelta is "hh:mm:ss".

Convert "Q12019" object to datetime64

I have a pandas dataframe, where one column contains a string for the quarter and year in the following format: Q12019
My Question: ​How do I convert this into datetime format?
You can use Pandas PeriodIndex to accomplish this. Just reformat your quarters column to the expected format %Y-%q (with some help from regex, move the year to the front):
reformatted_quarters = df['QuarterYear'].str.replace(r'(Q\d)(\d+)', r'\2\1')
print(reformatted_quarters)
This prints:
0 2019Q1
1 2018Q2
2 2019Q4
Name: QuarterYear, dtype: object
Then, feed this result to PeriodIndex to get the datetime format. Use 'Q' to specify a quarterly frequency:
datetimes = pd.PeriodIndex(reformatted_quarters, freq='Q').to_timestamp()
print(datetimes)
This prints:
DatetimeIndex(['2019-01-01', '2018-04-01', '2019-10-01'], dtype='datetime64[ns]', name='Quarter', freq=None)
Note: Pandas PeriodIndex functionality experienced a regression in behavior (documented here), so for Pandas versions greater than 0.23.4, you'll need to use reformatted_quarters.values instead:
datetimes = pd.PeriodIndex(reformatted_quarters.values, freq='Q').to_timestamp()
(quarter) => new Date(quarter.slice(-4), 3 * (quarter.slice(1, 2) - 1), 1)
This will give you the start of every quarter (e.g. q42019 will give 2019-10-01).
You should probably include some validation since it will just keep rolling over months (e.g. q52019 = q12020 = 2020-01-01)

Converting a series of DateTime values to the proper format

Currently, I have a series of Datetime Values that display as so
0 Datetime
1 20041001
2 20041002
3 20041003
4 20041004
they are within a series named
d['Datetime']
They were originally something like
20041001ABCDEF
But I split the end off just to leave them with the remaining numbers. How do I go about putting them into the following format?
2004-10-01
You can do the following,
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y%m%d'))

Find the earliest and oldest date in a list of dates' string representation

How can I convert this to a Python Date so I can find the latest date in a list?
["2018-06-12","2018-06-13", ...] to date
Then:
max(date_objects)
min(date_objects)
Since you want to convert from a list, you'll need to use my linked duplicate with a list comprehension,
from datetime import datetime
list_of_string_dates = ["2018-06-12","2018-06-13","2018-06-14","2018-06-15"]
list_of_dates= [datetime.strptime(date,"%Y-%m-%d") for date in list_of_string_dates]
print(max(list_of_dates)) # oldest
print(min(list_of_dates)) # earliest
2018-06-15 00:00:00
2018-06-12 00:00:00
Basically, you're converting the string representation of your dates to a Python date using datetime.strptime and then applying max and min which are implemented on this type of objects.
import datetime
timestamp = datetime.datetime.strptime("2018-06-12", "%Y-%m-%d")
date_only = timestamp.date()
You can use the datetime module. In particular, since your date is in the standard POSIX format, you'll be able to use the function date.fromtimestamp(timestamp) to return a datetime object from your string; otherwise, you can use the strptime function to read in a more complicated string - that has a few intricacies that you can figure out by looking at the documentation

Pandas groupby datetime function does not preserve dtype

I'm having trouble extracting the .minute property of a Pandas datetime object, in the context of a groupby aggregation.
This post appears to touch on the same root issue, but the accepted answer just explained why the problem was happening (which is fair, as the OP only asked to understand the problem). I'm posting now because I'm hoping to find a solution that doesn't rely on explicitly changing the type of the data I'm aggregating.
Here's some example code:
import pandas as pd
ids = ['a','a','b','b']
dates = ['2017-01-01 01:01:00','2017-01-01 01:02:00',
'2017-03-03 01:03:00','2017-03-03 01:04:00']
dates = pd.to_datetime(pd.Series(dates))
df = pd.DataFrame({'id':ids, 'datetime':dates})
id datetime
0 a 2017-01-01 01:01:00
1 a 2017-01-01 01:02:00
2 b 2017-03-03 01:03:00
3 b 2017-03-03 01:04:00
My goal is to group by id, and then extract the minute, as an integer value, of the earliest timestamp in each datetime group.
For example, to do this across all datetime values, this works:
df.datetime.min().minute # returns 1
I want to mimic that same functionality in a groupby() setting.
Combining min() and .minute in a UDF, however, the minute value gets tacked onto the end of a datetime that marks the start of the Unix epoch:
def get_earliest_minute(tstamps):
return tstamps.min().minute
df.groupby('id').agg({'datetime':get_earliest_minute})
datetime
id
a 1970-01-01 00:00:00.000000001
b 1970-01-01 00:00:00.000000003
The type returned from get_earliest_minute() is an integer:
def get_earliest_minute(tstamps):
return type(tstamps.min().minute)
df.groupby('id').agg({'datetime':get_earliest_minute})
datetime
id
a <type 'int'>
b <type 'int'>
But the type of datetime, post-aggregation, is <M8[ns]:
df.groupby('id').agg({'datetime':get_earliest_minute}).datetime.dtype # dtype('<M8[ns]')
This answer to the post linked above states that this happens because of a purposeful type coercion, which tries to maintain the type of the original Series object that underwent aggregation. I've looked around a bit but couldn't find any solutions, other than one comment that suggested changing the type of the field to object before performing groupby(), e.g.,
df.datetime = df.datetime.astype(object)
df.groupby('id').agg({'datetime':get_earliest_minute})
and another comment which proposed to convert the output of the function to float before returning, e.g.,
def get_earliest_minute(tstamps):
return float(tstamps.min().minute)
These workarounds do the job (although for some reason declaring int() does not escape type coercion like float() does), but is there a way to do these groupby manipulations on datetime objects without inserting explicit type conversions (i.e., either generalizing <M8[ns]->object or converting int->float)? In particular, in a case in where multiple agg() functions are applied to datetime, with some functions which rely on datetime attributes and some which don't, this wouldn't succeed with a pre-groupby conversion.
Also, is there a reason why float() type conversion overrides the built-in type coercion, but int() does not?
Thanks in advance!
I'm going to stick with #Jeff on this one. agg is doing what we all want. It is attempting to preserve the dtype because it is intended to aggregate the values of a particular dtype. And when I aggregate data of a particular dtype I expect that same dtype back...
...That said, you can very easily work around this with apply
your problem
def get_earliest_minute(tstamps):
return tstamps.min().minute
df.groupby('id').agg({'datetime':get_earliest_minute})
datetime
id
a 1970-01-01 00:00:00.000000001
b 1970-01-01 00:00:00.000000003
workaround
def get_earliest_minute(tstamps):
return tstamps.min().minute
df.groupby('id').datetime.apply(get_earliest_minute)
id
a 1
b 3
Name: datetime, dtype: int64

Categories

Resources