I have a dataframe of dtype datetime64
df:
time timestamp
18053.401736 2019-06-06 09:38:30+00:00
18053.418252 2019-06-06 10:02:17+00:00
18053.424514 2019-06-06 10:11:18+00:00
18053.454132 2019-06-06 10:53:57+00:00
Name: timestamp, dtype: datetime64[ns, UTC]
and a Series of dtype timedelta64
ss:
ref_time
0 days 09:00:00
1 0 days 09:00:01
2 0 days 09:00:02
3 0 days 09:00:03
4 0 days 09:00:04
...
21596 0 days 14:59:56
21597 0 days 14:59:57
21598 0 days 14:59:58
21599 0 days 14:59:59
21600 0 days 15:00:00
Name: timeonly, Length: 21601, dtype: timedelta64[ns]
I want to merge the two so that the output df have values only where timestamp coincide with the one of the Series:
Desired output:
time timestamp ref_time
Nan Nan 09:00:00
... ... ...
Nan Nan 09:38:29
18053.401736 2019-06-06 09:38:30+00:00 09:38:30
Nan Nan 09:38:31
... ... ...
18053.418252 2019-06-06 10:02:17+00:00 10:02:17
Nan Nan 10:02:18
Nan Nan 10:02:19
... ... ...
18053.424514 2019-06-06 10:11:18+00:00 10:11:18
... ... ...
18053.454132 2019-06-06 10:53:57+00:00 10:53:57
However if I convert 'timestamp' to a time-only I get an object dtype and I can't merge it with ss.
dframe['timestamp'].dtype # --> datetime64[ns, UTC]
df['timeonly'] = df['timestamp'].dt.time
df['timeonly'].dtype # --> object
df_date.merge(timeax, how='outer', on=['timeonly'])
# ValueError: You are trying to merge on object and timedelta64[ns] columns. If you wish to proceed you should use pd.concat
but using concat as suggested doesn't give me the desired output.
How can I merge/join the DataFrame and the Series?
Pandas version 1.1.5
Convert the timestamp to timedelta by subtracting the date part and then merge:
df1 = pd.DataFrame([pd.Timestamp('2019-06-06 09:38:30+00:00'),pd.Timestamp('2019-06-06 10:02:17+00:00')], columns=['timestamp'])
df2 = pd.DataFrame([pd.Timedelta('09:38:30')], columns=['ref_time'])
timestamp
0 2019-06-06 09:38:30+00:00
1 2019-06-06 10:02:17+00:00
timestamp datetime64[ns, UTC]
dtype: object
ref_time
0 09:38:30
ref_time timedelta64[ns]
dtype: object
df1['merge_key'] = df1['timestamp'].dt.tz_localize(None) - pd.to_datetime(df1['timestamp'].dt.date)
df_merged = df1.merge(df2, left_on = 'merge_key', right_on = 'ref_time')
Gives:
timestamp merge_key ref_time
0 2019-06-06 09:38:30+00:00 09:38:30 09:38:30
The main challenge here is to get everything into compatible date types. Using your, slightly modified, examples as inputs
from io import StringIO
df = pd.read_csv(StringIO(
"""
time,timestamp
18053.401736,2019-06-06 09:38:30+00:00
18053.418252,2019-06-06 10:02:17+00:00
18053.424514,2019-06-06 10:11:18+00:00
18053.454132,2019-06-06 10:53:57+00:00
"""))
df['timestamp'] = pd.to_datetime(df['timestamp'])
from datetime import timedelta
sdf = pd.read_csv(StringIO(
"""
ref_time
0 days 09:00:00
0 days 09:00:01
0 days 09:00:02
0 days 09:00:03
0 days 09:00:04
0 days 09:38:30
0 days 10:02:17
0 days 14:59:56
0 days 14:59:57
0 days 14:59:58
0 days 14:59:59
0 days 15:00:00
"""))
sdf['ref_time'] = pd.to_timedelta(sdf['ref_time'])
The dtypes here are as in your question which is important
First we figure out the base_date as we need to convert timedeltas into datetimes etc. Note we set it to midnight of the relevant date via round('1d')
base_date = df['timestamp'].iloc[0].round('1d').to_pydatetime()
base_date
output
datetime.datetime(2019, 6, 6, 0, 0, tzinfo=<UTC>)
Next we add timedeltas from sdf to the base_date:
sdf['ref_dt'] = sdf['ref_time'] + base_date
Now sdf['ref_dt'] and df['timestamp'] are in the same 'units' and of the same type, so we can merge
sdf.merge(df, left_on = 'ref_dt', right_on = 'timestamp', how = 'left')
output
ref_time ref_dt time timestamp
-- --------------- ------------------------- ------- -------------------------
0 0 days 09:00:00 2019-06-06 09:00:00+00:00 nan NaT
1 0 days 09:00:01 2019-06-06 09:00:01+00:00 nan NaT
2 0 days 09:00:02 2019-06-06 09:00:02+00:00 nan NaT
3 0 days 09:00:03 2019-06-06 09:00:03+00:00 nan NaT
4 0 days 09:00:04 2019-06-06 09:00:04+00:00 nan NaT
5 0 days 09:38:30 2019-06-06 09:38:30+00:00 18053.4 2019-06-06 09:38:30+00:00
6 0 days 10:02:17 2019-06-06 10:02:17+00:00 18053.4 2019-06-06 10:02:17+00:00
7 0 days 14:59:56 2019-06-06 14:59:56+00:00 nan NaT
8 0 days 14:59:57 2019-06-06 14:59:57+00:00 nan NaT
9 0 days 14:59:58 2019-06-06 14:59:58+00:00 nan NaT
10 0 days 14:59:59 2019-06-06 14:59:59+00:00 nan NaT
11 0 days 15:00:00 2019-06-06 15:00:00+00:00 nan NaT
and we see the merge happening where needed
Related
I have a Pandas DataFrame df, where time is given in seconds (from the beginning of the day)
df["time"]
0 43200
1 43240
2 43280
3 43320
43200 corresponds to 12:00:00
How can I add a date (2019-07-21) to df["time"], so that the result is
df["time"]
0 2019-07-21 12:00:00
1 2019-07-21 12:00:40
2 2019-07-21 12:01:20
3 2019-07-21 12:02:00
Use to_datetime with unit and origin parameters:
df["time"] = pd.to_datetime(df["time"], unit='s', origin='2019-07-21')
print (df)
time
0 2019-07-21 12:00:00
1 2019-07-21 12:00:40
2 2019-07-21 12:01:20
3 2019-07-21 12:02:00
I want the time without the date in Pandas.
I want to keep the time as dtype datetime64[ns] and not as an object so that I can determine periods between times.
The closest I have gotten is as follows, but it gives back the date in a new column not the time as needed as dtype datetime.
df_pres_mf['time'] = pd.to_datetime(df_pres_mf['time'], format ='%H:%M', errors = 'coerce') # returns date (1900-01-01) and actual time as a dtype datetime64[ns] format
df_pres_mf['just_time'] = df_pres_mf['time'].dt.date
df_pres_mf['normalised_time'] = df_pres_mf['time'].dt.normalize()
df_pres_mf.head()
Returns the date as 1900-01-01 and not the time that is needed.
Edit: Data
time
1900-01-01 11:16:00
1900-01-01 15:20:00
1900-01-01 09:55:00
1900-01-01 12:01:00
You could do it like Vishnudev suggested but then you would have dtype: object (or even strings, after using dt.strftime), which you said you didn't want.
What you are looking for doesn't exist, but the closest thing that I can get you is converting to timedeltas. Which won't seem like a solution at first but is actually very useful.
Convert it like this:
# sample df
df
>>
time
0 2021-02-07 09:22:00
1 2021-05-10 19:45:00
2 2021-01-14 06:53:00
3 2021-05-27 13:42:00
4 2021-01-18 17:28:00
df["timed"] = df.time - df.time.dt.normalize()
df
>>
time timed
0 2021-02-07 09:22:00 0 days 09:22:00 # this is just the time difference
1 2021-05-10 19:45:00 0 days 19:45:00 # since midnight, which is essentially the
2 2021-01-14 06:53:00 0 days 06:53:00 # same thing as regular time, except
3 2021-05-27 13:42:00 0 days 13:42:00 # that you can go over 24 hours
4 2021-01-18 17:28:00 0 days 17:28:00
this allows you to calculate periods between times like this:
# subtract the last time from the current
df["difference"] = df.timed - df.timed.shift()
df
Out[48]:
time timed difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 # <-- this is because the last
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 # time was later than the current
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 # (see below)
to get rid of odd differences, make it absolute:
df["abs_difference"] = df.difference.abs()
df
>>
time timed difference abs_difference
0 2021-02-07 09:22:00 0 days 09:22:00 NaT NaT
1 2021-05-10 19:45:00 0 days 19:45:00 0 days 10:23:00 0 days 10:23:00
2 2021-01-14 06:53:00 0 days 06:53:00 -1 days +11:08:00 0 days 12:52:00 ### <<--
3 2021-05-27 13:42:00 0 days 13:42:00 0 days 06:49:00 0 days 06:49:00
4 2021-01-18 17:28:00 0 days 17:28:00 0 days 03:46:00 0 days 03:46:00
Use proper formatting according to your date format and convert to datetime
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
Format according to the preferred format
df['time'].dt.strftime('%H:%M')
Output
0 11:16
1 15:20
2 09:55
3 12:01
Name: time, dtype: object
I'm doing some resampling on data and I was wondering why resampling 1min data to 5min data creates MORE time intervals than my original dataset?
Also, why does t resample until 2018-12-11 (11 days longer!) than the original datset?
1-min data:
result of resampling to 5-min intervalls:
This is how I do the resampling:
df1.loc[:,'qKfz_gesamt'].resample('5min').mean()
I was wondering why resampling 1min data to 5min data creates MORE time intervals than my original dataset?
Problem is if no consecutive values in original pandas create consecutive 5minutes intervals and for not exist values are created NaNs:
df1 = pd.DataFrame({'qKfz_gesamt': range(4)},
index=pd.to_datetime(['2018-11-25 00:00:00','2018-11-25 00:01:00',
'2018-11-25 00:02:00','2018-11-25 00:15:00']))
print (df1)
qKfz_gesamt
2018-11-25 00:00:00 0
2018-11-25 00:01:00 1
2018-11-25 00:02:00 2
2018-11-25 00:15:00 3
print (df1['qKfz_gesamt'].resample('5min').mean())
2018-11-25 00:00:00 1.0
2018-11-25 00:05:00 NaN
2018-11-25 00:10:00 NaN
2018-11-25 00:15:00 3.0
Freq: 5T, Name: qKfz_gesamt, dtype: float64
print (df1['qKfz_gesamt'].resample('5min').mean().dropna())
2018-11-25 00:00:00 1.0
2018-11-25 00:15:00 3.0
Name: qKfz_gesamt, dtype: float64
why does t resample until 2018-12-11 (11 days longer!) than the original datset?
You need filter by maximal value of index:
rng = pd.date_range('2018-11-25', periods=10)
df1 = pd.DataFrame({'a': range(10)}, index=rng)
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
2018-12-01 6
2018-12-02 7
2018-12-03 8
2018-12-04 9
df1 = df1.loc[:'2018-11-30']
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
Or:
df1 = df1.loc[df1.index <= '2018-11-30']
print (df1)
a
2018-11-25 0
2018-11-26 1
2018-11-27 2
2018-11-28 3
2018-11-29 4
2018-11-30 5
I am dealing with financial data which i need to extrapolate for different months. Here is my dataframe:
invoice_id,date_from,date_to
30492,2019-02-04,2019-09-18
I want to break this up for different months between date_from and date_to. Hence i need to add rows for each month with month starting date to ending date. Final output should look like:
invoice_id,date_from,date_to
30492,2019-02-04,2019-02-28
30492,2019-03-01,2019-03-31
30492,2019-04-01,2019-04-30
30492,2019-05-01,2019-05-31
30492,2019-06-01,2019-06-30
30492,2019-07-01,2019-07-31
30492,2019-08-01,2019-08-30
30492,2019-09-01,2019-09-18
Need to take care of leap year scenario as well. Is there any native method already available in pandas datetime package which i can use to achieve the desired output ?
Use:
print (df)
invoice_id date_from date_to
0 30492 2019-02-04 2019-09-18
1 30493 2019-01-20 2019-03-10
#added months between date_from and date_to
df1 = pd.concat([pd.Series(r.invoice_id,pd.date_range(r.date_from, r.date_to, freq='MS'))
for r in df.itertuples()]).reset_index()
df1.columns = ['date_from','invoice_id']
#added starts of months - sorting for correct positions
df2 = (pd.concat([df[['invoice_id','date_from']], df1], sort=False, ignore_index=True)
.sort_values(['invoice_id','date_from'])
.reset_index(drop=True))
#added MonthEnd and date_to to last rows
mask = df2['invoice_id'].duplicated(keep='last')
s = df2['invoice_id'].map(df.set_index('invoice_id')['date_to'])
df2['date_to'] = np.where(mask, df2['date_from'] + pd.offsets.MonthEnd(), s)
print (df2)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
8 30493 2019-01-20 2019-01-31
9 30493 2019-02-01 2019-02-28
10 30493 2019-03-01 2019-03-10
You can use pandas.date_range with start and end date, in combination with freq='MS' which is beginning of month and freq='M' which is end of month:
x = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='MS')
y = pd.date_range(start=df.iloc[0]['date_from'], end=df.iloc[0]['date_to'], freq='M')
df_new = pd.DataFrame({'date_from':x,
'date_to':y})
df_new['invoice_id'] = df.iloc[0]['invoice_id']
print(df_new)
date_from date_to invoice_id
0 2019-03-01 2019-02-28 30492
1 2019-04-01 2019-03-31 30492
2 2019-05-01 2019-04-30 30492
3 2019-06-01 2019-05-31 30492
4 2019-07-01 2019-06-30 30492
5 2019-08-01 2019-07-31 30492
6 2019-09-01 2019-08-31 30492
Another way, using the resample method of a datetime index:
# melt, so we have start and end dates in 1 column
df = pd.melt(df, id_vars='invoice_id')
# now set the date column as index
df.set_index(inplace=True, keys='value')
# resample to daily level
df = df.resample('D').ffill().reset_index()
# get the yr-month value of each daily row
df['yr_month'] = df['value'].dt.strftime("%Y-%m")
# Now group by month and take min/max day values
output = (df.groupby(['invoice_id', 'yr_month'])['value']
.agg({'date_from': 'min', 'date_to': 'max'})
.reset_index()
.drop(labels='yr_month', axis=1))
print(output)
invoice_id date_from date_to
0 30492 2019-02-04 2019-02-28
1 30492 2019-03-01 2019-03-31
2 30492 2019-04-01 2019-04-30
3 30492 2019-05-01 2019-05-31
4 30492 2019-06-01 2019-06-30
5 30492 2019-07-01 2019-07-31
6 30492 2019-08-01 2019-08-31
7 30492 2019-09-01 2019-09-18
I have a pandas dataset like this:
user_id datetime
1 13 days 21:50:00
2 0 days 02:05:00
5 10 days 00:10:00
7 2 days 01:20:00
1 3 days 11:50:00
2 1 days 02:30:00
I want to have a column that contains the mintues, So in this case the result can be :
user_id datetime minutes
1 13 days 21:50:00 20030
2 0 days 02:05:00 125
5 10 days 00:10:00 14402
7 2 days 01:20:00 2960
1 3 days 11:50:00 5030
2 1 days 02:30:00 1590
Is there any way to do that without loop?
Yes, there is a special dt accessor for date/time series:
df['minutes'] = df['datetime'].dt.total_seconds() / 60
If you only want whole minutes, cast the result using .astype(int).
Here is a way with pd.Timedelta:
df['minutes'] = pd.to_timedelta(df.datetime) / pd.Timedelta(1, 'm')
>>> df
user_id datetime minutes
0 1 13 days 21:50:00 20030.0
1 2 0 days 02:05:00 125.0
2 5 10 days 00:10:00 14410.0
3 7 2 days 01:20:00 2960.0
4 1 3 days 11:50:00 5030.0
5 2 1 days 02:30:00 1590.0
if your datetime column is already of dtype timedelta, you can omit the explicit casting and just use:
df['minutes'] = df.datetime / pd.Timedelta(1, 'm')