I have a dataframe of dtype datetime64
df:
time timestamp
18053.401736 2019-06-06 09:38:30+00:00
18053.418252 2019-06-06 10:02:17+00:00
18053.424514 2019-06-06 10:11:18+00:00
18053.454132 2019-06-06 10:53:57+00:00
Name: timestamp, dtype: datetime64[ns, UTC]
and a Series of dtype timedelta64
ss:
ref_time
0 days 09:00:00
1 0 days 09:00:01
2 0 days 09:00:02
3 0 days 09:00:03
4 0 days 09:00:04
...
21596 0 days 14:59:56
21597 0 days 14:59:57
21598 0 days 14:59:58
21599 0 days 14:59:59
21600 0 days 15:00:00
Name: timeonly, Length: 21601, dtype: timedelta64[ns]
I want to merge the two so that the output df have values only where timestamp coincide with the one of the Series:
Desired output:
time timestamp ref_time
Nan Nan 09:00:00
... ... ...
Nan Nan 09:38:29
18053.401736 2019-06-06 09:38:30+00:00 09:38:30
Nan Nan 09:38:31
... ... ...
18053.418252 2019-06-06 10:02:17+00:00 10:02:17
Nan Nan 10:02:18
Nan Nan 10:02:19
... ... ...
18053.424514 2019-06-06 10:11:18+00:00 10:11:18
... ... ...
18053.454132 2019-06-06 10:53:57+00:00 10:53:57
However if I convert 'timestamp' to a time-only I get an object dtype and I can't merge it with ss.
dframe['timestamp'].dtype # --> datetime64[ns, UTC]
df['timeonly'] = df['timestamp'].dt.time
df['timeonly'].dtype # --> object
df_date.merge(timeax, how='outer', on=['timeonly'])
# ValueError: You are trying to merge on object and timedelta64[ns] columns. If you wish to proceed you should use pd.concat
but using concat as suggested doesn't give me the desired output.
How can I merge/join the DataFrame and the Series?
Pandas version 1.1.5
Convert the timestamp to timedelta by subtracting the date part and then merge:
df1 = pd.DataFrame([pd.Timestamp('2019-06-06 09:38:30+00:00'),pd.Timestamp('2019-06-06 10:02:17+00:00')], columns=['timestamp'])
df2 = pd.DataFrame([pd.Timedelta('09:38:30')], columns=['ref_time'])
timestamp
0 2019-06-06 09:38:30+00:00
1 2019-06-06 10:02:17+00:00
timestamp datetime64[ns, UTC]
dtype: object
ref_time
0 09:38:30
ref_time timedelta64[ns]
dtype: object
df1['merge_key'] = df1['timestamp'].dt.tz_localize(None) - pd.to_datetime(df1['timestamp'].dt.date)
df_merged = df1.merge(df2, left_on = 'merge_key', right_on = 'ref_time')
Gives:
timestamp merge_key ref_time
0 2019-06-06 09:38:30+00:00 09:38:30 09:38:30
The main challenge here is to get everything into compatible date types. Using your, slightly modified, examples as inputs
from io import StringIO
df = pd.read_csv(StringIO(
"""
time,timestamp
18053.401736,2019-06-06 09:38:30+00:00
18053.418252,2019-06-06 10:02:17+00:00
18053.424514,2019-06-06 10:11:18+00:00
18053.454132,2019-06-06 10:53:57+00:00
"""))
df['timestamp'] = pd.to_datetime(df['timestamp'])
from datetime import timedelta
sdf = pd.read_csv(StringIO(
"""
ref_time
0 days 09:00:00
0 days 09:00:01
0 days 09:00:02
0 days 09:00:03
0 days 09:00:04
0 days 09:38:30
0 days 10:02:17
0 days 14:59:56
0 days 14:59:57
0 days 14:59:58
0 days 14:59:59
0 days 15:00:00
"""))
sdf['ref_time'] = pd.to_timedelta(sdf['ref_time'])
The dtypes here are as in your question which is important
First we figure out the base_date as we need to convert timedeltas into datetimes etc. Note we set it to midnight of the relevant date via round('1d')
base_date = df['timestamp'].iloc[0].round('1d').to_pydatetime()
base_date
output
datetime.datetime(2019, 6, 6, 0, 0, tzinfo=<UTC>)
Next we add timedeltas from sdf to the base_date:
sdf['ref_dt'] = sdf['ref_time'] + base_date
Now sdf['ref_dt'] and df['timestamp'] are in the same 'units' and of the same type, so we can merge
sdf.merge(df, left_on = 'ref_dt', right_on = 'timestamp', how = 'left')
output
ref_time ref_dt time timestamp
-- --------------- ------------------------- ------- -------------------------
0 0 days 09:00:00 2019-06-06 09:00:00+00:00 nan NaT
1 0 days 09:00:01 2019-06-06 09:00:01+00:00 nan NaT
2 0 days 09:00:02 2019-06-06 09:00:02+00:00 nan NaT
3 0 days 09:00:03 2019-06-06 09:00:03+00:00 nan NaT
4 0 days 09:00:04 2019-06-06 09:00:04+00:00 nan NaT
5 0 days 09:38:30 2019-06-06 09:38:30+00:00 18053.4 2019-06-06 09:38:30+00:00
6 0 days 10:02:17 2019-06-06 10:02:17+00:00 18053.4 2019-06-06 10:02:17+00:00
7 0 days 14:59:56 2019-06-06 14:59:56+00:00 nan NaT
8 0 days 14:59:57 2019-06-06 14:59:57+00:00 nan NaT
9 0 days 14:59:58 2019-06-06 14:59:58+00:00 nan NaT
10 0 days 14:59:59 2019-06-06 14:59:59+00:00 nan NaT
11 0 days 15:00:00 2019-06-06 15:00:00+00:00 nan NaT
and we see the merge happening where needed
I have following df:
date_from date_to birth_date death_date
0 2016-01-10 2019-06-05 2015-02-15 2018-07-25
1 2016-05-11 2020-06-13 2014-03-07 2020-07-11
2 2016-02-23 Nat 2014-03-07 2019-06-08
3 2015-12-08 Nat 2014-03-07 2019-06-08
I'm trying to select all cases where date_to > death_date OR where date_to = Nat.
I've tried following code:
df = df[(df['date_to'] > df['death_date']) | (df[df['DATE_TO'].isnull()])]
but I get following error-message
'TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]'
and I don't really know how to work around this problem.
From your question
import pandas as pd
# ..... your data frame df ......
# considering that you have the following types
>>> df.dtypes
date_from datetime64[ns]
date_to datetime64[ns]
birth_date datetime64[ns]
death_date datetime64[ns]
dtype: object
df = df[(df['date_to'] > df['death_date']) | (df['date_to'].isnull())]
>>> df
date_from date_to birth_date death_date
0 2016-01-10 2019-06-05 2015-02-15 2018-07-25
2 2016-02-23 NaT 2014-03-07 2019-06-08
3 2015-12-08 NaT 2014-03-07 2019-06-08
In case date_to column is not datetime you can convert like this
df['date_to'] = df['date_to'].replace('Nat', pd.NaT)
df['date_to'] = pd.to_datetime(df['date_to'])
I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str
I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay.
I have has a number of problems.
To illustrate I have used 2 columns.
import pandas as pd
import numpy as np
from datetime import datetime
import .csv
df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE'])
df
ADMIDATE DISDATE
0 04/02/2018 07/02/2018
1 25/07/2017 1801-01-01
2 28/06/2017 01/07/2017
3 22/06/2017 1801-01-01
4 11/12/2017 15/12/2017
... ... ...
1503 25/01/2019 27/01/2019
1504 31/08/2018 1801-01-01
1505 20/09/2018 05/11/2018
1506 28/09/2018 1801-01-01
1507 21/02/2019 24/02/2019
1508 rows × 2 columns
I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected.
To convert the dates to datetime, I have used .astype('datetime64[ns]')
This is because I didn't know how to use pd.to_datetime on multiple columns.
df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null int64
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 32.9 KB
So, the conversion appears to have worked.
However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm.
df.head(20)
Unnamed: 0 ADMIDATE DISDATE
0 0 2018-04-02 2018-07-02
1 2 2017-06-28 2017-01-07
2 4 2017-11-12 2017-12-15
3 5 2017-09-04 2017-12-04
4 6 2017-05-30 2017-01-06
5 7 2017-02-08 2017-07-08
6 8 2017-11-17 2017-11-18
7 9 2018-03-14 2018-03-20
8 10 2017-04-26 2017-03-05
9 11 2017-05-16 2017-05-17
10 12 2018-01-17 2018-01-19
11 13 2017-12-18 2017-12-20
12 14 2017-02-10 2017-04-10
13 16 2017-03-30 2017-07-04
14 17 2017-01-12 2017-12-18
15 18 2017-12-07 2017-07-14
16 19 2017-05-04 2017-08-04
17 20 2017-10-30 2017-01-11
18 21 2017-06-19 2017-06-22
19 22 2017-04-05 2017-08-05
So when I subract the ADMIDATE from the DISDATE I am getting negative values.
df['DISDATE'] - df['ADMIDATE']
0 91 days
1 -172 days
2 33 days
3 91 days
4 -144 days
...
1394 188 days
1395 -291 days
1396 2 days
1397 -132 days
1398 3 days
Length: 1399, dtype: timedelta64[ns]
I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields.
After the suggestions from #code-different which seems very sensible below
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
The format is unchanged despite dayfirst=True.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null datetime64[ns]
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 32.9 KB
df.head()
Unnamed: 0 ADMIDATE DISDATE
0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02
1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07
2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15
3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04
4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06
I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.
just tell pandas.to_datetime to use a specific and adequate format, e.g.:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017',
'28/06/2017', '22/06/2017', '11/12/2017'],
'DISDATE': ['07/02/2018', '1801-01-01',
'01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')})
for col in ['ADMIDATE', 'DISDATE']:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y')
# df
# ADMIDATE DISDATE
# 0 2018-02-04 2018-02-07
# 1 2017-07-25 NaT
# 2 2017-06-28 2017-07-01
# 3 2017-06-22 NaT
# 4 2017-12-11 2017-12-15
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ADMIDATE 5 non-null datetime64[ns]
# 1 DISDATE 3 non-null datetime64[ns]
# dtypes: datetime64[ns](2)
Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.
to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that:
df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA})
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
astype('datetime64[ns]') is too inflexible for what you need.
I have a dataframe df_energy2
df_energy2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29974 entries, 0 to 29973
Data columns (total 4 columns):
TIMESTAMP 29974 non-null datetime64[ns]
P_ACT_KW 29974 non-null int64
PERIODE_TARIF 29974 non-null object
P_SOUSCR 29974 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 936.8+ KB
with this structure :
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
Is there any python function which can extract hour from TIMESTAMP?
Kind regards
I think you need dt.hour:
print (df.TIMESTAMP.dt.hour)
0 0
1 0
Name: TIMESTAMP, dtype: int64
df['hours'] = df.TIMESTAMP.dt.hour
print (df)
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR hours
0 2016-01-01 00:00:00 116 HC 250 0
1 2016-01-01 00:10:00 121 HC 250 0
Given your data:
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
You have timestamp as the index. For extracting hours from timestamp where you have it as the index of the dataframe:
hours = df_energy2.index.hour
Edit: Yes, jezrael you're right. Putting what he has stated: pandas dataframe has a property for this i.e. dt :
<dataframe>.<ts_column>.dt.hour
Example in your context - the column with date is TIMESTAMP
df.TIMESTAMP.dt.hour
A similar question - Pandas, dataframe with a datetime64 column, querying by hour
I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.
If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]