Pandas: Adding varying numbers of days to a date in a dataframe - python

I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.

If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]

Related

how to change datetime format in pandas. fastest way?

I import from .csv and get object-type columns:
begin end
0 2019-03-29 17:02:32.838469+00 2019-04-13 17:32:19.134874+00
1 2019-06-13 19:22:19.331201+00 2019-06-13 19:51:21.987534+00
2 2019-03-27 06:56:51.138795+00 2019-03-27 06:56:54.834751+00
3 2019-05-28 11:09:29.320478+00 2019-05-29 06:47:21.794092+00
4 2019-03-24 07:03:03.582679+00 2019-03-24 09:50:32.595199+00
I need to get in datetime format:
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
what I do (at first I make them datetime format, then cut to dates only, then make datetime format again):
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
df['begin'] = df['begin'].dt.date
df['end'] = df['end'].dt.date
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
Is there any short way to do this without converting 2 times?
You can use .apply() on the 2 columns, each with pd.to_datetime once and use dt.normalize() to remove time info, yet maintaining as datetime format. (Also used dt.tz_localize(None) to remove timezone info):
df = df.apply(lambda x: pd.to_datetime(x).dt.tz_localize(None).dt.normalize())
Result:
print(df)
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 begin 5 non-null datetime64[ns] <=== datetime format
1 end 5 non-null datetime64[ns] <=== datetime format
dtypes: datetime64[ns](2)
memory usage: 120.0 bytes

Grouping data in DF but keeping all columns in Python

I have a df that includes high and low stock prices by day in 2 minute increments. I am trying to find the high and low for each day. I am able to do so by using the code below but the output only gives me the date and price data. I need to have the time column available as well. I've tried about 100 different ways but cannot get it to work.
high = df.groupby('Date')['High'].max()
low = df.groupby('Date')['Low'].min()
Below are my columns and dtypes.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 High 4277 non-null float64
1 Low 4277 non-null float64
2 Date 4277 non-null object
3 Time 4277 non-null object
Any suggestions?
transform with boolean indexing:
# sample data
np.random.seed(10)
df = pd.DataFrame([pd.date_range('2020-01-01', '2020-01-03', freq='H'),
np.random.randint(1,10000, 49), np.random.randint(1,10,49)]).T
df.columns = ['date', 'high', 'low']
df['time'] = df['date'].dt.time
df['date'] = df['date'].dt.date
# transform max and min then assign to a variable
mx = df.groupby('date')['high'].transform(max)
mn = df.groupby('date')['low'].transform(min)
# boolean indexing
high = df[df['high'] == mx]
low = df[df['low'] == mn]
# high
date high low time
4 2020-01-01 9373 9 04:00:00
42 2020-01-02 9647 2 18:00:00
48 2020-01-03 45 5 00:00:00
# low
date high low time
14 2020-01-01 2103 1 14:00:00
15 2020-01-01 3417 1 15:00:00
23 2020-01-01 654 1 23:00:00
27 2020-01-02 2701 1 03:00:00
30 2020-01-02 284 1 06:00:00
36 2020-01-02 6160 1 12:00:00
38 2020-01-02 631 1 14:00:00
40 2020-01-02 3417 1 16:00:00
44 2020-01-02 6860 1 20:00:00
45 2020-01-02 8989 1 21:00:00
47 2020-01-02 2811 1 23:00:00
48 2020-01-03 45 5 00:00:00
Do you wan this:
# should use datetime type:
df['Date'] = pd.to_datetime(df['Date'])
df.groupby(df.Date.dt.normalize()).agg({'High': 'max', 'Low': 'min'})
After you apply groupby and min or max function, you can select the columns using loc or iloc:
df.groupby('Date').max().loc[:,['High','Time']]

Datetime fails when setting astype, date mangled

I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str
I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay.
I have has a number of problems.
To illustrate I have used 2 columns.
import pandas as pd
import numpy as np
from datetime import datetime
import .csv
df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE'])
df
ADMIDATE DISDATE
0 04/02/2018 07/02/2018
1 25/07/2017 1801-01-01
2 28/06/2017 01/07/2017
3 22/06/2017 1801-01-01
4 11/12/2017 15/12/2017
... ... ...
1503 25/01/2019 27/01/2019
1504 31/08/2018 1801-01-01
1505 20/09/2018 05/11/2018
1506 28/09/2018 1801-01-01
1507 21/02/2019 24/02/2019
1508 rows × 2 columns
I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected.
To convert the dates to datetime, I have used .astype('datetime64[ns]')
This is because I didn't know how to use pd.to_datetime on multiple columns.
df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null int64
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 32.9 KB
So, the conversion appears to have worked.
However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm.
df.head(20)
Unnamed: 0 ADMIDATE DISDATE
0 0 2018-04-02 2018-07-02
1 2 2017-06-28 2017-01-07
2 4 2017-11-12 2017-12-15
3 5 2017-09-04 2017-12-04
4 6 2017-05-30 2017-01-06
5 7 2017-02-08 2017-07-08
6 8 2017-11-17 2017-11-18
7 9 2018-03-14 2018-03-20
8 10 2017-04-26 2017-03-05
9 11 2017-05-16 2017-05-17
10 12 2018-01-17 2018-01-19
11 13 2017-12-18 2017-12-20
12 14 2017-02-10 2017-04-10
13 16 2017-03-30 2017-07-04
14 17 2017-01-12 2017-12-18
15 18 2017-12-07 2017-07-14
16 19 2017-05-04 2017-08-04
17 20 2017-10-30 2017-01-11
18 21 2017-06-19 2017-06-22
19 22 2017-04-05 2017-08-05
So when I subract the ADMIDATE from the DISDATE I am getting negative values.
df['DISDATE'] - df['ADMIDATE']
0 91 days
1 -172 days
2 33 days
3 91 days
4 -144 days
...
1394 188 days
1395 -291 days
1396 2 days
1397 -132 days
1398 3 days
Length: 1399, dtype: timedelta64[ns]
I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields.
After the suggestions from #code-different which seems very sensible below
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
The format is unchanged despite dayfirst=True.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null datetime64[ns]
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 32.9 KB
df.head()
Unnamed: 0 ADMIDATE DISDATE
0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02
1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07
2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15
3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04
4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06
I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.
just tell pandas.to_datetime to use a specific and adequate format, e.g.:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017',
'28/06/2017', '22/06/2017', '11/12/2017'],
'DISDATE': ['07/02/2018', '1801-01-01',
'01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')})
for col in ['ADMIDATE', 'DISDATE']:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y')
# df
# ADMIDATE DISDATE
# 0 2018-02-04 2018-02-07
# 1 2017-07-25 NaT
# 2 2017-06-28 2017-07-01
# 3 2017-06-22 NaT
# 4 2017-12-11 2017-12-15
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ADMIDATE 5 non-null datetime64[ns]
# 1 DISDATE 3 non-null datetime64[ns]
# dtypes: datetime64[ns](2)
Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.
to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that:
df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA})
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
astype('datetime64[ns]') is too inflexible for what you need.

Python to convert different date formats in a column

I am trying to convert a column which has different date formats.
For example:
month
2018-01-01 float64
2018-02-01 float64
2018-03-01 float64
2018-03-01 00:00:00 float64
2018-04-01 01:00:00 float64
2018-05-01 01:00:00 float64
2018-06-01 01:00:00 float64
2018-07-01 01:00:00 float64
I want to convert everything in the column to just month and year. For example I would like Jan-18, Feb-18, Mar-18, etc.
I have tried using this code to first convert my column to datetime:
df['month'] = pd.to_datetime(df['month'], format='%Y-%m-%d')
But it returns a float64:
Out
month
2018-01-01 00:00:00 float64
2018-02-01 00:00:00 float64
2018-03-01 00:00:00 float64
2018-04-01 01:00:00 float64
2018-05-01 01:00:00 float64
2018-06-01 01:00:00 float64
2018-07-01 01:00:00 float64
In my output to CSV the month format has been changed to 01/05/2016 00:00:00. Can you please help me covert to just month and year e.g. Aug-18.
Thank you
I assume you have a Pandas dataframe. In this case, you can use pd.Series.dt.to_period:
s = pd.Series(['2018-01-01', '2018-02-01', '2018-03-01',
'2018-03-01 00:00:00', '2018-04-01 01:00:00'])
res = pd.to_datetime(s).dt.to_period('M')
print(res)
0 2018-01
1 2018-02
2 2018-03
3 2018-03
4 2018-04
dtype: object
As you can see, this results in a series of dtype object, which is generally inefficient. A better idea is to set the day to the last of the month and maintain a datetime series internally represented by integers.

extract hour from timestamp with python

I have a dataframe df_energy2
df_energy2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29974 entries, 0 to 29973
Data columns (total 4 columns):
TIMESTAMP 29974 non-null datetime64[ns]
P_ACT_KW 29974 non-null int64
PERIODE_TARIF 29974 non-null object
P_SOUSCR 29974 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 936.8+ KB
with this structure :
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
Is there any python function which can extract hour from TIMESTAMP?
Kind regards
I think you need dt.hour:
print (df.TIMESTAMP.dt.hour)
0 0
1 0
Name: TIMESTAMP, dtype: int64
df['hours'] = df.TIMESTAMP.dt.hour
print (df)
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR hours
0 2016-01-01 00:00:00 116 HC 250 0
1 2016-01-01 00:10:00 121 HC 250 0
Given your data:
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
You have timestamp as the index. For extracting hours from timestamp where you have it as the index of the dataframe:
hours = df_energy2.index.hour
Edit: Yes, jezrael you're right. Putting what he has stated: pandas dataframe has a property for this i.e. dt :
<dataframe>.<ts_column>.dt.hour
Example in your context - the column with date is TIMESTAMP
df.TIMESTAMP.dt.hour
A similar question - Pandas, dataframe with a datetime64 column, querying by hour

Categories

Resources