Pandas: Adding varying numbers of days to a date in a dataframe

Pandas: Adding varying numbers of days to a date in a dataframe - python

I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.

If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]

Related

how to change datetime format in pandas. fastest way?

I import from .csv and get object-type columns:
begin end
0 2019-03-29 17:02:32.838469+00 2019-04-13 17:32:19.134874+00
1 2019-06-13 19:22:19.331201+00 2019-06-13 19:51:21.987534+00
2 2019-03-27 06:56:51.138795+00 2019-03-27 06:56:54.834751+00
3 2019-05-28 11:09:29.320478+00 2019-05-29 06:47:21.794092+00
4 2019-03-24 07:03:03.582679+00 2019-03-24 09:50:32.595199+00
I need to get in datetime format:
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
what I do (at first I make them datetime format, then cut to dates only, then make datetime format again):
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
df['begin'] = df['begin'].dt.date
df['end'] = df['end'].dt.date
df['begin'] = pd.to_datetime(df['begin'], dayfirst=False)
df['end'] = pd.to_datetime(df['end'], dayfirst=False)
Is there any short way to do this without converting 2 times?

You can use .apply() on the 2 columns, each with pd.to_datetime once and use dt.normalize() to remove time info, yet maintaining as datetime format. (Also used dt.tz_localize(None) to remove timezone info):
df = df.apply(lambda x: pd.to_datetime(x).dt.tz_localize(None).dt.normalize())
Result:
print(df)
begin end
0 2019-03-29 2019-04-13
1 2019-06-13 2019-06-13
2 2019-03-27 2019-03-27
3 2019-05-28 2019-05-29
4 2019-03-24 2019-03-24
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 begin 5 non-null datetime64[ns] <=== datetime format
1 end 5 non-null datetime64[ns] <=== datetime format
dtypes: datetime64[ns](2)
memory usage: 120.0 bytes

Grouping data in DF but keeping all columns in Python

I have a df that includes high and low stock prices by day in 2 minute increments. I am trying to find the high and low for each day. I am able to do so by using the code below but the output only gives me the date and price data. I need to have the time column available as well. I've tried about 100 different ways but cannot get it to work.
high = df.groupby('Date')['High'].max()
low = df.groupby('Date')['Low'].min()
Below are my columns and dtypes.
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 High 4277 non-null float64
1 Low 4277 non-null float64
2 Date 4277 non-null object
3 Time 4277 non-null object
Any suggestions?

transform with boolean indexing:
# sample data
np.random.seed(10)
df = pd.DataFrame([pd.date_range('2020-01-01', '2020-01-03', freq='H'),
np.random.randint(1,10000, 49), np.random.randint(1,10,49)]).T
df.columns = ['date', 'high', 'low']
df['time'] = df['date'].dt.time
df['date'] = df['date'].dt.date
# transform max and min then assign to a variable
mx = df.groupby('date')['high'].transform(max)
mn = df.groupby('date')['low'].transform(min)
# boolean indexing
high = df[df['high'] == mx]
low = df[df['low'] == mn]
# high
date high low time
4 2020-01-01 9373 9 04:00:00
42 2020-01-02 9647 2 18:00:00
48 2020-01-03 45 5 00:00:00
# low
date high low time
14 2020-01-01 2103 1 14:00:00
15 2020-01-01 3417 1 15:00:00
23 2020-01-01 654 1 23:00:00
27 2020-01-02 2701 1 03:00:00
30 2020-01-02 284 1 06:00:00
36 2020-01-02 6160 1 12:00:00
38 2020-01-02 631 1 14:00:00
40 2020-01-02 3417 1 16:00:00
44 2020-01-02 6860 1 20:00:00
45 2020-01-02 8989 1 21:00:00
47 2020-01-02 2811 1 23:00:00
48 2020-01-03 45 5 00:00:00

Do you wan this:
# should use datetime type:
df['Date'] = pd.to_datetime(df['Date'])
df.groupby(df.Date.dt.normalize()).agg({'High': 'max', 'Low': 'min'})

After you apply groupby and min or max function, you can select the columns using loc or iloc:
df.groupby('Date').max().loc[:,['High','Time']]

Datetime fails when setting astype, date mangled

I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str
I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay.
I have has a number of problems.
To illustrate I have used 2 columns.
import pandas as pd
import numpy as np
from datetime import datetime
import .csv
df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE'])
df
ADMIDATE DISDATE
0 04/02/2018 07/02/2018
1 25/07/2017 1801-01-01
2 28/06/2017 01/07/2017
3 22/06/2017 1801-01-01
4 11/12/2017 15/12/2017
... ... ...
1503 25/01/2019 27/01/2019
1504 31/08/2018 1801-01-01
1505 20/09/2018 05/11/2018
1506 28/09/2018 1801-01-01
1507 21/02/2019 24/02/2019
1508 rows × 2 columns
I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected.
To convert the dates to datetime, I have used .astype('datetime64[ns]')
This is because I didn't know how to use pd.to_datetime on multiple columns.
df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null int64
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 32.9 KB
So, the conversion appears to have worked.
However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm.
df.head(20)
Unnamed: 0 ADMIDATE DISDATE
0 0 2018-04-02 2018-07-02
1 2 2017-06-28 2017-01-07
2 4 2017-11-12 2017-12-15
3 5 2017-09-04 2017-12-04
4 6 2017-05-30 2017-01-06
5 7 2017-02-08 2017-07-08
6 8 2017-11-17 2017-11-18
7 9 2018-03-14 2018-03-20
8 10 2017-04-26 2017-03-05
9 11 2017-05-16 2017-05-17
10 12 2018-01-17 2018-01-19
11 13 2017-12-18 2017-12-20
12 14 2017-02-10 2017-04-10
13 16 2017-03-30 2017-07-04
14 17 2017-01-12 2017-12-18
15 18 2017-12-07 2017-07-14
16 19 2017-05-04 2017-08-04
17 20 2017-10-30 2017-01-11
18 21 2017-06-19 2017-06-22
19 22 2017-04-05 2017-08-05
So when I subract the ADMIDATE from the DISDATE I am getting negative values.
df['DISDATE'] - df['ADMIDATE']
0 91 days
1 -172 days
2 33 days
3 91 days
4 -144 days
...
1394 188 days
1395 -291 days
1396 2 days
1397 -132 days
1398 3 days
Length: 1399, dtype: timedelta64[ns]
I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields.
After the suggestions from #code-different which seems very sensible below
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
The format is unchanged despite dayfirst=True.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null datetime64[ns]
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 32.9 KB
df.head()
Unnamed: 0 ADMIDATE DISDATE
0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02
1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07
2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15
3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04
4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06
I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.

just tell pandas.to_datetime to use a specific and adequate format, e.g.:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017',
'28/06/2017', '22/06/2017', '11/12/2017'],
'DISDATE': ['07/02/2018', '1801-01-01',
'01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')})
for col in ['ADMIDATE', 'DISDATE']:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y')
# df
# ADMIDATE DISDATE
# 0 2018-02-04 2018-02-07
# 1 2017-07-25 NaT
# 2 2017-06-28 2017-07-01
# 3 2017-06-22 NaT
# 4 2017-12-11 2017-12-15
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ADMIDATE 5 non-null datetime64[ns]
# 1 DISDATE 3 non-null datetime64[ns]
# dtypes: datetime64[ns](2)
Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.

to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that:
df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA})
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
astype('datetime64[ns]') is too inflexible for what you need.

Python to convert different date formats in a column

I am trying to convert a column which has different date formats.
For example:
month
2018-01-01 float64
2018-02-01 float64
2018-03-01 float64
2018-03-01 00:00:00 float64
2018-04-01 01:00:00 float64
2018-05-01 01:00:00 float64
2018-06-01 01:00:00 float64
2018-07-01 01:00:00 float64
I want to convert everything in the column to just month and year. For example I would like Jan-18, Feb-18, Mar-18, etc.
I have tried using this code to first convert my column to datetime:
df['month'] = pd.to_datetime(df['month'], format='%Y-%m-%d')
But it returns a float64:
Out
month
2018-01-01 00:00:00 float64
2018-02-01 00:00:00 float64
2018-03-01 00:00:00 float64
2018-04-01 01:00:00 float64
2018-05-01 01:00:00 float64
2018-06-01 01:00:00 float64
2018-07-01 01:00:00 float64
In my output to CSV the month format has been changed to 01/05/2016 00:00:00. Can you please help me covert to just month and year e.g. Aug-18.
Thank you

I assume you have a Pandas dataframe. In this case, you can use pd.Series.dt.to_period:
s = pd.Series(['2018-01-01', '2018-02-01', '2018-03-01',
'2018-03-01 00:00:00', '2018-04-01 01:00:00'])
res = pd.to_datetime(s).dt.to_period('M')
print(res)
0 2018-01
1 2018-02
2 2018-03
3 2018-03
4 2018-04
dtype: object
As you can see, this results in a series of dtype object, which is generally inefficient. A better idea is to set the day to the last of the month and maintain a datetime series internally represented by integers.

extract hour from timestamp with python

I have a dataframe df_energy2
df_energy2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29974 entries, 0 to 29973
Data columns (total 4 columns):
TIMESTAMP 29974 non-null datetime64[ns]
P_ACT_KW 29974 non-null int64
PERIODE_TARIF 29974 non-null object
P_SOUSCR 29974 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 936.8+ KB
with this structure :
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
Is there any python function which can extract hour from TIMESTAMP?
Kind regards

I think you need dt.hour:
print (df.TIMESTAMP.dt.hour)
0 0
1 0
Name: TIMESTAMP, dtype: int64
df['hours'] = df.TIMESTAMP.dt.hour
print (df)
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR hours
0 2016-01-01 00:00:00 116 HC 250 0
1 2016-01-01 00:10:00 121 HC 250 0

Given your data:
df_energy2.head()
TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR
2016-01-01 00:00:00 116 HC 250
2016-01-01 00:10:00 121 HC 250
You have timestamp as the index. For extracting hours from timestamp where you have it as the index of the dataframe:
hours = df_energy2.index.hour
Edit: Yes, jezrael you're right. Putting what he has stated: pandas dataframe has a property for this i.e. dt :
<dataframe>.<ts_column>.dt.hour
Example in your context - the column with date is TIMESTAMP
df.TIMESTAMP.dt.hour
A similar question - Pandas, dataframe with a datetime64 column, querying by hour

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Adding varying numbers of days to a date in a dataframe - python

Related

how to change datetime format in pandas. fastest way?

Grouping data in DF but keeping all columns in Python

Datetime fails when setting astype, date mangled

Python to convert different date formats in a column

extract hour from timestamp with python

Categories

Resources