Pandas: Adding varying numbers of days to a date in a dataframe - python
I have a dataframe with a date column and then a number of days that I want to add to that column. I want to create a new column, 'Recency_Date', with the resulting value.
df:
fan Community Name Count Mean_Days Date_Min
0 855 AAA Games 6 353 2013-04-16
1 855 First Person Shooters 2 420 2012-10-16
2 855 Playstation 3 108 2014-06-12
3 3148 AAA Games 1 0 2015-04-17
4 3148 Mobile Gaming 1 0 2013-01-19
df info:
merged.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4627415 entries, 0 to 4627414
Data columns (total 5 columns):
fan int64
Community Name object
Count int64
Mean_Days int32
Date_Min datetime64[ns]
dtypes: datetime64[ns](1), int32(1), int64(2), object(1)
memory usage: 194.2+ MB
Sample data as csv:
fan,Community Name,Count,Mean_Days,Date_Min
855,AAA Games,6,353,2013-04-16 00:00:00
855,First Person Shooters,2,420,2012-10-16 00:00:00
855,Playstation,3,108,2014-06-12 00:00:00
3148,AAA Games,1,0,2015-04-17 00:00:00
3148,Mobile Gaming,1,0,2013-01-19 00:00:00
3148,Power PCs,2,0,2014-06-17 00:00:00
3148,XBOX,1,0,2009-11-12 00:00:00
3860,AAA Games,1,0,2012-11-28 00:00:00
3860,Minecraft,3,393,2011-09-07 00:00:00
4044,AAA Games,5,338,2010-11-15 00:00:00
4044,Blizzard Games,1,0,2013-07-12 00:00:00
4044,Geek Culture,1,0,2011-06-03 00:00:00
4044,Indie Games,2,112,2013-01-09 00:00:00
4044,Minecraft,1,0,2014-01-02 00:00:00
4044,Professional Gaming,1,0,2014-01-02 00:00:00
4044,XBOX,2,785,2010-11-15 00:00:00
4827,AAA Games,1,0,2010-08-24 00:00:00
4827,Gaming Humour,1,0,2012-05-05 00:00:00
4827,Minecraft,2,10,2012-03-21 00:00:00
5260,AAA Games,4,27,2013-09-17 00:00:00
5260,Indie Games,8,844,2011-06-08 00:00:00
5260,MOBA,2,0,2012-10-27 00:00:00
5260,Minecraft,5,106,2012-02-17 00:00:00
5260,XBOX,1,0,2011-06-15 00:00:00
5484,AAA Games,21,1296,2009-08-01 00:00:00
5484,Free to Play,1,0,2014-12-08 00:00:00
5484,Indie Games,1,0,2014-05-28 00:00:00
5484,Music Games,1,0,2012-09-12 00:00:00
5484,Playstation,1,0,2012-02-22 00:00:00
I've tried:
merged['Recency_Date'] = merged['Date_Min'] + timedelta(days=merged['Mean_Days'])
and:
merged['Recency_Date'] = pd.DatetimeIndex(merged['Date_Min']) + pd.DateOffset(merged['Mean_Days'])
But am having trouble finding something that will work for a Series rather than an individual int value. Any and all help would be very much appreciated with this.
If 'Date_Min' dtype is already datetime then you can construct a Timedeltaindex from your 'Mean_Days' column and add these:
In [174]:
df = pd.DataFrame({'Date_Min':[dt.datetime.now(), dt.datetime(2015,3,4), dt.datetime(2011,6,9)], 'Mean_Days':[1,2,3]})
df
Out[174]:
Date_Min Mean_Days
0 2015-09-15 14:02:37.452369 1
1 2015-03-04 00:00:00.000000 2
2 2011-06-09 00:00:00.000000 3
In [175]:
df['Date_Min'] + pd.TimedeltaIndex(df['Mean_Days'], unit='D')
Out[175]:
0 2015-09-16 14:02:37.452369
1 2015-03-06 00:00:00.000000
2 2011-06-12 00:00:00.000000
Name: Date_Min, dtype: datetime64[ns]
Related
how to change datetime format in pandas. fastest way?
I import from .csv and get object-type columns: begin end 0 2019-03-29 17:02:32.838469+00 2019-04-13 17:32:19.134874+00 1 2019-06-13 19:22:19.331201+00 2019-06-13 19:51:21.987534+00 2 2019-03-27 06:56:51.138795+00 2019-03-27 06:56:54.834751+00 3 2019-05-28 11:09:29.320478+00 2019-05-29 06:47:21.794092+00 4 2019-03-24 07:03:03.582679+00 2019-03-24 09:50:32.595199+00 I need to get in datetime format: begin end 0 2019-03-29 2019-04-13 1 2019-06-13 2019-06-13 2 2019-03-27 2019-03-27 3 2019-05-28 2019-05-29 4 2019-03-24 2019-03-24 what I do (at first I make them datetime format, then cut to dates only, then make datetime format again): df['begin'] = pd.to_datetime(df['begin'], dayfirst=False) df['end'] = pd.to_datetime(df['end'], dayfirst=False) df['begin'] = df['begin'].dt.date df['end'] = df['end'].dt.date df['begin'] = pd.to_datetime(df['begin'], dayfirst=False) df['end'] = pd.to_datetime(df['end'], dayfirst=False) Is there any short way to do this without converting 2 times?
You can use .apply() on the 2 columns, each with pd.to_datetime once and use dt.normalize() to remove time info, yet maintaining as datetime format. (Also used dt.tz_localize(None) to remove timezone info): df = df.apply(lambda x: pd.to_datetime(x).dt.tz_localize(None).dt.normalize()) Result: print(df) begin end 0 2019-03-29 2019-04-13 1 2019-06-13 2019-06-13 2 2019-03-27 2019-03-27 3 2019-05-28 2019-05-29 4 2019-03-24 2019-03-24 df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 begin 5 non-null datetime64[ns] <=== datetime format 1 end 5 non-null datetime64[ns] <=== datetime format dtypes: datetime64[ns](2) memory usage: 120.0 bytes
Grouping data in DF but keeping all columns in Python
I have a df that includes high and low stock prices by day in 2 minute increments. I am trying to find the high and low for each day. I am able to do so by using the code below but the output only gives me the date and price data. I need to have the time column available as well. I've tried about 100 different ways but cannot get it to work. high = df.groupby('Date')['High'].max() low = df.groupby('Date')['Low'].min() Below are my columns and dtypes. # Column Non-Null Count Dtype --- ------ -------------- ----- 0 High 4277 non-null float64 1 Low 4277 non-null float64 2 Date 4277 non-null object 3 Time 4277 non-null object Any suggestions?
transform with boolean indexing: # sample data np.random.seed(10) df = pd.DataFrame([pd.date_range('2020-01-01', '2020-01-03', freq='H'), np.random.randint(1,10000, 49), np.random.randint(1,10,49)]).T df.columns = ['date', 'high', 'low'] df['time'] = df['date'].dt.time df['date'] = df['date'].dt.date # transform max and min then assign to a variable mx = df.groupby('date')['high'].transform(max) mn = df.groupby('date')['low'].transform(min) # boolean indexing high = df[df['high'] == mx] low = df[df['low'] == mn] # high date high low time 4 2020-01-01 9373 9 04:00:00 42 2020-01-02 9647 2 18:00:00 48 2020-01-03 45 5 00:00:00 # low date high low time 14 2020-01-01 2103 1 14:00:00 15 2020-01-01 3417 1 15:00:00 23 2020-01-01 654 1 23:00:00 27 2020-01-02 2701 1 03:00:00 30 2020-01-02 284 1 06:00:00 36 2020-01-02 6160 1 12:00:00 38 2020-01-02 631 1 14:00:00 40 2020-01-02 3417 1 16:00:00 44 2020-01-02 6860 1 20:00:00 45 2020-01-02 8989 1 21:00:00 47 2020-01-02 2811 1 23:00:00 48 2020-01-03 45 5 00:00:00
Do you wan this: # should use datetime type: df['Date'] = pd.to_datetime(df['Date']) df.groupby(df.Date.dt.normalize()).agg({'High': 'max', 'Low': 'min'})
After you apply groupby and min or max function, you can select the columns using loc or iloc: df.groupby('Date').max().loc[:,['High','Time']]
Datetime fails when setting astype, date mangled
I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay. I have has a number of problems. To illustrate I have used 2 columns. import pandas as pd import numpy as np from datetime import datetime import .csv df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE']) df ADMIDATE DISDATE 0 04/02/2018 07/02/2018 1 25/07/2017 1801-01-01 2 28/06/2017 01/07/2017 3 22/06/2017 1801-01-01 4 11/12/2017 15/12/2017 ... ... ... 1503 25/01/2019 27/01/2019 1504 31/08/2018 1801-01-01 1505 20/09/2018 05/11/2018 1506 28/09/2018 1801-01-01 1507 21/02/2019 24/02/2019 1508 rows × 2 columns I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected. To convert the dates to datetime, I have used .astype('datetime64[ns]') This is because I didn't know how to use pd.to_datetime on multiple columns. df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]') df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1399 entries, 0 to 1398 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1399 non-null int64 1 ADMIDATE 1399 non-null datetime64[ns] 2 DISDATE 1391 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(1) memory usage: 32.9 KB So, the conversion appears to have worked. However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm. df.head(20) Unnamed: 0 ADMIDATE DISDATE 0 0 2018-04-02 2018-07-02 1 2 2017-06-28 2017-01-07 2 4 2017-11-12 2017-12-15 3 5 2017-09-04 2017-12-04 4 6 2017-05-30 2017-01-06 5 7 2017-02-08 2017-07-08 6 8 2017-11-17 2017-11-18 7 9 2018-03-14 2018-03-20 8 10 2017-04-26 2017-03-05 9 11 2017-05-16 2017-05-17 10 12 2018-01-17 2018-01-19 11 13 2017-12-18 2017-12-20 12 14 2017-02-10 2017-04-10 13 16 2017-03-30 2017-07-04 14 17 2017-01-12 2017-12-18 15 18 2017-12-07 2017-07-14 16 19 2017-05-04 2017-08-04 17 20 2017-10-30 2017-01-11 18 21 2017-06-19 2017-06-22 19 22 2017-04-05 2017-08-05 So when I subract the ADMIDATE from the DISDATE I am getting negative values. df['DISDATE'] - df['ADMIDATE'] 0 91 days 1 -172 days 2 33 days 3 91 days 4 -144 days ... 1394 188 days 1395 -291 days 1396 2 days 1397 -132 days 1398 3 days Length: 1399, dtype: timedelta64[ns] I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields. After the suggestions from #code-different which seems very sensible below for col in df.columns: df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce') The format is unchanged despite dayfirst=True. df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1399 entries, 0 to 1398 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1399 non-null datetime64[ns] 1 ADMIDATE 1399 non-null datetime64[ns] 2 DISDATE 1391 non-null datetime64[ns] dtypes: datetime64[ns](3) memory usage: 32.9 KB df.head() Unnamed: 0 ADMIDATE DISDATE 0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02 1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07 2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15 3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04 4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06 I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.
just tell pandas.to_datetime to use a specific and adequate format, e.g.: import pandas as pd import numpy as np df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017', '28/06/2017', '22/06/2017', '11/12/2017'], 'DISDATE': ['07/02/2018', '1801-01-01', '01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')}) for col in ['ADMIDATE', 'DISDATE']: df[col] = pd.to_datetime(df[col], format='%d/%m/%Y') # df # ADMIDATE DISDATE # 0 2018-02-04 2018-02-07 # 1 2017-07-25 NaT # 2 2017-06-28 2017-07-01 # 3 2017-06-22 NaT # 4 2017-12-11 2017-12-15 # Column Non-Null Count Dtype # --- ------ -------------- ----- # 0 ADMIDATE 5 non-null datetime64[ns] # 1 DISDATE 3 non-null datetime64[ns] # dtypes: datetime64[ns](2) Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.
to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that: df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA}) for col in df.columns: df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce') astype('datetime64[ns]') is too inflexible for what you need.
Python to convert different date formats in a column
I am trying to convert a column which has different date formats. For example: month 2018-01-01 float64 2018-02-01 float64 2018-03-01 float64 2018-03-01 00:00:00 float64 2018-04-01 01:00:00 float64 2018-05-01 01:00:00 float64 2018-06-01 01:00:00 float64 2018-07-01 01:00:00 float64 I want to convert everything in the column to just month and year. For example I would like Jan-18, Feb-18, Mar-18, etc. I have tried using this code to first convert my column to datetime: df['month'] = pd.to_datetime(df['month'], format='%Y-%m-%d') But it returns a float64: Out month 2018-01-01 00:00:00 float64 2018-02-01 00:00:00 float64 2018-03-01 00:00:00 float64 2018-04-01 01:00:00 float64 2018-05-01 01:00:00 float64 2018-06-01 01:00:00 float64 2018-07-01 01:00:00 float64 In my output to CSV the month format has been changed to 01/05/2016 00:00:00. Can you please help me covert to just month and year e.g. Aug-18. Thank you
I assume you have a Pandas dataframe. In this case, you can use pd.Series.dt.to_period: s = pd.Series(['2018-01-01', '2018-02-01', '2018-03-01', '2018-03-01 00:00:00', '2018-04-01 01:00:00']) res = pd.to_datetime(s).dt.to_period('M') print(res) 0 2018-01 1 2018-02 2 2018-03 3 2018-03 4 2018-04 dtype: object As you can see, this results in a series of dtype object, which is generally inefficient. A better idea is to set the day to the last of the month and maintain a datetime series internally represented by integers.
extract hour from timestamp with python
I have a dataframe df_energy2 df_energy2.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 29974 entries, 0 to 29973 Data columns (total 4 columns): TIMESTAMP 29974 non-null datetime64[ns] P_ACT_KW 29974 non-null int64 PERIODE_TARIF 29974 non-null object P_SOUSCR 29974 non-null int64 dtypes: datetime64[ns](1), int64(2), object(1) memory usage: 936.8+ KB with this structure : df_energy2.head() TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR 2016-01-01 00:00:00 116 HC 250 2016-01-01 00:10:00 121 HC 250 Is there any python function which can extract hour from TIMESTAMP? Kind regards
I think you need dt.hour: print (df.TIMESTAMP.dt.hour) 0 0 1 0 Name: TIMESTAMP, dtype: int64 df['hours'] = df.TIMESTAMP.dt.hour print (df) TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR hours 0 2016-01-01 00:00:00 116 HC 250 0 1 2016-01-01 00:10:00 121 HC 250 0
Given your data: df_energy2.head() TIMESTAMP P_ACT_KW PERIODE_TARIF P_SOUSCR 2016-01-01 00:00:00 116 HC 250 2016-01-01 00:10:00 121 HC 250 You have timestamp as the index. For extracting hours from timestamp where you have it as the index of the dataframe: hours = df_energy2.index.hour Edit: Yes, jezrael you're right. Putting what he has stated: pandas dataframe has a property for this i.e. dt : <dataframe>.<ts_column>.dt.hour Example in your context - the column with date is TIMESTAMP df.TIMESTAMP.dt.hour A similar question - Pandas, dataframe with a datetime64 column, querying by hour