I have a data frame:
ID Date Volume
1 2019Q1 9
1 2020Q2 11
2 2019Q3 39
2 2020Q4 23
I want to convert this to yyyy-Qn to datetime.
I have used a dictionary to map the corresponding dates to the quarters.
But I need a more generalized code in instances where the yyyy changes.
Expected output:
ID Date Volume
1 2019-03 9
1 2020-06 11
2 2019-09 39
2 2020-12 23
Let's use pd.PeriodIndex:
df['Date_new'] = pd.PeriodIndex(df['Date'], freq='Q').strftime('%Y-%m')
Output:
ID Date Volume Date_new
0 1 2019Q1 9 2019-03
1 1 2020Q2 11 2020-06
2 2 2019Q3 39 2019-09
3 2 2020Q4 23 2020-12
Here's a simple solution but not as efficient (shouldn't be a problem if your dataset is not too large).
Convert the date column to datetime using to_datetime.
Then add 2 months to each date because you want month to be end-of-quarter month
df = pd.DataFrame({'date': ["2019Q1" ,"2019Q3", "2019Q2", "2020Q4"], 'volume': [1,2,3, 4]})
df['datetime'] = pd.to_datetime(df['date'])
df['datetime'] = df['datetime'] + pd.DateOffset(months=2)
Output is the same
date volume datetime
0 2019Q1 1 2019-03-01
1 2019Q3 2 2019-09-01
2 2019Q2 3 2019-06-01
3 2020Q4 4 2020-12-01
Related
For example, I have several columns of dates and I want to get the month from them. Is there a way to loop through columns instead of running pd.DatetimeIndex(df['date']).month
multiple times? The example below is simplified. The real dataset has many more columns.
import pandas as pd
import numpy as np
np.random.seed(0)
rng_start = pd.date_range('2015-07-24', periods=5, freq='M')
rng_mid = pd.date_range('2019-06-24', periods=5, freq='M')
rng_end = pd.date_range('2022-03-24', periods=5, freq='M')
df = pd.DataFrame({ 'start_date': rng_start, 'mid_date': rng_mid, 'end_date': rng_end })
df
start_date mid_date end_date
0 2015-07-31 2019-06-30 2022-03-31
1 2015-08-31 2019-07-31 2022-04-30
2 2015-09-30 2019-08-31 2022-05-31
3 2015-10-31 2019-09-30 2022-06-30
4 2015-11-30 2019-10-31 2022-07-31
The intended output would be
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You answered your question by saying "loop through columns":
for column in df:
df[column.replace("_date", "_month")] = df[column].dt.month
An alternative solution (a variation of #BENY's):
df[df.columns.str.replace("_date", "_month")] = df.apply(lambda x: x.dt.month, axis=1)
Try apply
df[['start_month', 'mid_month', 'end_month']] = df.apply(lambda x : x.dt.month,axis=1)
df
Out[244]:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You can avoid looping using stack:
out = df.join(df.filter(like='_date') # select _date columns
.stack() # convert to Series
.dt.month
.unstack() # back to DataFrame
.rename(columns=lambda x: x.replace('_date', '_month'))
)
Output:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
quite similar to this solution bub a bit different:
df.join(df.applymap(lambda x: x.month).
set_axis(['start_month', 'mid_month', 'end_month'],axis=1))
I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.
Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')
Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function
I have data that looks like this.
VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag
2 1/1/2018 0:18:50 1/1/2018 12:24:39 AM N
2 1/1/2018 0:30:26 1/1/2018 12:46:42 AM N
2 1/1/2018 0:07:25 1/1/2018 12:19:45 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:32:40 1/1/2018 12:33:41 AM N
2 1/1/2018 0:38:35 1/1/2018 1:08:50 AM N
2 1/1/2018 0:18:41 1/1/2018 12:28:22 AM N
2 1/1/2018 0:38:02 1/1/2018 12:55:02 AM N
2 1/1/2018 0:05:02 1/1/2018 12:18:35 AM N
2 1/1/2018 0:35:23 1/1/2018 12:42:07 AM N
So, I converted df.lpep_pickup_datetime to datetime, but originally it comes in as a string. I'm not sure which one is easier to work with. I want to append 5 fields onto my current dataframe: year, month, day, weekday, and hour.
I tried this:
df['Year']=[d.split('-')[0] for d in df.lpep_pickup_datetime]
df['Month']=[d.split('-')[1] for d in df.lpep_pickup_datetime]
df['Day']=[d.split('-')[2] for d in df.lpep_pickup_datetime]
That gives me this error: AttributeError: 'Timestamp' object has no attribute 'split'
I tried this:
df2 = pd.DataFrame(df.lpep_pickup_datetime.dt.strftime('%m-%d-%Y-%H').str.split('/').tolist(),
columns=['Month', 'Day', 'Year', 'Hour'],dtype=int)
df = pd.concat((df,df2),axis=1)
That gives me this error: AssertionError: 4 columns passed, passed data had 1 columns
Basically, I want to parse df.lpep_pickup_datetime into year, month, day, weekday, and hour, appending each to the same dataframe. How can I do that?
Thanks!!
Here you go, first I'm creating a random dataset and then renaming the column date to the name you want, so you can just copy the code. Pandas has a big section of time-series series manipulation, you don't actually need to import datetime. Here you can find a lot more information about it:
import pandas as pd
date_rng = pd.date_range(start='1/1/2018', end='4/01/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
df['lpep_pickup_datetime'] = df['date']
df['year'] = df['lpep_pickup_datetime'].dt.year
df['year'] = df['lpep_pickup_datetime'].dt.month
df['weekday'] = df['lpep_pickup_datetime'].dt.weekday
df['day'] = df['lpep_pickup_datetime'].dt.day
df['hour'] = df['lpep_pickup_datetime'].dt.hour
print(df)
Output:
date lpep_pickup_datetime year weekday day hour
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 0 1 0
1 2018-01-01 01:00:00 2018-01-01 01:00:00 1 0 1 1
2 2018-01-01 02:00:00 2018-01-01 02:00:00 1 0 1 2
3 2018-01-01 03:00:00 2018-01-01 03:00:00 1 0 1 3
4 2018-01-01 04:00:00 2018-01-01 04:00:00 1 0 1 4
... ... ... ... ... ... ...
2156 2018-03-31 20:00:00 2018-03-31 20:00:00 3 5 31 20
2157 2018-03-31 21:00:00 2018-03-31 21:00:00 3 5 31 21
2158 2018-03-31 22:00:00 2018-03-31 22:00:00 3 5 31 22
2159 2018-03-31 23:00:00 2018-03-31 23:00:00 3 5 31 23
2160 2018-04-01 00:00:00 2018-04-01 00:00:00 4 6 1 0
EDIT: Since this is not working (As stated in the comments in this answer), I believe your data is formated incorrectly. Try this before applying anything:
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'], format='%d/%m/%y %H:%M:%S')
If this format is recognized properly, then you should have no trouble using dt.year,dt.month,dt.hour,dt.day,dt.weekday.
Give this a go. Since your dates are in the datetime dtype already, just use the datetime properties to extract each part.
import pandas as pd
from datetime import datetime as dt
# Creating a fake dataset of dates.
dates = [dt.now().strftime('%d/%m/%Y %H:%M:%S') for i in range(10)]
df = pd.DataFrame({'lpep_pickup_datetime': dates})
df['lpep_pickup_datetime'] = pd.to_datetime(df['lpep_pickup_datetime'])
# Parse each date into its parts and store as a new column.
df['month'] = df['lpep_pickup_datetime'].dt.month
df['day'] = df['lpep_pickup_datetime'].dt.day
df['year'] = df['lpep_pickup_datetime'].dt.year
# ... and so on ...
Output:
lpep_pickup_datetime month day year
0 2019-09-24 16:46:10 9 24 2019
1 2019-09-24 16:46:10 9 24 2019
2 2019-09-24 16:46:10 9 24 2019
3 2019-09-24 16:46:10 9 24 2019
4 2019-09-24 16:46:10 9 24 2019
5 2019-09-24 16:46:10 9 24 2019
6 2019-09-24 16:46:10 9 24 2019
7 2019-09-24 16:46:10 9 24 2019
8 2019-09-24 16:46:10 9 24 2019
9 2019-09-24 16:46:10 9 24 2019
I am able to filter a dataframe using a date range:
df[(df['Due Date'] >= '2017-01-01') & (df['Due Date'] <= '2017-02-01')]
but I would like to be able to filter for a year
IIUC you can do it this way:
In [99]: from dateutil.relativedelta import relativedelta
In [100]: today = pd.datetime.today()
In [101]: today_next_year = today + relativedelta(years=1)
In [102]: df.loc[df['Due Date'].between(today, today_next_year)]
Out[102]:
Due Date OtherColumn
9 2017-06-30 9
10 2017-09-30 10
11 2017-12-31 11
12 2018-03-31 12
Just to make sure your column is datetime, start with this
df['Due Date'] = pd.to_datetime(df['Due Date'])
Consider the dataframe df
df = pd.DataFrame({
'Due Date': pd.date_range('2015', periods=20, freq='Q'),
'OtherColumn': range(20)
})
you should be able access the year via the dt date accessor
df[df['Due Date'].dt.year >= 2017]
Due Date OtherColumn
8 2017-03-31 8
9 2017-06-30 9
10 2017-09-30 10
11 2017-12-31 11
12 2018-03-31 12
13 2018-06-30 13
14 2018-09-30 14
15 2018-12-31 15
16 2019-03-31 16
17 2019-06-30 17
18 2019-09-30 18
19 2019-12-31 19
Or, you can use date filtering on the index
df.set_index('Due Date')['2017']
OtherColumn
Due Date
2017-03-31 8
2017-06-30 9
2017-09-30 10
2017-12-31 11
Or
df.set_index('Due Date')['2016':'2017']
OtherColumn
Due Date
2016-03-31 4
2016-06-30 5
2016-09-30 6
2016-12-31 7
2017-03-31 8
2017-06-30 9
2017-09-30 10
2017-12-31 11
Convert df['Due Date'] to a timestamp and then you can access the year attribute for filtering. For example:
df['Due Date'] = pd.to_datetime(df['Due date'], format='%Y-%m-%d')
df[(df['Due Date'].year >= 2017) & (df['Due Date'].year <= 2018)]
So I have a pandas dataframe indexed by date.
I need to grab a value from the dataframe by date...and then grab the value from the dataframe that was the day before...except I can't just subtract a day, since weekends and holidays are missing from the data.
It would be great if I could write:
x = dataframe.ix[date]
and
i = dataframe.ix[date].index
date2 = dataframe[i-1]
I'm not married to this solution. If there is a way to get the date or index number exactly one prior to the date I know, I would be happy...(short of looping through the whole dataframe and testing to see if I have a match, and saving the count...)
Use .get_loc to get the integer position of a label value in the index:
In [51]:
df = pd.DataFrame(index=pd.date_range(start=dt.datetime(2015,1,1), end=dt.datetime(2015,2,1)), data={'a':np.arange(32)})
df
Out[51]:
a
2015-01-01 0
2015-01-02 1
2015-01-03 2
2015-01-04 3
2015-01-05 4
2015-01-06 5
2015-01-07 6
2015-01-08 7
2015-01-09 8
2015-01-10 9
2015-01-11 10
2015-01-12 11
2015-01-13 12
2015-01-14 13
2015-01-15 14
2015-01-16 15
2015-01-17 16
2015-01-18 17
2015-01-19 18
2015-01-20 19
2015-01-21 20
2015-01-22 21
2015-01-23 22
2015-01-24 23
2015-01-25 24
2015-01-26 25
2015-01-27 26
2015-01-28 27
2015-01-29 28
2015-01-30 29
2015-01-31 30
2015-02-01 31
Here using .get_loc on the index will return the ordinal position:
In [52]:
df.index.get_loc('2015-01-10')
Out[52]:
9
pass this value using .iloc to get a row value by ordinal position:
In [53]:
df.iloc[df.index.get_loc('2015-01-10')]
Out[53]:
a 9
Name: 2015-01-10 00:00:00, dtype: int32
You can then subtract 1 from this to get the previous row:
In [54]:
df.iloc[df.index.get_loc('2015-01-10') - 1]
Out[54]:
a 8
Name: 2015-01-09 00:00:00, dtype: int32