Facing problem in converting into datetime - python

df_non_holidays=pd.read_csv("holidays.csv")
print(df_non_holidays["date"])
0 25-06-21
1 28-06-21
2 29-06-21
3 30-06-21
4 01-07-21
5 02-07-21
6 05-07-21
7 06-07-21
8 07-07-21
9 08-07-21
10 09-07-21
11 12-07-21
12 13-07-21
13 14-07-21
14 15-07-21
15 16-07-21
16 19-07-21
17 20-07-21
18 22-07-21
19 23-07-21
20 26-07-21
21 27-07-21
22 28-07-21
23 29-07-21
24 30-07-21
Name: date, dtype: object
df_non_holidays["date"]= pd.to_datetime(df_non_holidays["date"])
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-01-07
5 2021-02-07
6 2021-05-07
7 2021-06-07
8 2021-07-07
9 2021-08-07
10 2021-09-07
11 2021-12-07
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]
after converting , from index no:4 , it is changing in month and date differently..
does anything wrong in approach in converting object into datetime..
please guide me..

You also need to pass the format string, or you can pass dayfirst=True as well if you don't want to pass the format. But passing dayfirst may not always work for all type of datetime string values; however, passing format is always going to work if you pass the right format.
>>> pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-07-01
5 2021-07-02
6 2021-07-05
7 2021-07-06
8 2021-07-07
9 2021-07-08
10 2021-07-09
11 2021-07-12
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]

Regarding the documentation you can use one of the following parameters
dayfirst to tell that first value is day and not month
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], dayfirst=True)
format to explicitly tell that it's %d-%m-%y
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
That gives the proper conversion
# from
date
0 25-06-21
1 01-07-21
# to
date
0 2021-06-25
1 2021-07-01

Related

Adding a year to a period?

I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.
This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31

pandas datetime giving wrong output

I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.
Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')
Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function

How to plot Month name in correct order, using strftime?

Here's a quick peek of my dataframe:
local_date amount
0 2017-08-16 10.00
1 2017-10-26 21.70
2 2017-11-04 5.00
3 2017-11-12 37.20
4 2017-11-13 10.00
5 2017-11-18 31.00
6 2017-11-27 14.00
7 2017-11-29 10.00
8 2017-11-30 37.20
9 2017-12-16 8.00
10 2017-12-17 43.20
11 2017-12-17 49.60
12 2017-12-19 102.50
13 2017-12-19 28.80
14 2017-12-22 72.55
15 2017-12-23 24.80
16 2017-12-24 62.00
17 2017-12-26 12.40
18 2017-12-26 15.50
19 2017-12-26 40.00
20 2017-12-28 57.60
21 2017-12-31 37.20
22 2018-01-01 18.60
23 2018-01-02 12.40
24 2018-01-04 32.40
25 2018-01-05 17.00
26 2018-01-06 28.80
27 2018-01-11 20.80
28 2018-01-12 10.00
29 2018-01-12 26.00
I am trying to plot monthly sum of transactions, which is fine, except for ugly x-ticks:
I would like to change it to Name of the month and year (e.g. Jan 2019). So I sort the dates, change them using strftime and plot it again, but the order of the date are completely messed up.
The code I used to sort the dates and conver them is:
transactions = transactions.sort_values(by='local_date')
transactions['month_year'] = transactions['local_date'].dt.strftime('%B %Y')
#And then groupby that column:
transactions.groupby('month_year').amount.sum().plot(kind='bar')
When doing this, the Month_year are paired together. January 2019 comes after January 2018 etc. etc.
I thought sorting by date would fix this, but it doesn't. What's the best way to approach this?
You can convert column to mont periods by Series.dt.to_period and then change PeriodIndex to custom format in rename:
transactions = transactions.sort_values(by='local_date')
(transactions.groupby(transactions['local_date'].dt.to_period('m'))
.amount.sum()
.rename(lambda x: x.strftime('%B %Y'))
.plot(kind='bar'))
Alternative solution:
transactions = transactions.sort_values(by='local_date')
s = transactions.groupby(transactions['local_date'].dt.to_period('m')).amount.sum()
s.index = s.index.strftime('%B %Y')
s.plot(kind='bar')

How to prevent .diff() function to get a ridiculous value when applied to a dataframe of datetimes and NaT values in Pandas?

I have got a dataframe loc_df where all the values are datetime and some of them are NaT. This is what loc_df looks like:
loc_df = pd.DataFrame({'10101':['2020-01-03','2019-11-06','2019-10-09','2019-09-26','2019-09-19','2019-08-19','2019-08-08','2019-07-05','2019-07-04','2019-06-27','2019-05-21','2019-04-21','2019-04-15','2019-04-06','2019-03-28','2019-02-28'], '10102':['2020-01-03','2019-11-15','2019-11-11','2019-10-23','2019-10-10','2019-10-06','2019-09-26','2019-07-14','2019-05-21','2019-03-15','2019-03-11','2019-02-27','2019-02-25',None,None,None], '10103':['2019-08-27','2019-07-14','2019-06-24','2019-05-21','2019-04-11','2019-03-06','2019-02-11',None,None,None,None,None,None,None,None,None]})
loc_df = loc_df.apply(pd.to_datetime)
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
I want to know the days between the dates for each colum so I have used:
loc_df = loc_df.diff(periods = -1)
The result was:
print(loc_df)
10101 10102 10103
0 58 days 49 days 00:00:00 44 days 00:00:00
1 28 days 4 days 00:00:00 20 days 00:00:00
2 13 days 19 days 00:00:00 34 days 00:00:00
3 7 days 13 days 00:00:00 40 days 00:00:00
4 31 days 4 days 00:00:00 36 days 00:00:00
5 11 days 10 days 00:00:00 23 days 00:00:00
6 34 days 74 days 00:00:00 -88814 days +00:12:43.145224
7 1 days 54 days 00:00:00 0 days 00:00:00
8 7 days 67 days 00:00:00 0 days 00:00:00
9 37 days 4 days 00:00:00 0 days 00:00:00
10 30 days 12 days 00:00:00 0 days 00:00:00
11 6 days 2 days 00:00:00 0 days 00:00:00
12 9 days -88800 days +00:12:43.145224 0 days 00:00:00
13 9 days 0 days 00:00:00 0 days 00:00:00
14 28 days 0 days 00:00:00 0 days 00:00:00
15 NaT NaT NaT
Do you know why I high values at the end of each column? I guess it has something to do with subtract a NaT to a datetime.
Is there an alternative to my code to prevent this?
Thanks in advance
If you have some initial data:
print(loc_df)
10101 10102 10103
0 2020-01-03 2020-01-03 2019-08-27
1 2019-11-06 2019-11-15 2019-07-14
2 2019-10-09 2019-11-11 2019-06-24
3 2019-09-26 2019-10-23 2019-05-21
4 2019-09-19 2019-10-10 2019-04-11
5 2019-08-19 2019-10-06 2019-03-06
6 2019-08-08 2019-09-26 2019-02-11
7 2019-07-05 2019-07-14 NaT
8 2019-07-04 2019-05-21 NaT
9 2019-06-27 2019-03-15 NaT
10 2019-05-21 2019-03-11 NaT
11 2019-04-21 2019-02-27 NaT
12 2019-04-15 2019-02-25 NaT
13 2019-04-06 NaT NaT
14 2019-03-28 NaT NaT
15 2019-02-28 NaT NaT
You could use DataFrame.ffill to fill in the NaT values before you use diff():
loc_df = loc_df.ffill()
loc_df = loc_df.diff(periods=-1)
print(loc_df)
10101 10102 10103
0 58 days 49 days 44 days
1 28 days 4 days 20 days
2 13 days 19 days 34 days
3 7 days 13 days 40 days
4 31 days 4 days 36 days
5 11 days 10 days 23 days
6 34 days 74 days 0 days
7 1 days 54 days 0 days
8 7 days 67 days 0 days
9 37 days 4 days 0 days
10 30 days 12 days 0 days
11 6 days 2 days 0 days
12 9 days 0 days 0 days
13 9 days 0 days 0 days
14 28 days 0 days 0 days
15 NaT NaT NaT

How to find number of days in a given month from a date in Python [duplicate]

This question already has answers here:
Numbers of Day in Month
(4 answers)
Closed 4 years ago.
I have a Python dataframe with one column with dates like below:
Date:
2018-10-19
2018-10-20
2018-10-21
(...)
2019-01-31
2019-02-01
Any ideas on how to add an additional column with the number of days in a month, having something like:
Date DaysinMonth
2018-10-19 31
2018-10-20 31
2018-10-21 31
(...)
Thanks
If you have a dataframe df like the following:
df = pd.DataFrame({'dates': dates, 'vals': np.random.randint(10, size=dates.shape)})
dates vals
0 2018-01-01 5
1 2018-01-21 8
2 2018-02-10 1
3 2018-03-02 9
4 2018-03-22 0
5 2018-04-11 3
6 2018-05-01 8
7 2018-05-21 2
8 2018-06-10 4
9 2018-06-30 2
10 2018-07-20 7
11 2018-08-09 5
12 2018-08-29 3
13 2018-09-18 7
14 2018-10-08 6
15 2018-10-28 5
16 2018-11-17 3
17 2018-12-07 4
18 2018-12-27 1
Getting the days in the month is as simple as the following:
df['daysinmonth'] = df['dates'].dt.daysinmonth
dates vals daysinmonth
0 2018-01-01 5 31
1 2018-01-21 8 31
2 2018-02-10 1 28
3 2018-03-02 9 31
4 2018-03-22 0 31
5 2018-04-11 3 30
6 2018-05-01 8 31
7 2018-05-21 2 31
8 2018-06-10 4 30
9 2018-06-30 2 30
10 2018-07-20 7 31
11 2018-08-09 5 31
12 2018-08-29 3 31
13 2018-09-18 7 30
14 2018-10-08 6 31
15 2018-10-28 5 31
16 2018-11-17 3 30
17 2018-12-07 4 31
18 2018-12-27 1 31

Categories

Resources