pandas datetime giving wrong output

pandas datetime giving wrong output - python

I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.

Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')

Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function

Related

How to loop through multiple columns to create multiple new columns in

For example, I have several columns of dates and I want to get the month from them. Is there a way to loop through columns instead of running pd.DatetimeIndex(df['date']).month
multiple times? The example below is simplified. The real dataset has many more columns.
import pandas as pd
import numpy as np
np.random.seed(0)
rng_start = pd.date_range('2015-07-24', periods=5, freq='M')
rng_mid = pd.date_range('2019-06-24', periods=5, freq='M')
rng_end = pd.date_range('2022-03-24', periods=5, freq='M')
df = pd.DataFrame({ 'start_date': rng_start, 'mid_date': rng_mid, 'end_date': rng_end })
df
start_date mid_date end_date
0 2015-07-31 2019-06-30 2022-03-31
1 2015-08-31 2019-07-31 2022-04-30
2 2015-09-30 2019-08-31 2022-05-31
3 2015-10-31 2019-09-30 2022-06-30
4 2015-11-30 2019-10-31 2022-07-31
The intended output would be
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7

You answered your question by saying "loop through columns":
for column in df:
df[column.replace("_date", "_month")] = df[column].dt.month
An alternative solution (a variation of #BENY's):
df[df.columns.str.replace("_date", "_month")] = df.apply(lambda x: x.dt.month, axis=1)

Try apply
df[['start_month', 'mid_month', 'end_month']] = df.apply(lambda x : x.dt.month,axis=1)
df
Out[244]:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7

You can avoid looping using stack:
out = df.join(df.filter(like='_date') # select _date columns
.stack() # convert to Series
.dt.month
.unstack() # back to DataFrame
.rename(columns=lambda x: x.replace('_date', '_month'))
)
Output:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7

quite similar to this solution bub a bit different:
df.join(df.applymap(lambda x: x.month).
set_axis(['start_month', 'mid_month', 'end_month'],axis=1))

Adding a year to a period?

I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.

This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31

Facing problem in converting into datetime

df_non_holidays=pd.read_csv("holidays.csv")
print(df_non_holidays["date"])
0 25-06-21
1 28-06-21
2 29-06-21
3 30-06-21
4 01-07-21
5 02-07-21
6 05-07-21
7 06-07-21
8 07-07-21
9 08-07-21
10 09-07-21
11 12-07-21
12 13-07-21
13 14-07-21
14 15-07-21
15 16-07-21
16 19-07-21
17 20-07-21
18 22-07-21
19 23-07-21
20 26-07-21
21 27-07-21
22 28-07-21
23 29-07-21
24 30-07-21
Name: date, dtype: object
df_non_holidays["date"]= pd.to_datetime(df_non_holidays["date"])
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-01-07
5 2021-02-07
6 2021-05-07
7 2021-06-07
8 2021-07-07
9 2021-08-07
10 2021-09-07
11 2021-12-07
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]
after converting , from index no:4 , it is changing in month and date differently..
does anything wrong in approach in converting object into datetime..
please guide me..

You also need to pass the format string, or you can pass dayfirst=True as well if you don't want to pass the format. But passing dayfirst may not always work for all type of datetime string values; however, passing format is always going to work if you pass the right format.
>>> pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-07-01
5 2021-07-02
6 2021-07-05
7 2021-07-06
8 2021-07-07
9 2021-07-08
10 2021-07-09
11 2021-07-12
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]

Regarding the documentation you can use one of the following parameters
dayfirst to tell that first value is day and not month
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], dayfirst=True)
format to explicitly tell that it's %d-%m-%y
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
That gives the proper conversion
# from
date
0 25-06-21
1 01-07-21
# to
date
0 2021-06-25
1 2021-07-01

Convert yyyyQn to yyyy-mm

I have a data frame:
ID Date Volume
1 2019Q1 9
1 2020Q2 11
2 2019Q3 39
2 2020Q4 23
I want to convert this to yyyy-Qn to datetime.
I have used a dictionary to map the corresponding dates to the quarters.
But I need a more generalized code in instances where the yyyy changes.
Expected output:
ID Date Volume
1 2019-03 9
1 2020-06 11
2 2019-09 39
2 2020-12 23

Let's use pd.PeriodIndex:
df['Date_new'] = pd.PeriodIndex(df['Date'], freq='Q').strftime('%Y-%m')
Output:
ID Date Volume Date_new
0 1 2019Q1 9 2019-03
1 1 2020Q2 11 2020-06
2 2 2019Q3 39 2019-09
3 2 2020Q4 23 2020-12

Here's a simple solution but not as efficient (shouldn't be a problem if your dataset is not too large).
Convert the date column to datetime using to_datetime.
Then add 2 months to each date because you want month to be end-of-quarter month
df = pd.DataFrame({'date': ["2019Q1" ,"2019Q3", "2019Q2", "2020Q4"], 'volume': [1,2,3, 4]})
df['datetime'] = pd.to_datetime(df['date'])
df['datetime'] = df['datetime'] + pd.DateOffset(months=2)
Output is the same
date volume datetime
0 2019Q1 1 2019-03-01
1 2019Q3 2 2019-09-01
2 2019Q2 3 2019-06-01
3 2020Q4 4 2020-12-01

Python: how to group by for each user?

I have a dataframe that looks like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 23 3
1 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 20 2
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 6
4 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 3
I would like to groupby for each uid in order to have the sum of count every hour and the average of val
I would like something like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 43 2.5
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 18 4.5

You can try groupby in combination with agg using a dictionary style definition of your custom functions:
import pandas pd
import numpy as np
df.groupby(['uid', 'timestamp']).agg({"val": np.mean, "count" :np.sum})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas datetime giving wrong output - python

Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible: a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True) Or: a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')

Method 1 Look into this dateframe there is a parameter named dayfirst set it to true Method 2 Use the parameter format in the to_datetime function

Related

How to loop through multiple columns to create multiple new columns in

Adding a year to a period?

Facing problem in converting into datetime

Convert yyyyQn to yyyy-mm

Python: how to group by for each user?

Categories

Resources