I am working with a pandas dataframe with date column. I have converted the dtype of this column from object to datetime using pandas pd.to_datetime:
Input:
0 30-11-2019
1 31-12-2019
2 31-12-2019
3 31-12-2019
4 31-12-2019
5 21-01-2020
6 27-01-2020
7 01-02-2020
8 01-02-2020
9 03-02-2020
10 15-02-2020
11 12-03-2020
12 13-03-2020
13 31-03-2020
14 31-03-2020
15 04-04-2020
16 04-04-2020
17 04-04-2020
ta['transaction_date'] = pd.to_datetime(ta['transaction_date'])
Output:
0 2019-11-30
1 2019-12-31
2 2019-12-31
3 2019-12-31
4 2019-12-31
5 2020-01-21
6 2020-01-27
7 2020-01-02
8 2020-01-02
9 2020-03-02
10 2020-02-15
11 2020-12-03
12 2020-03-13
13 2020-03-31
14 2020-03-31
15 2020-04-04
16 2020-04-04
17 2020-04-04
As you can see that the 11th output after converting it into datetime is wrong month is swapped with day.This is affecting my further analysis. How can I sort this out.
Use dayfirst=True parameter or specify format, because pandas by default matching months first, if possible:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], dayfirst=True)
Or:
a['transaction_date'] = pd.to_datetime(ta['transaction_date'], format='%d-%m-%Y')
Method 1
Look into this dateframe
there is a parameter named dayfirst set it to true
Method 2
Use the parameter format in the to_datetime function
Related
For example, I have several columns of dates and I want to get the month from them. Is there a way to loop through columns instead of running pd.DatetimeIndex(df['date']).month
multiple times? The example below is simplified. The real dataset has many more columns.
import pandas as pd
import numpy as np
np.random.seed(0)
rng_start = pd.date_range('2015-07-24', periods=5, freq='M')
rng_mid = pd.date_range('2019-06-24', periods=5, freq='M')
rng_end = pd.date_range('2022-03-24', periods=5, freq='M')
df = pd.DataFrame({ 'start_date': rng_start, 'mid_date': rng_mid, 'end_date': rng_end })
df
start_date mid_date end_date
0 2015-07-31 2019-06-30 2022-03-31
1 2015-08-31 2019-07-31 2022-04-30
2 2015-09-30 2019-08-31 2022-05-31
3 2015-10-31 2019-09-30 2022-06-30
4 2015-11-30 2019-10-31 2022-07-31
The intended output would be
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You answered your question by saying "loop through columns":
for column in df:
df[column.replace("_date", "_month")] = df[column].dt.month
An alternative solution (a variation of #BENY's):
df[df.columns.str.replace("_date", "_month")] = df.apply(lambda x: x.dt.month, axis=1)
Try apply
df[['start_month', 'mid_month', 'end_month']] = df.apply(lambda x : x.dt.month,axis=1)
df
Out[244]:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
You can avoid looping using stack:
out = df.join(df.filter(like='_date') # select _date columns
.stack() # convert to Series
.dt.month
.unstack() # back to DataFrame
.rename(columns=lambda x: x.replace('_date', '_month'))
)
Output:
start_date mid_date end_date start_month mid_month end_month
0 2015-07-31 2019-06-30 2022-03-31 7 6 3
1 2015-08-31 2019-07-31 2022-04-30 8 7 4
2 2015-09-30 2019-08-31 2022-05-31 9 8 5
3 2015-10-31 2019-09-30 2022-06-30 10 9 6
4 2015-11-30 2019-10-31 2022-07-31 11 10 7
quite similar to this solution bub a bit different:
df.join(df.applymap(lambda x: x.month).
set_axis(['start_month', 'mid_month', 'end_month'],axis=1))
I have a column which I have converted to dateime:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
And I need to add 1 year to this time based on a conditional:
df.loc[df['quarter'] == "Q4_", 'date'] + pd.offsets.DateOffset(years=1)
but it's not working....
date
2021-10-21 00:00:00
2021-10-24 00:00:00
2021-10-25 00:00:00
2021-10-26 00:00:00
I have tried converting it to period since I only need the year to be used in a concatenation later:
df['year'] = df['date'].dt.to_period('Y')
but I cannot add any number to a period.
This appears to be working for me:
import pandas as pd
df = pd.DataFrame({'date':pd.date_range('1/1/2021', periods=50, freq='M')})
print(df.head(24))
Input:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2021-10-31
10 2021-11-30
11 2021-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2022-10-31
22 2022-11-30
23 2022-12-31
Add, year:
df.loc[df['date'].dt.quarter == 4, 'date'] += pd.offsets.DateOffset(years=1)
print(df.head(24))
Note per your logic, the year increase on October.
Output:
date
0 2021-01-31
1 2021-02-28
2 2021-03-31
3 2021-04-30
4 2021-05-31
5 2021-06-30
6 2021-07-31
7 2021-08-31
8 2021-09-30
9 2022-10-31
10 2022-11-30
11 2022-12-31
12 2022-01-31
13 2022-02-28
14 2022-03-31
15 2022-04-30
16 2022-05-31
17 2022-06-30
18 2022-07-31
19 2022-08-31
20 2022-09-30
21 2023-10-31
22 2023-11-30
23 2023-12-31
df_non_holidays=pd.read_csv("holidays.csv")
print(df_non_holidays["date"])
0 25-06-21
1 28-06-21
2 29-06-21
3 30-06-21
4 01-07-21
5 02-07-21
6 05-07-21
7 06-07-21
8 07-07-21
9 08-07-21
10 09-07-21
11 12-07-21
12 13-07-21
13 14-07-21
14 15-07-21
15 16-07-21
16 19-07-21
17 20-07-21
18 22-07-21
19 23-07-21
20 26-07-21
21 27-07-21
22 28-07-21
23 29-07-21
24 30-07-21
Name: date, dtype: object
df_non_holidays["date"]= pd.to_datetime(df_non_holidays["date"])
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-01-07
5 2021-02-07
6 2021-05-07
7 2021-06-07
8 2021-07-07
9 2021-08-07
10 2021-09-07
11 2021-12-07
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]
after converting , from index no:4 , it is changing in month and date differently..
does anything wrong in approach in converting object into datetime..
please guide me..
You also need to pass the format string, or you can pass dayfirst=True as well if you don't want to pass the format. But passing dayfirst may not always work for all type of datetime string values; however, passing format is always going to work if you pass the right format.
>>> pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
0 2021-06-25
1 2021-06-28
2 2021-06-29
3 2021-06-30
4 2021-07-01
5 2021-07-02
6 2021-07-05
7 2021-07-06
8 2021-07-07
9 2021-07-08
10 2021-07-09
11 2021-07-12
12 2021-07-13
13 2021-07-14
14 2021-07-15
15 2021-07-16
16 2021-07-19
17 2021-07-20
18 2021-07-22
19 2021-07-23
20 2021-07-26
21 2021-07-27
22 2021-07-28
23 2021-07-29
24 2021-07-30
Name: date, dtype: datetime64[ns]
Regarding the documentation you can use one of the following parameters
dayfirst to tell that first value is day and not month
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], dayfirst=True)
format to explicitly tell that it's %d-%m-%y
df_non_holidays["date"] = pd.to_datetime(df_non_holidays["date"], format='%d-%m-%y')
That gives the proper conversion
# from
date
0 25-06-21
1 01-07-21
# to
date
0 2021-06-25
1 2021-07-01
I have a data frame:
ID Date Volume
1 2019Q1 9
1 2020Q2 11
2 2019Q3 39
2 2020Q4 23
I want to convert this to yyyy-Qn to datetime.
I have used a dictionary to map the corresponding dates to the quarters.
But I need a more generalized code in instances where the yyyy changes.
Expected output:
ID Date Volume
1 2019-03 9
1 2020-06 11
2 2019-09 39
2 2020-12 23
Let's use pd.PeriodIndex:
df['Date_new'] = pd.PeriodIndex(df['Date'], freq='Q').strftime('%Y-%m')
Output:
ID Date Volume Date_new
0 1 2019Q1 9 2019-03
1 1 2020Q2 11 2020-06
2 2 2019Q3 39 2019-09
3 2 2020Q4 23 2020-12
Here's a simple solution but not as efficient (shouldn't be a problem if your dataset is not too large).
Convert the date column to datetime using to_datetime.
Then add 2 months to each date because you want month to be end-of-quarter month
df = pd.DataFrame({'date': ["2019Q1" ,"2019Q3", "2019Q2", "2020Q4"], 'volume': [1,2,3, 4]})
df['datetime'] = pd.to_datetime(df['date'])
df['datetime'] = df['datetime'] + pd.DateOffset(months=2)
Output is the same
date volume datetime
0 2019Q1 1 2019-03-01
1 2019Q3 2 2019-09-01
2 2019Q2 3 2019-06-01
3 2020Q4 4 2020-12-01
I have a dataframe that looks like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 23 3
1 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 20 2
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 6
4 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 9 3
I would like to groupby for each uid in order to have the sum of count every hour and the average of val
I would like something like the following
uid timestamp count val
0 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 13:00:00 43 2.5
2 ccf7758a-155f-4ebf-8740-68320f279baa 2020-03-17 15:00:00 10 5
3 16162f81-d745-41c2-a7d6-f11486958e36 2020-03-18 09:00:00 18 4.5
You can try groupby in combination with agg using a dictionary style definition of your custom functions:
import pandas pd
import numpy as np
df.groupby(['uid', 'timestamp']).agg({"val": np.mean, "count" :np.sum})