I have a dataframe df with Date column:
Date
--------
Wed 23 Dec
Sat 28 Nov
Thu 26 Nov
Sun 22 Nov
Tue 1 Dec
Wed 2 Dec
The Date column is object-type, I want to change the format using format="%m-%d-%Y" into yyyy-dd-mm
Expected output df:
Date
---------
2020-23-12
2020-28-11
2020-26-11
2020-22-11
2020-01-12
2020-02-12
Thanks in advance for the help!
Use to_datetime with format specified original data with added year, get column filled by datetimes:
df['Date'] = pd.to_datetime(df['Date']+'2020', format="%a %d %b%Y")
print (df)
Date
0 2020-12-23
1 2020-11-28
2 2020-11-26
3 2020-11-22
4 2020-12-01
5 2020-12-02
If need custom format add Series.dt.strftime, but datetimes are lost, get strings:
df['Date'] = pd.to_datetime(df['Date']+'2020', format="%a %d %b%Y").dt.strftime("%Y-%d-%m")
print (df)
Date
0 2020-23-12
1 2020-28-11
2 2020-26-11
3 2020-22-11
4 2020-01-12
5 2020-02-12
Related
I am parsing bank statement PDFs that have the full start and end dates in question both in the filename and in the document, but the actual entries corresponding to the transactions only contain the day and month ('%d %b'). Here is what the series looks like for the "Date" column:
1
2 24 Dec
3 27 Dec
4
5
6 30 Dec
7
8 31 Dec
9
10
11 2 Jan
12
13 3 Jan
14 6 Jan
15 14 Jan
16 15 Jan
I have the start and end dates as 2013-12-23 and 2014-01-23. What is an efficient way to populate this series/column with the correct full date given the start and end range? I would like any existing date to forward fill the same date to the next date so:
1
2 24 Dec 2013
3 27 Dec 2013
4 27 Dec 2013
5 27 Dec 2013
6 30 Dec 2013
7 30 Dec 2013
8 31 Dec 2013
9 31 Dec 2013
10 31 Dec 2013
11 2 Jan 2014
12 2 Jan 2014
13 3 Jan 2014
14 6 Jan 2014
15 14 Jan 2014
16 15 Jan 2014
The date format is irrelevant as long as it is a datetime format. I was hoping to use something internal to pandas, but I can't figure out what to use and right now a check against the start and end dates and filling out the year based on where the date fits into the range is the best I've come up with, but it is inefficient to have to run this on the whole column. Any help/advice/tips would be appreciated, thanks in advance.
EDIT: just wanted to add that I am hoping for a general procedural solution that can apply to any start/end date and set of transactions and not just this particular series included although I think this is a good test case as it has an end of year overlap.
EDIT2: what I have so far after posting this question is the following, which doesn't seem terribly efficient, but seems to work:
def add_year(date, start, end):
if not date:
return(np.NaN)
else:
test_date = "{} {}".format(date, start.year)
test_date = datetime.strptime(test_date, '%d %b %Y').date()
if start_date <= test_date <= end_date:
return(test_date)
else:
return(datetime.strptime("{} {}".format(date, end.year), '%d %b %Y').date())
df['Date'] = df.Date.map(lambda date: add_year(date, start_date, end_date))
df.Date.ffill(inplace=True)
try:
df['Date']=df['Date'].replace('nan|NaN',float('NaN'),regex=True)
#convert string nan to actual NaN's
df['Date']=df['Date'].ffill()
#forword fill NaN's
c=df['Date'].str.contains('Dec') & df['Date'].notna()
#checking if Date column contain Dec
idx=df[c].index[-1]
#getting the index of last 'Date' where condition c satisfies
df.loc[:idx,'Date']=df.loc[:idx,'Date']+' 2013'
#adding 2013 to 'Date' upto last index of c
df.loc[idx+1:,'Date']=df.loc[idx+1:,'Date']+' 2014'
#adding 2014 to 'Date' from last index of c+1 upto last
df['Date']=pd.to_datetime(df['Date'])
#Finally converting these values to datetime
output of df:
Date
0 NaT
1 2013-12-24
2 2013-12-27
3 2013-12-27
4 2013-12-27
5 2013-12-30
6 2013-12-30
7 2013-12-31
8 2013-12-31
9 2013-12-31
10 2014-01-02
11 2014-01-02
12 2014-01-03
13 2014-01-06
14 2014-01-14
15 2014-01-15
Input Data format given like this.
date
State
Jan 18 2021 7:26:9 PM UTC
True
Jan 18 2021 7:24:56 PM UTC
True
Jan 18 2021 7:23:42 PM UTC
True
Oct 27 2020 9:36:52 PM UTC
False
Oct 27 2020 8:23:16 PM UTC
False
Oct 27 2020 7:48:20 PM UTC
False
Oct 27 2021 6:24:56 PM UTC
True
Oct 27 2021 5:24:56 PM UTC
True
Oct 28 2020 7:48:20 PM UTC, False
Output I am looking how many hours system was working and how many hour system was not working (false) in a day.
Output I am looking on Jan 18, true is (7:26:9 - 7:23:42 = 2 mins
27 sec)
On Jan 27 False is (9:36:52 - 7:48:20 = 1hour 48 min 32
sec)
On Jan 27 true is (6:24:56 - 5:24:56 = 1 hour) How to use
Pandas function here.Thanks in advance.
Convert column to datetimes and aggregate min and max by dates by Series.dt.date with helper column g for consecutive Trues and Falses, then subtract and get timedeltas.
df['date'] = pd.to_datetime(df['date'])
df['g'] = df['State'].ne(df['State'].shift()).cumsum()
df = df.groupby([df['date'].dt.date, 'State','g'], sort=False)['date'].agg(['max','min'])
df = df['max'].sub(df['min']).reset_index(level=2, drop=True).reset_index(name='td')
print (df)
date State td
0 2021-01-18 True 0 days 00:02:27
1 2020-10-27 False 0 days 01:48:32
2 2021-10-27 True 0 days 01:00:00
3 2020-10-28 False 0 days 00:00:00
And if need seconds add Series.dt.total_seconds:
df['sec'] = df['td'].dt.total_seconds()
print (df)
date State td
0 2021-01-18 True 147.0
1 2020-10-27 False 6512.0
2 2021-10-27 True 3600.0
3 2020-10-28 False 0.0
I have a table, that looks like this:
date id
0 11:09:27 Nov. 26 2020 94857
1 10:49:26 Okt. 26 2020 94853
2 10:48:24 Sept. 26 2020 94852
3 9:26:33 Aug. 26 2020 94856
4 9:26:33 Jul. 26 2020 94851
5 9:24:38 Dez. 26 2020 94850
6 9:24:38 Jan. 26 2020 94849
7 9:09:08 Jun. 27 2019 32148
8 9:02:41 Mai 27 2019 32145
9 9:02:19 Apr. 27 2019 32144
10 9:02:05 Mrz. 27 2019 32143
11 9:02:05 Feb. 27 2019 32140
(initial table)
the date column format now is 'object', I'm trying to change it to 'datetime' using
df['date'] = pd.to_datetime(df['date'], format ='HH:MM:SS-%mm-%dd-%YYYY', errors='coerce')
and receive only NaT as a result.
The problem is that the names of the months here are not standart. For example, Mai comes without a dot in the end.
What's the best way to convert its format?
The following format works for most of your data:
format="%H:%M:%S %b. %d %Y"
H stands for Hours, M for minutes, S for seconds, b for abbreviated months, and Y for year.
As said by Justin in the comments, your month abbreviations are off. These four characters abbreviations are unconventional, you should format your string to remove the last character of the month if it is 4 characters long. If it is 3 characters long then leave it like it is.
EDIT:
Note that in your dataset, the abbreviations are ended by a ".", hence the dot in the string format.
This works for me... even with inconsistency
pd.to_datetime(df.date.str[9:]+' '+df.date.str[0:8])
Input (random generated dates, changed 7 to have Sept.)
date
0 19:06:04 Mar. 19 2020
1 17:27:11 Mar. 05 2020
2 07:17:04 May. 05 2020
3 04:53:50 Sep. 23 2020
4 03:43:20 Jun. 23 2020
5 17:35:00 Mar. 06 2020
6 06:04:48 Jan. 15 2020
7 12:26:14 Sept. 18 2020
8 03:21:10 Jun. 03 2020
9 17:37:00 Aug. 26 2020
output
0 2020-03-19 19:06:04
1 2020-03-05 17:27:11
2 2020-05-05 07:17:04
3 2020-09-23 04:53:50
4 2020-06-23 03:43:20
5 2020-03-06 17:35:00
6 2020-01-15 06:04:48
7 2020-09-18 12:26:14
8 2020-06-03 03:21:10
9 2020-08-26 17:37:00
The following works for all months MMM, but fails due to 'Sept.' month. Strange because if date precedes time, by default it parses it correctly (ie. coerce code seems to work when precedes) ???
pd.to_datetime(df['date'].astype(str), format ="%H:%M:%S %b. %d %Y",
errors='coerce')
Your date column has a complicated format, so just change the format of your pd.to_datetime function:
# 11:09:27 Nov.26 2020 ---> '%I:%M:%S %b.%d %Y'
df['date'] = pd.to_datetime(df['date'], format ='%I:%M:%S %b. %d %Y', errors='coerce')
output: 2020-11-26 11:09:27
I am currently working on a dataset of 8 000 rows.
I want to split my date column by day, month, year. dtype for the date is object
How to convert the whole column of date by date. month, year?
A sample of the date of my dataset is shown below:
date
01-01-2016
01-01-2016
01-01-2016
01-01-2016
01-01-2016
df=pd.DataFrame(columns=['date'])
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
print(df)
dt=datetime.strptime('date',"%d-%m-%y")
print(dt)
This is the code I am using for date splitting but it is showing mean error
ValueError: time data 'date' does not match format '%d-%m-%y'
If you have pandas you can do this:
import pandas as pd
# Recreate your dataframe
df = pd.DataFrame(dict(date=['01-01-2016']*6))
df.date = pd.to_datetime(df.date)
# Create 3 new columns
df[['year','month','day']] = df.date.apply(lambda x: pd.Series(x.strftime("%Y,%m,%d").split(",")))
df
Returns
date year month day
0 2016-01-01 2016 01 01
1 2016-01-01 2016 01 01
2 2016-01-01 2016 01 01
3 2016-01-01 2016 01 01
4 2016-01-01 2016 01 01
5 2016-01-01 2016 01 01
Or without the formatting options:
df['year'],df['month'],df['day'] = df.date.dt.year, df.date.dt.month, df.date.dt.day
df
Returns
date year month day
0 2016-01-01 2016 1 1
1 2016-01-01 2016 1 1
2 2016-01-01 2016 1 1
3 2016-01-01 2016 1 1
4 2016-01-01 2016 1 1
5 2016-01-01 2016 1 1
I found this but cant get the syntax correct.
time.asctime(time.strptime('2017 28 1', '%Y %W %w'))
I want to set a new column to show month in the format "201707" for July. It can be int64 or string doesnt have to be an actual readable date in the column.
My dataframe column ['Week'] is also in the format 201729 i.e. YYYYWW
dfAttrition_Billings_KPIs['Day_1'] = \
time.asctime(time.strptime(dfAttrition_Billings_KPIs['Week'].str[:4]
+ dfAttrition_Billings_KPIs['Month'].str[:-2] - 1 + 1', '%Y %W %w'))
So I want the output of the rows that have week 201729 to show in a new field month 201707. the output depends on what the row value is in 'Week'.
I have a million records so would like to avoid iterations of rows, lambdas and slow functions where possible :)
Use to_datetime with parameter format with add 1 for Mondays, last for format YYYYMM use strftime
df = pd.DataFrame({'date':[201729,201730,201735]})
df['date1']=pd.to_datetime(df['date'].astype(str) + '1', format='%Y%W%w')
df['date2']=pd.to_datetime(df['date'].astype(str) + '1', format='%Y%W%w').dt.strftime('%Y%m')
print (df)
date date1 date2
0 201729 2017-07-17 201707
1 201730 2017-07-24 201707
2 201735 2017-08-28 201708
If need convert from datetime to weeks custom format:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=10)})
df['date3'] = df['date'].dt.strftime('%Y %W %w')
print (df)
date date3
0 2017-01-01 2017 00 0
1 2017-01-02 2017 01 1
2 2017-01-03 2017 01 2
3 2017-01-04 2017 01 3
4 2017-01-05 2017 01 4
5 2017-01-06 2017 01 5
6 2017-01-07 2017 01 6
7 2017-01-08 2017 01 0
8 2017-01-09 2017 02 1
9 2017-01-10 2017 02 2