I have a table, that looks like this:
date id
0 11:09:27 Nov. 26 2020 94857
1 10:49:26 Okt. 26 2020 94853
2 10:48:24 Sept. 26 2020 94852
3 9:26:33 Aug. 26 2020 94856
4 9:26:33 Jul. 26 2020 94851
5 9:24:38 Dez. 26 2020 94850
6 9:24:38 Jan. 26 2020 94849
7 9:09:08 Jun. 27 2019 32148
8 9:02:41 Mai 27 2019 32145
9 9:02:19 Apr. 27 2019 32144
10 9:02:05 Mrz. 27 2019 32143
11 9:02:05 Feb. 27 2019 32140
(initial table)
the date column format now is 'object', I'm trying to change it to 'datetime' using
df['date'] = pd.to_datetime(df['date'], format ='HH:MM:SS-%mm-%dd-%YYYY', errors='coerce')
and receive only NaT as a result.
The problem is that the names of the months here are not standart. For example, Mai comes without a dot in the end.
What's the best way to convert its format?
The following format works for most of your data:
format="%H:%M:%S %b. %d %Y"
H stands for Hours, M for minutes, S for seconds, b for abbreviated months, and Y for year.
As said by Justin in the comments, your month abbreviations are off. These four characters abbreviations are unconventional, you should format your string to remove the last character of the month if it is 4 characters long. If it is 3 characters long then leave it like it is.
EDIT:
Note that in your dataset, the abbreviations are ended by a ".", hence the dot in the string format.
This works for me... even with inconsistency
pd.to_datetime(df.date.str[9:]+' '+df.date.str[0:8])
Input (random generated dates, changed 7 to have Sept.)
date
0 19:06:04 Mar. 19 2020
1 17:27:11 Mar. 05 2020
2 07:17:04 May. 05 2020
3 04:53:50 Sep. 23 2020
4 03:43:20 Jun. 23 2020
5 17:35:00 Mar. 06 2020
6 06:04:48 Jan. 15 2020
7 12:26:14 Sept. 18 2020
8 03:21:10 Jun. 03 2020
9 17:37:00 Aug. 26 2020
output
0 2020-03-19 19:06:04
1 2020-03-05 17:27:11
2 2020-05-05 07:17:04
3 2020-09-23 04:53:50
4 2020-06-23 03:43:20
5 2020-03-06 17:35:00
6 2020-01-15 06:04:48
7 2020-09-18 12:26:14
8 2020-06-03 03:21:10
9 2020-08-26 17:37:00
The following works for all months MMM, but fails due to 'Sept.' month. Strange because if date precedes time, by default it parses it correctly (ie. coerce code seems to work when precedes) ???
pd.to_datetime(df['date'].astype(str), format ="%H:%M:%S %b. %d %Y",
errors='coerce')
Your date column has a complicated format, so just change the format of your pd.to_datetime function:
# 11:09:27 Nov.26 2020 ---> '%I:%M:%S %b.%d %Y'
df['date'] = pd.to_datetime(df['date'], format ='%I:%M:%S %b. %d %Y', errors='coerce')
output: 2020-11-26 11:09:27
Related
I am parsing bank statement PDFs that have the full start and end dates in question both in the filename and in the document, but the actual entries corresponding to the transactions only contain the day and month ('%d %b'). Here is what the series looks like for the "Date" column:
1
2 24 Dec
3 27 Dec
4
5
6 30 Dec
7
8 31 Dec
9
10
11 2 Jan
12
13 3 Jan
14 6 Jan
15 14 Jan
16 15 Jan
I have the start and end dates as 2013-12-23 and 2014-01-23. What is an efficient way to populate this series/column with the correct full date given the start and end range? I would like any existing date to forward fill the same date to the next date so:
1
2 24 Dec 2013
3 27 Dec 2013
4 27 Dec 2013
5 27 Dec 2013
6 30 Dec 2013
7 30 Dec 2013
8 31 Dec 2013
9 31 Dec 2013
10 31 Dec 2013
11 2 Jan 2014
12 2 Jan 2014
13 3 Jan 2014
14 6 Jan 2014
15 14 Jan 2014
16 15 Jan 2014
The date format is irrelevant as long as it is a datetime format. I was hoping to use something internal to pandas, but I can't figure out what to use and right now a check against the start and end dates and filling out the year based on where the date fits into the range is the best I've come up with, but it is inefficient to have to run this on the whole column. Any help/advice/tips would be appreciated, thanks in advance.
EDIT: just wanted to add that I am hoping for a general procedural solution that can apply to any start/end date and set of transactions and not just this particular series included although I think this is a good test case as it has an end of year overlap.
EDIT2: what I have so far after posting this question is the following, which doesn't seem terribly efficient, but seems to work:
def add_year(date, start, end):
if not date:
return(np.NaN)
else:
test_date = "{} {}".format(date, start.year)
test_date = datetime.strptime(test_date, '%d %b %Y').date()
if start_date <= test_date <= end_date:
return(test_date)
else:
return(datetime.strptime("{} {}".format(date, end.year), '%d %b %Y').date())
df['Date'] = df.Date.map(lambda date: add_year(date, start_date, end_date))
df.Date.ffill(inplace=True)
try:
df['Date']=df['Date'].replace('nan|NaN',float('NaN'),regex=True)
#convert string nan to actual NaN's
df['Date']=df['Date'].ffill()
#forword fill NaN's
c=df['Date'].str.contains('Dec') & df['Date'].notna()
#checking if Date column contain Dec
idx=df[c].index[-1]
#getting the index of last 'Date' where condition c satisfies
df.loc[:idx,'Date']=df.loc[:idx,'Date']+' 2013'
#adding 2013 to 'Date' upto last index of c
df.loc[idx+1:,'Date']=df.loc[idx+1:,'Date']+' 2014'
#adding 2014 to 'Date' from last index of c+1 upto last
df['Date']=pd.to_datetime(df['Date'])
#Finally converting these values to datetime
output of df:
Date
0 NaT
1 2013-12-24
2 2013-12-27
3 2013-12-27
4 2013-12-27
5 2013-12-30
6 2013-12-30
7 2013-12-31
8 2013-12-31
9 2013-12-31
10 2014-01-02
11 2014-01-02
12 2014-01-03
13 2014-01-06
14 2014-01-14
15 2014-01-15
I have a dataframe df with Date column:
Date
--------
Wed 23 Dec
Sat 28 Nov
Thu 26 Nov
Sun 22 Nov
Tue 1 Dec
Wed 2 Dec
The Date column is object-type, I want to change the format using format="%m-%d-%Y" into yyyy-dd-mm
Expected output df:
Date
---------
2020-23-12
2020-28-11
2020-26-11
2020-22-11
2020-01-12
2020-02-12
Thanks in advance for the help!
Use to_datetime with format specified original data with added year, get column filled by datetimes:
df['Date'] = pd.to_datetime(df['Date']+'2020', format="%a %d %b%Y")
print (df)
Date
0 2020-12-23
1 2020-11-28
2 2020-11-26
3 2020-11-22
4 2020-12-01
5 2020-12-02
If need custom format add Series.dt.strftime, but datetimes are lost, get strings:
df['Date'] = pd.to_datetime(df['Date']+'2020', format="%a %d %b%Y").dt.strftime("%Y-%d-%m")
print (df)
Date
0 2020-23-12
1 2020-28-11
2 2020-26-11
3 2020-22-11
4 2020-01-12
5 2020-02-12
I have dates as below:
date
0 Today, 12 Mar
1 Tomorrow, 13 Mar
2 Tomorrow, 13 Mar
3 Tomorrow, 13 Mar
4 Tomorrow, 13 Mar
5 14 Mar 2021
6 14 Mar 2021
7 14 Mar 2021
8 14 Mar 2021
9 15 Mar 2021
How do I parse it as datetime in pandas?
Your date contains 'Today' and 'Tomorrow' which is not a valid format(if it is valid then I don't know I never worked with this type of format) of datetime so firstly replace them to 2021(if year is fixed...i.e 2021):-
df['date']=df['date'].str.replace('Today','2021')
df['date']=df['date'].str.replace('Tomorrow','2021')
Now just use to_datetime() method:-
df['date']=pd.to_datetime(df['date'])
I have a dataframe which consists of departments, year, the month of invoice, the invoice date and the value.
I have offset the Invoice dates by business days and now what I am trying to achieve is to combine all the months that have the same number of working days (so the 'count' of each month by year) and average the value for each day.
The data I have is as follows:
Department Year Month Invoice Date Value
0 Sales 2019 March 2019-03-25 1000.00
1 Sales 2019 March 2019-03-26 2000.00
2 Sales 2019 March 2019-03-27 3000.00
3 Sales 2019 March 2019-03-28 4000.00
4 Sales 2019 March 2019-03-29 5000.00
... ... ... ... ... ...
2435 Specialist 2020 August 2020-08-27 6000.00
2436 Specialist 2020 August 2020-08-28 7000.00
2437 Specialist 2020 September 2020-09-01 8000.00
2438 Specialist 2020 September 2020-09-02 9000.00
2439 Specialist 2020 September 2020-09-07 1000.00
The count of each month is as follows:
Year Month
2019 April 21
August 21
December 20
July 23
June 20
March 5
May 21
November 21
October 23
September 21
2020 April 21
August 20
February 20
January 22
July 23
June 22
March 22
May 19
September 5
My hope is that using this count I could aggregate the data from the original df and average for example April, August, May, November, September (2019) along with April (2020) as they all have 21 working days in the month.
Producing one dataframe with each day of the month an average of the months combined for each # of days.
I hope that makes sense.
Note: Please ignore the 5 days length, just incomplete data for those months...
Thank you
EDIT: I just realised that the days wont line up for each month so my plan is to aggregate it based on whether its the first business day of the month, then the second the third etc regardless of the actual date.
ALSO (SORRY): I was hoping it could be by department!
Department Month Length Day Number Average Value
0 Sales 21 1 20000
1 Sales 21 2 5541
2 Sales 21 3 87485
3 Sales 21 4 1863
4 Sales 21 5 48687
5 Sales 21 6 486996
6 Sales 21 7 892
7 Sales 21 8 985
8 Sales 21 9 14169
9 Sales 21 10 20000
10 Sales 21 11 5541
11 Sales 21 12 87485
12 Sales 21 13 1863
13 Sales 21 14 48687
14 Sales 21 15 486996
15 Sales 21 16 892
16 Sales 21 17 985
17 Sales 21 18 14169
......
So to explain it a bit better lets take sales, and all the months which have 21 days in them, for each day in those 21 day months I am hoping to get the average of the value and get a table that looks like above.
So 'day 1' is an average of all 'day 1s' in the 21 day months (as seen in the count df)! This is to allow me to then plot a line chart profile to show average revenue value on each given day in a 21 day month. I hope this is a bit of a better explanation, apologies.
I am not really sure whether I understand your question. Maybe you could add an expected df to your question?
In the mean time would this point you in the direction you are looking for:
import pandas as pd
from random import randint
from calendar import month_name
df = pd.DataFrame({'years': [randint(1998, 2020) for x in range(10000)],
'months': [month_name[randint(1, 12)] for x in range(10000)],
'days': [randint(1, 30) for x in range(10000)],
'revenue': [randint(0, 1000) for x in range(10000)]}
)
print(df.groupby(['months', 'days'])['revenue'].mean())
Output is:
months days
April 1 475.529412
2 542.870968
3 296.045455
4 392.416667
5 475.571429
September 26 516.888889
27 539.583333
28 513.500000
29 480.724138
30 456.500000
Name: revenue, Length: 360, dtype: float64
df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. Please note that year doesn't matter.
If a subject has either Dec 31st or Jan 1st, we leave them as is.
But if they have both Dec 31st and Jan 1st, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.
We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.
Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12
Another way
Coerce the date to day month.
Create temp column where 31st Dec is converted to 1st Jan
Drop duplicates by Person id and the temp column keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul