I have dates as below:
date
0 Today, 12 Mar
1 Tomorrow, 13 Mar
2 Tomorrow, 13 Mar
3 Tomorrow, 13 Mar
4 Tomorrow, 13 Mar
5 14 Mar 2021
6 14 Mar 2021
7 14 Mar 2021
8 14 Mar 2021
9 15 Mar 2021
How do I parse it as datetime in pandas?
Your date contains 'Today' and 'Tomorrow' which is not a valid format(if it is valid then I don't know I never worked with this type of format) of datetime so firstly replace them to 2021(if year is fixed...i.e 2021):-
df['date']=df['date'].str.replace('Today','2021')
df['date']=df['date'].str.replace('Tomorrow','2021')
Now just use to_datetime() method:-
df['date']=pd.to_datetime(df['date'])
Related
I am parsing bank statement PDFs that have the full start and end dates in question both in the filename and in the document, but the actual entries corresponding to the transactions only contain the day and month ('%d %b'). Here is what the series looks like for the "Date" column:
1
2 24 Dec
3 27 Dec
4
5
6 30 Dec
7
8 31 Dec
9
10
11 2 Jan
12
13 3 Jan
14 6 Jan
15 14 Jan
16 15 Jan
I have the start and end dates as 2013-12-23 and 2014-01-23. What is an efficient way to populate this series/column with the correct full date given the start and end range? I would like any existing date to forward fill the same date to the next date so:
1
2 24 Dec 2013
3 27 Dec 2013
4 27 Dec 2013
5 27 Dec 2013
6 30 Dec 2013
7 30 Dec 2013
8 31 Dec 2013
9 31 Dec 2013
10 31 Dec 2013
11 2 Jan 2014
12 2 Jan 2014
13 3 Jan 2014
14 6 Jan 2014
15 14 Jan 2014
16 15 Jan 2014
The date format is irrelevant as long as it is a datetime format. I was hoping to use something internal to pandas, but I can't figure out what to use and right now a check against the start and end dates and filling out the year based on where the date fits into the range is the best I've come up with, but it is inefficient to have to run this on the whole column. Any help/advice/tips would be appreciated, thanks in advance.
EDIT: just wanted to add that I am hoping for a general procedural solution that can apply to any start/end date and set of transactions and not just this particular series included although I think this is a good test case as it has an end of year overlap.
EDIT2: what I have so far after posting this question is the following, which doesn't seem terribly efficient, but seems to work:
def add_year(date, start, end):
if not date:
return(np.NaN)
else:
test_date = "{} {}".format(date, start.year)
test_date = datetime.strptime(test_date, '%d %b %Y').date()
if start_date <= test_date <= end_date:
return(test_date)
else:
return(datetime.strptime("{} {}".format(date, end.year), '%d %b %Y').date())
df['Date'] = df.Date.map(lambda date: add_year(date, start_date, end_date))
df.Date.ffill(inplace=True)
try:
df['Date']=df['Date'].replace('nan|NaN',float('NaN'),regex=True)
#convert string nan to actual NaN's
df['Date']=df['Date'].ffill()
#forword fill NaN's
c=df['Date'].str.contains('Dec') & df['Date'].notna()
#checking if Date column contain Dec
idx=df[c].index[-1]
#getting the index of last 'Date' where condition c satisfies
df.loc[:idx,'Date']=df.loc[:idx,'Date']+' 2013'
#adding 2013 to 'Date' upto last index of c
df.loc[idx+1:,'Date']=df.loc[idx+1:,'Date']+' 2014'
#adding 2014 to 'Date' from last index of c+1 upto last
df['Date']=pd.to_datetime(df['Date'])
#Finally converting these values to datetime
output of df:
Date
0 NaT
1 2013-12-24
2 2013-12-27
3 2013-12-27
4 2013-12-27
5 2013-12-30
6 2013-12-30
7 2013-12-31
8 2013-12-31
9 2013-12-31
10 2014-01-02
11 2014-01-02
12 2014-01-03
13 2014-01-06
14 2014-01-14
15 2014-01-15
I have a table, that looks like this:
date id
0 11:09:27 Nov. 26 2020 94857
1 10:49:26 Okt. 26 2020 94853
2 10:48:24 Sept. 26 2020 94852
3 9:26:33 Aug. 26 2020 94856
4 9:26:33 Jul. 26 2020 94851
5 9:24:38 Dez. 26 2020 94850
6 9:24:38 Jan. 26 2020 94849
7 9:09:08 Jun. 27 2019 32148
8 9:02:41 Mai 27 2019 32145
9 9:02:19 Apr. 27 2019 32144
10 9:02:05 Mrz. 27 2019 32143
11 9:02:05 Feb. 27 2019 32140
(initial table)
the date column format now is 'object', I'm trying to change it to 'datetime' using
df['date'] = pd.to_datetime(df['date'], format ='HH:MM:SS-%mm-%dd-%YYYY', errors='coerce')
and receive only NaT as a result.
The problem is that the names of the months here are not standart. For example, Mai comes without a dot in the end.
What's the best way to convert its format?
The following format works for most of your data:
format="%H:%M:%S %b. %d %Y"
H stands for Hours, M for minutes, S for seconds, b for abbreviated months, and Y for year.
As said by Justin in the comments, your month abbreviations are off. These four characters abbreviations are unconventional, you should format your string to remove the last character of the month if it is 4 characters long. If it is 3 characters long then leave it like it is.
EDIT:
Note that in your dataset, the abbreviations are ended by a ".", hence the dot in the string format.
This works for me... even with inconsistency
pd.to_datetime(df.date.str[9:]+' '+df.date.str[0:8])
Input (random generated dates, changed 7 to have Sept.)
date
0 19:06:04 Mar. 19 2020
1 17:27:11 Mar. 05 2020
2 07:17:04 May. 05 2020
3 04:53:50 Sep. 23 2020
4 03:43:20 Jun. 23 2020
5 17:35:00 Mar. 06 2020
6 06:04:48 Jan. 15 2020
7 12:26:14 Sept. 18 2020
8 03:21:10 Jun. 03 2020
9 17:37:00 Aug. 26 2020
output
0 2020-03-19 19:06:04
1 2020-03-05 17:27:11
2 2020-05-05 07:17:04
3 2020-09-23 04:53:50
4 2020-06-23 03:43:20
5 2020-03-06 17:35:00
6 2020-01-15 06:04:48
7 2020-09-18 12:26:14
8 2020-06-03 03:21:10
9 2020-08-26 17:37:00
The following works for all months MMM, but fails due to 'Sept.' month. Strange because if date precedes time, by default it parses it correctly (ie. coerce code seems to work when precedes) ???
pd.to_datetime(df['date'].astype(str), format ="%H:%M:%S %b. %d %Y",
errors='coerce')
Your date column has a complicated format, so just change the format of your pd.to_datetime function:
# 11:09:27 Nov.26 2020 ---> '%I:%M:%S %b.%d %Y'
df['date'] = pd.to_datetime(df['date'], format ='%I:%M:%S %b. %d %Y', errors='coerce')
output: 2020-11-26 11:09:27
I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
df2 = pd.DataFrame({'person_id':[11,11,11,11,11,12,12,13,13,14,14,14,14],
'admit_date':['01/01/2011','01/01/2009','12/31/2013','12/31/2017','04/03/2014','08/04/2016',
'03/05/2014','02/07/2011','08/08/2016','12/31/2017','05/01/2011','05/21/2014','07/12/2016']})
df2 = df2.melt('person_id', value_name='dates')
df2['dates'] = pd.to_datetime(df2['dates'])
What I would like to do is
a) Exclude/filter out records from the data frame if a subject has Dec 31st and Jan 1st in its records. Please note that year doesn't matter.
If a subject has either Dec 31st or Jan 1st, we leave them as is.
But if they have both Dec 31st and Jan 1st, we remove one (either Dec 31st or Jan 1st) of them. note they could have multiple entries with the same date as well. Like person_id = 11
I could only do the below
df2_new = df2['dates'] != '2017-12-31' #but this excludes if a subject has only `Dec 31st on 2017`. How can I ignore the dates and not consider `year`
df2[df2_new]
My expected output is like as shown below
For person_id = 11, we drop 12-31 because it had both 12-31 and 01-01 in their records whereas for person_id = 14, we don't drop 12-31 because it has only 12-31 in its records.
We drop 12-31 only when both 12-31 and 01-01 appear in a person's records.
Use:
s = df2['dates'].dt.strftime('%m-%d')
m1 = s.eq('01-01').groupby(df2['person_id']).transform('any')
m2 = s.eq('12-31').groupby(df2['person_id']).transform('any')
m3 = np.select([m1 & m2, m1 | m2], [s.ne('12-31'), True], default=True)
df3 = df2[m3]
Result:
# print(df3)
person_id variable dates
0 11 admit_date 2011-01-01
1 11 admit_date 2009-01-01
4 11 admit_date 2014-04-03
5 12 admit_date 2016-08-04
6 12 admit_date 2014-03-05
7 13 admit_date 2011-02-07
8 13 admit_date 2016-08-08
9 14 admit_date 2017-12-31
10 14 admit_date 2011-05-01
11 14 admit_date 2014-05-21
12 14 admit_date 2016-07-12
Another way
Coerce the date to day month.
Create temp column where 31st Dec is converted to 1st Jan
Drop duplicates by Person id and the temp column keeping first.
df2['dates']=df2['dates'].dt.strftime('%d %b')
df2=df2.assign(check=np.where(df2.dates=='31 Dec','01 Jan', df2.dates)).drop_duplicates(['person_id', 'variable', 'check'], keep='first').drop(columns=['check'])
person_id variable dates check
0 11 admit_date 01 Jan 01 Jan
4 11 admit_date 03 Apr 03 Apr
5 12 admit_date 04 Aug 04 Aug
6 12 admit_date 05 Mar 05 Mar
7 13 admit_date 07 Feb 07 Feb
8 13 admit_date 08 Aug 08 Aug
9 14 admit_date 31 Dec 01 Jan
10 14 admit_date 01 May 01 May
11 14 admit_date 21 May 21 May
12 14 admit_date 12 Jul 12 Jul
I have been collecting Twitter data for a couple of days now and, among other things, I need to analyze how content propagates. I created a list of timestamps when users were interested in content and imported twitter timestamps in pandas df with the column name 'timestamps'. It looks like this:
0 Sat Dec 14 05:13:28 +0000 2013
1 Sat Dec 14 05:21:12 +0000 2013
2 Sat Dec 14 05:23:10 +0000 2013
3 Sat Dec 14 05:27:54 +0000 2013
4 Sat Dec 14 05:37:43 +0000 2013
5 Sat Dec 14 05:39:38 +0000 2013
6 Sat Dec 14 05:41:39 +0000 2013
7 Sat Dec 14 05:43:46 +0000 2013
8 Sat Dec 14 05:44:50 +0000 2013
9 Sat Dec 14 05:47:33 +0000 2013
10 Sat Dec 14 05:49:29 +0000 2013
11 Sat Dec 14 05:55:03 +0000 2013
12 Sat Dec 14 05:59:09 +0000 2013
13 Sat Dec 14 05:59:45 +0000 2013
14 Sat Dec 14 06:17:19 +0000 2013
etc. What I want to do is to sample every 10min and count how many users are interested in content in each time frame. My problem is that I have no clue how to process the timestamps I imported from Twitter. Should I use regular expressions or is there any better approach to this? I would appreciate if someone could provide some pointers. Thanks!
That's ISO date format, it can be easily converted to datetime with pd.to_datetime:
>>> df[:2]
timestamp
0 Sat Dec 14 05:13:28 +0000 2013
1 Sat Dec 14 05:21:12 +0000 2013
>>> df['timestamp'] = pd.to_datetime(df['timestamp'])
>>> df[:2]
timestamp
0 2013-12-14 05:13:28
1 2013-12-14 05:21:12
To resample, you can make it an index, and use resample
>>> df.index = df['timestamp']
>>> df.resample('20Min', 'count')
2013-12-14 05:00:00 timestamp 1
2013-12-14 05:20:00 timestamp 5
2013-12-14 05:40:00 timestamp 8
2013-12-14 06:00:00 timestamp 1
dtype: int64