I have the foll. list in pandas:
str = jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
Is there a way to convert it into datetime?
I tried:
pd.to_datetime(pd.Series(str))
You have to specify the format argument while calling pd.to_datetime. Try
pd.to_datetime(pd.Series(s), format='%b_%d')
this gives
0 1900-01-01
1 1900-01-15
2 1900-02-01
3 1900-02-15
4 1900-03-01
5 1900-03-15
6 1900-04-01
7 1900-04-15
8 1900-05-01
9 1900-05-15
For setting the current year, a hack may be required, like
pd.to_datetime(pd.Series(s) + '_2015', format='%b_%d_%Y')
to get
0 2015-01-01
1 2015-01-15
2 2015-02-01
3 2015-02-15
4 2015-03-01
5 2015-03-15
6 2015-04-01
7 2015-04-15
8 2015-05-01
9 2015-05-15
Related
What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18
I have a Pandas Series object (df.Lateness) in which every element is a string but the format is not consistent:
0 00:01:48
1 00:07:38
2 00:04:44
3 00:12:18
4 0
5 0
6 00:01:36
7 0
8 0
9 0
I would like to convert these into datetime.time types where 0's represent 00:00:00 in format %H:%M:%S, but when I execute the following code:
pd.to_datetime(df.Lateness, format = '%H:%M:%S')
I get two exceptions:
TypeError: Unrecognized value type: class 'str'
ValueError: time data '0' does not match format '%H:%M:%S' (match)
Is there a way to get around this problem?
I'm not sure is it suits for you, but in my opinion converting non-format values to NaT will be more convenient:
result = pd.to_datetime(lateness, format='%H:%M:%S', errors='coerce').dt.time
Result:
0 00:01:48
1 00:07:38
2 00:04:44
3 00:12:18
4 NaT
5 NaT
6 00:01:36
7 NaT
8 NaT
9 NaT
If you still want to convert NaT to 00:00:00 using fillna:
result.fillna(pd.to_datetime('00:00:00', format='%H:%M:%S').time())
Result:
0 00:01:48
1 00:07:38
2 00:04:44
3 00:12:18
4 00:00:00
5 00:00:00
6 00:01:36
7 00:00:00
8 00:00:00
9 00:00:00
My data frame has a column 'Date' which is of type object but I want to convert it to pandas time series. So I am using pd.to_datetime function. This function is converting the datatype but giving erratic output.
code:
x1['TS'] = pd.to_datetime(x1['Date'])
x1['Day'] = x1['TS'].dt.dayofweek
x1[['Date', 'TS', 'Day']].iloc[::1430,:]
Now notice the output closely and see the columns Date and TS. it should be same but in some cases, its different.
output :
Date TS Day
0 01-12-2017 2017-01-12 3
1430 01-12-2017 2017-01-12 3
2860 02-12-2017 2017-02-12 6
4290 03-12-2017 2017-03-12 6
5720 04-12-2017 2017-04-12 2
7150 05-12-2017 2017-05-12 4
8580 07-12-2017 2017-07-12 2
10010 08-12-2017 2017-08-12 5
11440 09-12-2017 2017-09-12 1
12870 09-12-2017 2017-09-12 1
14300 10-12-2017 2017-10-12 3
15730 11-12-2017 2017-11-12 6
17160 12-12-2017 2017-12-12 1
18590 13-12-2017 2017-12-13 2
20020 14-12-2017 2017-12-14 3
21450 15-12-2017 2017-12-15 4
22880 16-12-2017 2017-12-16 5
24310 17-12-2017 2017-12-17 6
25740 18-12-2017 2017-12-18 0
27170 19-12-2017 2017-12-19 1
28600 20-12-2017 2017-12-20 2
30030 21-12-2017 2017-12-21 3
31460 22-12-2017 2017-12-22 4
32890 23-12-2017 2017-12-23 5
34320 24-12-2017 2017-12-24 6
35750 25-12-2017 2017-12-25 0
37180 26-12-2017 2017-12-26 1
38610 27-12-2017 2017-12-27 2
40040 28-12-2017 2017-12-28 3
41470 29-12-2017 2017-12-29 4
42900 30-12-2017 2017-12-30 5
44330 31-12-2017 2017-12-31 6
45760 01-01-2018 2018-01-01 0
47190 02-01-2018 2018-02-01 3
48620 03-01-2018 2018-03-01 3
50050 04-01-2018 2018-04-01 6
51480 05-01-2018 2018-05-01 1
52910 06-01-2018 2018-06-01 4
54340 07-01-2018 2018-07-01 6
55770 08-01-2018 2018-08-01 2
57200 09-01-2018 2018-09-01 5
58630 10-01-2018 2018-10-01 0
60060 11-01-2018 2018-11-01 3
61490 12-01-2018 2018-12-01 5
62920 13-01-2018 2018-01-13 5
64350 14-01-2018 2018-01-14 6
65780 15-01-2018 2018-01-15 0
67210 16-01-2018 2018-01-16 1
Oops! Looks like your dates start with the day being first. You'll have to tell pandas to handle that accordingly. Set the dayfirst flag to True when calling to_datetime.
x1['TS'] = pd.to_datetime(x1['Date'], dayfirst=True)
When you pass in a time without specifying the format, Pandas tries to guess at the format in a naive manner. It was assuming that what is your day is actually your month but then when it sees that it is month 13, realizes that can't be the month column and switches.
The following should work but I like #cᴏʟᴅsᴘᴇᴇᴅ's solution better because it is simpler to just raise the dayfirst flag.
To fix this, provide the current format to the to_datetime function.
The documentation gives the following example which you can modify to fit your situation:
pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
See details here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html
Time format conventions (what %Y means and so on) are here: https://docs.python.org/3.2/library/time.html
i have a dataframe date column with below values
2015-01-01
2015-02-01
2015-03-01
2015-07-01
2015-08-01
2015-10-01
2015-11-01
2016-02-01
i want to find the difference of these values in months, as below
date_dt diff_mnts
2015-01-01 0
2015-02-01 1
2015-03-01 1
2015-07-01 4
2015-08-01 1
2015-10-01 2
2015-11-01 1
2016-02-01 3
i tried to use the diff() method to calculate the days and then convert to astype('timedelta64(M)'). but in those cases, when days are less than 30 - its showing month difference values as 0. please let me know, if there is any easy built in function, which i can try in this case.
Option 1
Change the period and call diff.
df
Date
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-07-01
4 2015-08-01
5 2015-10-01
6 2015-11-01
7 2016-02-01
df.Date.dtype
dtype('<M8[ns]')
df.Date.dt.to_period('M').diff().fillna(0)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Option 2
Alternatively, call diff on dt.month, but you'll need to account for gaps over a year (solution improved thanks to #galaxyan!) -
i = df.Date.dt.year.diff() * 12
j = df.Date.dt.month.diff()
(i + j).fillna(0).astype(int)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Caveat (thanks to for spotting it) is that this wouldn't work for gaps over a year.
Try the following steps
Cast the column into datetime format.
Use the .month method to get the month number
Use the shift() method in pandas to calculate difference
example code will look something like this
df['diff_mnts'] = date_dt.month - date_dt.shift().month
I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4