I have a dataframe containing a dates field as text.
I convert the dates field into a date time object using:
df['date'] = pd.to_datetime(df['date'])
Doing:
df['date']
Produces something like this:
0 2012-06-28 09:36:21
1 2013-05-21 14:52:57
2 2011-10-14 16:31:34
3 2011-11-11 12:51:13
4 2013-02-07 15:33:22
5 2013-01-02 14:40:08
6 2013-06-24 14:49:40
7 2013-07-15 15:29:26
8 2011-11-04 12:17:32
9 2013-04-29 17:31:43
10 2013-06-24 15:00:06
11 2012-10-22 18:23:53
12 NaT
13 NaT
14 2011-12-13 10:06:18
Now I convert the date time object into a date object:
df['date'].apply(try_convert_date)
(see below for how try_to_convert is defined). I get:
0 2012-06-28
1 2013-05-21
2 2011-10-14
3 2011-11-11
4 2013-02-07
5 2013-01-02
6 2013-06-24
7 2013-07-15
8 2011-11-04
9 2013-04-29
10 2013-06-24
11 2012-10-22
12 0001-255-255
13 0001-255-255
14 2011-12-13
Where the 'NaT' values have been converted to '0001-255-255'. How do I avoid this and keep 'NA' in these cells?
Thanks in advance
def try_convert_date(obj):
try:
return obj.date()
except: #AttributeError:
return 'NA'
The problem is that pd.NaT.date() will not raise an error, it will return datetime.date(1, 255, 255), so the part of your code where you catch an exception will never be reached. You'll have to check if the value is pd.NaT and in that case return 'NA'. In all other cases you can safely return obj.date() since the column has datetime64 dtype.
def try_convert(obj):
if obj is pd.NaT:
return 'NA'
else:
return obj.date()
n [17]: s.apply(try_convert)
Out[17]:
0 2012-06-28
1 2013-05-21
2 2011-10-14
3 2011-11-11
4 2013-02-07
5 2013-01-02
6 2013-06-24
7 2013-07-15
8 2011-11-04
9 2013-04-29
10 2013-06-24
11 2012-10-22
12 NA
13 NA
14 2011-12-13
Name: 1_2, dtype: object
Related
I wrote a function in which the date column of a Pandas dataframe is traversed & along the way the dates are converted to a format specified by the user. In case any date is invalid or missing, the user can replace it with a value of their choice.
Here's my code:
def date_fun(dfd, col_name, choice, replace_date=None):
for col in dfd.columns:
if col.__contains__(col_name):
dfd[col_name]=dfd[col_name].fillna(value=replace_date)
date_formats = {1: 'YYYY-MM-DD', 2: 'MM/DD/YYYY', 3: 'DD/MM/YYYY', 4: 'YYYY-MM-DD HH:MM:SS', 5: 'MM/DD/YYYY HH:MM:SS', 6: 'DD/MM/YYYY HH:MM:SS'}
selection = date_formats[choice]
formatted_dates = pd.to_datetime(dfd[col], errors='coerce', format=selection)
dfd[col_name] = formatted_dates
return dfd
date_fun(dfd, 'joining_Dates', 4, "07/20/1990")
My date_column:
joining_Dates
0 25.09.2019
1 9/16/2015
2 10.12.2017
3 02.12.2014
4 08-Mar-18
5 08-12-2016
6 26.04.2016
7 05-03-2016
8 24.12.2016
9 10-Aug-19
10 abc
11 05-06-2015
12 12-2012-18
13 24-02-2010
14 2008,13,02
15 16-09-2015
16 23-01-1992, 7:45
Expected output:
**joining_Dates**
0 2019-09-25T00:00:00
1 2015-09-16T00:00:00
2 2017-10-12T00:00:00
3 2014-02-12T00:00:00
4 2018-03-08T00:00:00
5 2016-08-12T00:00:00
6 2016-04-26T00:00:00
7 2016-05-03T00:00:00
8 2016-12-24T00:00:00
9 2019-08-10T00:00:00
10 07-20-1990T00:00:00
11 2015-05-06T00:00:00
12 07-20-1990T00:00:00
13 2010-02-24T00:00:00
14 2008-02-01T00:00:00
15 2015-09-16T00:00:00
16 1992-01-23T07:45:00
Output of my code:
joining_Dates
0 NaT
1 NaT
2 NaT
3 "
4 "
5 .......
Why am I not getting the expected output?
Maybe something like this?
# python 3.9.13
from io import StringIO
import pandas as pd # 1.5.2
df = pd.read_csv(StringIO("""joining_Dates
25.09.2019
9/16/2015
10.12.2017
02.12.2014
08-Mar-18
08-12-2016
26.04.2016
05-03-2016
24.12.2016
10-Aug-19
abc
05-06-2015
12-2012-18
24-02-2010
2008,13,02
16-09-2015
23-01-1992, 7:45"""), sep="\t")
df["new_dates"] = pd.to_datetime(df.joining_Dates, errors="coerce")
df["new_dates"] = df.new_dates.fillna(pd.to_datetime("07/20/1990"))
print(df.new_dates)
0 2019-09-25 00:00:00
1 2015-09-16 00:00:00
2 2017-10-12 00:00:00
3 2014-02-12 00:00:00
4 2018-03-08 00:00:00
5 2016-08-12 00:00:00
6 2016-04-26 00:00:00
7 2016-05-03 00:00:00
8 2016-12-24 00:00:00
9 2019-08-10 00:00:00
10 1990-07-20 00:00:00
11 2015-05-06 00:00:00
12 1990-07-20 00:00:00
13 2010-02-24 00:00:00
14 2008-02-01 00:00:00
15 2015-09-16 00:00:00
16 1992-01-23 07:45:00
Name: new_dates, dtype: datetime64[ns]
I have a dataset which looks like this:
ID Date
1 3 2016-04-01
2 3 2016-04-02
3 3 2016-04-03
4 3 2016-04-04
5 3 2016-04-05
6 3 2017-04-01
7 3 2017-04-02
8 3 2017-04-03
9 3 2017-04-04
10 3 2017-04-05
11 7 2016-04-01
12 7 2016-04-02
13 7 2016-04-03
14 7 2016-04-04
15 7 2016-04-05
16 7 2017-04-01
17 7 2017-04-02
18 7 2017-04-03
19 7 2017-04-04
20 7 2017-04-05
I want to change the year of the dates given two conditions. The conditions are the value of the ID and the year of the Date. For example, if ID = 3 and the year is 2016, I want to change it to 2014
You can use something like this:
def f(x):
if x['ID'] == 3 and '2016' in x['Date']:
return x['Date'].replace('2014','2016')
else:
return x['Date']
df['new_column'] = df.apply(f, axis=1)
Depending on how the date is stored you have to modify. This example is for a simple string, but should be adaptable to other types.
If you want to use a lambda function:
df['new_column'] = df.apply(lambda x: x['Date'].replace('2014', '2016') if x['ID'] == 3 and '2016' in x['Date'] else x['Date'], axis=1)
Similarly, if your data is stored as a datetime object, your corresponding function is x['date'] = x['date'].replace(2016), and your condition is x['date'].year == 2014
Following the previous answer, the one-liner would look like this:
df['date'] = df.apply(lambda x: x['Date'].replace(2014) if x['ID'] == 3 and x['date'].year == 2016 else x['date'], axis=1)`
Generally speaking, I'd recommend working with datetime for dates and times.
data['Date'] = data.apply(lambda x: x['Date'].replace('2016','2014') if (x['ID'] == 3 and "2016" in x['Date']) else x['Date'], axis=1)
Output:
ID Date
0 3 2014-04-01
1 3 2014-04-02
2 3 2014-04-03
3 3 2014-04-04
4 3 2014-04-05
5 3 2017-04-01
6 3 2017-04-02
7 3 2017-04-03
8 3 2017-04-04
9 3 2017-04-05
10 7 2016-04-01
11 7 2016-04-02
12 7 2016-04-03
13 7 2016-04-04
14 7 2016-04-05
15 7 2017-04-01
16 7 2017-04-02
17 7 2017-04-03
18 7 2017-04-04
19 7 2017-04-05
What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18
I have a dataframe that looks similar to the following:
df = pd.DataFrame({'Y_M':['201710','201711','201712'],'1':[1,5,9],'2':[2,6,10],'3':[3,7,11],'4':[4,8,12]})
df = df.set_index('Y_M')
Which creates a dataframe looking like this:
1 2 3 4
Y_M
201711 1 2 3 4
201712 5 6 7 8
201713 9 10 11 12
The columns are the day of the month. They stretch on to the right, going all the way up to 31. (February will have columns 29, 30, and 31 filled with NaN).
The index contains the year and the month (e.g. 201711 referring to Nov 2017)
My question is: How can I make this a single series, with the year/month/day combined? My output would be the following:
Y_M
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
The index can be converted to a datetime. In fact I think it would make it easier.
Use stack for Series and then combine datetimes by to_datetime with timedeltas by
to_timedelta:
df = df.stack()
df.index = pd.to_datetime(df.index.get_level_values(0), format='%Y%m') + \
pd.to_timedelta(df.index.get_level_values(1).astype(int) - 1, unit='D')
print (df)
2017-10-01 1
2017-10-02 2
2017-10-03 3
2017-10-04 4
2017-11-01 5
2017-11-02 6
2017-11-03 7
2017-11-04 8
2017-12-01 9
2017-12-02 10
2017-12-03 11
2017-12-04 12
dtype: int64
print (df.index)
DatetimeIndex(['2017-10-01', '2017-10-02', '2017-10-03', '2017-10-04',
'2017-11-01', '2017-11-02', '2017-11-03', '2017-11-04',
'2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04'],
dtype='datetime64[ns]', freq=None)
Last if necessary strings in index (not DatetimeIndex) add DatetimeIndex.strftime:
df.index = df.index.strftime('%Y%m%d')
print (df)
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
dtype: int64
print (df.index)
Index(['20171001', '20171002', '20171003', '20171004', '20171101', '20171102',
'20171103', '20171104', '20171201', '20171202', '20171203', '20171204'],
dtype='object')
Without bringing date into it.
s = df.stack()
s.index = s.index.map('{0[0]}{0[1]:>02s}'.format)
s
20171001 1
20171002 2
20171003 3
20171004 4
20171101 5
20171102 6
20171103 7
20171104 8
20171201 9
20171202 10
20171203 11
20171204 12
dtype: int64
I have the foll. list in pandas:
str = jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
Is there a way to convert it into datetime?
I tried:
pd.to_datetime(pd.Series(str))
You have to specify the format argument while calling pd.to_datetime. Try
pd.to_datetime(pd.Series(s), format='%b_%d')
this gives
0 1900-01-01
1 1900-01-15
2 1900-02-01
3 1900-02-15
4 1900-03-01
5 1900-03-15
6 1900-04-01
7 1900-04-15
8 1900-05-01
9 1900-05-15
For setting the current year, a hack may be required, like
pd.to_datetime(pd.Series(s) + '_2015', format='%b_%d_%Y')
to get
0 2015-01-01
1 2015-01-15
2 2015-02-01
3 2015-02-15
4 2015-03-01
5 2015-03-15
6 2015-04-01
7 2015-04-15
8 2015-05-01
9 2015-05-15