I wrote a function in which the date column of a Pandas dataframe is traversed & along the way the dates are converted to a format specified by the user. In case any date is invalid or missing, the user can replace it with a value of their choice.
Here's my code:
def date_fun(dfd, col_name, choice, replace_date=None):
for col in dfd.columns:
if col.__contains__(col_name):
dfd[col_name]=dfd[col_name].fillna(value=replace_date)
date_formats = {1: 'YYYY-MM-DD', 2: 'MM/DD/YYYY', 3: 'DD/MM/YYYY', 4: 'YYYY-MM-DD HH:MM:SS', 5: 'MM/DD/YYYY HH:MM:SS', 6: 'DD/MM/YYYY HH:MM:SS'}
selection = date_formats[choice]
formatted_dates = pd.to_datetime(dfd[col], errors='coerce', format=selection)
dfd[col_name] = formatted_dates
return dfd
date_fun(dfd, 'joining_Dates', 4, "07/20/1990")
My date_column:
joining_Dates
0 25.09.2019
1 9/16/2015
2 10.12.2017
3 02.12.2014
4 08-Mar-18
5 08-12-2016
6 26.04.2016
7 05-03-2016
8 24.12.2016
9 10-Aug-19
10 abc
11 05-06-2015
12 12-2012-18
13 24-02-2010
14 2008,13,02
15 16-09-2015
16 23-01-1992, 7:45
Expected output:
**joining_Dates**
0 2019-09-25T00:00:00
1 2015-09-16T00:00:00
2 2017-10-12T00:00:00
3 2014-02-12T00:00:00
4 2018-03-08T00:00:00
5 2016-08-12T00:00:00
6 2016-04-26T00:00:00
7 2016-05-03T00:00:00
8 2016-12-24T00:00:00
9 2019-08-10T00:00:00
10 07-20-1990T00:00:00
11 2015-05-06T00:00:00
12 07-20-1990T00:00:00
13 2010-02-24T00:00:00
14 2008-02-01T00:00:00
15 2015-09-16T00:00:00
16 1992-01-23T07:45:00
Output of my code:
joining_Dates
0 NaT
1 NaT
2 NaT
3 "
4 "
5 .......
Why am I not getting the expected output?
Maybe something like this?
# python 3.9.13
from io import StringIO
import pandas as pd # 1.5.2
df = pd.read_csv(StringIO("""joining_Dates
25.09.2019
9/16/2015
10.12.2017
02.12.2014
08-Mar-18
08-12-2016
26.04.2016
05-03-2016
24.12.2016
10-Aug-19
abc
05-06-2015
12-2012-18
24-02-2010
2008,13,02
16-09-2015
23-01-1992, 7:45"""), sep="\t")
df["new_dates"] = pd.to_datetime(df.joining_Dates, errors="coerce")
df["new_dates"] = df.new_dates.fillna(pd.to_datetime("07/20/1990"))
print(df.new_dates)
0 2019-09-25 00:00:00
1 2015-09-16 00:00:00
2 2017-10-12 00:00:00
3 2014-02-12 00:00:00
4 2018-03-08 00:00:00
5 2016-08-12 00:00:00
6 2016-04-26 00:00:00
7 2016-05-03 00:00:00
8 2016-12-24 00:00:00
9 2019-08-10 00:00:00
10 1990-07-20 00:00:00
11 2015-05-06 00:00:00
12 1990-07-20 00:00:00
13 2010-02-24 00:00:00
14 2008-02-01 00:00:00
15 2015-09-16 00:00:00
16 1992-01-23 07:45:00
Name: new_dates, dtype: datetime64[ns]
Related
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I have a dataset which looks like this:
ID Date
1 3 2016-04-01
2 3 2016-04-02
3 3 2016-04-03
4 3 2016-04-04
5 3 2016-04-05
6 3 2017-04-01
7 3 2017-04-02
8 3 2017-04-03
9 3 2017-04-04
10 3 2017-04-05
11 7 2016-04-01
12 7 2016-04-02
13 7 2016-04-03
14 7 2016-04-04
15 7 2016-04-05
16 7 2017-04-01
17 7 2017-04-02
18 7 2017-04-03
19 7 2017-04-04
20 7 2017-04-05
I want to change the year of the dates given two conditions. The conditions are the value of the ID and the year of the Date. For example, if ID = 3 and the year is 2016, I want to change it to 2014
You can use something like this:
def f(x):
if x['ID'] == 3 and '2016' in x['Date']:
return x['Date'].replace('2014','2016')
else:
return x['Date']
df['new_column'] = df.apply(f, axis=1)
Depending on how the date is stored you have to modify. This example is for a simple string, but should be adaptable to other types.
If you want to use a lambda function:
df['new_column'] = df.apply(lambda x: x['Date'].replace('2014', '2016') if x['ID'] == 3 and '2016' in x['Date'] else x['Date'], axis=1)
Similarly, if your data is stored as a datetime object, your corresponding function is x['date'] = x['date'].replace(2016), and your condition is x['date'].year == 2014
Following the previous answer, the one-liner would look like this:
df['date'] = df.apply(lambda x: x['Date'].replace(2014) if x['ID'] == 3 and x['date'].year == 2016 else x['date'], axis=1)`
Generally speaking, I'd recommend working with datetime for dates and times.
data['Date'] = data.apply(lambda x: x['Date'].replace('2016','2014') if (x['ID'] == 3 and "2016" in x['Date']) else x['Date'], axis=1)
Output:
ID Date
0 3 2014-04-01
1 3 2014-04-02
2 3 2014-04-03
3 3 2014-04-04
4 3 2014-04-05
5 3 2017-04-01
6 3 2017-04-02
7 3 2017-04-03
8 3 2017-04-04
9 3 2017-04-05
10 7 2016-04-01
11 7 2016-04-02
12 7 2016-04-03
13 7 2016-04-04
14 7 2016-04-05
15 7 2017-04-01
16 7 2017-04-02
17 7 2017-04-03
18 7 2017-04-04
19 7 2017-04-05
What I have:
A dataframe, df consists of 3 columns (Id, Item and Timestamp). Each subject has unique Id with recorded Item on a particular date and time (Timestamp). The second dataframe, df_ref consists of date time range reference for slicing the df, the Start and the End for each subject, Id.
df:
Id Item Timestamp
0 1 aaa 2011-03-15 14:21:00
1 1 raa 2012-05-03 04:34:01
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
5 1 aud 2017-05-10 11:58:02
6 2 boo 2004-06-22 22:20:58
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
9 2 baa 2008-05-22 21:28:00
10 2 boo 2017-06-08 23:31:06
11 3 ige 2011-06-30 13:14:21
12 3 afr 2013-06-11 01:38:48
13 3 gui 2013-06-21 23:14:26
14 3 loo 2014-06-10 15:15:42
15 3 boo 2015-01-23 02:08:35
16 3 afr 2015-04-15 00:15:23
17 3 aaa 2016-02-16 10:26:03
18 3 aaa 2016-06-10 01:11:15
19 3 ige 2016-07-18 11:41:18
20 3 boo 2016-12-06 19:14:00
21 4 gui 2016-01-05 09:19:50
22 4 aaa 2016-12-09 14:49:50
23 4 ige 2016-11-01 08:23:18
df_ref:
Id Start End
0 1 2013-03-12 00:00:00 2016-05-30 15:20:36
1 2 2005-06-05 08:51:22 2007-02-24 00:00:00
2 3 2011-05-14 10:11:28 2013-12-31 17:04:55
3 3 2015-03-29 12:18:31 2016-07-26 00:00:00
What I want:
Slice the df dataframe based on the data time range given for each Id (groupby Id) in df_ref and concatenate the sliced data into new dataframe. However, a subject could have more than one date time range (in this example Id=3 has 2 date time range).
df_expected:
Id Item Timestamp
0 1 baa 2013-05-08 22:21:29
1 1 boo 2015-12-24 21:53:41
2 1 afr 2016-04-14 12:28:26
3 2 aaa 2005-11-16 07:00:00
4 2 ige 2006-06-28 17:09:18
5 3 ige 2011-06-30 13:14:21
6 3 afr 2013-06-11 01:38:48
7 3 gui 2013-06-21 23:14:26
8 3 afr 2015-04-15 00:15:23
9 3 aaa 2016-02-16 10:26:03
10 3 aaa 2016-06-10 01:11:15
11 3 ige 2016-07-18 11:41:18
What I have done so far:
I referred to this post (Time series multiple slice) while doing my code. I modify the code since it does not have the groupby element which I need.
My code:
from datetime import datetime
df['Timestamp'] = pd.to_datetime(df.Timestamp, format='%Y-%m-%d %H:%M')
x = pd.DataFrame()
for pid in def_ref.Id.unique():
selection = df[(df['Id']== pid) & (df['Timestamp']>= def_ref['Start']) & (df['Timestamp']<= def_ref['End'])]
x = x.append(selection)
Above code give error:
ValueError: Can only compare identically-labeled Series objects
First use merge with default inner join, also it create all combinations for duplicated Id. Then filter by between and DataFrame.loc for filtering by conditions and by df1.columns in one step:
df1 = df.merge(df_ref, on='Id')
df2 = df1.loc[df1['Timestamp'].between(df1['Start'], df1['End']), df.columns]
print (df2)
Id Item Timestamp
2 1 baa 2013-05-08 22:21:29
3 1 boo 2015-12-24 21:53:41
4 1 afr 2016-04-14 12:28:26
7 2 aaa 2005-11-16 07:00:00
8 2 ige 2006-06-28 17:09:18
11 3 ige 2011-06-30 13:14:21
13 3 afr 2013-06-11 01:38:48
15 3 gui 2013-06-21 23:14:26
22 3 afr 2015-04-15 00:15:23
24 3 aaa 2016-02-16 10:26:03
26 3 aaa 2016-06-10 01:11:15
28 3 ige 2016-07-18 11:41:18
I have the foll. list in pandas:
str = jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
Is there a way to convert it into datetime?
I tried:
pd.to_datetime(pd.Series(str))
You have to specify the format argument while calling pd.to_datetime. Try
pd.to_datetime(pd.Series(s), format='%b_%d')
this gives
0 1900-01-01
1 1900-01-15
2 1900-02-01
3 1900-02-15
4 1900-03-01
5 1900-03-15
6 1900-04-01
7 1900-04-15
8 1900-05-01
9 1900-05-15
For setting the current year, a hack may be required, like
pd.to_datetime(pd.Series(s) + '_2015', format='%b_%d_%Y')
to get
0 2015-01-01
1 2015-01-15
2 2015-02-01
3 2015-02-15
4 2015-03-01
5 2015-03-15
6 2015-04-01
7 2015-04-15
8 2015-05-01
9 2015-05-15
I have a dataframe containing a dates field as text.
I convert the dates field into a date time object using:
df['date'] = pd.to_datetime(df['date'])
Doing:
df['date']
Produces something like this:
0 2012-06-28 09:36:21
1 2013-05-21 14:52:57
2 2011-10-14 16:31:34
3 2011-11-11 12:51:13
4 2013-02-07 15:33:22
5 2013-01-02 14:40:08
6 2013-06-24 14:49:40
7 2013-07-15 15:29:26
8 2011-11-04 12:17:32
9 2013-04-29 17:31:43
10 2013-06-24 15:00:06
11 2012-10-22 18:23:53
12 NaT
13 NaT
14 2011-12-13 10:06:18
Now I convert the date time object into a date object:
df['date'].apply(try_convert_date)
(see below for how try_to_convert is defined). I get:
0 2012-06-28
1 2013-05-21
2 2011-10-14
3 2011-11-11
4 2013-02-07
5 2013-01-02
6 2013-06-24
7 2013-07-15
8 2011-11-04
9 2013-04-29
10 2013-06-24
11 2012-10-22
12 0001-255-255
13 0001-255-255
14 2011-12-13
Where the 'NaT' values have been converted to '0001-255-255'. How do I avoid this and keep 'NA' in these cells?
Thanks in advance
def try_convert_date(obj):
try:
return obj.date()
except: #AttributeError:
return 'NA'
The problem is that pd.NaT.date() will not raise an error, it will return datetime.date(1, 255, 255), so the part of your code where you catch an exception will never be reached. You'll have to check if the value is pd.NaT and in that case return 'NA'. In all other cases you can safely return obj.date() since the column has datetime64 dtype.
def try_convert(obj):
if obj is pd.NaT:
return 'NA'
else:
return obj.date()
n [17]: s.apply(try_convert)
Out[17]:
0 2012-06-28
1 2013-05-21
2 2011-10-14
3 2011-11-11
4 2013-02-07
5 2013-01-02
6 2013-06-24
7 2013-07-15
8 2011-11-04
9 2013-04-29
10 2013-06-24
11 2012-10-22
12 NA
13 NA
14 2011-12-13
Name: 1_2, dtype: object