Add Multiple Columns to Pandas Dataframe from Function - python

I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()

Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2

To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11

def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))

Related

How to "unroll" time intervals in a dataframe?

I have a dataframe:
df1 = pd.DataFrame(
[['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
columns=['d0', 'd1', 'event'])
d0 d1 event
0 2011-01-01 2011-01-03 A
1 2011-04-01 2011-04-01 A
2 2012-08-28 2012-08-30 B
3 2015-04-03 2015-04-05 A
4 2015-08-21 2015-08-21 B
It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:
df2 = pd.DataFrame(
[['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
columns=['Date', 'event'])
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?
We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:
df2 = (
df1.set_index('event')
.apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
.explode()
.reset_index(name='Date')[['Date', 'event']]
)
df2:
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
Let us try comprehension to create the pairs of date and event
pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
for d in pd.date_range(*v)), columns=['Date', 'Event'])
Date Event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:
from datetime import timedelta
def unroll_events(df):
rows = []
for _, row in df.iterrows():
event = row['event']
start = row['d0']
end = row['d1']
current = start
while current != end:
rows.append(dict(Date=current, event=event))
current += timedelta(days=1)
rows.append(dict(Date=current, event=event)) # make sure last one is included
return pd.DataFrame(rows)

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

Partially melt hierarchical index

I have a dataframe df that I want to look like df2.
yesterday = datetime.datetime.today - datetime.timedelta(days=1)
cols = pd.MultiIndex.from_product(iterables=[['A','B'],['1','2']], names=['letter','number'])
idx = pd.DatetimeIndex(start=yesterday, periods=2, freq='H')
df = pd.DataFrame(columns=cols, index=idx, data=[[1,2,3,4],[5,6,7,8]])
letter A B
number 1 2 1 2
2019-09-26 07:45:49.659873 1 2 3 4
2019-09-26 08:45:49.659873 5 6 7 8
df2 = pd.DataFrame(columns=['letter', '1', '2'], index=idx.append(idx), data=[['A',1,2],['A',5,6],['B',3,4],['B',7,8]])
letter 1 2
2019-09-26 07:45:49.659873 A 1 2
2019-09-26 08:45:49.659873 A 5 6
2019-09-26 07:45:49.659873 B 3 4
2019-09-26 08:45:49.659873 B 7 8
So it's sort of like a melt, or reset_index but I haven't found the magic combination to do this on a much larger dataframe in practice. I want to change level 0 of the columns into an unpivoted column, but leave level 1 unchanged.

Efficient and elegant way to fill values in pandas column based on each groups

df_new = pd.DataFrame(
{
'person_id': [1, 1, 3, 3, 5, 5],
'obs_date': ['12/31/2007', 'NA-NA-NA NA:NA:NA', 'NA-NA-NA NA:NA:NA', '11/25/2009', '10/15/2019', 'NA-NA-NA NA:NA:NA']
})
It looks like as shown below
What I would like to do is replace/fill NA type rows with actual date values from the same group. For which I tried the below
m1 = df_new['obs_date'].str.contains('^\d')
df_new['obs_date'] = df_new.groupby((m1).cumsum())['obs_date'].transform('first')
But this gives an unexpected output like shown below
Here for the 2nd row it should have been 11/25/2009 from person_id = 3 instead it is from the 1st group of person_id = 1.
How can I get the expected output as shown below
Any elegant and efficient solution is helpful as I am dealing with more than million records
First use to_datetime with errors='coerce' for convert non datetimes to missing values, then GroupBy.first for get first non missing value in GroupBy.transform new column filled by data:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('first')
#alternative - minimal value per group
#df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('min')
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
Another idea is use DataFrame.sort_values with GroupBy.first:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = (df_new.sort_values(['person_id','obs_date'])
.groupby('person_id')['obs_date']
.ffill())
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
You can do a pd.to_datetime(..,errors='coerce') to fill non date values as NaT and ffill and bfill after groupby :
df_new['obs_date']=(df_new.assign(obs_date=pd.to_datetime(df_new['obs_date'],
errors='coerce')).groupby('person_id')['obs_date'].apply(lambda x: x.ffill().bfill()))
print(df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
df_new= df_new.join(df_new.groupby('person_id')["obs_date"].min(),
on='person_id',
rsuffix="_clean")
Output:
person_id obs_date obs_date_clean
0 1 12/31/2007 12/31/2007
1 1 NA-NA-NA NA:NA:NA 12/31/2007
2 3 NA-NA-NA NA:NA:NA 11/25/2009
3 3 11/25/2009 11/25/2009
4 5 10/15/2019 10/15/2019
5 5 NA-NA-NA NA:NA:NA 10/15/2019

Efficiently applying custom functions to groups in Pandas

This question is about efficiently applying a custom function on logical groups of rows in a Pandas dataframe, which share a value in some column.
Consider the following example of a dataframe:
sID = [1,1,1,2,4,4,5,5,5]
data = np.random.randn(len(sID))
dates = pd.date_range(start='1/1/2018', periods=len(sID))
mydf = pd.DataFrame({"subject_id":sID, "data":data, "date":dates})
mydf['date'][5] += pd.Timedelta('2 days')
which looks like:
data date subject_id
0 0.168150 2018-01-01 1
1 -0.484301 2018-01-02 1
2 -0.522980 2018-01-03 1
3 -0.724524 2018-01-04 2
4 0.563453 2018-01-05 4
5 0.439059 2018-01-08 4
6 -1.902182 2018-01-07 5
7 -1.433561 2018-01-08 5
8 0.586191 2018-01-09 5
Imagine that for each subject_id, we want to subtract from each date the first date encountered for this subject_id. Storing the result in a new column "days_elapsed", the result will look like this:
data date subject_id days_elapsed
0 0.168150 2018-01-01 1 0
1 -0.484301 2018-01-02 1 1
2 -0.522980 2018-01-03 1 2
3 -0.724524 2018-01-04 2 0
4 0.563453 2018-01-05 4 0
5 0.439059 2018-01-08 4 3
6 -1.902182 2018-01-07 5 0
7 -1.433561 2018-01-08 5 1
8 0.586191 2018-01-09 5 2
One natural way of doing this is by using groupby and apply:
g_df = mydf.groupby('subject_id')
mydf.loc[:, "days_elapsed"] = g_df["date"].apply(lambda x: x - x.iloc[0]).astype('timedelta64[D]').astype(int)
However, if the number of groups (subject IDs) is big (e.g. 10^4), and let's say only 10 times smaller than the length of the dataframe, this very simple operation is really slow.
Is there any faster method?
PS: I have also tried setting the index to subject_id and then using the following list comprehension:
def get_first(series, ind):
"Return the first row in a group within a series which (group) potentially can span multiple rows and corresponds to a given index"
group = series.loc[ind]
if hasattr(group, 'iloc'):
return group.iloc[0]
else: # this is for indices with a single element
return group
hind_df = mydf.set_index('subject_id')
A = pd.concat([hind_df["date"].loc[ind] - get_first(hind_df["date"], ind) for ind in np.unique(hind_df.index)])
However, it's even slower.
You can use GroupBy + transform with first. This should be more efficient as it avoids expensive lambda function calls.
You may also see a performance improvement by working with the NumPy array via pd.Series.values:
first = df.groupby('subject_id')['date'].transform('first').values
df['days_elapsed'] = (df['date'].values - first).astype('timedelta64[D]').astype(int)
print(df)
subject_id data date days_elapsed
0 1 1.079472 2018-01-01 0
1 1 -0.197255 2018-01-02 1
2 1 -0.687764 2018-01-03 2
3 2 0.023771 2018-01-04 0
4 4 -0.538191 2018-01-05 0
5 4 1.479294 2018-01-08 3
6 5 -1.993196 2018-01-07 0
7 5 -2.111831 2018-01-08 1
8 5 -0.934775 2018-01-09 2
mydf['days_elapsed'] = (mydf['date'] - mydf.groupby(['subject_id'])['date'].transform('min')).dt.days

Categories

Resources