Add Multiple Columns to Pandas Dataframe from Function

Add Multiple Columns to Pandas Dataframe from Function - python

I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()

Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2

To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11

def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))

Related

How to "unroll" time intervals in a dataframe?

I have a dataframe:
df1 = pd.DataFrame(
[['2011-01-01','2011-01-03','A'], ['2011-04-01','2011-04-01','A'], ['2012-08-28','2012-08-30','B'], ['2015-04-03','2015-04-05','A'], ['2015-08-21','2015-08-21','B']],
columns=['d0', 'd1', 'event'])
d0 d1 event
0 2011-01-01 2011-01-03 A
1 2011-04-01 2011-04-01 A
2 2012-08-28 2012-08-30 B
3 2015-04-03 2015-04-05 A
4 2015-08-21 2015-08-21 B
It contains some events A and B that occurred in the specified interval from d0 to d1. (There are actually more events, they are mixed, but they have no intersection by dates.) Moreover, this interval can be 1 day (d0 = d1). I need to go from df1 to df2 in which these time intervals are "unrolled" for each event, i.e.:
df2 = pd.DataFrame(
[['2011-01-01','A'], ['2011-01-02','A'], ['2011-01-03','A'], ['2011-04-01','A'], ['2012-08-28','B'], ['2012-08-29','B'], ['2012-08-30','B'], ['2015-04-03','A'], ['2015-04-04','A'], ['2015-04-05','A'], ['2015-08-21','B']],
columns=['Date', 'event'])
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B
I tried doing this based on resample and comparing areas where ffill = bfill but couldn't come up with anything. How can this be done in the most simple way?

We can set_index to event then create date_range per row, then explode to unwind the ranges and reset_index to create the DataFrame:
df2 = (
df1.set_index('event')
.apply(lambda r: pd.date_range(r['d0'], r['d1']), axis=1)
.explode()
.reset_index(name='Date')[['Date', 'event']]
)
df2:
Date event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B

Let us try comprehension to create the pairs of date and event
pd.DataFrame(((d, c) for (*v, c) in df1.to_numpy()
for d in pd.date_range(*v)), columns=['Date', 'Event'])
Date Event
0 2011-01-01 A
1 2011-01-02 A
2 2011-01-03 A
3 2011-04-01 A
4 2012-08-28 B
5 2012-08-29 B
6 2012-08-30 B
7 2015-04-03 A
8 2015-04-04 A
9 2015-04-05 A
10 2015-08-21 B

I don't know if this is the "most simple," but it's the most intuitive way I can think to do it. I iterate over the rows and unroll it manually into a new dataframe. This means that I look at each row, iterate over the dates between d0 and d1, and construct a row for each of them and compile them into a dataframe:
from datetime import timedelta
def unroll_events(df):
rows = []
for _, row in df.iterrows():
event = row['event']
start = row['d0']
end = row['d1']
current = start
while current != end:
rows.append(dict(Date=current, event=event))
current += timedelta(days=1)
rows.append(dict(Date=current, event=event)) # make sure last one is included
return pd.DataFrame(rows)

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?

You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True

Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)

Partially melt hierarchical index

I have a dataframe df that I want to look like df2.
yesterday = datetime.datetime.today - datetime.timedelta(days=1)
cols = pd.MultiIndex.from_product(iterables=[['A','B'],['1','2']], names=['letter','number'])
idx = pd.DatetimeIndex(start=yesterday, periods=2, freq='H')
df = pd.DataFrame(columns=cols, index=idx, data=[[1,2,3,4],[5,6,7,8]])
letter A B
number 1 2 1 2
2019-09-26 07:45:49.659873 1 2 3 4
2019-09-26 08:45:49.659873 5 6 7 8
df2 = pd.DataFrame(columns=['letter', '1', '2'], index=idx.append(idx), data=[['A',1,2],['A',5,6],['B',3,4],['B',7,8]])
letter 1 2
2019-09-26 07:45:49.659873 A 1 2
2019-09-26 08:45:49.659873 A 5 6
2019-09-26 07:45:49.659873 B 3 4
2019-09-26 08:45:49.659873 B 7 8
So it's sort of like a melt, or reset_index but I haven't found the magic combination to do this on a much larger dataframe in practice. I want to change level 0 of the columns into an unpivoted column, but leave level 1 unchanged.

Efficient and elegant way to fill values in pandas column based on each groups

df_new = pd.DataFrame(
{
'person_id': [1, 1, 3, 3, 5, 5],
'obs_date': ['12/31/2007', 'NA-NA-NA NA:NA:NA', 'NA-NA-NA NA:NA:NA', '11/25/2009', '10/15/2019', 'NA-NA-NA NA:NA:NA']
})
It looks like as shown below
What I would like to do is replace/fill NA type rows with actual date values from the same group. For which I tried the below
m1 = df_new['obs_date'].str.contains('^\d')
df_new['obs_date'] = df_new.groupby((m1).cumsum())['obs_date'].transform('first')
But this gives an unexpected output like shown below
Here for the 2nd row it should have been 11/25/2009 from person_id = 3 instead it is from the 1st group of person_id = 1.
How can I get the expected output as shown below
Any elegant and efficient solution is helpful as I am dealing with more than million records

First use to_datetime with errors='coerce' for convert non datetimes to missing values, then GroupBy.first for get first non missing value in GroupBy.transform new column filled by data:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('first')
#alternative - minimal value per group
#df_new['obs_date'] = df_new.groupby('person_id')['obs_date'].transform('min')
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15
Another idea is use DataFrame.sort_values with GroupBy.first:
df_new['obs_date'] = pd.to_datetime(df_new['obs_date'], format='%m/%d/%Y', errors='coerce')
df_new['obs_date'] = (df_new.sort_values(['person_id','obs_date'])
.groupby('person_id')['obs_date']
.ffill())
print (df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15

You can do a pd.to_datetime(..,errors='coerce') to fill non date values as NaT and ffill and bfill after groupby :
df_new['obs_date']=(df_new.assign(obs_date=pd.to_datetime(df_new['obs_date'],
errors='coerce')).groupby('person_id')['obs_date'].apply(lambda x: x.ffill().bfill()))
print(df_new)
person_id obs_date
0 1 2007-12-31
1 1 2007-12-31
2 3 2009-11-25
3 3 2009-11-25
4 5 2019-10-15
5 5 2019-10-15

df_new= df_new.join(df_new.groupby('person_id')["obs_date"].min(),
on='person_id',
rsuffix="_clean")
Output:
person_id obs_date obs_date_clean
0 1 12/31/2007 12/31/2007
1 1 NA-NA-NA NA:NA:NA 12/31/2007
2 3 NA-NA-NA NA:NA:NA 11/25/2009
3 3 11/25/2009 11/25/2009
4 5 10/15/2019 10/15/2019
5 5 NA-NA-NA NA:NA:NA 10/15/2019

Efficiently applying custom functions to groups in Pandas

This question is about efficiently applying a custom function on logical groups of rows in a Pandas dataframe, which share a value in some column.
Consider the following example of a dataframe:
sID = [1,1,1,2,4,4,5,5,5]
data = np.random.randn(len(sID))
dates = pd.date_range(start='1/1/2018', periods=len(sID))
mydf = pd.DataFrame({"subject_id":sID, "data":data, "date":dates})
mydf['date'][5] += pd.Timedelta('2 days')
which looks like:
data date subject_id
0 0.168150 2018-01-01 1
1 -0.484301 2018-01-02 1
2 -0.522980 2018-01-03 1
3 -0.724524 2018-01-04 2
4 0.563453 2018-01-05 4
5 0.439059 2018-01-08 4
6 -1.902182 2018-01-07 5
7 -1.433561 2018-01-08 5
8 0.586191 2018-01-09 5
Imagine that for each subject_id, we want to subtract from each date the first date encountered for this subject_id. Storing the result in a new column "days_elapsed", the result will look like this:
data date subject_id days_elapsed
0 0.168150 2018-01-01 1 0
1 -0.484301 2018-01-02 1 1
2 -0.522980 2018-01-03 1 2
3 -0.724524 2018-01-04 2 0
4 0.563453 2018-01-05 4 0
5 0.439059 2018-01-08 4 3
6 -1.902182 2018-01-07 5 0
7 -1.433561 2018-01-08 5 1
8 0.586191 2018-01-09 5 2
One natural way of doing this is by using groupby and apply:
g_df = mydf.groupby('subject_id')
mydf.loc[:, "days_elapsed"] = g_df["date"].apply(lambda x: x - x.iloc[0]).astype('timedelta64[D]').astype(int)
However, if the number of groups (subject IDs) is big (e.g. 10^4), and let's say only 10 times smaller than the length of the dataframe, this very simple operation is really slow.
Is there any faster method?
PS: I have also tried setting the index to subject_id and then using the following list comprehension:
def get_first(series, ind):
"Return the first row in a group within a series which (group) potentially can span multiple rows and corresponds to a given index"
group = series.loc[ind]
if hasattr(group, 'iloc'):
return group.iloc[0]
else: # this is for indices with a single element
return group
hind_df = mydf.set_index('subject_id')
A = pd.concat([hind_df["date"].loc[ind] - get_first(hind_df["date"], ind) for ind in np.unique(hind_df.index)])
However, it's even slower.

You can use GroupBy + transform with first. This should be more efficient as it avoids expensive lambda function calls.
You may also see a performance improvement by working with the NumPy array via pd.Series.values:
first = df.groupby('subject_id')['date'].transform('first').values
df['days_elapsed'] = (df['date'].values - first).astype('timedelta64[D]').astype(int)
print(df)
subject_id data date days_elapsed
0 1 1.079472 2018-01-01 0
1 1 -0.197255 2018-01-02 1
2 1 -0.687764 2018-01-03 2
3 2 0.023771 2018-01-04 0
4 4 -0.538191 2018-01-05 0
5 4 1.479294 2018-01-08 3
6 5 -1.993196 2018-01-07 0
7 5 -2.111831 2018-01-08 1
8 5 -0.934775 2018-01-09 2

mydf['days_elapsed'] = (mydf['date'] - mydf.groupby(['subject_id'])['date'].transform('min')).dt.days

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add Multiple Columns to Pandas Dataframe from Function - python

def getWd(d): d.isocalendar()[1], d.weekday() def getH(t): return t.hour mydf["hour"] = zip(df["mytime"].map(getH)) mydf["weekday"], mydf["weeknum"] = zip(df["mydate"].map(getWd))

Related

How to "unroll" time intervals in a dataframe?

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

Partially melt hierarchical index

Efficient and elegant way to fill values in pandas column based on each groups

Efficiently applying custom functions to groups in Pandas

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add Multiple Columns to Pandas Dataframe from Function - python

def getWd(d): d.isocalendar()[1], d.weekday() def getH(t): return t.hour mydf["hour"] = zip(*df["mytime"].map(getH)) mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))

Related

How to "unroll" time intervals in a dataframe?

Python Pandas - Selecting specific rows based on the max and min of two columns with the same group id

Partially melt hierarchical index

Efficient and elegant way to fill values in pandas column based on each groups

Efficiently applying custom functions to groups in Pandas

Categories

Resources

def getWd(d): d.isocalendar()[1], d.weekday() def getH(t): return t.hour mydf["hour"] = zip(df["mytime"].map(getH)) mydf["weekday"], mydf["weeknum"] = zip(df["mydate"].map(getWd))