Pandas dataframe condition on datetime at other rows - python

My dataframe is shown as follows:
User Date Unit
1 A 2000-10-31 1
2 A 2001-10-31 2
3 A 2002-10-31 1
4 A 2003-10-31 2
5 B 2000-07-31 1
6 B 2000-08-31 2
7 B 2001-07-31 1
8 B 2002-06-30 1
9 B 2002-07-31 1
10 B 2002-08-31 1
I want to make the following judgement:
(1) For the 'User' with 'Unit' in the same month in the past consecutive two years. The data should be classified as 'Routine' with a dummy variable 1.
(2) Otherwise, the data should be classified as 0 in the 'Routine' column.
(3) For the data do not have two past consecutive years. The 'Routine' column should show NaN.
My desired output is:
User Date Unit Routine
1 A 2000-10-31 1 NaN
2 A 2001-10-31 2 NaN
3 A 2002-10-31 1 1
4 A 2003-10-31 2 1
5 B 2000-07-31 1 NaN
6 B 2000-08-31 2 NaN
7 B 2001-07-31 1 NaN
8 B 2002-06-30 1 0
9 B 2002-07-31 1 1
10 B 2002-08-31 1 0
The code of the dataframe is shown as follows:
df=pd.DataFrame({'User':list('AAAABBBBBB'),
'Date':['2000-10-31','2001-10-31','2002-10-31','2003-10-31','2000-07-31',
'2000-08-31','2001-07-31','2002-06-30','2002-07-31','2002-08-31'],
'Unit':[1,2,1,2,1,2,1,1,1,1]})
df['Date']=pd.to_datetime(df['Date'])
I want to use groupby function since there are many users in the dataframe. Thank you.

The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'User': list('AAAABBBBBB'),
'Date': [
'2000-10-31', '2001-10-31', '2002-10-31', '2003-10-31',
'2000-07-31', '2000-08-31', '2001-07-31', '2002-06-30',
'2002-07-31', '2002-08-31'],
'Unit': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1]})
df['Date'] = pd.to_datetime(df['Date'])
def routine(user, cdate, unit):
result = np.nan
two_years = [cdate.year - 1, cdate.year - 2]
mask = df.User == user
mask = mask & df.Date.dt.year.isin(two_years)
sdf = df[mask]
years = sdf.Date.dt.year.to_list()
got_years = all([y in years for y in two_years])
result = 0 if (sdf.shape[0] > 0) & got_years else result
mask2 = (sdf.Date.dt.month == cdate.month) & (sdf.Unit == unit)
sdf = sdf[mask2]
result = 1 if (sdf.shape[0] > 0) & got_years else result
return result
df['Routine'] = df.apply(
lambda row: routine(row['User'], row['Date'], row['Unit']), axis=1)
print(df)
Output:
User Date Unit Routine
0 A 2000-10-31 1 NaN
1 A 2001-10-31 2 NaN
2 A 2002-10-31 1 1.0
3 A 2003-10-31 2 1.0
4 B 2000-07-31 1 NaN
5 B 2000-08-31 2 NaN
6 B 2001-07-31 1 NaN
7 B 2002-06-30 1 0.0
8 B 2002-07-31 1 1.0
9 B 2002-08-31 1 0.0

Related

Find the time difference between consecutive rows of two columns for a given value in third column

Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.

Number winning streak ID's in pandas

I have a Python pandas dataframe with winning streaks for some teams over several time periods and I would like to identfy the streaks chronologically. So, what I have is:
import pandas as pd
data = pd.DataFrame({'period': list(range(1,7))+list(range(1,6)),
'team_id': ['A']*6 + ['B']*5,
'win': [1,1,1,0,1,1,1,0,0,1,1],
'streak_length': [1,2,3,0,1,2,1,0,0,1,2]})
print(data)
And what I would like to have is:
result = pd.DataFrame({'period': list(range(1,7))+list(range(1,6)),
'team_id': ['A']*6 + ['B']*5,
'win': [1,1,1,0,1,1,1,0,0,1,1],
'streak_length': [1,2,3,0,1,2,1,0,0,1,2],
'streak_id': [1,1,1,None,2,2,1,None,None,2,2]})
print(result)
I tried to groupby by team_id and sum over streak length, but it can be repeated, so I think this would not work. Any help appreciated!
Create consecutive groups by Series.shift Series.ne and Series.cumsum, filter only 1 in win and use GroupBy.transform with factorize in lambda function:
m = data['win'].eq(1)
g = data['win'].ne(data['win'].shift()).cumsum()
data['streak_id'] = g[m].groupby(data['team_id']).transform(
lambda x: pd.factorize(x)[0] + 1
)
print (data)
period team_id win streak_length streak_id
0 1 A 1 1 1.0
1 2 A 1 2 1.0
2 3 A 1 3 1.0
3 4 A 0 0 NaN
4 5 A 1 1 2.0
5 6 A 1 2 2.0
6 1 B 1 1 1.0
7 2 B 0 0 NaN
8 3 B 0 0 NaN
9 4 B 1 1 2.0
10 5 B 1 2 2.0

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

Proportional/Percentage values

I have this data frame:
o d r kz p
1 3 1 5 NaN
1 3 2 0 NaN
1 10 1 7 NaN
1 10 3 1 NaN
1 10 2 2 NaN
I would like to fill up the 'p' column by the proportions of 'kz' values for each pair of 'o' and 'd'. The result should look like:
o d r kz p
1 3 1 5 100%
1 3 2 0 0%
1 10 1 7 70%
1 10 3 1 10%
1 10 2 2 20%
I am thinking of looping through the data frame and assign a list of lists of kz values and then regressively fill up the p column.
Is there any elegant way to do it e.g. with groupby or Pivot table?
You can do it in several steps:
Compute the sum per group with groupby (doc) and agg (doc).
Merge these values with you current dataframe with merge (doc).
Compute the ratio
Here the code:
# Import modules
import pandas as pd
import numpy as np
# Data
df = pd.DataFrame(
[[1, 3, 1, 5, np.NaN],
[1, 3, 2, 0, np.NaN],
[1, 10, 1, 7, np.NaN],
[1, 10, 3, 1, np.NaN],
[1, 10, 2, 2, np.NaN]],
columns=["o", "d", "r", "kz", "p"])
print(df)
# o d r kz p
# 0 1 3 1 5 NaN
# 1 1 3 2 0 NaN
# 2 1 10 1 7 NaN
# 3 1 10 3 1 NaN
# 4 1 10 2 2 NaN
# Compute the sum per group
sum_ = df.groupby(['o', 'd']).agg({'kz': 'sum'})
sum_.reset_index(inplace=True)
print(sum_)
# o d kz
# 0 1 3 5
# 1 1 10 10
# Merge these values with the current dataframe
df = df.merge(sum_, on=['o', 'd'], how="outer", suffixes=('', '_sum'))
print(df)
# o d r kz p kz_sum
# 0 1 3 1 5 NaN 5
# 1 1 3 2 0 NaN 5
# 2 1 10 1 7 NaN 10
# 3 1 10 3 1 NaN 10
# 4 1 10 2 2 NaN 10
# Compute teh ratio
df.p = df.kz / df.kz_sum * 100
print(df)
# o d r kz p kz_sum
# 0 1 3 1 5 100.0 5
# 1 1 3 2 0 0.0 5
# 2 1 10 1 7 70.0 10
# 3 1 10 3 1 10.0 10
# 4 1 10 2 2 20.0 10
First sum() 'kz' column group by 'o' and 'd' and store it in the 'tmp'. Merge those two data frames. Then calculate the percentage value 'p' using the original value of 'kz' and sum value of 'kz'. Drop sum value of 'kz' and rename the original column name to 'kz'.
import pandas as pd
d = {'o' : pd.Series([1,1,1,1,1]),
'd' : pd.Series([3,3,10,10,10]),
'r' : pd.Series([1,2,1,3,2]),
'kz' : pd.Series([5,0,7,1,2]),
'p' : pd.Series(None)}
# creates Dataframe.
df = pd.DataFrame(d)
tmp=df.groupby(['o','d'])["kz"].sum()
merge_tmp=pd.merge(df, tmp, on=['o','d'], how='inner',suffixes=('_org','_tmp'))
merge_tmp['p'] = ((merge_tmp['kz_org']/merge_tmp['kz_tmp'])*100)
merge_tmp = merge_tmp.drop('kz_tmp', axis='columns')
merge_tmp = merge_tmp.rename({'kz_org': 'kz'}, axis='columns')
print(merge_tmp)

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.
Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0

Categories

Resources