Number winning streak ID's in pandas

Number winning streak ID's in pandas - python

I have a Python pandas dataframe with winning streaks for some teams over several time periods and I would like to identfy the streaks chronologically. So, what I have is:
import pandas as pd
data = pd.DataFrame({'period': list(range(1,7))+list(range(1,6)),
'team_id': ['A']*6 + ['B']*5,
'win': [1,1,1,0,1,1,1,0,0,1,1],
'streak_length': [1,2,3,0,1,2,1,0,0,1,2]})
print(data)
And what I would like to have is:
result = pd.DataFrame({'period': list(range(1,7))+list(range(1,6)),
'team_id': ['A']*6 + ['B']*5,
'win': [1,1,1,0,1,1,1,0,0,1,1],
'streak_length': [1,2,3,0,1,2,1,0,0,1,2],
'streak_id': [1,1,1,None,2,2,1,None,None,2,2]})
print(result)
I tried to groupby by team_id and sum over streak length, but it can be repeated, so I think this would not work. Any help appreciated!

Create consecutive groups by Series.shift Series.ne and Series.cumsum, filter only 1 in win and use GroupBy.transform with factorize in lambda function:
m = data['win'].eq(1)
g = data['win'].ne(data['win'].shift()).cumsum()
data['streak_id'] = g[m].groupby(data['team_id']).transform(
lambda x: pd.factorize(x)[0] + 1
)
print (data)
period team_id win streak_length streak_id
0 1 A 1 1 1.0
1 2 A 1 2 1.0
2 3 A 1 3 1.0
3 4 A 0 0 NaN
4 5 A 1 1 2.0
5 6 A 1 2 2.0
6 1 B 1 1 1.0
7 2 B 0 0 NaN
8 3 B 0 0 NaN
9 4 B 1 1 2.0
10 5 B 1 2 2.0

Related

Finding mean of specific column and keep all rows that have specific mean values

I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0

Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)

Pandas dataframe condition on datetime at other rows

My dataframe is shown as follows:
User Date Unit
1 A 2000-10-31 1
2 A 2001-10-31 2
3 A 2002-10-31 1
4 A 2003-10-31 2
5 B 2000-07-31 1
6 B 2000-08-31 2
7 B 2001-07-31 1
8 B 2002-06-30 1
9 B 2002-07-31 1
10 B 2002-08-31 1
I want to make the following judgement:
(1) For the 'User' with 'Unit' in the same month in the past consecutive two years. The data should be classified as 'Routine' with a dummy variable 1.
(2) Otherwise, the data should be classified as 0 in the 'Routine' column.
(3) For the data do not have two past consecutive years. The 'Routine' column should show NaN.
My desired output is:
User Date Unit Routine
1 A 2000-10-31 1 NaN
2 A 2001-10-31 2 NaN
3 A 2002-10-31 1 1
4 A 2003-10-31 2 1
5 B 2000-07-31 1 NaN
6 B 2000-08-31 2 NaN
7 B 2001-07-31 1 NaN
8 B 2002-06-30 1 0
9 B 2002-07-31 1 1
10 B 2002-08-31 1 0
The code of the dataframe is shown as follows:
df=pd.DataFrame({'User':list('AAAABBBBBB'),
'Date':['2000-10-31','2001-10-31','2002-10-31','2003-10-31','2000-07-31',
'2000-08-31','2001-07-31','2002-06-30','2002-07-31','2002-08-31'],
'Unit':[1,2,1,2,1,2,1,1,1,1]})
df['Date']=pd.to_datetime(df['Date'])
I want to use groupby function since there are many users in the dataframe. Thank you.

The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'User': list('AAAABBBBBB'),
'Date': [
'2000-10-31', '2001-10-31', '2002-10-31', '2003-10-31',
'2000-07-31', '2000-08-31', '2001-07-31', '2002-06-30',
'2002-07-31', '2002-08-31'],
'Unit': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1]})
df['Date'] = pd.to_datetime(df['Date'])
def routine(user, cdate, unit):
result = np.nan
two_years = [cdate.year - 1, cdate.year - 2]
mask = df.User == user
mask = mask & df.Date.dt.year.isin(two_years)
sdf = df[mask]
years = sdf.Date.dt.year.to_list()
got_years = all([y in years for y in two_years])
result = 0 if (sdf.shape[0] > 0) & got_years else result
mask2 = (sdf.Date.dt.month == cdate.month) & (sdf.Unit == unit)
sdf = sdf[mask2]
result = 1 if (sdf.shape[0] > 0) & got_years else result
return result
df['Routine'] = df.apply(
lambda row: routine(row['User'], row['Date'], row['Unit']), axis=1)
print(df)
Output:
User Date Unit Routine
0 A 2000-10-31 1 NaN
1 A 2001-10-31 2 NaN
2 A 2002-10-31 1 1.0
3 A 2003-10-31 2 1.0
4 B 2000-07-31 1 NaN
5 B 2000-08-31 2 NaN
6 B 2001-07-31 1 NaN
7 B 2002-06-30 1 0.0
8 B 2002-07-31 1 1.0
9 B 2002-08-31 1 0.0

How to find sum and count of a column based on a grouping condition on a Pandas dataset?

I have a Pandas dataset with 3 columns. I need to group by the ID column while finding the sum and count of the other two columns. Also, I have to ignore the zeroes in the columsn 'A' and 'B'.
The dataset looks like -
ID A B
1 0 5
2 10 0
2 20 0
3 0 30
What I need -
ID A_Count A_Sum B_Count B_Sum
1 0 0 1 5
2 2 30 0 0
3 0 0 1 30
I have tried this using one column but wasn't able to get both the aggregations in the final dataset.
(df.groupby('ID').agg({'A':'sum', 'A':'count'}).reset_index().rename(columns = {'A':'A_sum', 'A': 'A_count'}))

If you don't pass it columns specifically, it will aggregate the numeric columns by itself.
Since your don't want to count 0, replace them with NaN first:
df.replace(0, np.NaN, inplace=True)
print(df)
ID A B
0 1 NaN 5.0
1 2 10.0 NaN
2 2 20.0 NaN
3 3 NaN 30.0
df = df.groupby('ID').agg(['count', 'sum'])
print(df)
A B
count sum count sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0
Remove MultiIndex columns
You can use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print(df)
A_count A_sum B_count B_sum
ID
1 0 0.0 1 5.0
2 2 30.0 0 0.0
3 0 0.0 1 30.0

Conditional sum from rows into a new column in pandas

I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20

Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0

you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.

Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Number winning streak ID's in pandas - python

Related

Finding mean of specific column and keep all rows that have specific mean values

Pandas dataframe condition on datetime at other rows

How to find sum and count of a column based on a grouping condition on a Pandas dataset?

Conditional sum from rows into a new column in pandas

Use Pandas dataframe to add lag feature from MultiIindex Series

Categories

Resources