Pandas Group By With Overlapping Bins - python

I want to sum up data across overlapping bins. Basically the question here but instead of the bins being (0-8 years old), (9 - 17 years old), (18-26 years old), (27-35 years old), and (26 - 44 years old) I want them to be (0-8 years old), (1 - 9 years old), (2-10 years old), (3-11 years old), and (4 - 12 years old).
Starting with a df like this
id
awards
age
1
100
24
1
150
26
1
50
54
2
193
34
2
209
50
I am using the code from this answer to calculate summation across non-overlapping bins.
bins = [9 * i for i in range(0, df['age'].max() // 9 + 2)]
cuts = pd.cut(df['age'], bins, right=False)
print(cuts)
0 [18, 27)
1 [18, 27)
2 [54, 63)
3 [27, 36)
4 [45, 54)
Name: age, dtype: category
Categories (7, interval[int64, left]): [[0, 9) < [9, 18) < [18, 27) < [27, 36) < [36, 45) < [45, 54) < [54, 63)]
df_out = (df.groupby(['id', cuts])
.agg(total_awards=('awards', 'sum'))
.reset_index(level=0)
.reset_index(drop=True)
)
df_out['age_interval'] = df_out.groupby('id').cumcount()
Result
print(df_out)
id total_awards age_interval
0 1 0 0
1 1 0 1
2 1 250 2
3 1 0 3
4 1 0 4
5 1 0 5
6 1 50 6
7 2 0 0
8 2 0 1
9 2 0 2
10 2 193 3
11 2 0 4
12 2 209 5
13 2 0 6
Is it possible to work off the existing code to do this with overlapping bins?

First pivot_table your data to get a row per id and the columns being the ages. then reindex to get all the ages possible, from 0 to at least the max in the column age (here I use the max plus the interval length). Now you can use rolling along the columns. Rename the columns to create meaningful names. Finally stack and reset_index to get a dataframe with the expected shape.
interval = 9 #include both bounds like 0 and 8 for the first interval
res = (
df.pivot_table(index='id', columns='age', values='awards',
aggfunc=sum, fill_value=0)
.reindex(columns=range(0, df['age'].max()+interval), fill_value=0)
.rolling(interval, axis=1, min_periods=interval).sum()
.rename(columns=lambda x: f'{x-interval+1}-{x} y.o.')
.stack()
.reset_index(name='awards')
)
and you get with the input data provided in the question
print(res)
# id age awards
# 0 1 0-8 y.o. 0.0
# 1 1 1-9 y.o. 0.0
# ...
# 15 1 15-23 y.o. 0.0
# 16 1 16-24 y.o. 100.0
# 17 1 17-25 y.o. 100.0
# 18 1 18-26 y.o. 250.0
# 19 1 19-27 y.o. 250.0
# 20 1 20-28 y.o. 250.0
# 21 1 21-29 y.o. 250.0
# 22 1 22-30 y.o. 250.0
# 23 1 23-31 y.o. 250.0
# 24 1 24-32 y.o. 250.0
# 25 1 25-33 y.o. 150.0
# 26 1 26-34 y.o. 150.0
# 27 1 27-35 y.o. 0.0
# ...
# 45 1 45-53 y.o. 0.0
# 46 1 46-54 y.o. 50.0
# 47 1 47-55 y.o. 50.0
# 48 1 48-56 y.o. 50.0
# 49 1 49-57 y.o. 50.0
# ...

I think the best would be to first compute per-age sums, and then a rolling window to get all 9 year intervals. This only works because all your intervals have the same size − otherwise it would be much harder.
>>> totals = df.groupby('age')['awards'].sum()
>>> totals = totals.reindex(np.arange(0, df['age'].max() + 9)).fillna(0, downcast='infer')
>>> totals
0 6
1 2
2 4
3 6
4 4
..
98 0
99 0
100 0
101 0
102 0
Name: age, Length: 103, dtype: int64
>>> totals.rolling(9).sum().dropna().astype(int).rename(lambda age: f'{age-8}-{age}')
0-8 42
1-9 43
2-10 45
3-11 47
4-12 47
..
90-98 31
91-99 27
92-100 20
93-101 13
94-102 8
Name: age, Length: 95, dtype: int64
This is slightly complicated by the fact you also want to group by id, but the idea stays the same:
>>> idx = pd.MultiIndex.from_product([df['id'].unique(), np.arange(0, df['age'].max() + 9)], names=['id', 'age'])
>>> totals = df.groupby(['id', 'age']).sum().reindex(idx).fillna(0, downcast='infer')
>>> totals
awards
1 0 128
1 204
2 136
3 367
4 387
... ...
2 98 0
99 0
100 0
101 0
102 0
[206 rows x 1 columns]
>>> totals.groupby('id').rolling(9).sum().droplevel(0).dropna().astype(int).reset_index('id')
id awards
age
8 1 3112
9 1 3390
10 1 3431
11 1 3609
12 1 3820
.. .. ...
98 2 1786
99 2 1226
100 2 900
101 2 561
102 2 317
[190 rows x 2 columns]
This is the same as #Ben.T’s answer except we keep the Series shape and his answer pivots it to a dataframe. At any step you could .stack('age') or .unstack('age') to switch between both answer’s formats.

IIUC, you can use pd.IntervalIndex with some list comprehension:
ii = pd.IntervalIndex.from_tuples(
[
(s, e)
for e, s in pd.Series(np.arange(51)).rolling(9).agg(min).dropna().iteritems()
]
)
df_out = pd.concat(
[
pd.Series(ii.contains(x["age"]) * x["awards"], index=ii)
for i, x in df[["age", "awards"]].iterrows()
],
axis=1,
).groupby(level=0).sum().T
df_out.stack()
Output:
0 (0.0, 8.0] 0
(1.0, 9.0] 0
(2.0, 10.0] 0
(3.0, 11.0] 0
(4.0, 12.0] 0
...
4 (38.0, 46.0] 0
(39.0, 47.0] 0
(40.0, 48.0] 0
(41.0, 49.0] 0
(42.0, 50.0] 209
Length: 215, dtype: int64

A old way without pd.cut using a for loop and some masks.
import pandas as pd
max_age = df["age"].max()
interval_length = 8
values = []
for min_age in range(max_age - interval_length + 1):
max_age = min_age + interval_length
awards = df.query("#min_age <= age <= #max_age").loc[:, "age"].sum()
values.append([min_age, max_age, awards])
df_out = pd.DataFrame(values, columns=["min_age", "max_age", "awards"])
Let me know if this is what you want :)

Let df be a DataFrame:
import pandas as pd
import random
def r(b, e):
return [random.randint(b, e) for _ in range(300)]
df = pd.DataFrame({'id': r(1, 3), 'awards': r(0, 400), 'age': r(1, 99)})
For binning by age, I would advise creating a new column since it is clearer (and faster):
df['bin'] = df['age'].apply(lambda x: x // 9)
print(df)
The number of awards per id per bin can be obtained using simply:
totals_separate = df.groupby(['id', 'bin'])['awards'].sum()
print(totals_separate)
If I understand correctly, you would like the sum for each window of size 9 rows:
totals_rolling = df.groupby(['id', 'bin'])['awards'].rolling(9, min_periods=1).sum()
print(totals_rolling)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html

Related

How do I create a while loop for this df that has moving average in every stage? [duplicate]

This question already has an answer here:
For loop that adds and deducts from pandas columns
(1 answer)
Closed 1 year ago.
So I want to spread the shipments per ID in the group one by one by looking at avg sales to determine who to give it to.
Here's my dataframe:
ID STOREID BAL SALES SHIP
1 STR1 50 5 18
1 STR2 6 7 18
1 STR3 74 4 18
2 STR1 35 3 500
2 STR2 5 4 500
2 STR3 54 7 500
While SHIP (grouped by ID) is greater than 0, calculate AVG (BAL/SALES) and the lowest AVG per group give +1 to its column BAL and +1 to its column final. And then repeat the process until SHIP is 0. The AVG would be different every stage which is why I wanted it to be a while loop.
Sample output of first round is below. So do this until SHIP is 0 and SUM of Final per ID is = to SHIP:
ID STOREID BAL SALES SHIP AVG Final
1 STR1 50 5 18 10 0
1 STR2 6 4 18 1.5 1
1 STR3 8 4 18 2 0
2 STR1 35 3 500 11.67 0
2 STR2 5 4 500 1.25 1
2 STR3 54 7 500 7.71 0
I've tried a couple of ways in SQL, I thought it would be better to do it in python but I haven't been doing a great job with my loop. Here's what I tried so far:
df['AVG'] = 0
df['FINAL'] = 0
for i in df.groupby(["ID"])['SHIP']:
if i > 0:
df['AVG'] = df['BAL'] / df['SALES']
df['SHIP'] = df.groupby(["ID"])['SHIP']-1
total = df.groupby(["ID"])["FINAL"].transform("cumsum")
df['FINAL'] = + 1
df['A'] = + 1
else:
df['FINAL'] = 0
This was challenging because more than one row in the group can have the same average calculation. then it throws off the allocation.
This works on the example dataframe, if I understood you correctly.
d = {'ID': [1, 1, 1, 2,2,2], 'STOREID': ['str1', 'str2', 'str3','str1', 'str2', 'str3'],'BAL':[50, 6, 74, 35,5,54], 'SALES': [5, 7, 4, 3,4,7], 'SHIP': [18, 18, 18, 500,500,500]}
df = pd.DataFrame(data=d)
df['AVG'] = 0
df['FINAL'] = 0
def calc_something(x):
# print(x.iloc[0]['SHIP'])
for i in range(x.iloc[0]['SHIP'])[0:500]:
x['AVG'] = x['BAL'] / x['SALES']
x['SHIP'] = x['SHIP']-1
x = x.sort_values('AVG').reset_index(drop=True)
# print(x.iloc[0, 2])
x.iloc[0, 2] = x['BAL'][0] + 1
x.iloc[0, 6] = x['FINAL'][0] + 1
return x
df_final = df.groupby('ID').apply(calc_something).reset_index(drop=True).sort_values(['ID', 'STOREID'])
df_final
ID STOREID BAL SALES SHIP AVG FINAL
1 1 STR1 50 5 0 10.000 0
0 1 STR2 24 7 0 3.286 18
2 1 STR3 74 4 0 18.500 0
4 2 STR1 127 3 0 42.333 92
5 2 STR2 170 4 0 42.500 165
3 2 STR3 297 7 0 42.286 243

Series squeeze() not returning similar outputs

import pandas as pd
year = '2018'
idNumber = '450007'
df = pd.read_csv('FARS_data/FARS' + year + 'NationalCSV/ACCIDENT.csv')
df = df.astype(str)
print (df.loc[df['ST_CASE'] == idNumber].squeeze(axis='index'))
df = pd.read_csv('FARS_data/FARS' + year + 'NationalCSV/PERSON.csv')
df = df.astype(str)
rows = df.loc[df['ST_CASE'] == idNumber]
for i in range(len(rows.index)):
print (rows.iloc[i].to_frame().transpose().squeeze(axis='index'))
The type of all the items printed is <class 'pandas.core.series.Series'>.
The reason I have a for loop for the second csv is that rows has two rows in it, while the first one has only one row, so I had to make the DataFrame produce two Series.
My output is below: (comments added by me)
#First print
STATE 45
ST_CASE 450007
VE_TOTAL 2
VE_FORMS 2
PVH_INVL 0
PEDS 0
PERNOTMVIT 0
PERMVIT 2
PERSONS 2
COUNTY 15
CITY 1072
DAY 3
MONTH 1
YEAR 2018
DAY_WEEK 4
HOUR 9
MINUTE 28
NHS 0
RUR_URB 2
FUNC_SYS 4
RD_OWNER 1
ROUTE 3
TWAY_ID SR-136
TWAY_ID2 nan
MILEPT 0
LATITUDE 32.15663889
LONGITUD -79.1526
SP_JUR 0
HARM_EV 12
MAN_COLL 6
RELJCT1 0
RELJCT2 1
TYP_INT 1
WRK_ZONE 0
REL_ROAD 1
LGT_COND 1
WEATHER1 3
WEATHER2 0
WEATHER 3
SCH_BUS 0
RAIL 0000000
NOT_HOUR 99
NOT_MIN 99
ARR_HOUR 99
ARR_MIN 99
HOSP_HR 99
HOSP_MN 99
CF1 0
CF2 0
CF3 0
FATALS 1
DRUNK_DR 0
Name: 25834, dtype: object
#Second print
STATE 45
ST_CASE 450007
VE_FORMS 2
VEH_NO 1
PER_NO 1
...
P_SF3 0
WORK_INJ 0
HISPANIC 7
RACE 1
LOCATION 0
Name: 64374, Length: 62, dtype: object
#Third print
STATE 45
ST_CASE 450007
VE_FORMS 2
VEH_NO 2
PER_NO 1
...
P_SF3 0
WORK_INJ 8
HISPANIC 0
RACE 0
LOCATION 0
Name: 64375, Length: 62, dtype: object
What I want to happen is for the second two prints to look like the first one - all pretty and formatted, showing all the columns.
The default pd.options.display.max_rows is 60, while PERSON.csv was 62 rows, so it wouldn't display. Setting pd.options.display.max_rows = 999 before running any of the code allowed everything to show.

How can I replace age binning feature into numerical data?

I have created agebin column from age columns. I have range of ages but how can I convert them into agebin numerical data type because I want to check whether agebin is important feature or not.
I tried following code for age binning:
traindata = data.assign(age_bins = pd.cut(data.age, 4, retbins=False, include_lowest=True))
data['agebin'] = traindata['age_bins']
data['agebin'].unique()
[[16.954, 28.5], (28.5, 40], (40, 51.5], (51.5, 63]]
Categories (4, object): [[16.954, 28.5] < (28.5, 40] < (40, 51.5] < (51.5, 63]]
What I tried :
data['enc_agebin'] = data.agebin.map({[16.954, 28.5]:1,(28.5, 40]:2,(40, 51.5]:3,(51.5, 63]:4})
I tried to map each range and convert it to numerical but I am getting syntax error. Please suggest some good technique for converting agebin which is categorical to numerical data.
I think need parameter labels in cut:
data = pd.DataFrame({'age':[10,20,40,50,44,56,12,34,56]})
data['agebin'] = pd.cut(data.age,bins=4,labels=range(1, 5), retbins=False,include_lowest=True)
print (data)
age agebin
0 10 1
1 20 1
2 40 3
3 50 4
4 44 3
5 56 4
6 12 1
7 34 3
8 56 4
Or use labels=False, then first bin is 0 and last 3 (like range(4)):
data['agebin'] = pd.cut(data.age, bins=4, labels=False, retbins=False, include_lowest=True)
print (data)
age agebin
0 10 0
1 20 0
2 40 2
3 50 3
4 44 2
5 56 3
6 12 0
7 34 2
8 56 3

How to combine boolean indexer with multi-index in pandas?

I have a multi-indexed dataframe and I wish to extract a subset based on index values and on a boolean criteria. I wish to overwrite the values of a specific new values using multi-index keys and boolean indexers to select the records to modify.
import pandas as pd
import numpy as np
years = [1994,1995,1996]
householdIDs = [ id for id in range(1,100) ]
midx = pd.MultiIndex.from_product( [years, householdIDs], names = ['Year', 'HouseholdID'] )
householdIncomes = np.random.randint( 10000,100000, size = len(years)*len(householdIDs) )
householdSize = np.random.randint( 1,5, size = len(years)*len(householdIDs) )
df = pd.DataFrame( {'HouseholdIncome':householdIncomes, 'HouseholdSize':householdSize}, index = midx )
df.sort_index(inplace = True)
Here's what the sample data looks like...
df.head()
=> HouseholdIncome HouseholdSize
Year HouseholdID
1994 1 23866 3
2 57956 3
3 21644 3
4 71912 4
5 83663 3
I'm able to successfully query the dataframe using the indices and column labels.
This example gives me the HouseholdSize for household 3 in year 1996
df.loc[ (1996,3 ) , 'HouseholdSize' ]
=> 1
However, I'm unable to combine boolean selection with multi-index queries...
The pandas docs on Multi-indexing says there is a way to combine boolean indexing with multi-indexing and gives an example...
In [52]: idx = pd.IndexSlice
In [56]: mask = dfmi[('a','foo')]>200
In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
...which I can't seem to replicate on my dataframe
idx = pd.IndexSlice
housholdSizeAbove2 = ( df.HouseholdSize > 2 )
df.loc[ idx[ housholdSizeAbove2, 1996, :] , 'HouseholdSize' ]
Traceback (most recent call last):
File "python", line 1, in <module>
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'
In this example I would want to see all the households in 1996 with householdsize above 2
Pandas.query() should work in this case:
df.query("Year == 1996 and HouseholdID > 2")
Demo:
In [326]: with pd.option_context('display.max_rows',20):
...: print(df.query("Year == 1996 and HouseholdID > 2"))
...:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 4
4 11057 1
5 36321 2
6 89469 4
7 35711 2
8 85741 1
9 34758 3
10 56085 2
11 32275 4
12 77096 4
... ... ...
90 40276 4
91 10594 2
92 61080 4
93 65334 2
94 21477 4
95 83112 4
96 25627 2
97 24830 4
98 85693 1
99 84653 4
[97 rows x 2 columns]
UPDATE:
Is there a way to select a specific column?
In [333]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdIncome']
Out[333]:
Year HouseholdID
1996 3 28664
4 11057
5 36321
6 89469
7 35711
8 85741
9 34758
10 56085
11 32275
12 77096
...
90 40276
91 10594
92 61080
93 65334
94 21477
95 83112
96 25627
97 24830
98 85693
99 84653
Name: HouseholdIncome, dtype: int32
and ultimately I want to overwrite the data on the dataframe.
In [331]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdSize'] *= 10
In [332]: df.loc[df.eval("Year == 1996 and HouseholdID > 2")]
Out[332]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 40
4 11057 10
5 36321 20
6 89469 40
7 35711 20
8 85741 10
9 34758 30
10 56085 20
11 32275 40
12 77096 40
... ... ...
90 40276 40
91 10594 20
92 61080 40
93 65334 20
94 21477 40
95 83112 40
96 25627 20
97 24830 40
98 85693 10
99 84653 40
[97 rows x 2 columns]
UPDATE2:
I want to pass a variable year instead of a specific value. Is there
a cleaner way to do it than Year == " + str(year) + " and HouseholdID > " + str(householdSize) ?
In [5]: year = 1996
In [6]: household_ids = [1, 2, 98, 99]
In [7]: df.loc[df.eval("Year == #year and HouseholdID in #household_ids")]
Out[7]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 1 42217 1
2 66009 3
98 33121 4
99 45489 3

In Pandas, how to operate between columns in max perfornace

I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you
I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200

Categories

Resources