Pandas Q-cut: Binning Data using an Expanding Window Approach - python

This question is somewhat similar to a 2018 question I have found on an identical topic.
I am hoping that if I ask it in a simpler way, someone will be able to figure out a simple fix to the issue that I am currently facing:
I have a timeseries dataframe named "df", which is roughly structured as follows:
V_1 V_2 V_3 V_4
1/1/2000 17 77 15 88
1/2/2000 85 78 6 59
1/3/2000 31 9 49 16
1/4/2000 81 55 28 33
1/5/2000 8 82 82 4
1/6/2000 89 87 57 62
1/7/2000 50 60 54 49
1/8/2000 65 84 29 26
1/9/2000 12 57 53 84
1/10/2000 6 27 70 56
1/11/2000 61 6 38 38
1/12/2000 22 8 82 58
1/13/2000 17 86 65 42
1/14/2000 9 27 42 86
1/15/2000 63 78 18 35
1/16/2000 73 13 51 61
1/17/2000 70 64 75 83
If I wanted to use all the columns to produce daily quantiles, I would follow this approach:
quantiles = df.apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
The output looks like this:
V_1 V_2 V_3 V_4
2000-01-01 1 3 0 4
2000-01-02 4 3 0 3
2000-01-03 2 0 2 0
2000-01-04 4 1 0 0
2000-01-05 0 4 4 0
2000-01-06 4 4 3 3
2000-01-07 2 2 3 2
2000-01-08 3 4 1 0
2000-01-09 0 2 2 4
2000-01-10 0 1 4 2
2000-01-11 2 0 1 1
2000-01-12 1 0 4 2
2000-01-13 1 4 3 1
2000-01-14 0 1 1 4
2000-01-15 3 3 0 1
2000-01-16 4 0 2 3
2000-01-17 3 2 4 4
What I want to do:
I would like to produce quantiles of the data in "df" using observations that occurred before and at a specific point in time. I do not want to include observations that occurred after the specific point in time.
For instance:
To calculate the bins for the 2nd of January 2000, I would like to just use observations from the 1st and 2nd of January 2000; and, nothing after the dates;
To calculate the bins for the 3rd of January 2000, I would like to just use observations from the 1st, 2nd and 3rd of January 2000; and, nothing after the dates;
To calculate the bins for the 4th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd and 4th of January 2000; and, nothing after the dates;
To calculate the bins for the 5th of January 2000, I would like to just use observations from the 1st, 2nd, 3rd, 4th and 5th of January 2000; and, nothing after the dates;
Otherwise put, I would like to use this approach to calculate the bins for ALL the datapoints in "df". That is, to calculate bins from the 1st of January 2000 to the 17th of January 2000.
In short, what I want to do is to conduct an expanding window q-cut (if there is any such thing). It helps to avoid "look-ahead" bias when dealing with timeseries data.
This code block below is wrong, but it illustrates exactly what I am trying to accomplish:
quantiles = df.expanding().apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
Does anyone have any ideas of how to do this in a simpler fashion than this

I am new so take this with a grain of salt, but when broken down I believe your question is a duplicate because it requires simple datetime index slicing answered HERE.
lt_jan_5 = df.loc[:'2000-01-05'].apply(lambda x: pd.qcut(x, 5, duplicates='drop', labels=False), axis=0)
print(lt_jan_5)
V_1 V_2 V_3 V_4
2000-01-01 1 2 1 4
2000-01-02 4 3 0 3
2000-01-03 2 0 3 1
2000-01-04 3 1 2 2
2000-01-05 0 4 4 0
Hope this is helpful

Related

Cumulative sum sorted descending within a group. Pandas

I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.

Find average of every column in a dataframe, grouped by column, exluding one value

I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?
You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()

Is it possible to use target row to initialize a Spark UDAF?

I have a problem that I am trying to solve in Spark by defining my own UDAF by trying to mimic the recommendations given here and here. My eventual goal is to apply a series of complex bit-shifting and bit-wise boolean manipulations to a sequence of integers within a given window.
I am having issues since my use-case is on a fairly large dataset (~100million rows, for which I need to perform 6 such bit-wise manipulations on groups ranging from 2-7 elements long), and am therefore trying to implement this in scala. Problem is that I'm brand new to scala (my primary language being python), and while scala itself doesn't seem that difficult, the combination of both a new language plus specifics of the UDAF class itself as applied to windows is leaving me a little stumped.
Explaining the logic by example in python/pandas
To make the question more concrete, consider a pandas DataFrame:
keep = list(range(30))
for num in (3, 5, 11, 16, 22, 24):
keep.pop(num)
np.random.seed(100)
df = pd.DataFrame({
'id': 'A',
'date': pd.date_range('2018-06-01', '2018-06-30')[keep],
'num': np.random.randint(low=1, high=100, size=30)[keep]
})
Which produces:
id date num
0 A 2018-06-01 9
1 A 2018-06-02 25
2 A 2018-06-03 68
3 A 2018-06-05 80
4 A 2018-06-06 49
5 A 2018-06-08 95
6 A 2018-06-09 53
7 A 2018-06-10 99
8 A 2018-06-11 54
9 A 2018-06-12 67
10 A 2018-06-13 99
11 A 2018-06-15 35
12 A 2018-06-16 25
13 A 2018-06-17 16
14 A 2018-06-18 61
15 A 2018-06-19 59
16 A 2018-06-21 10
17 A 2018-06-22 94
18 A 2018-06-23 87
19 A 2018-06-24 3
20 A 2018-06-25 28
21 A 2018-06-26 5
22 A 2018-06-28 2
23 A 2018-06-29 14
What I would like to be able to do is, relative to the current row, find the number of days, then perform some bit-wise manipulations based on that value. To demonstrate, staying in pandas (I will have to do a full outer join then filter to demonstrate):
exp_df = df[['id', 'date']].merge(df, on='id') \ # full outer join on 'id'
.assign(days_diff = lambda df: (df['date_y'] - df['date_x']).dt.days) \ # number of days since my date of interest
.mask(lambda df: (df['days_diff'] > 3) | (df['days_diff'] < 0)) \ # nulls rows where days_diff isn't between 0 and 3
.dropna() \ # then filters the rows
.drop('date_y', axis='columns') \
.rename({'date_x': 'date', 'num': 'nums'}, axis='columns') \
.reset_index(drop=True)
exp_df[['nums', 'days_diff']] = exp_df[['nums', 'days_diff']].astype('int')
Now I perform my bit-wise shifting and other logic:
# Extra values to add after bit-wise shifting (1 for shift of 1, 3 for shift of 2 ...)
additions = {val: sum(2**power for power in range(val)) for val in exp_df['days_diff'].unique()}
exp_df['shifted'] = np.left_shift(exp_df['nums'].values, exp_df['days_diff'].values) \
+ exp_df['days_diff'].apply(lambda val: additions[val])
After all this, exp_df looks like the following (first 10 rows):
id date nums days_diff shifted
0 A 2018-06-01 9 0 9
1 A 2018-06-01 25 1 51
2 A 2018-06-01 68 2 275
3 A 2018-06-02 25 0 25
4 A 2018-06-02 68 1 137
5 A 2018-06-02 80 3 647
6 A 2018-06-03 68 0 68
7 A 2018-06-03 80 2 323
8 A 2018-06-03 49 3 399
9 A 2018-06-05 80 0 80
Now I can aggregate:
exp_df.groupby('date')['shifted'].agg(lambda group_vals: np.bitwise_and.reduce(group_vals.values)
And the final result looks like the following (if I join back to the original DataFrame:
id date num shifted
0 A 2018-06-01 9 1
1 A 2018-06-02 25 1
2 A 2018-06-03 68 0
3 A 2018-06-05 80 64
4 A 2018-06-06 49 33
5 A 2018-06-08 95 3
6 A 2018-06-09 53 1
7 A 2018-06-10 99 1
8 A 2018-06-11 54 6
9 A 2018-06-12 67 3
10 A 2018-06-13 99 3
11 A 2018-06-15 35 3
12 A 2018-06-16 25 1
13 A 2018-06-17 16 0
14 A 2018-06-18 61 21
15 A 2018-06-19 59 35
16 A 2018-06-21 10 8
17 A 2018-06-22 94 6
18 A 2018-06-23 87 3
19 A 2018-06-24 3 1
20 A 2018-06-25 28 0
21 A 2018-06-26 5 1
22 A 2018-06-28 2 0
23 A 2018-06-29 14 14
Back to the question
Ok, now that I've demonstrated my logic, I realize that I can essentially do the same thing in Spark - performing a full outer join of the DataFrame on itself, then filtering and aggregating.
What I want to know is if I can avoid performing a full join, and instead create my own UDAF to perform this aggregation over a window function, using the target row as an input. Basically, I need to create the equivalent of the "days_diff" column in order to perform my required logic, which means comparing the target date to each of the other dates within my specified window. Is this even possible?
Also, am I even justified in worrying about using a self-join? I know that spark does all of its processing lazily, so it's very possible that I wouldn't need to worry . Should I expect the performance to be similar if I were to do all of this using a self join versus my imaginary UDAF applied over a window? The logic is more sequential and easier to follow using the join-filter-aggregate method, which is a clear advantage.
One thing to know is that I will be performing this logic on multiple windows. In principle, I could cache the largest version of the filtered DataFrame after the join, then use that for subsequent calculations.

Replacing dataframe values by median value of group

Apologies if this is a repeat, I didn't find a similar answer.
Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median (or whatever metric) from the entire column. My data has labels and I would like to replace each NaN value with the median value for that column from other samples belonging to that label only.
I can do this by splitting the df into one df for each label, imputing over each of those dfs, and combining, but this logic doesn't scale well. I could have up to 20 classes, and I fundamentally don't believe this is the 'right' way to do it.
I would like to do this without copying my df, by using a groupby object in a split-apply-combine technique (or another technique you think would work). I appreciate your help.
Example df:
r1 r2 r3 label
0 12 NaN 58 0
1 34 52 24 1
2 32 4 NaN 1
3 7 89 2 0
4 22 19 12 1
Here, I would like the NaN value at (0, r2) to equal the median of that column for label 0, which is the value 89 (from 3, r2).
I would like the NaN value at (2,r3) to equal the median of that column for label 1, which is median(24, 12), or 18.
Example successful result:
r1 r2 r3 label
0 12 89 58 0
1 34 52 24 1
2 32 4 18 1
3 7 89 2 0
4 22 19 12 1
In [158]: df.groupby('label', group_keys=False) \
.apply(lambda x: x.fillna(x.median()).astype(int))
Out[158]:
r1 r2 r3 label
0 12 89 58 0
3 7 89 2 0
1 34 52 24 1
2 32 4 18 1
4 22 19 12 1
or using transform:
In [149]: df[['label']].join(df.groupby('label')
.transform(lambda x: x.fillna(x.median())).astype(int))
Out[149]:
label r1 r2 r3
0 0 12 89 58
1 1 34 52 24
2 1 32 4 18
3 0 7 89 2
4 1 22 19 12

Best approach to create time difference variable by id

I am working with a pandas df that looks like this:
ID time
34 43
2 99
2 20
34 8
2 90
What would be the best approach to a create variable that represents the difference from the most recent time per ID?
ID time diff
34 43 35
2 99 9
2 20 NA
34 8 NA
2 90 70
Here's one possibility
df["diff"] = df.sort_values("time").groupby("ID")["time"].diff()
df
ID time diff
0 34 43 35.0
1 2 99 9.0
2 2 20 NaN
3 34 8 NaN
4 2 90 70.0

Categories

Resources