I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.
Related
I have a shipping records table with approx. 100K rows and
I want to calculate, for each row, for each material, how many qtys were shipped in last 30 days.
As you can see in below example, calculated qty depends on "material, shipping date".
I've tried to write very basic code and couldn't find a way to apply it to all rows.
df[(df['malzeme']==material) & (df['cikistarihi'] < shippingDate) & (df['cikistarihi'] >= (shippingDate-30))]['qty'].sum()
material
shippingDate
qty
shipped qtys in last 30 days
A
23.01.2019
8
0
A
28.01.2019
41
8
A
31.01.2019
66
49 (8+41)
A
20.03.2019
67
0
B
17.02.2019
53
0
B
26.02.2019
35
53
B
11.03.2019
4
88 (53+35)
B
20.03.2019
67
106 (35+4+67)
You can use .groupby with .rolling:
# convert the shippingData to datetime:
df["shippingDate"] = pd.to_datetime(df["shippingDate"], dayfirst=True)
# sort the values (if they aren't already)
df = df.sort_values(["material", "shippingDate"])
df["shipped qtys in last 30 days"] = (
df.groupby("material")
.rolling("30D", on="shippingDate", closed="left")["qty"]
.sum()
.fillna(0)
.values
)
print(df)
Prints:
material shippingDate qty shipped qtys in last 30 days
0 A 2019-01-23 8 0.0
1 A 2019-01-28 41 8.0
2 A 2019-01-31 66 49.0
3 A 2019-03-20 67 0.0
4 B 2019-02-17 53 0.0
5 B 2019-02-26 35 53.0
6 B 2019-03-11 4 88.0
7 B 2019-03-20 67 39.0
EDIT: Add .sort_values() before groupby
I have following df. I am going to dynamically create new columns based on number of date (day_number=2), and conditionally fill them based on "code" and "count"
Current format:
code count
id date
ABC1 2019-04-04 1 76
2019-04-05 2 82
Desired matrix-like format:
code count code1_day1 code1_day1 code1_day2 code2_day2
id date
ABC1 2019-04-04 1 76 76 0 0 82
2019-04-05 2 82
I have done this but it fills the same for every column:
code=[1,2]
for date, new in df.groupby(level=[0]):
for col in range(day_number): # day_number=2
for lvl in code:
new[f"day{col+1}_code1"]=new['count'].where(new['code']==1)
new[f"day{col+1}_code2"]=new['count'].where(new['code']==2)
So many thanks for your help!
A biger example of the databse:
code count new-col1 new_col2 ......
id date
ABC1
2019-04-04 1 76 76 0 79 0 82 0 83 0 88 0 55 3 65 6
2019-04-05 1 79 79 0 82 0 83 0 88 0 55 3 65 6 101 10
2019-04-06 1 82 82 0 83 0 88 0 55 3 65 6 101 10 120 14
2019-04-07 2 83 83 0 88 0 55 3 65 6 101 10 120 14 0 0
2019-04-08 1 88 88 0 55 3 65 6 101 10 120 14 0 0 0 0
2019-04-09 1 55 55 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-09 2 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-10 1 65 101 10 120 14 0 0 0 0 10 0
2019-04-10 2 6 120 14 0 0 0 0 10 0
2019-04-11 1 101 0 0 0 0 10 0
your sample data is not so usable so I've simulated
considering differently, the data is grouped, hence groupby() ID in index and code
apply() after a groupby() gets passed a dataframe, build required columns on this dataframe
d = pd.date_range("01-jan-2021","03-jan-2021")
df = pd.concat([
pd.DataFrame({"ID":"ABC1","date":d,"code":1,"count":np.random.randint(20,50, len(d))}),
pd.DataFrame({"ID":"ABC1","date":d,"code":2,"count":np.random.randint(20,50, len(d))})
]).sort_values(["ID","date","code"], ascending=[True,False,True]).set_index(["ID","date"])
# pad an array with NaN to same length as second iterable
def nppad(a, s):
return np.pad(a.astype(float), (0,len(s)-len(a)), "constant", constant_values=np.nan)
df2 = df.groupby(["ID","code"]).apply(lambda dfa: dfa.assign(**{f"code{dfa.iloc[0,0]}_day{i+1}":
nppad(dfa["count"].values[i:],dfa)
for i in range(len(dfa))}))
output
code count code1_day1 code1_day2 code1_day3 code2_day1 code2_day2 code2_day3
ID date
ABC1 2021-01-03 1 40 40.0 38.0 46.0 NaN NaN NaN
2021-01-03 2 37 NaN NaN NaN 37.0 33.0 33.0
2021-01-02 1 38 38.0 46.0 NaN NaN NaN NaN
2021-01-02 2 33 NaN NaN NaN 33.0 33.0 NaN
2021-01-01 1 46 46.0 NaN NaN NaN NaN NaN
2021-01-01 2 33 NaN NaN NaN 33.0 NaN NaN
Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN
I am working with a pandas df that looks like this:
ID time
34 43
2 99
2 20
34 8
2 90
What would be the best approach to a create variable that represents the difference from the most recent time per ID?
ID time diff
34 43 35
2 99 9
2 20 NA
34 8 NA
2 90 70
Here's one possibility
df["diff"] = df.sort_values("time").groupby("ID")["time"].diff()
df
ID time diff
0 34 43 35.0
1 2 99 9.0
2 2 20 NaN
3 34 8 NaN
4 2 90 70.0
Assuming I have a Pandas dataframe similar to the below, how would I get the rolling correlation (for 2 days in this example) between 2 specific columns and group by the 'ID' column? I am familiar with the Pandas rolling_corr() function but I cannot figure out how to combine that with the groupby() clause.
What I have:
ID Date Val1 Val2
A 1-Jan 45 22
A 2-Jan 15 66
A 3-Jan 55 13
B 1-Jan 41 12
B 2-Jan 87 45
B 3-Jan 82 66
C 1-Jan 33 34
C 2-Jan 15 67
C 3-Jan 46 22
What I need:
ID Date Val1 Val2 Rolling_Corr
A 1-Jan 45 22
A 2-Jan 15 66 0.1
A 3-Jan 55 13 0.16
B 1-Jan 41 12
B 2-Jan 87 45 0.15
B 3-Jan 82 66 0.05
C 1-Jan 33 34
C 2-Jan 15 67 0.09
C 3-Jan 46 22 0.11
Thanks!
You can actually start with the simple approach here:
Pandas Correlation Groupby
and then add rolling(3) like this:
df.groupby('ID')[['Val1','Val2']].rolling(3).corr()
I've changed the window from 2 to 3 because you'll only get 1 or -1 with a window size of 2. Unfortunately, that output (not shown) is a bit verbose because it outputs a 2x2 correlation matrix when all you need is a scalar.
But with an additional line you can make the output more concise:
df2 = df.groupby('ID')[['Val1','Val2']].rolling(3).corr()
df2.groupby(level=[0,1]).last()['Val1']
ID
A 0 NaN
1 NaN
2 -0.996539
B 3 NaN
4 NaN
5 0.879868
C 6 NaN
7 NaN
8 -0.985529