I have a shipping records table with approx. 100K rows and
I want to calculate, for each row, for each material, how many qtys were shipped in last 30 days.
As you can see in below example, calculated qty depends on "material, shipping date".
I've tried to write very basic code and couldn't find a way to apply it to all rows.
df[(df['malzeme']==material) & (df['cikistarihi'] < shippingDate) & (df['cikistarihi'] >= (shippingDate-30))]['qty'].sum()
material
shippingDate
qty
shipped qtys in last 30 days
A
23.01.2019
8
0
A
28.01.2019
41
8
A
31.01.2019
66
49 (8+41)
A
20.03.2019
67
0
B
17.02.2019
53
0
B
26.02.2019
35
53
B
11.03.2019
4
88 (53+35)
B
20.03.2019
67
106 (35+4+67)
You can use .groupby with .rolling:
# convert the shippingData to datetime:
df["shippingDate"] = pd.to_datetime(df["shippingDate"], dayfirst=True)
# sort the values (if they aren't already)
df = df.sort_values(["material", "shippingDate"])
df["shipped qtys in last 30 days"] = (
df.groupby("material")
.rolling("30D", on="shippingDate", closed="left")["qty"]
.sum()
.fillna(0)
.values
)
print(df)
Prints:
material shippingDate qty shipped qtys in last 30 days
0 A 2019-01-23 8 0.0
1 A 2019-01-28 41 8.0
2 A 2019-01-31 66 49.0
3 A 2019-03-20 67 0.0
4 B 2019-02-17 53 0.0
5 B 2019-02-26 35 53.0
6 B 2019-03-11 4 88.0
7 B 2019-03-20 67 39.0
EDIT: Add .sort_values() before groupby
Related
I have three dataframes
df1 :
Date ID Number ID2 info_df1
2021-12-11 1 34 36 60
2021-12-10 2 33 35 57
2021-12-09 3 32 34 80
2021-12-08 4 3133 55
df2:
Date ID Number ID2 info_df2
2021-12-10 2 18 20 50
2021-12-11 1 34 36 89
2021-12-10 2 33 35 40
2021-12-09 3 32 34 92
df3:
Date ID Number ID2 info_df3
2021-12-10 2 18 20 57
2021-12-10 2 18 20 63
2021-12-11 1 34 36 52
2021-12-10 2 33 35 33
I need a data frame with info column from df1,df2 and df3 and Date,ID,Number,ID2 as index.
Format of the merged dataframe should consist these columns:
Date ID Number ID2 info_df1 info_df2
info_df3
If you trying to merge the dataframe based on Date, I think what you need is merge function:
mergedDf = df1.merge(df2, on="Date").merge(df3, on="Date");
mergedDf.set_index("ID2", inplace = True)
But if you are trying to merge dataframes based on multiple columns, you can use a list of column names on the on argument:
mergedDf = df1.merge(df2, on=["Date", "ID", "ID2"]).merge(df3, on=["Date", "ID", "ID2"]);
mergedDf.set_index("ID2", inplace = True)
Two steps:
first, pandas.concat(<DFs-list>) all those DFs into a df;
then, define a multi-index with df.set_index(<col-names-list>).
That will do it. Sure, you have to read some docs (here below), but those two steps should be about it.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.set_levels.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_frame.html
As others have mentioned, you need to merge the dataframes together. Using the built-in function functools.reduce, we can do this dynamically (for any number of dataframes) and easily:
i = 0
def func(x, y):
global i
i += 1
return y.merge(x.rename({'info': f'info_df{i + 1}'}, axis=1), on=['Date', 'ID', 'Number', 'ID2'], how='outer')
dfs = [df1, df2, df3]
new_df = ft.reduce(func, dfs).rename({'info': 'info_df1'}, axis=1)
Output:
>>> new_df
Date ID Number ID2 info_df1 info_df7 info_df6
0 2021-12-10 2 18 20 57.0 50.0 NaN
1 2021-12-10 2 18 20 63.0 50.0 NaN
2 2021-12-11 1 34 36 52.0 89.0 60.0
3 2021-12-10 2 33 35 33.0 40.0 57.0
4 2021-12-09 3 32 34 NaN 92.0 80.0
5 2021-12-08 4 31 33 NaN NaN 55.0
I have the following problem and do not know how to solve it in a perfomant way:
Input Pandas DataFrame:
timestep
article
volume
35
1
20
37
2
5
123
2
12
155
3
10
178
2
23
234
1
17
478
1
28
Output Pandas DataFrame:
timestep
volume
35
20
37
25
123
32
178
53
234
50
478
61
Calculation Example for timestep 478:
28 (last article 1 volume) + 23 (last article 2 volume) + 10 (last article 3 volume) = 61
What ist the best way to do this in pandas?
Try with ffill:
#sort if needed
df = df.sort_values("timestep")
df["volume"] = (df["volume"].where(df["article"].eq(1)).ffill().fillna(0) +
df["volume"].where(df["article"].eq(2)).ffill().fillna(0))
output = df.drop("article", axis=1)
>>> output
timestep volume
0 35 20.0
1 37 25.0
2 123 32.0
3 178 43.0
4 234 40.0
5 478 51.0
Group By article & Take last element & Sum
df.groupby(['article']).tail(1)["volume"].sum()
You can set group number of consecutive article by .cumsum(). Then get the value of previous group last item by .map() with GroupBy.last(). Finally, add volume with this previous last, as follows:
# Get group number of consecutive `article`
g = df['article'].ne(df['article'].shift()).cumsum()
# Add `volume` to previous group last
df['volume'] += g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
Result:
print(df)
timestep article volume
0 35 1 20
1 37 2 25
2 123 2 32
3 178 2 43
4 234 1 40
5 478 1 51
Breakdown of steps
Previous group last values:
g.sub(1).map(df.groupby(g)['volume'].last()).fillna(0, downcast='infer')
0 0
1 20
2 20
3 20
4 43
5 43
Name: article, dtype: int64
Try:
df["new_volume"] = (
df.loc[df["article"] != df["article"].shift(-1), "volume"]
.reindex(df.index, method='ffill')
.shift()
+ df["volume"]
).fillna(df["volume"])
df
Output:
timestep article volume new_volume
0 35 1 20 20.0
1 37 2 5 25.0
2 123 2 12 32.0
3 178 2 23 43.0
4 234 1 17 40.0
5 478 1 28 51.0
Explained:
Find the last record of each group by checking the 'article' from the previous row, then reindex that series aligning to the original dataframe and fill forward and shift to the next group with that 'volume'. And this to the current row's 'volume' and fill that first value with the original 'volume' value.
I've faced a problem while applying sort_values() and cumsum() within a group.
I have a dataaset:
Basically, I need to sort values within a group, get cumulative sales and select those lines that compose 90% of sales.
to get first
and then, just select 90% of sales within each region
I have tried the following but the last line doesn't work. I returns an error: Cannot access callable attribute 'sort_values' of 'SeriesGroupBy' objects, try using the 'apply' method
I've tried apply also..
import pandas as pd
df = pd.DataFrame({'id':['id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8', 'id_1',
'id_2','id_3','id_4','id_5','id_6','id_7','id_8'],
'region':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,],
'sales':[54,34,23,56,78,98,76,34,27,89,76,54,34,45,56,54]})
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cumul'] = df.groupby(df['region'])['sales'].sort_values(ascending=False).cumsum()
Thank you for any suggestions
You can definitely sort the dataframe first, then do groupby():
df.sort_values(['region','sales'], ascending=[True,False],inplace=True)
df['%']=df['sales']/df.groupby(df['region'])['sales'].transform('sum')
df['cummul'] = df.groupby('region')['%'].cumsum()
# filter
df[df['cummul'].le(0.9)]
Output:
id region sales % cummul
5 id_6 1 98 0.216336 0.216336
4 id_5 1 78 0.172185 0.388521
6 id_7 1 76 0.167770 0.556291
3 id_4 1 56 0.123620 0.679912
0 id_1 1 54 0.119205 0.799117
1 id_2 1 34 0.075055 0.874172
9 id_2 2 89 0.204598 0.204598
10 id_3 2 76 0.174713 0.379310
14 id_7 2 56 0.128736 0.508046
11 id_4 2 54 0.124138 0.632184
15 id_8 2 54 0.124138 0.756322
13 id_6 2 45 0.103448 0.859770
First we use your logic to create the % column, but we multiply by 100 and round to whole numbers.
Then we sort by region and %, no need for groupby.
After we sort, we create the cumul column.
And finally we select those within the 90% range with query:
df['%'] = df['sales'].div(df.groupby('region')['sales'].transform('sum')).mul(100).round()
df = df.sort_values(['region', '%'], ascending=[True, False])
df['cumul'] = df.groupby('region')['%'].cumsum()
df.query('cumul.le(90)')
output
id region sales % cumul
5 id_6 1 98 22.0 22.0
4 id_5 1 78 17.0 39.0
6 id_7 1 76 17.0 56.0
0 id_1 1 54 12.0 68.0
3 id_4 1 56 12.0 80.0
1 id_2 1 34 8.0 88.0
9 id_2 2 89 20.0 20.0
10 id_3 2 76 17.0 37.0
14 id_7 2 56 13.0 50.0
11 id_4 2 54 12.0 62.0
15 id_8 2 54 12.0 74.0
13 id_6 2 45 10.0 84.0
If you just need the sales data without the percentage, this can easily be done with method chaining:
(
df
.sort_values(by='sales', ascending=False)
.groupby('region')
.apply(lambda x[x.sales > x.sales.quantile(.1)])
.reset_index(level=0, drop=True)
)
Output
id region sales
5 id_6 1 98
4 id_5 1 78
6 id_7 1 76
3 id_4 1 56
0 id_1 1 54
1 id_2 1 34
7 id_8 1 34
9 id_2 2 89
10 id_3 2 76
14 id_7 2 56
11 id_4 2 54
15 id_8 2 54
13 id_6 2 45
12 id_5 2 34
This works because getting all values greater than 10% is essentially the same as getting the top 90%.
I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?
You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()
I have a problem that I am trying to solve in Spark by defining my own UDAF by trying to mimic the recommendations given here and here. My eventual goal is to apply a series of complex bit-shifting and bit-wise boolean manipulations to a sequence of integers within a given window.
I am having issues since my use-case is on a fairly large dataset (~100million rows, for which I need to perform 6 such bit-wise manipulations on groups ranging from 2-7 elements long), and am therefore trying to implement this in scala. Problem is that I'm brand new to scala (my primary language being python), and while scala itself doesn't seem that difficult, the combination of both a new language plus specifics of the UDAF class itself as applied to windows is leaving me a little stumped.
Explaining the logic by example in python/pandas
To make the question more concrete, consider a pandas DataFrame:
keep = list(range(30))
for num in (3, 5, 11, 16, 22, 24):
keep.pop(num)
np.random.seed(100)
df = pd.DataFrame({
'id': 'A',
'date': pd.date_range('2018-06-01', '2018-06-30')[keep],
'num': np.random.randint(low=1, high=100, size=30)[keep]
})
Which produces:
id date num
0 A 2018-06-01 9
1 A 2018-06-02 25
2 A 2018-06-03 68
3 A 2018-06-05 80
4 A 2018-06-06 49
5 A 2018-06-08 95
6 A 2018-06-09 53
7 A 2018-06-10 99
8 A 2018-06-11 54
9 A 2018-06-12 67
10 A 2018-06-13 99
11 A 2018-06-15 35
12 A 2018-06-16 25
13 A 2018-06-17 16
14 A 2018-06-18 61
15 A 2018-06-19 59
16 A 2018-06-21 10
17 A 2018-06-22 94
18 A 2018-06-23 87
19 A 2018-06-24 3
20 A 2018-06-25 28
21 A 2018-06-26 5
22 A 2018-06-28 2
23 A 2018-06-29 14
What I would like to be able to do is, relative to the current row, find the number of days, then perform some bit-wise manipulations based on that value. To demonstrate, staying in pandas (I will have to do a full outer join then filter to demonstrate):
exp_df = df[['id', 'date']].merge(df, on='id') \ # full outer join on 'id'
.assign(days_diff = lambda df: (df['date_y'] - df['date_x']).dt.days) \ # number of days since my date of interest
.mask(lambda df: (df['days_diff'] > 3) | (df['days_diff'] < 0)) \ # nulls rows where days_diff isn't between 0 and 3
.dropna() \ # then filters the rows
.drop('date_y', axis='columns') \
.rename({'date_x': 'date', 'num': 'nums'}, axis='columns') \
.reset_index(drop=True)
exp_df[['nums', 'days_diff']] = exp_df[['nums', 'days_diff']].astype('int')
Now I perform my bit-wise shifting and other logic:
# Extra values to add after bit-wise shifting (1 for shift of 1, 3 for shift of 2 ...)
additions = {val: sum(2**power for power in range(val)) for val in exp_df['days_diff'].unique()}
exp_df['shifted'] = np.left_shift(exp_df['nums'].values, exp_df['days_diff'].values) \
+ exp_df['days_diff'].apply(lambda val: additions[val])
After all this, exp_df looks like the following (first 10 rows):
id date nums days_diff shifted
0 A 2018-06-01 9 0 9
1 A 2018-06-01 25 1 51
2 A 2018-06-01 68 2 275
3 A 2018-06-02 25 0 25
4 A 2018-06-02 68 1 137
5 A 2018-06-02 80 3 647
6 A 2018-06-03 68 0 68
7 A 2018-06-03 80 2 323
8 A 2018-06-03 49 3 399
9 A 2018-06-05 80 0 80
Now I can aggregate:
exp_df.groupby('date')['shifted'].agg(lambda group_vals: np.bitwise_and.reduce(group_vals.values)
And the final result looks like the following (if I join back to the original DataFrame:
id date num shifted
0 A 2018-06-01 9 1
1 A 2018-06-02 25 1
2 A 2018-06-03 68 0
3 A 2018-06-05 80 64
4 A 2018-06-06 49 33
5 A 2018-06-08 95 3
6 A 2018-06-09 53 1
7 A 2018-06-10 99 1
8 A 2018-06-11 54 6
9 A 2018-06-12 67 3
10 A 2018-06-13 99 3
11 A 2018-06-15 35 3
12 A 2018-06-16 25 1
13 A 2018-06-17 16 0
14 A 2018-06-18 61 21
15 A 2018-06-19 59 35
16 A 2018-06-21 10 8
17 A 2018-06-22 94 6
18 A 2018-06-23 87 3
19 A 2018-06-24 3 1
20 A 2018-06-25 28 0
21 A 2018-06-26 5 1
22 A 2018-06-28 2 0
23 A 2018-06-29 14 14
Back to the question
Ok, now that I've demonstrated my logic, I realize that I can essentially do the same thing in Spark - performing a full outer join of the DataFrame on itself, then filtering and aggregating.
What I want to know is if I can avoid performing a full join, and instead create my own UDAF to perform this aggregation over a window function, using the target row as an input. Basically, I need to create the equivalent of the "days_diff" column in order to perform my required logic, which means comparing the target date to each of the other dates within my specified window. Is this even possible?
Also, am I even justified in worrying about using a self-join? I know that spark does all of its processing lazily, so it's very possible that I wouldn't need to worry . Should I expect the performance to be similar if I were to do all of this using a self join versus my imaginary UDAF applied over a window? The logic is more sequential and easier to follow using the join-filter-aggregate method, which is a clear advantage.
One thing to know is that I will be performing this logic on multiple windows. In principle, I could cache the largest version of the filtered DataFrame after the join, then use that for subsequent calculations.