Python Pandas - sumif in excel - criteria and range same df volumn - python

I’ve been trying to code python equivalent of excel sumif
Excel:
Sumif($A$1:$A$20,A1,$C$1:$C$20)
enter code here
Pandas df:
A C Term
1 10 1
1 20 2
1 10 3
1 10 4
2 30 5
2 30 6
2 30 7
3 20 8
3 10 9
3 10 10
3 10 11
3 10 12
Output df - I want output df with ‘fwdSum’ as follows
—————————
A C Term fwdSum
1 10 1 50
1 20 2 50
1 10 3 50
1 10 4 50
2 30 5 90
2 30 6 90
2 30 7 90
3 20 8 60
3 10 9 60
3 10 10 60
3 10 11 60
3 10 12 60
I tried creating another df with groupby and sum and then later merge
Please can anyone suggest the best Way to achieve this?

df['fwdSum'] = df.groupby('A')['C'].transform('sum')
print(df)
Prints:
A C Term fwdSum
0 1 10 1 50
1 1 20 2 50
2 1 10 3 50
3 1 10 4 50
4 2 30 5 90
5 2 30 6 90
6 2 30 7 90
7 3 20 8 60
8 3 10 9 60
9 3 10 10 60
10 3 10 11 60
11 3 10 12 60

Related

Pandas expand date range with multiple times and forward filling

I have a dataframe like this:
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
01/09/2022 10 20 1 2
01/09/2022 15 25 4 5
01/09/2022 30 50 7 10
05/09/2022 10 20 1 2
05/09/2022 15 25 4 5
07/09/2022 15 25 4 5
I want to expand the dataframe to all date range between the DATE column with forward filling. The desired putput is:
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
01/09/2022 10 20 1 2
01/09/2022 15 25 4 5
01/09/2022 30 50 7 10
02/09/2022 10 20 1 2
02/09/2022 15 25 4 5
02/09/2022 30 50 7 10
03/09/2022 10 20 1 2
03/09/2022 15 25 4 5
03/09/2022 30 50 7 10
04/09/2022 10 20 1 2
04/09/2022 15 25 4 5
04/09/2022 30 50 7 10
05/09/2022 10 20 1 2
05/09/2022 15 25 4 5
06/09/2022 10 20 1 2
06/09/2022 15 25 4 5
07/09/2022 15 25 4 5
Could you please help me about this?
First convert values to datetimes, create helper counter Series g by GroupBy.cumcount for reshape by DataFrame.set_index and DataFrame.unstack, then use DataFrame.asfreq with method='ffill' and reshape back by DataFrame.stack, remove helper level by DataFrame.droplevel, convert DatetimeIndex to column, change format of datetimes and last create same dtypes like original DataFrame:
df['DATE'] = pd.to_datetime(df['DATE'], dayfirst=True)
g = df.groupby('DATE').cumcount()
df = (df.set_index(['DATE',g])
.unstack()
.asfreq('D', method='ffill')
.stack()
.droplevel(-1)
.reset_index()
.assign(DATE = lambda x: x['DATE'].dt.strftime('%d/%m/%Y'))
.astype(df.dtypes)
)
print (df)
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
0 2022-01-09 10 20 1 2
1 2022-01-09 15 25 4 5
2 2022-01-09 30 50 7 10
3 2022-02-09 10 20 1 2
4 2022-02-09 15 25 4 5
5 2022-02-09 30 50 7 10
6 2022-03-09 10 20 1 2
7 2022-03-09 15 25 4 5
8 2022-03-09 30 50 7 10
9 2022-04-09 10 20 1 2
10 2022-04-09 15 25 4 5
11 2022-04-09 30 50 7 10
12 2022-05-09 10 20 1 2
13 2022-05-09 15 25 4 5
14 2022-06-09 10 20 1 2
15 2022-06-09 15 25 4 5
16 2022-07-09 15 25 4 5
A couple of merges should help with this, and should still be efficient as the data size increases:
Get the unique dates and build a new dataframe from that:
out = df.DATE.drop_duplicates()
dates = pd.date_range(out.min(), out.max(), freq='D')
dates = pd.DataFrame(dates, columns=['dates'])
Merge dates with out, and subsequently merge the outcome with the original dataframe:
(dates
.merge(
out,
left_on='dates',
right_on='DATE',
how = 'left')
# faster to fill on a Series than a Dataframe
.assign(DATE = lambda df: df.DATE.ffill())
.merge(
df,
on = 'DATE',
how = 'left')
.drop(columns='DATE')
.rename(columns= {'dates':'DATE'})
)
DATE MIN_AMOUNT MAX_AMOUNT MIN_DAY MAX_DAY
0 2022-09-01 10 20 1 2
1 2022-09-01 15 25 4 5
2 2022-09-01 30 50 7 10
3 2022-09-02 10 20 1 2
4 2022-09-02 15 25 4 5
5 2022-09-02 30 50 7 10
6 2022-09-03 10 20 1 2
7 2022-09-03 15 25 4 5
8 2022-09-03 30 50 7 10
9 2022-09-04 10 20 1 2
10 2022-09-04 15 25 4 5
11 2022-09-04 30 50 7 10
12 2022-09-05 10 20 1 2
13 2022-09-05 15 25 4 5
14 2022-09-06 10 20 1 2
15 2022-09-06 15 25 4 5
16 2022-09-07 15 25 4 5

How to get the number of events in a regular interval of time in a dataframe

Assume I'm having dataframe as shown below.
In the data frame we are representing the events occurred on every sec.
Time events_occured
1 2
2 3
3 7
4 4
5 6
6 3
7 86
8 26
9 7
10 26
. .
. .
. .
996 56
997 26
998 97
999 58
1000 34
Now I need to get the cumulative occurrences of events in every 5 secs.
As in first 5 seconds 22 events occurred, from 6 to 10 secs 148 events occurred and so on.
Like this:
In [647]: df['cumulative'] = df.events_occured.groupby(df.index // 5).cumsum()
In [648]: df
Out[648]:
Time events_occured cumulative
0 1 2 2
1 2 3 5
2 3 7 12
3 4 4 16
4 5 6 22
5 6 3 3
6 7 86 89
7 8 26 115
8 9 7 122
9 10 26 148
if there are missing values ​​of Time using df.index could produce errors in the logic so use df['Time'].
It also works if time starts at any value N and if there are missing values ​​greater than N
GROUP_SIZE = 5
df['cumulative'] = df.events_occured\
.groupby(df['Time'].sub(df['Time'].min()) // GROUP_SIZE).cumsum()
print(df)
Time events_occured cumulative
0 1 2 2
1 2 3 5
2 3 7 12
3 4 4 16
4 5 6 22
5 6 3 3
6 7 86 89
7 8 26 115
8 9 7 122
9 10 26 148

assign a number id for every 4 rows in pandas dataframe

I have a pandas dataframe like this:
pd.DataFrame({'week': ['2019-w01', '2019-w02','2019-w03','2019-w04',
'2019-w05','2019-w06','2019-w07','2019-w08',
'2019-w9','2019-w10','2019-w11','2019-w12'],
'value': [11,22,33,34,57,88,2,9,10,1,76,14],
'period': [1,1,1,1,2,2,2,2,3,3,3,3]})
week value
0 2019-w1 11
1 2019-w2 22
2 2019-w3 33
3 2019-w4 34
4 2019-w5 57
5 2019-w6 88
6 2019-w7 2
7 2019-w8 9
8 2019-w9 10
9 2019-w10 1
10 2019-w11 76
11 2019-w12 14
what I need is like below. I would like to assign a period ID every 4-week interval.
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3
what is the best way to achieve that? Thanks.
try with:
df['period']=(pd.to_numeric(df['week'].str.split('-').str[-1]
.str.replace('w',''))//4).shift(fill_value=0).add(1)
print(df)
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3

Merge dataframes including extreme values

I have 2 data frames, df1 and df2:
df1
Out[66]:
A B
0 1 11
1 1 2
2 1 32
3 1 42
4 1 54
5 1 66
6 2 16
7 2 23
8 3 13
9 3 24
10 3 35
11 3 46
12 3 51
13 4 12
14 4 28
15 4 39
16 4 49
df2
Out[80]:
B
0 32
1 42
2 13
3 24
4 35
5 39
6 49
I want to merge dataframes but at the same time including the first and/or last value of the set in column A. This is an example of the desired outcome:
df3
Out[93]:
A B
0 1 2
1 1 32
2 1 42
3 1 54
4 3 13
5 3 24
6 3 35
7 3 46
8 4 28
9 4 39
10 4 49
I'm trying to use merge but that only slice the portion of data frames that coincides. Someone have an idea to deal with this? thanks!
Here's one way to do it using merge with indicator, groupby, and rolling:
df[df.merge(df2, on='B', how='left', indicator='Ind').eval('Found=Ind == "both"')
.groupby('A')['Found']
.apply(lambda x: x.rolling(3, center=True, min_periods=2).max()).astype(bool)]
Output:
A B
1 1 2
2 1 32
3 1 42
4 1 54
8 3 13
9 3 24
10 3 35
11 3 46
14 4 28
15 4 39
16 4 49
pd.concat([df1.groupby('A').min().reset_index(), pd.merge(df1,df2, on="B"), df1.groupby('A').max().reset_index()]).reset_index(drop=True).drop_duplicates().sort_values(['A','B'])
A B
0 1 2
4 1 32
5 1 42
1 2 16
2 3 13
7 3 24
8 3 35
3 4 12
9 4 39
10 4 49
Breaking down each part
#Get Minimum
df1.groupby('A').min().reset_index()
# Merge on B
pd.merge(df1,df2, on="B")
# Get Maximum
df1.groupby('A').max().reset_index()
# Reset the Index and drop duplicated rows since there may be similarities between the Merge and Min/Max. Sort values by 'A' then by 'B'
.reset_index(drop=True).drop_duplicates().sort_values(['A','B'])

How to do a GroupBy without any mean for other features?

I'm working with python pandas DataFrames and I want to group my Data by category and I don't want any mean or median for other features (PriceBucket, success_rate and products_by_number). My DataFrame look like this :
PriceBucket success_rate products_by_number category
0 0 6.890 149837 10
1 1 7.240 105447 10
2 2 7.710 145295 10
3 3 8.090 181323 10
4 4 8.930 57187 10
5 5 8.110 133449 10
6 6 7.920 142858 10
7 7 8.230 115109 10
8 8 8.510 121930 10
9 9 8.340 122510 10
10 0 10.520 28105 20
11 1 9.770 27494 20
12 2 10.080 26758 20
13 3 10.180 29973 20
14 4 9.860 29175 20
15 5 9.950 23807 20
16 6 9.550 30520 20
17 7 9.550 23653 20
18 8 8.990 27514 20
19 9 6.710 26152 20
20 0 11.060 39538 60
21 1 10.740 34479 60
22 2 10.700 36133 60
23 3 10.900 34220 60
24 4 11.290 46001 60
25 5 11.130 26705 60
26 6 11.040 37258 60
27 7 11.150 34561 60
28 8 10.845 35495 60
29 9 10.220 35434 60
30 0 8.380 34134 90
31 1 7.920 32160 90
32 2 8.170 29500 90
33 3 8.270 31688 90
34 4 8.395 38977 90
35 5 8.620 27130 90
36 6 8.440 31007 90
37 7 8.570 31005 90
38 8 8.170 32659 90
39 9 7.290 30227 90
And this is exactly what I want :
PriceBucket success_rate products_by_number
category
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227
What to do ? Many thanks
Assuming you dataframe is df then you want:
print df.set_index(['category', 'PriceBucket'])
success_rate products_by_number
category PriceBucket
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227

Categories

Resources