average of one wrt another or averageifs in python - python

I have a pandas df as displayed I would like to calculate Avg Rate by DC by Brand column which is a similar to averageif in excel ,
I have tried methods like groupby mean() but that does not give correct results

Your question is not clear but you may be looking for:
df.groupby(['DC','Brand'])['Rate'].mean()

AVERAGEIF in excel returns a column which is the same size as your original data. So I think you're looking for pandas.transform():
# Sample DF
Brand Rate
0 A 45
1 B 100
2 C 28
3 A 92
4 B 2
5 C 79
6 A 48
7 B 97
8 C 72
9 D 14
10 D 16
11 D 64
12 E 85
13 E 22
Result:
df['Avg Rate by Brand'] = df.groupby('Brand')['Rate'].transform('mean')
print(df)
Brand Rate Avg Rate by Brand
0 A 45 61.666667
1 B 100 66.333333
2 C 28 59.666667
3 A 92 61.666667
4 B 2 66.333333
5 C 79 59.666667
6 A 48 61.666667
7 B 97 66.333333
8 C 72 59.666667
9 D 14 31.333333
10 D 16 31.333333
11 D 64 31.333333
12 E 85 53.500000
13 E 22 53.500000

Related

Pandas custom groupby fill

I have this dataset:
menu alternative id varA varB varC
1 NaN A NaN NaN NaN
1 NaN A NaN NaN NaN
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 NaN A NaN NaN NaN
7 NaN A NaN NaN NaN
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 NaN B NaN NaN NaN
6 NaN B NaN NaN NaN
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 NaN C NaN NaN NaN
5 NaN C NaN NaN NaN
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
As you can see here, I have some null values which I need to fill. However, I need to fill them in a somewhat custom manner. For every id and for every menu I need to fill the null values based on random selection among the same menus (same menu number) in different ids which have non-null values.
Example. The menu 1 in id A has null values. I want to randomly select menu 1 in different id which has non-null values and fill them there. Let it be, id B and menu 1. For menu 7 in id A let it be menu 7 in id C and etc.
It is somehow similar to this question but iin my case, the filling should happen within the same "subgroups" if we can say so.
The final output should be something like this:
menu alternative id varA varB varC
1 82 A 21.34428845 9.901326583 1.053134597
1 91 A 19.04689216 16.29217346 29.56962312
2 94 A 8.089481019 7.07639559 0.90627215
2 89 A 7.52310322 19.49894193 14.4562262
3 79 A 24.79634962 18.91163612 23.85341972
3 95 A 21.10990397 17.00630516 1.09875582
4 47 A 5.681766806 4.136047755 17.38880496
4 62 A 10.39459876 0.997853805 0.045331687
5 58 A 11.91790497 5.696799013 27.21424163
5 23 A 11.71107828 2.165751058 11.56534045
6 57 A 1.068603487 27.95362014 1.334049372
6 100 A 26.31848796 6.757305213 4.742282633
7 17 A 4.319991082 21.29233667 3.516184987
7 8 A 24.09490443 9.507000131 14.93472971
8 24 A 29.99608877 28.49057834 0.14073638
8 7 A 8.749041949 14.17745528 9.604565417
9 64 A 29.4316969 19.57593592 9.174503643
9 60 A 13.53995541 1.898164567 16.49089291
10 85 A 20.1394155 0.995839592 16.18638727
10 22 A 22.68625486 14.26052953 17.79707308
1 82 B 21.34428845 9.901326583 1.053134597
1 91 B 19.04689216 16.29217346 29.56962312
2 35 B 25.44168095 29.00407645 2.246459981
2 100 B 15.79687903 20.37920541 28.45071525
3 44 B 7.359501131 23.66924419 7.198215907
3 41 B 22.65272801 8.66227065 12.05186217
4 59 B 26.67565422 9.608511948 26.45016581
4 53 B 5.64870847 21.83063691 19.20105218
5 48 B 5.591317152 25.17616679 24.30522374
5 16 B 23.85069753 23.12154586 0.781450997
6 57 B 1.068603487 27.95362014 1.334049372
6 100 B 26.31848796 6.757305213 4.742282633
7 68 B 9.334935288 16.39114327 21.17696541
7 41 B 5.841577934 6.901223007 28.38116983
8 35 B 21.20288984 9.665414964 4.472546438
8 96 B 0.451299457 27.66880932 26.2120144
9 84 B 19.67310555 1.993071082 9.08442779
9 65 B 0.475983889 16.72261394 17.17122898
10 40 B 9.553130945 17.88616649 22.17570401
10 40 B 19.70487161 5.898428653 11.25844279
1 19 C 20.47792809 9.344376127 7.855311112
1 59 C 14.59141273 8.090534362 19.6972446
2 19 C 6.624345353 0.192145343 26.31356322
2 67 C 24.483236 6.718856437 25.75609679
3 67 C 27.6408808 24.91014602 25.90758755
3 30 C 26.52738124 10.78363589 4.873602089
4 14 C 3.776964641 21.16561036 24.03153234
4 46 C 16.53719818 23.86634958 25.61504006
5 48 C 5.591317152 25.17616679 24.30522374
5 16 C 23.85069753 23.12154586 0.781450997
6 58 C 28.1357636 15.89359176 0.567406646
6 28 C 0.708229201 12.20641988 0.309303591
7 17 C 4.319991082 21.29233667 3.516184987
7 8 C 24.09490443 9.507000131 14.93472971
8 85 C 19.99606403 21.61509867 0.161222766
8 5 C 6.056082264 25.35186187 5.375641692
9 24 C 19.83904205 24.54037422 11.08571464
9 13 C 4.388769239 7.928106767 4.279531285
10 78 C 13.67598922 5.3140143 15.2710129
10 13 C 12.27642791 16.04610858 1.815260029
Any guidance would be appreciated. Maybe even there is some groupby apply logic which could assist in this.
You can run fillna() row-wise in apply(), then fill with a random sample from the dataframe filtered by your conditions:
df.apply(lambda row: row.fillna(df[(df['menu'] == row['menu']) & (df['id'] != row['id'])].dropna().sample(n=1).iloc[0]), axis=1)

Labeling by period

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading
Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

How to get the value if the value is present in the xlsx sheet without knowing index number?

I have a unstructured Xslx file. I want to get the full row if the values are present in the sheet. For example
A B C D F
abc 10 24 32 54
cdf 9 10 34 98
mgl 11 90 21 98
fgd 1 9 2 10
I want to get if the 10 value present in the sheet to get the full row values
output =>
abc 10 24 32 54
cdf 9 10 34 98
fgd 1 9 2 10
thanks for the contributions
Use DataFrame.eq with DataFrame.any for test if at least one True per rows:
df = pd.read_excel('file.xlsx')
df1 = df[df.eq(10).any(axis=1)]
Or:
df1 = df[(df == 10).any(axis=1)]
print (df1)
A B C D F
0 abc 10 24 32 54
1 cdf 9 10 34 98
3 fgd 1 9 2 10
You can use pandas.DataFrame.isin followed by pandas.DataFrame.any:
df[df.isin([10]).any(axis = 1)]
A B C D F
0 abc 10 24 32 54
1 cdf 9 10 34 98
3 fgd 1 9 2 10

DataFrame : Get the top n value of each type

I have a group of data like below
ID Type value_1 value_2
1 A 12 89
2 A 13 78
3 A 11 92
4 A 9 79
5 B 15 83
6 B 34 91
7 B 2 87
8 B 3 86
9 B 7 85
10 C 9 83
11 C 3 85
12 C 2 87
13 C 12 88
14 C 11 82
I want to get the top 3 member of each Type according to the value_1 . The only solution occurs to me is that: first , get each Type data into a dataframe and sorted according to the value_1 and get the top 3; Then, merge the result together.
But is ther any simple method to solve it ? For easy discuss , I have codes below
#coding:utf-8
import pandas as pd
_data = [
["1","A",12,89],
["2","A",13,78],
["3","A",11,92],
["4","A",9,79],
["5","B",15,83],
["6","B",34,91],
["7","B",2,87],
["8","B",3,86],
["9","B",7,85],
["10","C",9,83],
["11","C",3,85],
["12","C",2,87],
["13","C",12,88],
["14","C",11,82]
]
head= ["ID","type","value_1","value_2"]
df = pd.DataFrame(_data, columns=head)
Then we using groupby tail with sort_values
newdf=df.sort_values(['type','value_1']).groupby('type').tail(3)
newer
ID type value_1 value_2
2 3 A 11 92
0 1 A 12 89
1 2 A 13 78
8 9 B 7 85
4 5 B 15 83
5 6 B 34 91
9 10 C 9 83
13 14 C 11 82
12 13 C 12 88
Sure! DataFrame.groupby can split a dataframe into different parts by the group fields and apply function can apply UDF on each group.
df.groupby('type', as_index=False, group_keys=False)\
.apply(lambda x: x.sort_values('value_1', ascending=False).head(3))

Create a column with periodically repeated values in pandas

I have a sample data frame df with one column:
Cost
30
49
98
10
37
20
10
48
70
20
30
40
50
29
90
39
30
29
50
40
and a list: id_list = ["A","B","C","D"] which is a list with 4 different id types. I would like to create a new column in the data frame where the first 5 cost values will be "A" the next 5 cost values will be "B" .... and the last 5 cost values will be "D". Therefore, I want to repeat the elements of the id_list 5 times and my new df will be like this:
Cost ID
30 A
49 A
98 A
10 A
37 A
20 B
10 B
48 B
70 B
20 B
30 C
40 C
50 C
29 C
90 C
39 D
30 D
29 D
50 D
40 D
My actual data frame has many rows and the actual id_list has many elements.
The row-number is multiple of 5 so there will be an exact fill in the final data frame.
In general I know how to add a column with specific values in pandas data frame
but I don't know how to do this with the repeated values.
Could you suggest how can I do this in python?
Thanks in advance for any help
There is function from numpy , repeat
df['New']=np.repeat(id_list,5)
df
Out[23]:
Cost New
0 30 A
1 49 A
2 98 A
3 10 A
4 37 A
5 20 B
6 10 B
7 48 B
8 70 B
9 20 B
10 30 C
11 40 C
12 50 C
13 29 C
14 90 C
15 39 D
16 30 D
17 29 D
18 50 D
19 40 D
Numpy free v1
df.assign(ID=sum(zip(*[id_list] * 5), tuple()))
Cost ID
0 30 A
1 49 A
2 98 A
3 10 A
4 37 A
5 20 B
6 10 B
7 48 B
8 70 B
9 20 B
10 30 C
11 40 C
12 50 C
13 29 C
14 90 C
15 39 D
16 30 D
17 29 D
18 50 D
19 40 D
Numpy free v2
df.assign(ID=[x for x in id_list for _ in range(5)])
I would suggest something like this, which takes advantage of the [item]*n => [item, item, item, ...] expansion that python does:
labels = ['label1', 'label2', 'label3']
num = 5
repeated = []
for i in labels:
repeated.extend([i]*num)
You can then add the column to your dataframe.

Categories

Resources