I'm working with python pandas DataFrames and I want to group my Data by category and I don't want any mean or median for other features (PriceBucket, success_rate and products_by_number). My DataFrame look like this :
PriceBucket success_rate products_by_number category
0 0 6.890 149837 10
1 1 7.240 105447 10
2 2 7.710 145295 10
3 3 8.090 181323 10
4 4 8.930 57187 10
5 5 8.110 133449 10
6 6 7.920 142858 10
7 7 8.230 115109 10
8 8 8.510 121930 10
9 9 8.340 122510 10
10 0 10.520 28105 20
11 1 9.770 27494 20
12 2 10.080 26758 20
13 3 10.180 29973 20
14 4 9.860 29175 20
15 5 9.950 23807 20
16 6 9.550 30520 20
17 7 9.550 23653 20
18 8 8.990 27514 20
19 9 6.710 26152 20
20 0 11.060 39538 60
21 1 10.740 34479 60
22 2 10.700 36133 60
23 3 10.900 34220 60
24 4 11.290 46001 60
25 5 11.130 26705 60
26 6 11.040 37258 60
27 7 11.150 34561 60
28 8 10.845 35495 60
29 9 10.220 35434 60
30 0 8.380 34134 90
31 1 7.920 32160 90
32 2 8.170 29500 90
33 3 8.270 31688 90
34 4 8.395 38977 90
35 5 8.620 27130 90
36 6 8.440 31007 90
37 7 8.570 31005 90
38 8 8.170 32659 90
39 9 7.290 30227 90
And this is exactly what I want :
PriceBucket success_rate products_by_number
category
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227
What to do ? Many thanks
Assuming you dataframe is df then you want:
print df.set_index(['category', 'PriceBucket'])
success_rate products_by_number
category PriceBucket
10 0 6.890 149837
1 7.240 105447
2 7.710 145295
3 8.090 181323
4 8.930 57187
5 8.110 133449
6 7.920 142858
7 8.230 115109
8 8.510 121930
9 8.340 122510
20 0 10.520 28105
1 9.770 27494
2 10.080 26758
3 10.180 29973
4 9.860 29175
5 9.950 23807
6 9.550 30520
7 9.550 23653
8 8.990 27514
9 6.710 26152
60 0 11.060 39538
1 10.740 34479
2 10.700 36133
3 10.900 34220
4 11.290 46001
5 11.130 26705
6 11.040 37258
7 11.150 34561
8 10.845 35495
9 10.220 35434
90 0 8.380 34134
1 7.920 32160
2 8.170 29500
3 8.270 31688
4 8.395 38977
5 8.620 27130
6 8.440 31007
7 8.570 31005
8 8.170 32659
9 7.290 30227
Related
Let's say I have the following dataframe:
ID stop x y z
0 202 9 20 27 4
1 202 2 23 24 13
2 1756 5 5 41 73
3 1756 3 7 42 72
4 1756 4 3 50 73
5 2153 14 121 12 6
6 2153 3 122.5 2 6
7 3276 1 54 33 -12
8 5609 9 -2 44 -32
9 5609 2 8 44 -32
10 5609 5 102 -23 16
I would like to change the ID values in order to have the smallest being 1, the second smallest being 2 etc.. So for my example, I would get this:
ID stop x y z
0 1 9 20 27 4
1 1 2 23 24 13
2 2 5 5 41 73
3 2 3 7 42 72
4 2 4 3 50 73
5 3 14 121 12 6
6 3 3 122.5 2 6
7 4 1 54 33 -12
8 5 9 -2 44 -32
9 5 2 8 44 -32
10 5 5 102 -23 16
Any idea please?
Thanks in advance!
You can use pd.Series.rank with method='dense'
df['ID'] = df['ID'].rank(method='dense').astype(int)
I have a pandas.DataFrame of the form. I'll show you a simple example.(In reality, it consists of hundreds of millions of rows of data.).
I want to change the number as the letter in column '2' changes. Numbers in the remaining columns (columns:1,3 ~) should not change.
df=
index 1 2 3
0 0 a100 1
1 1.04 a100 2
2 32 a100 3
3 5.05 a105 4
4 1.01 a105 5
5 155 a105 6
6 3155.26 a105 7
7 354.12 a100 8
8 5680.13 a100 9
9 125.55 a100 10
10 13.32 a100 11
11 5656.33 a156 12
12 456.61 a156 13
13 23.52 a1235 14
14 35.35 a1235 15
15 350.20 a100 16
16 30. a100 17
17 13.50 a100 18
18 323.13 a231 19
19 15.11 a1111 20
20 11.22 a1111 21
Here is my expected result:
df=
index 1 2 3
0 0 0 1
1 1.04 0 2
2 32 0 3
3 5.05 1 4
4 1.01 1 5
5 155 1 6
6 3155.26 1 7
7 354.12 2 8
8 5680.13 2 9
9 125.55 2 10
10 13.32 2 11
11 5656.33 3 12
12 456.61 3 13
13 23.52 4 14
14 35.35 4 15
15 350.20 5 16
16 30 5 17
17 13.50 5 18
18 323.13 6 19
19 15.11 7 20
20 11.22 7 21
How do I solve this problem?
Use consecutive groups created by compare for not equal shifted values with cumulative sum and then subtract 1:
#if column is string '2'
df['2'] = df['2'].ne(df['2'].shift()).cumsum().sub(1)
#if column is number 2
df[2] = df[2].ne(df[2].shift()).cumsum().sub(1)
print (df)
index 1 2 3
0 0 0.00 0 1
1 1 1.04 0 2
2 2 32.00 0 3
3 3 5.05 1 4
4 4 1.01 1 5
5 5 155.00 1 6
6 6 3155.26 1 7
7 7 354.12 2 8
8 8 5680.13 2 9
9 9 125.55 2 10
10 10 13.32 2 11
11 11 5656.33 3 12
12 12 456.61 3 13
13 13 23.52 4 14
14 14 35.35 4 15
15 15 350.20 5 16
16 16 30.00 5 17
17 17 13.50 5 18
18 18 323.13 6 19
19 19 15.11 7 20
20 20 11.22 7 21
I’ve been trying to code python equivalent of excel sumif
Excel:
Sumif($A$1:$A$20,A1,$C$1:$C$20)
enter code here
Pandas df:
A C Term
1 10 1
1 20 2
1 10 3
1 10 4
2 30 5
2 30 6
2 30 7
3 20 8
3 10 9
3 10 10
3 10 11
3 10 12
Output df - I want output df with ‘fwdSum’ as follows
—————————
A C Term fwdSum
1 10 1 50
1 20 2 50
1 10 3 50
1 10 4 50
2 30 5 90
2 30 6 90
2 30 7 90
3 20 8 60
3 10 9 60
3 10 10 60
3 10 11 60
3 10 12 60
I tried creating another df with groupby and sum and then later merge
Please can anyone suggest the best Way to achieve this?
df['fwdSum'] = df.groupby('A')['C'].transform('sum')
print(df)
Prints:
A C Term fwdSum
0 1 10 1 50
1 1 20 2 50
2 1 10 3 50
3 1 10 4 50
4 2 30 5 90
5 2 30 6 90
6 2 30 7 90
7 3 20 8 60
8 3 10 9 60
9 3 10 10 60
10 3 10 11 60
11 3 10 12 60
I have cvs data, this dataset has different latitude locations from 17 to 20, and each location has monthly data i.e (1,2,3,4,5,6, ...).
I would like to add a new column name and N and it depends on the latitude and the value per month, put the respective associated number for the given value.
Input data
lan/lon/year/month/prec
-17/18/1990/1/0.4
-17/18/1990/2/0.02
-17/18/1990/3/0.12
-17/18/1990/4/0.06
.
.
.
-17/18/2020/12/0.35
-17/20/1990/1/0.2
-17/20/1990/2/0.2
-17/20/1990/3/0.2
-17/20/1990/4/0.2
.
.
.
-17/20/2020/12/0.08
-18/20/1990/1/0.11
-18/20/1990/2/0.11
-18/20/1990/3/0.11
.
.
.
.
N values depend on lat and month
17 18 19 20 21
1 25 29 13 13 2
2 22 11 1 16 23
3 8 13 10 21 8
4 4 14 16 10 13
5 23 30 8 8 18
6 16 4 7 5 29
7 26 5 10 25 28
8 3 16 2 27 2
9 21 16 23 8 7
10 19 30 10 28 20
11 28 18 12 6 8
12 21 14 26 3 8
EXPECTED OUTPUT
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
.
.
.
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
.
.
.
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
.
.
.
.
Use DataFrame.stack for reshape second df2 with rename MultiIndex names for match by columns lat1, month in df by DataFrame.join:
df = pd.read_csv(data, sep="/")
s = df2.rename(columns=int).unstack().rename_axis(['lan1','month'])
df['lan1'] = df['lan'].abs()
df2 = df.join(s.rename('N'), on=['lan1','month']).drop('lan1', axis=1)
print (df2)
lan lon year month prec N
0 -17 18 1990 1 0.40 25
1 -17 18 1990 2 0.02 22
2 -17 18 1990 3 0.12 8
3 -17 18 1990 4 0.06 4
4 -17 18 2020 12 0.35 21
5 -17 20 1990 1 0.20 25
6 -17 20 1990 2 0.20 22
7 -17 20 1990 3 0.20 8
8 -17 20 1990 4 0.20 4
9 -17 20 2020 12 0.08 21
10 -18 20 1990 1 0.11 29
11 -18 20 1990 2 0.11 11
12 -18 20 1990 3 0.11 13
print (df2.to_csv(sep='/', index=False))
lan/lon/year/month/prec/N
-17/18/1990/1/0.4/25
-17/18/1990/2/0.02/22
-17/18/1990/3/0.12/8
-17/18/1990/4/0.06/4
-17/18/2020/12/0.35/21
-17/20/1990/1/0.2/25
-17/20/1990/2/0.2/22
-17/20/1990/3/0.2/8
-17/20/1990/4/0.2/4
-17/20/2020/12/0.08/21
-18/20/1990/1/0.11/29
-18/20/1990/2/0.11/11
-18/20/1990/3/0.11/13
print (s)
lan month
17 1 25
2 22
3 8
4 4
5 23
6 16
7 26
8 3
9 21
10 19
11 28
12 21
18 1 29
2 11
3 13
4 14
5 30
6 4
7 5
8 16
9 16
10 30
11 18
12 14
19 1 13
2 1
3 10
4 16
5 8
6 7
7 10
8 2
9 23
10 10
11 12
12 26
20 1 13
2 16
3 21
4 10
5 8
6 5
7 25
8 27
9 8
10 28
11 6
12 3
21 1 2
2 23
3 8
4 13
5 18
6 29
7 28
8 2
9 7
10 20
11 8
12 8
dtype: int64
I have a pandas dataframe like this:
pd.DataFrame({'week': ['2019-w01', '2019-w02','2019-w03','2019-w04',
'2019-w05','2019-w06','2019-w07','2019-w08',
'2019-w9','2019-w10','2019-w11','2019-w12'],
'value': [11,22,33,34,57,88,2,9,10,1,76,14],
'period': [1,1,1,1,2,2,2,2,3,3,3,3]})
week value
0 2019-w1 11
1 2019-w2 22
2 2019-w3 33
3 2019-w4 34
4 2019-w5 57
5 2019-w6 88
6 2019-w7 2
7 2019-w8 9
8 2019-w9 10
9 2019-w10 1
10 2019-w11 76
11 2019-w12 14
what I need is like below. I would like to assign a period ID every 4-week interval.
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3
what is the best way to achieve that? Thanks.
try with:
df['period']=(pd.to_numeric(df['week'].str.split('-').str[-1]
.str.replace('w',''))//4).shift(fill_value=0).add(1)
print(df)
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3