Ranking and subranking rows in pandas - python

I have the following dataset, that I would like to rank by region, and also by store type (within each region).
Is there a slick way of coding these 2 columns in python?
Data:
print (df)
Region ID Location store Type ID Brand share
0 1 Warehouse 1.97
1 1 Warehouse 0.24
2 1 Super Centre 0.21
3 1 Warehouse 0.13
4 1 Mini Warehouse 0.10
5 1 Super Centre 0.07
6 1 Mini Warehouse 0.04
7 1 Super Centre 0.02
8 1 Mini Warehouse 0.02
9 10 Warehouse 0.64
10 10 Mini Warehouse 0.18
11 10 Warehouse 0.13
12 10 Warehouse 0.09
13 10 Super Centre 0.07
14 10 Mini Warehouse 0.03
15 10 Mini Warehouse 0.02
16 10 Super Centre 0.02

Use GroupBy.cumcount:
df['RegionRank'] = df.groupby('Region ID')['Brand share'].cumcount() + 1
cols = ['Location store Type ID', 'Region ID']
df['StoreTypeRank'] = df.groupby(cols)['Brand share'].cumcount() + 1
print (df)
Region ID Location store Type ID Brand share RegionRank StoreTypeRank
0 1 Warehouse 1.97 1 1
1 1 Warehouse 0.24 2 2
2 1 Super Centre 0.21 3 1
3 1 Warehouse 0.13 4 3
4 1 Mini Warehouse 0.10 5 1
5 1 Super Centre 0.07 6 2
6 1 Mini Warehouse 0.04 7 2
7 1 Super Centre 0.02 8 3
8 1 Mini Warehouse 0.02 9 3
9 10 Warehouse 0.64 1 1
10 10 Mini Warehouse 0.18 2 1
11 10 Warehouse 0.13 3 2
12 10 Warehouse 0.09 4 3
13 10 Super Centre 0.07 5 1
14 10 Mini Warehouse 0.03 6 2
15 10 Mini Warehouse 0.02 7 3
16 10 Super Centre 0.02 8 2
Or GroupBy.rank, but it return same values for same categories:
df['RegionRank'] = (df.groupby('Region ID')['Brand share']
.rank(method='dense', ascending=False)
.astype(int))
cols = ['Location store Type ID', 'Region ID']
df['StoreTypeRank'] = (df.groupby(cols)['Brand share']
.rank(method='dense', ascending=False)
.astype(int))
print (df)
Region ID Location store Type ID Brand share RegionRank StoreTypeRank
0 1 Warehouse 1.97 1 1
1 1 Warehouse 0.24 2 2
2 1 Super Centre 0.21 3 1
3 1 Warehouse 0.13 4 3
4 1 Mini Warehouse 0.10 5 1
5 1 Super Centre 0.07 6 2
6 1 Mini Warehouse 0.04 7 2
7 1 Super Centre 0.02 8 3
8 1 Mini Warehouse 0.02 8 3
9 10 Warehouse 0.64 1 1
10 10 Mini Warehouse 0.18 2 1
11 10 Warehouse 0.13 3 2
12 10 Warehouse 0.09 4 3
13 10 Super Centre 0.07 5 1
14 10 Mini Warehouse 0.03 6 2
15 10 Mini Warehouse 0.02 7 3 <-same value .02
16 10 Super Centre 0.02 7 2 <-same value .02

Related

Select certain columns based on multiple criteria in pandas

I have the following dataset:
my_df = pd.DataFrame({'id':[1,2,3,4,5],
'type':['corp','smb','smb','corp','mid'],
'sales':[34567,2190,1870,22000,10000],
'sales_roi':[.10,.21,.22,.15,.16],
'sales_pct':[.38,.05,.08,.30,.20],
'sales_ln':[4.2,2.1,2.0,4.1,4],
'cost_pct':[22000,1000,900,14000,5000],
'flag':[0,1,0,1,1],
'gibberish':['bla','ble','bla','ble','bla'],
'tech':['lnx','mst','mst','lnx','mc']})
my_df['type'] = pd.Categorical(my_df.type)
my_df
id type sales sales_roi sales_pct sales_ln cost_pct flag gibberish tech
0 1 corp 34567 0.10 0.38 4.2 22000 0 bla lnx
1 2 smb 2190 0.21 0.05 2.1 1000 1 ble mst
2 3 smb 1870 0.22 0.08 2.0 900 0 bla mst
3 4 corp 22000 0.15 0.30 4.1 14000 1 ble lnx
4 5 mid 10000 0.16 0.20 4.0 5000 1 bla mc
And I want to filter out all variables who end in "_pct" or "_ln" or are equal to "gibberish" or "tech". This is what I have tried:
df_selected = df.loc[:, ~my_df.columns.str.endswith('_pct') &
~my_df.columns.str.endswith('_ln') &
~my_df.columns.str.contains('gibberish','tech')]
But it returns me an unwanted column ("tech"):
id type sales sales_roi flag tech
0 1 corp 34567 0.10 0 lnx
1 2 smb 2190 0.21 1 mst
2 3 smb 1870 0.22 0 mst
3 4 corp 22000 0.15 1 lnx
4 5 mid 10000 0.16 1 mc
This is the expected result:
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
Please consider that I have to deal with hundreds of variables and this is just an example of what I need.
Currently, what you are doing will return every column because of how the conditions are written. endswith will accept tuples so just put all the columns you are looking for in a single tuple and then filter
my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
I would do it like this:
criterion = ["_pct", "_ln", "gibberish", "tech"]
for column in my_df:
for criteria in criterion:
if criteria in column:
my_df = my_df.drop(column, axis=1)
Ofcourse you can change the if statement in line 3 to endswith or something of your choice.

Generate new column based on values in another column and their index

In the df underneath, I want to sort the values of column 'cdf_X' based on column 'A' and 'X'. Column 'X' and 'cdf_X' are connected, so if a value in 'X' appears in column 'A', the value of 'cdf_X' should be repositioned to that index number of column 'A' in a new column. (Values don't occur twice in a column 'cdf_A'.)
Example: 'X'=3 at index 0 -> cdf_X=0.05 at index 0 -> '3' appears in column 'A' at index 4 -> cdf_A at index 4 = cdf_X at index 0
Initial df:
A X cdf_X
0 7 3 0.05
1 4 4 0.15
2 11 7 0.27
3 9 9 0.45
4 3 11 0.69
5 13 13 1.00
Desired df:
A X cdf_X cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00
Tried code:
import pandas as pd
df = pd.DataFrame({"A": [7,4,11,9,3,13],
"cdf_X": [0.05,0.15,0.27,0.45,0.69,1.00],
"X": [3,4,7,9,11,13]})
df.loc[:, 'cdf_A'] = df['cdf_X'].where(df['A'] == df['X'])

print(df)
Check with map
df['cdf_A'] = df.A.map(df.set_index('X')['cdf'])
I think you need replace
df['cdf_A'] = df.A.replace(df.set_index('X').cdf)
Out[989]:
A X cdf cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00

Add a column of normalised values based on sections of a dataframe column

I have got a dataframe of several hundred thousand rows. Which is of the following format:
time_elapsed cycle
0 0.00 1
1 0.50 1
2 1.00 1
3 1.30 1
4 1.50 1
5 0.00 2
6 0.75 2
7 1.50 2
8 3.00 2
I want to create a third column that will give me the percentage of each time instance that the row is of the cycle (until the next time_elapsed = 0). To give something like:
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 75
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100
I'm not fussed about the number of decimal places, I've just excluded them for ease here.
I started going along this route, but I keep getting errors.
data['percentage'] = data['time_elapsed'].sub(data.groupby(['cycle'])['time_elapsed'].transform(lambda x: x*100/data['time_elapsed'].max()))
I think it's the lambda function causing errors, but I'm not sure what I should do to change it. Any help is much appreciated :)
Use Series.div for division instead sub for subtract, then solution is simplify - get only max per groups, multiple by Series.mul, if necessary Series.round and last convert to integers by Series.astype:
s = data.groupby(['cycle'])['time_elapsed'].transform('max')
data['percentage'] = data['time_elapsed'].div(s).mul(100).round().astype(int)
print (data)
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 67
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100

How to do a calculation only at some rows of my dataframe?

Let's say I have a dataframe with only two columns and 20 rows, where all values from the first column are equal to 10, and all values from the second row are random percentage numbers.
Now, I want to multiply the first column with the percentage values of the second column +1, but only at some intervals, and copy the last value to the next row.
E.g. I want to do this multiplication operation from row 5 to 10.
The problem Is that I don't know to start and end the calculation in arbitrary spots based on the df's index.
Example input data:
df = pd.DataFrame(np.random.randint(0,10,size=(20, 2)), columns=list('AB'))
df['A'] = 10
df['B'] = df['B'] /100
Which produces:
A B
0 10 0.07
1 10 0.02
2 10 0.05
3 10 0.00
4 10 0.01
5 10 0.09
6 10 0.00
7 10 0.02
8 10 0.03
9 10 0.05
10 10 0.05
11 10 0.03
12 10 0.01
13 10 0.09
14 10 0.06
15 10 0.07
16 10 0.01
17 10 0.01
18 10 0.01
19 10 0.07
An output I would like to get, is where the first row go thorugh a comulative multiplication only at sow rows, like this:
C B
0 10 0.07
1 10 0.02
2 10 0.05
3 10 0.00
4 10 0.01
5 10,9 0.09
6 10,9 0.00
7 11,11 0.02
8 11,45 0.03
9 12,02 0.05
10 12,62 0.05
11 12,62 0.03
12 12,62 0.01
13 12,62 0.09
14 12,62 0.06
15 12,62 0.07
16 12,62 0.01
17 12,62 0.01
18 12,62 0.01
19 12,62 0.07
Thank you!
To get the recursive product you can do the following:
start = 5
end = 10
df['C'] = ((1+df.B)[start:end+1].cumprod().reindex(df.index[:end+1]).fillna(1)*df.A).ffill()
Output:
A B C
0 10 0.07 10.000000
1 10 0.02 10.000000
2 10 0.05 10.000000
3 10 0.00 10.000000
4 10 0.01 10.000000
5 10 0.09 10.900000
6 10 0.00 10.900000
7 10 0.02 11.118000
8 10 0.03 11.451540
9 10 0.05 12.024117
10 10 0.05 12.625323
11 10 0.03 12.625323
12 10 0.01 12.625323
13 10 0.09 12.625323
14 10 0.06 12.625323
15 10 0.07 12.625323
16 10 0.01 12.625323
17 10 0.01 12.625323
18 10 0.01 12.625323
19 10 0.07 12.625323
Explanation:
Calculate the cumulative product of (1 + df.B), which is the factor to mulitply by df.A to obtain the recursive product. Do this only over the range specified. reindex and fill the the rows before start with 1, so the value remains constant before this range.
Multiply by df.A to get the actual value, forward filling values after the range you specify.

How to do probit feature engineering from numerical data (cdf and pdf style) on pandas

This question is based on my current understanding (edit for more exact statistical terminology is very Wellcome). In my assumption, probit is the right terminology. I want to do probit_pdf and probit_cdf
probit_pdf is the probability of the variable is equal certain value
probit_cdf is the probability of the variable less or same with value
Here's my data
Id Value
1 2
2 4
3 2
4 6
5 5
6 4
7 2
8 4
9 2
10 5
To make the question clearer, I give example for few Id's
probit_pdf sample, for Id = 1 :
Here's the expected output, because probability of Value = 2 is 0.40 (4in 10), so the probit_pdf is 0.40.
probit_cdf sample, for Id = 5:
And because probability of Value >= 5 is 0.90 (9in 10), so the probit_cdf is 0.90
So my expected output is
Id Value probit_pdf probit_cdf
1 2 0.40 0.40
2 4 0.30 0.70
3 2 0.40 0.40
4 6 0.10 1.00
5 5 0.20 0.90
6 4 0.30 0.70
7 2 0.40 0.40
8 4 0.30 0.70
9 2 0.40 0.40
10 5 0.20 0.90
First for probit_pdf use GroupBy.transform with size and divide by length of DataFrame, for probit_cdf compare each value by all values, get sums and divide same way:
lens = len(df)
df['probit_pdf'] = df.groupby('Value')['Value'].transform('size').div(lens)
df['probit_cdf'] = df['probit_pdf'].apply(lambda x: df['probit_pdf'].ge(x).sum()).div(lens)
print (df)
Id Value probit_pdf probit_cdf
0 1 2 0.4 0.4
1 2 4 0.3 0.7
2 3 2 0.4 0.4
3 4 6 0.1 1.0
4 5 5 0.2 0.9
5 6 4 0.3 0.7
6 7 2 0.4 0.4
7 8 4 0.3 0.7
8 9 2 0.4 0.4
9 10 5 0.2 0.9

Categories

Resources