How to do a calculation only at some rows of my dataframe? - python

Let's say I have a dataframe with only two columns and 20 rows, where all values from the first column are equal to 10, and all values from the second row are random percentage numbers.
Now, I want to multiply the first column with the percentage values of the second column +1, but only at some intervals, and copy the last value to the next row.
E.g. I want to do this multiplication operation from row 5 to 10.
The problem Is that I don't know to start and end the calculation in arbitrary spots based on the df's index.
Example input data:
df = pd.DataFrame(np.random.randint(0,10,size=(20, 2)), columns=list('AB'))
df['A'] = 10
df['B'] = df['B'] /100
Which produces:
A B
0 10 0.07
1 10 0.02
2 10 0.05
3 10 0.00
4 10 0.01
5 10 0.09
6 10 0.00
7 10 0.02
8 10 0.03
9 10 0.05
10 10 0.05
11 10 0.03
12 10 0.01
13 10 0.09
14 10 0.06
15 10 0.07
16 10 0.01
17 10 0.01
18 10 0.01
19 10 0.07
An output I would like to get, is where the first row go thorugh a comulative multiplication only at sow rows, like this:
C B
0 10 0.07
1 10 0.02
2 10 0.05
3 10 0.00
4 10 0.01
5 10,9 0.09
6 10,9 0.00
7 11,11 0.02
8 11,45 0.03
9 12,02 0.05
10 12,62 0.05
11 12,62 0.03
12 12,62 0.01
13 12,62 0.09
14 12,62 0.06
15 12,62 0.07
16 12,62 0.01
17 12,62 0.01
18 12,62 0.01
19 12,62 0.07
Thank you!

To get the recursive product you can do the following:
start = 5
end = 10
df['C'] = ((1+df.B)[start:end+1].cumprod().reindex(df.index[:end+1]).fillna(1)*df.A).ffill()
Output:
A B C
0 10 0.07 10.000000
1 10 0.02 10.000000
2 10 0.05 10.000000
3 10 0.00 10.000000
4 10 0.01 10.000000
5 10 0.09 10.900000
6 10 0.00 10.900000
7 10 0.02 11.118000
8 10 0.03 11.451540
9 10 0.05 12.024117
10 10 0.05 12.625323
11 10 0.03 12.625323
12 10 0.01 12.625323
13 10 0.09 12.625323
14 10 0.06 12.625323
15 10 0.07 12.625323
16 10 0.01 12.625323
17 10 0.01 12.625323
18 10 0.01 12.625323
19 10 0.07 12.625323
Explanation:
Calculate the cumulative product of (1 + df.B), which is the factor to mulitply by df.A to obtain the recursive product. Do this only over the range specified. reindex and fill the the rows before start with 1, so the value remains constant before this range.
Multiply by df.A to get the actual value, forward filling values after the range you specify.

Related

How to extract a specific range out of a dataframe and store it in another dataframe and then delete the range out of the original dataframe | pandas

I have some timeseries of energy consumption and i can eyeball when someone is on holidays if the consumption is under a certain range. I have this piece of code to extract said holidays:
dummy data:
values = [0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7]
df = pd.DataFrame(values, columns = ["values"])
so the df looks like this:
values
0 0.80
1 0.80
2 0.70
3 0.60
4 0.70
5 0.50
6 0.80
7 0.40
8 0.30
9 0.50
10 0.70
11 0.50
12 0.70
13 0.15
14 0.11
15 0.10
16 0.13
17 0.16
18 0.17
19 0.10
20 0.13
21 0.30
22 0.40
23 0.50
24 0.60
25 0.70
now, given these variables, I want to detect all subsequent values that are smaller than value_threshold for at least 5 timesteps:
value_threshold = 0.2
count_threshold = 5
I check which values are under the threshold:
is_under_val_threshold =df["values"] < value_threshold
which gives me this:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 False
22 False
23 False
24 False
25 False
Now I can isolate the values under the threshold:
subset_thre = df.loc[is_under_val_threshold, "values"]
13 0.15
14 0.11
15 0.10
16 0.13
17 0.16
18 0.17
19 0.10
20 0.13
Since this can happen for more than one time and not always for more than 5 steps, I put each "sequence" into groups:
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 3
22 3
23 3
24 3
25 3
Now I would like to extract the groups that are under the threshold for more than 5 steps and create new dataframes where the break is, so that in this example I will have three dataframes.
What I tried so far:
Identify where a group switch happens:
identify_switch = thre_grouper.diff().to_frame()
index_of_switch = identify_switch.index[identify_switch['values'] == 1].tolist()
which gives me the index of where the switch happens:
[13, 21]
with this I can for this example at least do the splits as I wish:
holidays_1 = df[index_of_switch[0]:index_of_switch[1]]
split_df_1 = df[:index_of_switch[0]]
split_df_2 = df[index_of_switch[1]:]
My question would be, how to make sure that when looping this for very variable amounts of holidays within a series to make sure that I will do all the needed splits
I have added to you values to give a better idea of how this answer works. The first few rows are under 0.2, but are not of 5 or more consecutively, so not "holidays", 16-18 are the same, 20-24 satisfy the conditions. Therefore the output should be "split_df_1" 0-19, "holidays_1" 20-24, "split_df_2" 25-32.
import pandas as pd
values = [0.1,0.15,0.1,0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.5,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7,0.1,0.15,0.1]
df = pd.DataFrame(values, columns = ["values"])
df
# values
#0 0.10
#1 0.15
#2 0.10
#3 0.80
#4 0.80
#5 0.70
#6 0.60
#7 0.70
#8 0.50
#9 0.80
#10 0.40
#11 0.30
#12 0.50
#13 0.70
#14 0.50
#15 0.70
#16 0.15
#17 0.11
#18 0.10
#19 0.50
#20 0.13
#21 0.16
#22 0.17
#23 0.10
#24 0.13
#25 0.30
#26 0.40
#27 0.50
#28 0.60
#29 0.70
#30 0.10
#31 0.15
#32 0.10
The conditions and other series you created:
# conditions
value_threshold = 0.2
count_threshold = 5
# under value_threshold bool
is_under_val_threshold = df["values"] < value_threshold
# grouped
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()
Calculating the group numbers (in thre_grouper) that satisfy the conditions of values being less than value_threshold and greater than or equal to count_threshold:
# if the first value is less than value_threshold, then start from first group (index 0)
if (df["values"].iloc[0] < value_threshold):
x = 0
# otherwise start from second (index 1)
else:
x = 1
# potential holiday groups are every other group
holidays = thre_grouper[thre_grouper.isin(thre_grouper.unique()[x::2])].value_counts(sort=False)
# get group number of those greater than count_threshold, and add start of dataframe and group above last
is_holiday = [0] + list(holidays[holidays >= count_threshold].to_frame().index) + [thre_grouper.max()+1]
Looping through to create dataframes:
# dictionary to add dataframes to
d = {}
for i in range(1, len(is_holiday)):
# split dataframes are those with group numbers between those in is_holiday list
d["split_df_"+str(i)] = df.loc[thre_grouper[
(thre_grouper > is_holiday[i-1]) &
(thre_grouper < is_holiday[i])].index]
# holiday dataframes are those that are in the is_holiday list but not the first or last
if not i in([0, len(is_holiday)-1]):
d["holiday_"+str(i)] = df.loc[thre_grouper[
thre_grouper == is_holiday[i]].index]

DataFrame Create new column after applying a function on groupby values

I have such a dataframe :
With a minimal example :
d = {'Subject': [1,1,1,1,2,2,3,3,3,3,3,3,3],
'Pattern': [1,1,2,2,3,3,2,2,2,2,2,2,2],
'Time': [0.85, 0.92, 1.03, 1.06, 0.89, 0.85, 1.20, 1.03, 1.25, 100.03, 1.97,0.23,0.64]}
df = pd.DataFrame(data=d)
Where Subject ranges from 1 to 8 and Pattern from 1 to 3. I want to create a new column where after grouping by Subject and Pattern I apply a function that removes outliers from the Time list associated to the groupby. Right now I have a solution that works well, but I was wondering if there would be a more elegant solution to it, so that I learn how to interact better with DataFrame. Taking the example, it should output :
Subject Pattern Time Time_2
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 # <---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
My current code :
def remove_outliers(arr):
elements = np.array(arr)
mean = np.mean(elements)
sd = np.std(elements)
return [x if (mean - 2 * sd < x < mean + 2 * sd) else 0 for x in arr]
df_g = df.groupby(['Subject', 'Pattern'])['Time']
times = []
keys = list(df_g.groups.keys())
for i, l in enumerate(df_g.apply(list)):
times.append((keys[i], remove_outliers(l)))
df['Time_2'] = 0
for k, l in times:
vals = df[(df['Subject'] == k[0]) & (df['Pattern'] == k[1])].index.values
df['Time_2'].iloc[vals] = l
Try this -
Use groupby transform the groups to get GroupWise mean and std for each row.
Next use these series objects to create your check condition as per your function.
Next inverse this and use df.mask to mask values that lie outside this range, and fill them with 0 instead.
grouper = df.groupby(['Subject', 'Pattern'])['Time']
mean = grouper.transform('mean')
std = grouper.transform('std').fillna(0)
check = (df['Time'] < (mean - 2*std)) | (df['Time'] > (mean + 2*std))
df['Time_new'] = df['Time'].mask(check).fillna(0)
print(df)
Subject Pattern Time Time_new
0 1 1 0.85 0.85
1 1 1 0.92 0.92
2 1 2 1.03 1.03
3 1 2 1.06 1.06
4 2 3 0.89 0.89
5 2 3 0.85 0.85
6 3 2 1.20 1.20
7 3 2 1.03 1.03
8 3 2 1.25 1.25
9 3 2 100.03 0.00 #<---
10 3 2 1.97 1.97
11 3 2 0.23 0.23
12 3 2 0.64 0.64
NOTE: Jsut to add the 3std deviation condition is too high a range for your example. Try 2std.

Generate new column based on values in another column and their index

In the df underneath, I want to sort the values of column 'cdf_X' based on column 'A' and 'X'. Column 'X' and 'cdf_X' are connected, so if a value in 'X' appears in column 'A', the value of 'cdf_X' should be repositioned to that index number of column 'A' in a new column. (Values don't occur twice in a column 'cdf_A'.)
Example: 'X'=3 at index 0 -> cdf_X=0.05 at index 0 -> '3' appears in column 'A' at index 4 -> cdf_A at index 4 = cdf_X at index 0
Initial df:
A X cdf_X
0 7 3 0.05
1 4 4 0.15
2 11 7 0.27
3 9 9 0.45
4 3 11 0.69
5 13 13 1.00
Desired df:
A X cdf_X cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00
Tried code:
import pandas as pd
df = pd.DataFrame({"A": [7,4,11,9,3,13],
"cdf_X": [0.05,0.15,0.27,0.45,0.69,1.00],
"X": [3,4,7,9,11,13]})
df.loc[:, 'cdf_A'] = df['cdf_X'].where(df['A'] == df['X'])

print(df)
Check with map
df['cdf_A'] = df.A.map(df.set_index('X')['cdf'])
I think you need replace
df['cdf_A'] = df.A.replace(df.set_index('X').cdf)
Out[989]:
A X cdf cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00

Ranking and subranking rows in pandas

I have the following dataset, that I would like to rank by region, and also by store type (within each region).
Is there a slick way of coding these 2 columns in python?
Data:
print (df)
Region ID Location store Type ID Brand share
0 1 Warehouse 1.97
1 1 Warehouse 0.24
2 1 Super Centre 0.21
3 1 Warehouse 0.13
4 1 Mini Warehouse 0.10
5 1 Super Centre 0.07
6 1 Mini Warehouse 0.04
7 1 Super Centre 0.02
8 1 Mini Warehouse 0.02
9 10 Warehouse 0.64
10 10 Mini Warehouse 0.18
11 10 Warehouse 0.13
12 10 Warehouse 0.09
13 10 Super Centre 0.07
14 10 Mini Warehouse 0.03
15 10 Mini Warehouse 0.02
16 10 Super Centre 0.02
Use GroupBy.cumcount:
df['RegionRank'] = df.groupby('Region ID')['Brand share'].cumcount() + 1
cols = ['Location store Type ID', 'Region ID']
df['StoreTypeRank'] = df.groupby(cols)['Brand share'].cumcount() + 1
print (df)
Region ID Location store Type ID Brand share RegionRank StoreTypeRank
0 1 Warehouse 1.97 1 1
1 1 Warehouse 0.24 2 2
2 1 Super Centre 0.21 3 1
3 1 Warehouse 0.13 4 3
4 1 Mini Warehouse 0.10 5 1
5 1 Super Centre 0.07 6 2
6 1 Mini Warehouse 0.04 7 2
7 1 Super Centre 0.02 8 3
8 1 Mini Warehouse 0.02 9 3
9 10 Warehouse 0.64 1 1
10 10 Mini Warehouse 0.18 2 1
11 10 Warehouse 0.13 3 2
12 10 Warehouse 0.09 4 3
13 10 Super Centre 0.07 5 1
14 10 Mini Warehouse 0.03 6 2
15 10 Mini Warehouse 0.02 7 3
16 10 Super Centre 0.02 8 2
Or GroupBy.rank, but it return same values for same categories:
df['RegionRank'] = (df.groupby('Region ID')['Brand share']
.rank(method='dense', ascending=False)
.astype(int))
cols = ['Location store Type ID', 'Region ID']
df['StoreTypeRank'] = (df.groupby(cols)['Brand share']
.rank(method='dense', ascending=False)
.astype(int))
print (df)
Region ID Location store Type ID Brand share RegionRank StoreTypeRank
0 1 Warehouse 1.97 1 1
1 1 Warehouse 0.24 2 2
2 1 Super Centre 0.21 3 1
3 1 Warehouse 0.13 4 3
4 1 Mini Warehouse 0.10 5 1
5 1 Super Centre 0.07 6 2
6 1 Mini Warehouse 0.04 7 2
7 1 Super Centre 0.02 8 3
8 1 Mini Warehouse 0.02 8 3
9 10 Warehouse 0.64 1 1
10 10 Mini Warehouse 0.18 2 1
11 10 Warehouse 0.13 3 2
12 10 Warehouse 0.09 4 3
13 10 Super Centre 0.07 5 1
14 10 Mini Warehouse 0.03 6 2
15 10 Mini Warehouse 0.02 7 3 <-same value .02
16 10 Super Centre 0.02 7 2 <-same value .02

How to do probit feature engineering from numerical data (cdf and pdf style) on pandas

This question is based on my current understanding (edit for more exact statistical terminology is very Wellcome). In my assumption, probit is the right terminology. I want to do probit_pdf and probit_cdf
probit_pdf is the probability of the variable is equal certain value
probit_cdf is the probability of the variable less or same with value
Here's my data
Id Value
1 2
2 4
3 2
4 6
5 5
6 4
7 2
8 4
9 2
10 5
To make the question clearer, I give example for few Id's
probit_pdf sample, for Id = 1 :
Here's the expected output, because probability of Value = 2 is 0.40 (4in 10), so the probit_pdf is 0.40.
probit_cdf sample, for Id = 5:
And because probability of Value >= 5 is 0.90 (9in 10), so the probit_cdf is 0.90
So my expected output is
Id Value probit_pdf probit_cdf
1 2 0.40 0.40
2 4 0.30 0.70
3 2 0.40 0.40
4 6 0.10 1.00
5 5 0.20 0.90
6 4 0.30 0.70
7 2 0.40 0.40
8 4 0.30 0.70
9 2 0.40 0.40
10 5 0.20 0.90
First for probit_pdf use GroupBy.transform with size and divide by length of DataFrame, for probit_cdf compare each value by all values, get sums and divide same way:
lens = len(df)
df['probit_pdf'] = df.groupby('Value')['Value'].transform('size').div(lens)
df['probit_cdf'] = df['probit_pdf'].apply(lambda x: df['probit_pdf'].ge(x).sum()).div(lens)
print (df)
Id Value probit_pdf probit_cdf
0 1 2 0.4 0.4
1 2 4 0.3 0.7
2 3 2 0.4 0.4
3 4 6 0.1 1.0
4 5 5 0.2 0.9
5 6 4 0.3 0.7
6 7 2 0.4 0.4
7 8 4 0.3 0.7
8 9 2 0.4 0.4
9 10 5 0.2 0.9

Categories

Resources