Generate new column based on values in another column and their index - python

In the df underneath, I want to sort the values of column 'cdf_X' based on column 'A' and 'X'. Column 'X' and 'cdf_X' are connected, so if a value in 'X' appears in column 'A', the value of 'cdf_X' should be repositioned to that index number of column 'A' in a new column. (Values don't occur twice in a column 'cdf_A'.)
Example: 'X'=3 at index 0 -> cdf_X=0.05 at index 0 -> '3' appears in column 'A' at index 4 -> cdf_A at index 4 = cdf_X at index 0
Initial df:
A X cdf_X
0 7 3 0.05
1 4 4 0.15
2 11 7 0.27
3 9 9 0.45
4 3 11 0.69
5 13 13 1.00
Desired df:
A X cdf_X cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00
Tried code:
import pandas as pd
df = pd.DataFrame({"A": [7,4,11,9,3,13],
"cdf_X": [0.05,0.15,0.27,0.45,0.69,1.00],
"X": [3,4,7,9,11,13]})
df.loc[:, 'cdf_A'] = df['cdf_X'].where(df['A'] == df['X'])

print(df)

Check with map
df['cdf_A'] = df.A.map(df.set_index('X')['cdf'])

I think you need replace
df['cdf_A'] = df.A.replace(df.set_index('X').cdf)
Out[989]:
A X cdf cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00

Related

How to extract a specific range out of a dataframe and store it in another dataframe and then delete the range out of the original dataframe | pandas

I have some timeseries of energy consumption and i can eyeball when someone is on holidays if the consumption is under a certain range. I have this piece of code to extract said holidays:
dummy data:
values = [0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7]
df = pd.DataFrame(values, columns = ["values"])
so the df looks like this:
values
0 0.80
1 0.80
2 0.70
3 0.60
4 0.70
5 0.50
6 0.80
7 0.40
8 0.30
9 0.50
10 0.70
11 0.50
12 0.70
13 0.15
14 0.11
15 0.10
16 0.13
17 0.16
18 0.17
19 0.10
20 0.13
21 0.30
22 0.40
23 0.50
24 0.60
25 0.70
now, given these variables, I want to detect all subsequent values that are smaller than value_threshold for at least 5 timesteps:
value_threshold = 0.2
count_threshold = 5
I check which values are under the threshold:
is_under_val_threshold =df["values"] < value_threshold
which gives me this:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 False
22 False
23 False
24 False
25 False
Now I can isolate the values under the threshold:
subset_thre = df.loc[is_under_val_threshold, "values"]
13 0.15
14 0.11
15 0.10
16 0.13
17 0.16
18 0.17
19 0.10
20 0.13
Since this can happen for more than one time and not always for more than 5 steps, I put each "sequence" into groups:
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
11 1
12 1
13 2
14 2
15 2
16 2
17 2
18 2
19 2
20 2
21 3
22 3
23 3
24 3
25 3
Now I would like to extract the groups that are under the threshold for more than 5 steps and create new dataframes where the break is, so that in this example I will have three dataframes.
What I tried so far:
Identify where a group switch happens:
identify_switch = thre_grouper.diff().to_frame()
index_of_switch = identify_switch.index[identify_switch['values'] == 1].tolist()
which gives me the index of where the switch happens:
[13, 21]
with this I can for this example at least do the splits as I wish:
holidays_1 = df[index_of_switch[0]:index_of_switch[1]]
split_df_1 = df[:index_of_switch[0]]
split_df_2 = df[index_of_switch[1]:]
My question would be, how to make sure that when looping this for very variable amounts of holidays within a series to make sure that I will do all the needed splits
I have added to you values to give a better idea of how this answer works. The first few rows are under 0.2, but are not of 5 or more consecutively, so not "holidays", 16-18 are the same, 20-24 satisfy the conditions. Therefore the output should be "split_df_1" 0-19, "holidays_1" 20-24, "split_df_2" 25-32.
import pandas as pd
values = [0.1,0.15,0.1,0.8,0.8,0.7,0.6,0.7,0.5,0.8,0.4,0.3,0.5,0.7,0.5,0.7,0.15,0.11,0.1,0.5,0.13,0.16,0.17,0.1,0.13,0.3,0.4,0.5,0.6,0.7,0.1,0.15,0.1]
df = pd.DataFrame(values, columns = ["values"])
df
# values
#0 0.10
#1 0.15
#2 0.10
#3 0.80
#4 0.80
#5 0.70
#6 0.60
#7 0.70
#8 0.50
#9 0.80
#10 0.40
#11 0.30
#12 0.50
#13 0.70
#14 0.50
#15 0.70
#16 0.15
#17 0.11
#18 0.10
#19 0.50
#20 0.13
#21 0.16
#22 0.17
#23 0.10
#24 0.13
#25 0.30
#26 0.40
#27 0.50
#28 0.60
#29 0.70
#30 0.10
#31 0.15
#32 0.10
The conditions and other series you created:
# conditions
value_threshold = 0.2
count_threshold = 5
# under value_threshold bool
is_under_val_threshold = df["values"] < value_threshold
# grouped
thre_grouper = is_under_val_threshold.diff().ne(0).cumsum()
Calculating the group numbers (in thre_grouper) that satisfy the conditions of values being less than value_threshold and greater than or equal to count_threshold:
# if the first value is less than value_threshold, then start from first group (index 0)
if (df["values"].iloc[0] < value_threshold):
x = 0
# otherwise start from second (index 1)
else:
x = 1
# potential holiday groups are every other group
holidays = thre_grouper[thre_grouper.isin(thre_grouper.unique()[x::2])].value_counts(sort=False)
# get group number of those greater than count_threshold, and add start of dataframe and group above last
is_holiday = [0] + list(holidays[holidays >= count_threshold].to_frame().index) + [thre_grouper.max()+1]
Looping through to create dataframes:
# dictionary to add dataframes to
d = {}
for i in range(1, len(is_holiday)):
# split dataframes are those with group numbers between those in is_holiday list
d["split_df_"+str(i)] = df.loc[thre_grouper[
(thre_grouper > is_holiday[i-1]) &
(thre_grouper < is_holiday[i])].index]
# holiday dataframes are those that are in the is_holiday list but not the first or last
if not i in([0, len(is_holiday)-1]):
d["holiday_"+str(i)] = df.loc[thre_grouper[
thre_grouper == is_holiday[i]].index]

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

I have a pandas dataframe as follows:
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
polyid value
0 1 0.56
1 1 0.59
2 1 0.62
3 1 0.83
4 2 0.85
5 2 0.01
6 2 0.79
7 3 0.37
8 3 0.99
9 3 0.48
10 3 0.55
11 3 0.06
I need to reclassify the 'value' column separately for each 'polyid'. For the reclassification, I have two dictionaries. One with the bins that contain the information on how I want to cut the 'values' for each 'polyid' separately:
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
And one with the ids with which I want to label the resulting bins:
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
I tried to get this answer to work for my use case. I could only come up with applying pd.cut on each 'polyid' subset and then pd.concat all subsets again back to one dataframe:
import pandas as pd
def reclass_df_dic(df, bins_dic, names_dic, bin_key_col, val_col, name_col):
df_lst = []
for key in df[bin_key_col].unique():
bins = bins_dic[key]
names = names_dic[key]
sub_df = df[df[bin_key_col] == key]
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
df_lst.append(sub_df)
return(pd.concat(df_lst))
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
df = reclass_df_dic(df, bins_dic, ids_dic, 'polyid', 'value', 'id')
This results in my desired output:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
However, the line:
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
raises the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
that I am unable to solve with using .loc. Also, I guess there generally is a more efficient way of doing this without having to loop over each category?
A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

How to do a calculation only at some rows of my dataframe?

Let's say I have a dataframe with only two columns and 20 rows, where all values from the first column are equal to 10, and all values from the second row are random percentage numbers.
Now, I want to multiply the first column with the percentage values of the second column +1, but only at some intervals, and copy the last value to the next row.
E.g. I want to do this multiplication operation from row 5 to 10.
The problem Is that I don't know to start and end the calculation in arbitrary spots based on the df's index.
Example input data:
df = pd.DataFrame(np.random.randint(0,10,size=(20, 2)), columns=list('AB'))
df['A'] = 10
df['B'] = df['B'] /100
Which produces:
A B
0 10 0.07
1 10 0.02
2 10 0.05
3 10 0.00
4 10 0.01
5 10 0.09
6 10 0.00
7 10 0.02
8 10 0.03
9 10 0.05
10 10 0.05
11 10 0.03
12 10 0.01
13 10 0.09
14 10 0.06
15 10 0.07
16 10 0.01
17 10 0.01
18 10 0.01
19 10 0.07
An output I would like to get, is where the first row go thorugh a comulative multiplication only at sow rows, like this:
C B
0 10 0.07
1 10 0.02
2 10 0.05
3 10 0.00
4 10 0.01
5 10,9 0.09
6 10,9 0.00
7 11,11 0.02
8 11,45 0.03
9 12,02 0.05
10 12,62 0.05
11 12,62 0.03
12 12,62 0.01
13 12,62 0.09
14 12,62 0.06
15 12,62 0.07
16 12,62 0.01
17 12,62 0.01
18 12,62 0.01
19 12,62 0.07
Thank you!
To get the recursive product you can do the following:
start = 5
end = 10
df['C'] = ((1+df.B)[start:end+1].cumprod().reindex(df.index[:end+1]).fillna(1)*df.A).ffill()
Output:
A B C
0 10 0.07 10.000000
1 10 0.02 10.000000
2 10 0.05 10.000000
3 10 0.00 10.000000
4 10 0.01 10.000000
5 10 0.09 10.900000
6 10 0.00 10.900000
7 10 0.02 11.118000
8 10 0.03 11.451540
9 10 0.05 12.024117
10 10 0.05 12.625323
11 10 0.03 12.625323
12 10 0.01 12.625323
13 10 0.09 12.625323
14 10 0.06 12.625323
15 10 0.07 12.625323
16 10 0.01 12.625323
17 10 0.01 12.625323
18 10 0.01 12.625323
19 10 0.07 12.625323
Explanation:
Calculate the cumulative product of (1 + df.B), which is the factor to mulitply by df.A to obtain the recursive product. Do this only over the range specified. reindex and fill the the rows before start with 1, so the value remains constant before this range.
Multiply by df.A to get the actual value, forward filling values after the range you specify.

How to do probit feature engineering from numerical data (cdf and pdf style) on pandas

This question is based on my current understanding (edit for more exact statistical terminology is very Wellcome). In my assumption, probit is the right terminology. I want to do probit_pdf and probit_cdf
probit_pdf is the probability of the variable is equal certain value
probit_cdf is the probability of the variable less or same with value
Here's my data
Id Value
1 2
2 4
3 2
4 6
5 5
6 4
7 2
8 4
9 2
10 5
To make the question clearer, I give example for few Id's
probit_pdf sample, for Id = 1 :
Here's the expected output, because probability of Value = 2 is 0.40 (4in 10), so the probit_pdf is 0.40.
probit_cdf sample, for Id = 5:
And because probability of Value >= 5 is 0.90 (9in 10), so the probit_cdf is 0.90
So my expected output is
Id Value probit_pdf probit_cdf
1 2 0.40 0.40
2 4 0.30 0.70
3 2 0.40 0.40
4 6 0.10 1.00
5 5 0.20 0.90
6 4 0.30 0.70
7 2 0.40 0.40
8 4 0.30 0.70
9 2 0.40 0.40
10 5 0.20 0.90
First for probit_pdf use GroupBy.transform with size and divide by length of DataFrame, for probit_cdf compare each value by all values, get sums and divide same way:
lens = len(df)
df['probit_pdf'] = df.groupby('Value')['Value'].transform('size').div(lens)
df['probit_cdf'] = df['probit_pdf'].apply(lambda x: df['probit_pdf'].ge(x).sum()).div(lens)
print (df)
Id Value probit_pdf probit_cdf
0 1 2 0.4 0.4
1 2 4 0.3 0.7
2 3 2 0.4 0.4
3 4 6 0.1 1.0
4 5 5 0.2 0.9
5 6 4 0.3 0.7
6 7 2 0.4 0.4
7 8 4 0.3 0.7
8 9 2 0.4 0.4
9 10 5 0.2 0.9

Concat two DataFrames on missing indices

I have two DataFrames and want to use the second one only on the rows whose index is not already contained in the first one.
What is the most efficient way to do this?
Example:
df_1
idx val
0 0.32
1 0.54
4 0.26
5 0.76
7 0.23
df_2
idx val
1 10.24
2 10.90
3 10.66
4 10.25
6 10.13
7 10.52
df_final
idx val
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
Recap: I need to add the rows in df_2 for which the index is not already in df_1.
EDIT
Removed some indices in df_2 to illustrate the fact that all indices from df_1 are not covered in df_2.
You can use reindex with combine_first or fillna:
df = df_1.reindex(df_2.index).combine_first(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
df = df_1.reindex(df_2.index).fillna(df_2)
print (df)
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23
You can achieve the wanted output by using the combine_first method of the DataFrame. From the documentation of the method:
Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Example usage:
import pandas as pd
df_1 = pd.DataFrame([0.32,0.54,0.26,0.76,0.23], columns=['val'], index=[0,1,4,5,7])
df_1.index.name = 'idx'
df_2 = pd.DataFrame([10.56,10.24,10.90,10.66,10.25,10.13,10.52], columns=['val'], index=[0,1,2,3,4,6,7])
df_2.index.name = 'idx'
df_final = df_1.combine_first(df_2)
This will give the desired result:
In [7]: df_final
Out[7]:
val
idx
0 0.32
1 0.54
2 10.90
3 10.66
4 0.26
5 0.76
6 10.13
7 0.23

Categories

Resources