My data frame has 4 columns and looks as below.
What I have:
ID start_date end_date active
1,111 6/30/2015 8/6/1904 1 to 10
1,111 6/28/2016 3/30/1905 1 to 10
1,111 7/31/2017 6/6/1905 1 to 10
1,111 7/31/2018 6/6/1905 1 to 9
1,111 5/31/2019 12/4/1904 1 to 9
3,033 3/31/2015 5/18/1908 3 to 7
3,033 3/31/2016 11/24/1905 3 to 7
3,033 3/31/2017 1/20/1906 3 to 7
3,033 3/31/2018 1/8/1906 2 to 7
3,033 4/4/2019 2200,0 2 to 8
I want to generate 10 more columns based on the value of column "active" as below. Is there a way to populate this efficiently.
What I want to achieve
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1 1 1 1 1 1
1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1 1 1 1 1 1
3,033 3/31/2015 5/18/1908 3 to 7 1 1 1 1 1
3,033 3/31/2016 11/24/1905 3 to 7 1 1 1 1 1
3,033 3/31/2017 1/20/1906 3 to 7 1 1 1 1 1
3,033 3/31/2018 1/8/1906 2 to 7 1 1 1 1 1 1
3,033 4/4/2019 2200,0 2 to 8 1 1 1 1 1 1 1
Use custom function with np.arange:
def f(x):
a = list(map(int, x.split(' to ')))
return pd.Series(1, index= np.arange(a[0], a[1] + 1))
df = df.join(df['active'].apply(f).add_prefix('Type '))
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1.0 1.0 1.0 1.0
1 1,111 6/28/2016 3/30/1905 1 to 10 1.0 1.0 1.0 1.0
2 1,111 7/31/2017 6/6/1905 1 to 10 1.0 1.0 1.0 1.0
3 1,111 7/31/2018 6/6/1905 1 to 9 1.0 1.0 1.0 1.0
4 1,111 5/31/2019 12/4/1904 1 to 9 1.0 1.0 1.0 1.0
5 3,033 3/31/2015 5/18/1908 3 to 7 NaN NaN 1.0 1.0
6 3,033 3/31/2016 11/24/1905 3 to 7 NaN NaN 1.0 1.0
7 3,033 3/31/2017 1/20/1906 3 to 7 NaN NaN 1.0 1.0
8 3,033 3/31/2018 1/8/1906 2 to 7 NaN 1.0 1.0 1.0
9 3,033 4/4/2019 2200,0 2 to 8 NaN 1.0 1.0 1.0
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 NaN
5 1.0 1.0 1.0 NaN NaN NaN
6 1.0 1.0 1.0 NaN NaN NaN
7 1.0 1.0 1.0 NaN NaN NaN
8 1.0 1.0 1.0 NaN NaN NaN
9 1.0 1.0 1.0 1.0 NaN NaN
Similar:
def f(x):
a = list(map(int, x.split(' to ')))
return pd.Series(1, index= np.arange(a[0], a[1] + 1))
df = df.join(df['active'].apply(f).add_prefix('Type ').fillna(0).astype(int))
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1
5 3,033 3/31/2015 5/18/1908 3 to 7 0 0 1 1
6 3,033 3/31/2016 11/24/1905 3 to 7 0 0 1 1
7 3,033 3/31/2017 1/20/1906 3 to 7 0 0 1 1
8 3,033 3/31/2018 1/8/1906 2 to 7 0 1 1 1
9 3,033 4/4/2019 2200,0 2 to 8 0 1 1 1
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 0
4 1 1 1 1 1 0
5 1 1 1 0 0 0
6 1 1 1 0 0 0
7 1 1 1 0 0 0
8 1 1 1 0 0 0
9 1 1 1 1 0 0
Another non loop solution - idea is remove duplicates, create new rows with get_dummies, reindex for add missing columns and last add 1 by multiple cumsumed values:
df1 = (df.set_index('active', drop=False)
.pop('active')
.drop_duplicates()
.str.get_dummies(' to '))
df1.columns = df1.columns.astype(int)
df1 = df1.reindex(columns=np.arange(df1.columns.min(),df1.columns.max() + 1), fill_value=0)
df1 = (df1.cumsum(axis=1) * df1.iloc[:, ::-1].cumsum(axis=1)).clip_upper(1)
print (df1)
1 2 3 4 5 6 7 8 9 10
active
1 to 10 1 1 1 1 1 1 1 1 1 1
1 to 9 1 1 1 1 1 1 1 1 1 0
3 to 7 0 0 1 1 1 1 1 0 0 0
2 to 7 0 1 1 1 1 1 1 0 0 0
2 to 8 0 1 1 1 1 1 1 1 0 0
df = df.join(df1.add_prefix('Type '), on='active')
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1
5 3,033 3/31/2015 5/18/1908 3 to 7 0 0 1 1
6 3,033 3/31/2016 11/24/1905 3 to 7 0 0 1 1
7 3,033 3/31/2017 1/20/1906 3 to 7 0 0 1 1
8 3,033 3/31/2018 1/8/1906 2 to 7 0 1 1 1
9 3,033 4/4/2019 2200,0 2 to 8 0 1 1 1
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 0
4 1 1 1 1 1 0
5 1 1 1 0 0 0
6 1 1 1 0 0 0
7 1 1 1 0 0 0
8 1 1 1 0 0 0
9 1 1 1 1 0 0
def f(s):
a, b = map(int, s.split('to'))
return '|'.join(map(str, range(a, b + 1)))
df.drop('active', 1).join(df.active.apply(f).str.get_dummies().add_prefix('Type '))
ID start_date end_date Type 1 Type 10 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9
0 1,111 6/30/2015 8/6/1904 1 1 1 1 1 1 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 1 1 1 1 1 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 1 1 1 1 1 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 0 1 1 1 1 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 0 1 1 1 1 1 1 1 1
5 3,033 3/31/2015 5/18/1908 0 0 0 1 1 1 1 1 0 0
6 3,033 3/31/2016 11/24/1905 0 0 0 1 1 1 1 1 0 0
7 3,033 3/31/2017 1/20/1906 0 0 0 1 1 1 1 1 0 0
8 3,033 3/31/2018 1/8/1906 0 0 1 1 1 1 1 1 0 0
9 3,033 4/4/2019 2200,0 0 0 1 1 1 1 1 1 1 0
Related
I got data describing the number of newly hospitalized persons for specific days and regions.
The number of hospitalized persons is the rolling sum of new hospitalized persons for the last 7 days.
The DataFrame looks like this:
Date Region sum_of_last_7_days
01.01.2020 1 1
02.01.2020 1 2
03.01.2020 1 3
04.01.2020 1 4
05.01.2020 1 5
06.01.2020 1 6
07.01.2020 1 7
08.01.2020 1 7
09.01.2020 1 7
01.01.2020 2 1
02.01.2020 2 2
03.01.2020 2 3
04.01.2020 2 4
05.01.2020 2 5
06.01.2020 2 6
07.01.2020 2 7
08.01.2020 2 7
09.01.2020 2 7
10.01.2020 2 4
The goal output is:
Date Region daily_new
01.01.2020 1 1
02.01.2020 1 1
03.01.2020 1 1
04.01.2020 1 1
05.01.2020 1 1
06.01.2020 1 1
07.01.2020 1 1
08.01.2020 1 0
09.01.2020 1 0
01.01.2020 2 1
02.01.2020 2 1
03.01.2020 2 1
04.01.2020 2 1
05.01.2020 2 1
06.01.2020 2 1
07.01.2020 2 1
08.01.2020 2 0
09.01.2020 2 0
10.01.2020 2 0
The way should be via undo the rolling sum operation with a window for 7 days, but I wasn't able to find any solution.
To get the original, perform a diff and fill with the first value:
s = df.groupby('Region')['sum_of_last_7_days'].diff()
df['original'] = s.mask(s.isna(), df['sum_of_last_7_days'])
output:
Date Region sum_of_last_7_days original
0 01.01.2020 1 1 1.0
1 02.01.2020 1 2 1.0
2 03.01.2020 1 3 1.0
3 04.01.2020 1 4 1.0
4 05.01.2020 1 5 1.0
5 06.01.2020 1 6 1.0
6 07.01.2020 1 7 1.0
7 08.01.2020 1 7 0.0
8 09.01.2020 1 7 0.0
9 01.01.2020 2 1 1.0
10 02.01.2020 2 2 1.0
11 03.01.2020 2 3 1.0
12 04.01.2020 2 4 1.0
13 05.01.2020 2 5 1.0
14 06.01.2020 2 6 1.0
15 07.01.2020 2 7 1.0
16 08.01.2020 2 7 0.0
17 09.01.2020 2 7 0.0
Below is script for a simplified version of the df in question:
import pandas as pd
df = pd.DataFrame({
'id' : [1,1,1,1,2,2,2,2,3,3,3,3],
'feature' : ['cd_player', 'sat_nav', 'sub_woofer', 'usb_port','cd_player', 'sat_nav', 'sub_woofer', 'usb_port','cd_player', 'sat_nav', 'sub_woofer', 'usb_port'],
'feature_value' : [1,1,1,0,1,0,0,1,1,1,1,0],
})
df
id feature feature_value
0 1 cd_player 1
1 1 sat_nav 1
2 1 sub_woofer 1
3 1 usb_port 0
4 2 cd_player 1
5 2 sat_nav 0
6 2 sub_woofer 0
7 2 usb_port 1
8 3 cd_player 1
9 3 sat_nav 1
10 3 sub_woofer 1
11 3 usb_port 0
What I would like to do, is create a new column which counts the number of 0 values for each feature as per the df below.
INTENDED DF:
id feature feature_value no_value_count
0 1 cd_player 1 0
1 1 sat_nav 1 1
2 1 sub_woofer 1 1
3 1 usb_port 0 2
4 2 cd_player 1 0
5 2 sat_nav 0 1
6 2 sub_woofer 0 1
7 2 usb_port 1 2
8 3 cd_player 1 0
9 3 sat_nav 1 1
10 3 sub_woofer 1 1
11 3 usb_port 0 2
Any help would be greatly appreciated.
IIUC:
df["count"] = df["id"].nunique() - df.groupby("feature")["feature_value"].transform("sum")
print (df)
id feature feature_value count
0 1 cd_player 1 0
1 1 sat_nav 1 1
2 1 sub_woofer 1 1
3 1 usb_port 0 2
4 2 cd_player 1 0
5 2 sat_nav 0 1
6 2 sub_woofer 0 1
7 2 usb_port 1 2
8 3 cd_player 1 0
9 3 sat_nav 1 1
10 3 sub_woofer 1 1
11 3 usb_port 0 2
you can map the column feature with the result of groupby.sum by feature where the column feature_value is equal (eq) to 0.
df['no_value_count'] = df['feature'].map(df['feature_value'].eq(0)
.groupby(df['feature']).sum())
print(df)
id feature feature_value no_value_count
0 1 cd_player 1 0
1 1 sat_nav 1 1
2 1 sub_woofer 1 1
3 1 usb_port 0 2
4 1 cd_player 1 0
5 2 sat_nav 0 1
6 2 sub_woofer 0 1
7 2 usb_port 1 2
8 2 cd_player 1 0
9 2 sat_nav 1 1
10 3 sub_woofer 1 1
11 3 usb_port 0 2
From what I understand, you can try:
df['feature_value'].eq(0).groupby(df['feature']).transform('sum')
0 0.0
1 1.0
2 1.0
3 2.0
4 0.0
5 1.0
6 1.0
7 2.0
8 0.0
9 1.0
10 1.0
11 2.0
Task
I have a df where I do some ratios that are groupby date and id. I want to fill column c with NaN if the sum of a and b is 0. Any help would be awesome!!
df
date id a b c
0 2001-09-06 1 3 1 1
1 2001-09-07 1 3 1 1
2 2001-09-08 1 4 0 1
3 2001-09-09 2 6 0 1
4 2001-09-10 2 0 0 2
5 2001-09-11 1 0 0 2
6 2001-09-12 2 1 1 2
7 2001-09-13 2 0 0 2
8 2001-09-14 1 0 0 2
Try this:
df['new_c'] = df.c.where(df[['a','b']].sum(1).ne(0))
Out[75]:
date id a b c new_c
0 2001-09-06 1 3 1 1 1.0
1 2001-09-07 1 3 1 1 1.0
2 2001-09-08 1 4 0 1 1.0
3 2001-09-09 2 6 0 1 1.0
4 2001-09-10 2 0 0 2 NaN
5 2001-09-11 1 0 0 2 NaN
6 2001-09-12 2 1 1 2 2.0
7 2001-09-13 2 0 0 2 NaN
8 2001-09-14 1 0 0 2 NaN
It is better to build a new dataframe with same shape , and then do the following :
i = 0
for line in df :
new_df[i]['date'] = line['date']
new_df[i]['a'] = line['a']
new_df[i]['b'] = line['b']
if line['a'] + line['b'] == 0 :
new_df[i]['c'] = Nan
i += 1
I have a series of values (Pandas DF or Numpy Arr):
vals = [0,1,3,4,5,5,4,2,1,0,-1,-2,-3,-2,3,5,8,4,2,0,-1,-3,-8,-20,-10,-5,-2,-1,0,1,2,3,5,6,8,4,3]
df = pd.DataFrame({'val': vals})
I want to classify/group the values into 4 categories:
Increasing above 0
Decreasing above 0
Increasing below 0
Decreasing below 0
Current approach with Pandas is to categorize into above/below 0 and then that into increasing/decreasing by seeing when diff values change above/below 0.
df['above_zero'] = np.where(df['val'] >= 0, 1, 0)
df['below_zero'] = np.where(df['val'] < 0, 1, 0)
df['diffs'] = df['val'].diff()
df['diff_above_zero'] = np.where(df['diffs'] >= 0, 1, 0)
df['diff_below_zero'] = np.where(df['diffs'] < 0, 1, 0)
This produces the desired output, but now I am trying to find a solution how to group these columns into an ascending group number as soon as one of the 4 conditions changes.
Desired output would look like this (*group col is manually typed, might have errors from calculated values):
id val above_zero below_zero diffs diff_above_zero diff_below_zero group
0 0 1 0 0.0 1 0 0
1 1 1 0 1.0 1 0 0
2 3 1 0 2.0 1 0 0
3 4 1 0 1.0 1 0 0
4 5 1 0 1.0 1 0 0
5 5 1 0 0.0 1 0 0
6 4 1 0 -1.0 0 1 1
7 2 1 0 -2.0 0 1 1
8 1 1 0 -1.0 0 1 1
9 0 1 0 -1.0 0 1 1
10 -1 0 1 -1.0 0 1 2
11 -2 0 1 -1.0 0 1 2
12 -3 0 1 -1.0 0 1 2
13 -2 0 1 1.0 1 0 3
14 3 1 0 5.0 1 0 4
15 5 1 0 2.0 1 0 4
16 8 1 0 3.0 1 0 4
17 4 1 0 -4.0 0 1 5
18 2 1 0 -2.0 0 1 5
19 0 1 0 -2.0 0 1 5
20 -1 0 1 -1.0 0 1 6
21 -3 0 1 -2.0 0 1 6
22 -8 0 1 -5.0 0 1 6
23 -20 0 1 -12.0 0 1 6
24 -10 0 1 10.0 1 0 7
25 -5 0 1 5.0 1 0 7
26 -2 0 1 3.0 1 0 7
27 -1 0 1 1.0 1 0 7
28 0 1 0 1.0 1 0 8
29 1 1 0 1.0 1 0 8
30 2 1 0 1.0 1 0 8
31 3 1 0 1.0 1 0 8
32 5 1 0 2.0 1 0 8
33 6 1 0 1.0 1 0 8
34 8 1 0 2.0 1 0 8
35 4 1 0 -4.0 0 1 9
36 3 1 0 -1.0 0 1 9
Would appreciate any help on how to solve this efficiently. Thanks!
Setup
g1 = ['above_zero', 'below_zero', 'diff_above_zero', 'diff_below_zero']
You can simply index all of your boolean columns, and use shift:
c = df.loc[:, g1]
(c != c.shift().fillna(c)).any(1).cumsum()
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 1
8 1
9 1
10 2
11 2
12 2
13 3
14 4
15 4
16 4
17 5
18 5
19 5
20 6
21 6
22 6
23 6
24 7
25 7
26 7
27 7
28 8
29 8
30 8
31 8
32 8
33 8
34 8
35 9
36 9
dtype: int32
The following code will produce two columns: c1 and c2.
The values of c1 correspond to the following 4 categories:
0 means below zero and increasing
1 means above zero and increasing
2 means below zero and decreasing
3 means above zero and decreasing
And c2 corresponds to ascending group number as soon as condition (i.e. c1) changes (as you wanted). Credits to #user3483203 for using the shift with cumsum
# calculate difference
df["diff"] = df['val'].diff()
# set first value in column 'diff' to 0 (as previous step sets it to NaN)
df.loc[0, 'diff'] = 0
df["c1"] = (df['val'] >= 0).astype(int) + (df["diff"] < 0).astype(int) * 2
df["c2"] = (df["c1"] != df["c1"].shift().fillna(df["c1"])).astype(int).cumsum()
Result:
val diff c1 c2
0 0 0.0 1 0
1 1 1.0 1 0
2 3 2.0 1 0
3 4 1.0 1 0
4 5 1.0 1 0
5 5 0.0 1 0
6 4 -1.0 3 1
7 2 -2.0 3 1
8 1 -1.0 3 1
9 0 -1.0 3 1
10 -1 -1.0 2 2
11 -2 -1.0 2 2
12 -3 -1.0 2 2
13 -2 1.0 0 3
14 3 5.0 1 4
15 5 2.0 1 4
16 8 3.0 1 4
17 4 -4.0 3 5
18 2 -2.0 3 5
19 0 -2.0 3 5
20 -1 -1.0 2 6
21 -3 -2.0 2 6
22 -8 -5.0 2 6
23 -20 -12.0 2 6
24 -10 10.0 0 7
25 -5 5.0 0 7
26 -2 3.0 0 7
27 -1 1.0 0 7
28 0 1.0 1 8
29 1 1.0 1 8
30 2 1.0 1 8
31 3 1.0 1 8
32 5 2.0 1 8
33 6 1.0 1 8
34 8 2.0 1 8
35 4 -4.0 3 9
36 3 -1.0 3 9
Currently I have following dataframe, where F1-F4 are some segments
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 NaN 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 NaN 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 NaN 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 NaN 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 NaN 0 0 1 0
20:45 1 3 5 7 NaN 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
What is the best approach to achieve next dataset after some manipulations?
E(06:15) = MEAN( AVG[E(06:00-06:30)], AVG[06:15(A-E)] ) #F1==1
E(20:45) = MEAN( AVG[E(20:45-21:00)], AVG[20:45(A-E)] ) #F4==1
A B C D E F1 F2 F3 F4
06:00 2 4 6 8 1 1 0 0 0
06:15 3 5 7 9 [X0] 1 0 0 0
06:30 4 6 8 7 3 1 0 0 0
06:45 1 3 5 7 [X1] 1 0 0 0
07:00 2 4 6 8 6 0 1 0 0
07:15 4 4 8 8 [X2] 0 1 0 0
---------------------------------------------
20:00 2 4 6 8 [X3] 0 0 1 0
20:15 1 2 3 4 5 0 0 1 0
20:30 8 1 5 9 [X4] 0 0 1 0
20:45 1 3 5 7 [X5] 0 0 0 1
21:00 5 4 6 5 6 0 0 0 1
I was trying to use an idea like below, but without success so far
In[89]: df.groupby(['F1', 'F2', 'F3', 'F4'], as_index=False).median()
Out[89]:
F1 F2 F3 F4 A B C D E
0 0 0 0 1 2.0 3.0 2.0 2.0 0.0
1 0 0 1 0 1.5 2.0 3.0 3.5 1.0
2 0 1 0 0 6.0 7.0 6.0 7.0 9.0
3 1 0 0 0 3.0 4.0 3.0 4.0 4.0
and now, I am struggling with accessing to values E==0.0 via key F4==1