Get original values from rolling sum in Pandas DataFrame

Get original values from rolling sum in Pandas DataFrame - python

I got data describing the number of newly hospitalized persons for specific days and regions.
The number of hospitalized persons is the rolling sum of new hospitalized persons for the last 7 days.
The DataFrame looks like this:
Date Region sum_of_last_7_days
01.01.2020 1 1
02.01.2020 1 2
03.01.2020 1 3
04.01.2020 1 4
05.01.2020 1 5
06.01.2020 1 6
07.01.2020 1 7
08.01.2020 1 7
09.01.2020 1 7
01.01.2020 2 1
02.01.2020 2 2
03.01.2020 2 3
04.01.2020 2 4
05.01.2020 2 5
06.01.2020 2 6
07.01.2020 2 7
08.01.2020 2 7
09.01.2020 2 7
10.01.2020 2 4
The goal output is:
Date Region daily_new
01.01.2020 1 1
02.01.2020 1 1
03.01.2020 1 1
04.01.2020 1 1
05.01.2020 1 1
06.01.2020 1 1
07.01.2020 1 1
08.01.2020 1 0
09.01.2020 1 0
01.01.2020 2 1
02.01.2020 2 1
03.01.2020 2 1
04.01.2020 2 1
05.01.2020 2 1
06.01.2020 2 1
07.01.2020 2 1
08.01.2020 2 0
09.01.2020 2 0
10.01.2020 2 0
The way should be via undo the rolling sum operation with a window for 7 days, but I wasn't able to find any solution.

To get the original, perform a diff and fill with the first value:
s = df.groupby('Region')['sum_of_last_7_days'].diff()
df['original'] = s.mask(s.isna(), df['sum_of_last_7_days'])
output:
Date Region sum_of_last_7_days original
0 01.01.2020 1 1 1.0
1 02.01.2020 1 2 1.0
2 03.01.2020 1 3 1.0
3 04.01.2020 1 4 1.0
4 05.01.2020 1 5 1.0
5 06.01.2020 1 6 1.0
6 07.01.2020 1 7 1.0
7 08.01.2020 1 7 0.0
8 09.01.2020 1 7 0.0
9 01.01.2020 2 1 1.0
10 02.01.2020 2 2 1.0
11 03.01.2020 2 3 1.0
12 04.01.2020 2 4 1.0
13 05.01.2020 2 5 1.0
14 06.01.2020 2 6 1.0
15 07.01.2020 2 7 1.0
16 08.01.2020 2 7 0.0
17 09.01.2020 2 7 0.0

Related

check if cumsum of the column is greater than range value than increment the element in list

I have a list
sample_dates = ["10/07/2021","11/07/2021","12/07/2021","13/07/2021",
"14/07/2021","15/07/2021","16/07/2021","17/07/2021",
"18/07/2021","19/07/2021","20/07/2021","21/07/2021",
"22/07/2021","23/07/2021","24/07/2021"]
and dataframe like below
Truckid Tripid kms
1 1 700.3
1 1 608.9
1 1 400.2
1 2 100.2
1 2 140.8
1 3 1580.0
1 3 357.3
1 3 541.5
1 4 421.2
1 4 1694.4
1 4 1585.9
1 5 173.3
1 5 237.4
1 5 83.3
2 1 846.1
2 1 1167.6
2 2 388.8
2 2 70.5
2 2 127.1
2 3 126.7
2 3 262.4
I want Date column by cumsum,if kms > 0 & < 2000 should have same date,if it increase 2000 than change the date, and than if it is > 2000 & < 3000 than do not change and than if its passes 3000 than again change the date. and so on
also if tripid changes than restart the counting from 0.
I want something like this
Truckid Tripid kms Date
1 1 700.3 10/07/2021
1 1 608.9 10/07/2021
1 1 400.2 10/07/2021
1 2 100.2 11/07/2021
1 2 140.8 11/07/2021
1 3 1580.0 12/07/2021
1 3 357.3 12/07/2021
1 3 541.5 13/07/2021
1 4 421.2 14/07/2021
1 4 1694.4 15/07/2021
1 4 1585.9 16/07/2021
1 5 173.3 17/07/2021
1 5 237.4 17/07/2021
1 5 83.3 17/07/2021
2 1 846.1 18/07/2021
2 1 1167.6 19/07/2021
2 2 388.8 20/07/2021
2 2 70.5 20/07/2021
2 2 127.1 20/07/2021
2 3 126.7 21/07/2021
2 3 262.4 21/07/2021

You can compute the cumsum per group and either cut is manually or use a mathematical trick to make groups.
Then map your dates:
# round to thousands, clip to get min 1000 km
kms = df.groupby(['Truckid', 'Tripid'])['kms'].cumsum().floordiv(1000).clip(1)
# OR use manual bins
kms = pd.cut(df.groupby(['Truckid', 'Tripid'])['kms'].cumsum(),
bins=[0,2000,3000,4000]) # etc. up to max wanted value
df['Date'] = (df
.groupby(['Truckid', 'Tripid', kms]).ngroup() # get group ID
.map(dict(enumerate(sample_dates))) # match to items in order
)
alternative to use consecutive days from the starting point:
df['Date'] = pd.to_datetime(df.groupby(['Truckid', 'Tripid', kms]).ngroup(),
unit='d', origin='2021-07-10')
output:
Truckid Tripid kms Date
0 1 1 700.3 10/07/2021
1 1 1 608.9 10/07/2021
2 1 1 400.2 10/07/2021
3 1 2 100.2 11/07/2021
4 1 2 140.8 11/07/2021
5 1 3 1580.0 12/07/2021
6 1 3 357.3 12/07/2021
7 1 3 541.5 13/07/2021
8 1 4 421.2 14/07/2021
9 1 4 1694.4 15/07/2021
10 1 4 1585.9 16/07/2021
11 1 5 173.3 17/07/2021
12 1 5 237.4 17/07/2021
13 1 5 83.3 17/07/2021
14 2 1 846.1 18/07/2021
15 2 1 1167.6 19/07/2021
16 2 2 388.8 20/07/2021
17 2 2 70.5 20/07/2021
18 2 2 127.1 20/07/2021
19 2 3 126.7 21/07/2021
20 2 3 262.4 21/07/2021

How to find out the cumulative count between numbers?

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change

IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0

If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

Create new columns according row values in pandas

I have a pandas dataframe that looks like this:
id name total cubierto no_cubierto escuela_id nivel_id
0 1 direccion 1 1 0 420000707 1
1 2 frente_a_alunos 4 4 0 420000707 1
2 3 apoyo 2 2 0 420000707 1
3 4 direccion 2 2 0 840477414 2
4 5 frente_a_alunos 8 8 0 840477414 2
5 6 apoyo 4 3 1 840477414 2
6 7 direccion 7 7 0 918751515 3
7 8 apoyo 37 37 0 918751515 3
8 9 direccion 1 1 0 993683216 1
9 10 frente_a_alunos 7 7 0 993683216 1
The column "name" has 3 unique values:
- direccion
- frente a alunos
- apoyo
and I need to get a new dataframe, grouped by "escuela_id" and "nivel_id" that has the columns:
- direccion_total
- direccion_cubierto
- frente_a_alunos_total
- frente_a_alunos_cubierto
- apoyo_total
- apoyo_cubierto
- escuela_id
- nivel_id
getting the values from columns "total" and "cubierto" respectively.
I don't need the column "no_cubierto".
Is it possible to do it with pandas functions? I am stucked on it and I couldn't find any solution.
The output for the example should look like this:
escuela_id nivel_id apoyo_cubierto apoyo_total direccion_total
0 420000707 1 2 2 1
1 840477414 2 3 4 2
2 918751515 3 37 37 7
3 993683216 1 .. .. 1
direccion_cubierto frente_a_alunos_total frente_a_alunos_cubierto
0 1 4 4
1 2 8 8
2 7 .. ..
3 1 7 7

You need to use pivot_table here:
df = df.pivot_table(index=['escuela_id', 'nivel_id'], columns='name', values=['total', 'cubierto']).reset_index()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print(df)
Output:
escuela_id_ nivel_id_ cubierto_apoyo cubierto_direccion cubierto_frente_a_alunos total_apoyo total_direccion total_frente_a_alunos
0 420000707 1 2.0 1.0 4.0 2.0 1.0 4.0
1 840477414 2 3.0 2.0 8.0 4.0 2.0 8.0
2 918751515 3 37.0 7.0 NaN 37.0 7.0 NaN
3 993683216 1 NaN 1.0 7.0 NaN 1.0 7.0

Create and populate a Pandas data frame columns using two existing columns

My data frame has 4 columns and looks as below.
What I have:
ID start_date end_date active
1,111 6/30/2015 8/6/1904 1 to 10
1,111 6/28/2016 3/30/1905 1 to 10
1,111 7/31/2017 6/6/1905 1 to 10
1,111 7/31/2018 6/6/1905 1 to 9
1,111 5/31/2019 12/4/1904 1 to 9
3,033 3/31/2015 5/18/1908 3 to 7
3,033 3/31/2016 11/24/1905 3 to 7
3,033 3/31/2017 1/20/1906 3 to 7
3,033 3/31/2018 1/8/1906 2 to 7
3,033 4/4/2019 2200,0 2 to 8
I want to generate 10 more columns based on the value of column "active" as below. Is there a way to populate this efficiently.
What I want to achieve
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1 1 1 1 1 1
1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1 1 1 1 1 1
3,033 3/31/2015 5/18/1908 3 to 7 1 1 1 1 1
3,033 3/31/2016 11/24/1905 3 to 7 1 1 1 1 1
3,033 3/31/2017 1/20/1906 3 to 7 1 1 1 1 1
3,033 3/31/2018 1/8/1906 2 to 7 1 1 1 1 1 1
3,033 4/4/2019 2200,0 2 to 8 1 1 1 1 1 1 1

Use custom function with np.arange:
def f(x):
a = list(map(int, x.split(' to ')))
return pd.Series(1, index= np.arange(a[0], a[1] + 1))
df = df.join(df['active'].apply(f).add_prefix('Type '))
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1.0 1.0 1.0 1.0
1 1,111 6/28/2016 3/30/1905 1 to 10 1.0 1.0 1.0 1.0
2 1,111 7/31/2017 6/6/1905 1 to 10 1.0 1.0 1.0 1.0
3 1,111 7/31/2018 6/6/1905 1 to 9 1.0 1.0 1.0 1.0
4 1,111 5/31/2019 12/4/1904 1 to 9 1.0 1.0 1.0 1.0
5 3,033 3/31/2015 5/18/1908 3 to 7 NaN NaN 1.0 1.0
6 3,033 3/31/2016 11/24/1905 3 to 7 NaN NaN 1.0 1.0
7 3,033 3/31/2017 1/20/1906 3 to 7 NaN NaN 1.0 1.0
8 3,033 3/31/2018 1/8/1906 2 to 7 NaN 1.0 1.0 1.0
9 3,033 4/4/2019 2200,0 2 to 8 NaN 1.0 1.0 1.0
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 NaN
5 1.0 1.0 1.0 NaN NaN NaN
6 1.0 1.0 1.0 NaN NaN NaN
7 1.0 1.0 1.0 NaN NaN NaN
8 1.0 1.0 1.0 NaN NaN NaN
9 1.0 1.0 1.0 1.0 NaN NaN
Similar:
def f(x):
a = list(map(int, x.split(' to ')))
return pd.Series(1, index= np.arange(a[0], a[1] + 1))
df = df.join(df['active'].apply(f).add_prefix('Type ').fillna(0).astype(int))
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1
5 3,033 3/31/2015 5/18/1908 3 to 7 0 0 1 1
6 3,033 3/31/2016 11/24/1905 3 to 7 0 0 1 1
7 3,033 3/31/2017 1/20/1906 3 to 7 0 0 1 1
8 3,033 3/31/2018 1/8/1906 2 to 7 0 1 1 1
9 3,033 4/4/2019 2200,0 2 to 8 0 1 1 1
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 0
4 1 1 1 1 1 0
5 1 1 1 0 0 0
6 1 1 1 0 0 0
7 1 1 1 0 0 0
8 1 1 1 0 0 0
9 1 1 1 1 0 0
Another non loop solution - idea is remove duplicates, create new rows with get_dummies, reindex for add missing columns and last add 1 by multiple cumsumed values:
df1 = (df.set_index('active', drop=False)
.pop('active')
.drop_duplicates()
.str.get_dummies(' to '))
df1.columns = df1.columns.astype(int)
df1 = df1.reindex(columns=np.arange(df1.columns.min(),df1.columns.max() + 1), fill_value=0)
df1 = (df1.cumsum(axis=1) * df1.iloc[:, ::-1].cumsum(axis=1)).clip_upper(1)
print (df1)
1 2 3 4 5 6 7 8 9 10
active
1 to 10 1 1 1 1 1 1 1 1 1 1
1 to 9 1 1 1 1 1 1 1 1 1 0
3 to 7 0 0 1 1 1 1 1 0 0 0
2 to 7 0 1 1 1 1 1 1 0 0 0
2 to 8 0 1 1 1 1 1 1 1 0 0
df = df.join(df1.add_prefix('Type '), on='active')
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1
5 3,033 3/31/2015 5/18/1908 3 to 7 0 0 1 1
6 3,033 3/31/2016 11/24/1905 3 to 7 0 0 1 1
7 3,033 3/31/2017 1/20/1906 3 to 7 0 0 1 1
8 3,033 3/31/2018 1/8/1906 2 to 7 0 1 1 1
9 3,033 4/4/2019 2200,0 2 to 8 0 1 1 1
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 0
4 1 1 1 1 1 0
5 1 1 1 0 0 0
6 1 1 1 0 0 0
7 1 1 1 0 0 0
8 1 1 1 0 0 0
9 1 1 1 1 0 0

def f(s):
a, b = map(int, s.split('to'))
return '|'.join(map(str, range(a, b + 1)))
df.drop('active', 1).join(df.active.apply(f).str.get_dummies().add_prefix('Type '))
ID start_date end_date Type 1 Type 10 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9
0 1,111 6/30/2015 8/6/1904 1 1 1 1 1 1 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 1 1 1 1 1 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 1 1 1 1 1 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 0 1 1 1 1 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 0 1 1 1 1 1 1 1 1
5 3,033 3/31/2015 5/18/1908 0 0 0 1 1 1 1 1 0 0
6 3,033 3/31/2016 11/24/1905 0 0 0 1 1 1 1 1 0 0
7 3,033 3/31/2017 1/20/1906 0 0 0 1 1 1 1 1 0 0
8 3,033 3/31/2018 1/8/1906 0 0 1 1 1 1 1 1 0 0
9 3,033 4/4/2019 2200,0 0 0 1 1 1 1 1 1 1 0

Pandas: Groupby two columns and count the occurence of all values for 2nd column

I want to groupby my dataframe using two columns, one is yearmonth(format : 16-10) and other is number of cust. Then if number of cumstomers are more the six, i want to create one one row which replaces all the rows with number of cust = 6+ and sum of total values for number of cust >6.
This is how data looks like
index month num ofcust count
0 10 1.0 1
1 10 2.0 1
2 10 3.0 1
3 10 4.0 1
4 10 5.0 1
5 10 6.0 1
6 10 7.0 1
7 10 8.0 1
8 11 1.0 1
9 11 2.0 1
10 11 3.0 1
11 12 12.0 1
Output:
index month no of cust count
0 16-10 1.0 3
1 16-10 2.0 6
2 16-10 3.0 2
3 16-10 4.0 3
4 16-10 5.0 4
5 16-10 6+ 4
6 16-11 1.0 4
7 16-11 2.0 3
8 16-11 3.0 2
9 16-11 4.0 1
10 16-11 5.0 3
11 16-11 6+ 5

I believe you need replace all values >=6 first and then groupby + aggregate sum:
s = df['num ofcust'].mask(df['num ofcust'] >=6, '6+')
#alternatively
#s = df['num ofcust'].where(df['num ofcust'] <6, '6+')
df = df.groupby(['month', s])['count'].sum().reset_index()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1
Detail:
print (s)
0 1
1 2
2 3
3 4
4 5
5 6+
6 6+
7 6+
8 1
9 2
10 3
11 6+
Name: num ofcust, dtype: object
Another very similar solution is append data to column first:
df.loc[df['num ofcust'] >= 6, 'num ofcust'] = '6+'
df = df.groupby(['month', 'num ofcust'], as_index=False)['count'].sum()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get original values from rolling sum in Pandas DataFrame - python

Related

check if cumsum of the column is greater than range value than increment the element in list

How to find out the cumulative count between numbers?

Create new columns according row values in pandas

Create and populate a Pandas data frame columns using two existing columns

Pandas: Groupby two columns and count the occurence of all values for 2nd column

Categories

Resources