Get original values from rolling sum in Pandas DataFrame - python

I got data describing the number of newly hospitalized persons for specific days and regions.
The number of hospitalized persons is the rolling sum of new hospitalized persons for the last 7 days.
The DataFrame looks like this:
Date Region sum_of_last_7_days
01.01.2020 1 1
02.01.2020 1 2
03.01.2020 1 3
04.01.2020 1 4
05.01.2020 1 5
06.01.2020 1 6
07.01.2020 1 7
08.01.2020 1 7
09.01.2020 1 7
01.01.2020 2 1
02.01.2020 2 2
03.01.2020 2 3
04.01.2020 2 4
05.01.2020 2 5
06.01.2020 2 6
07.01.2020 2 7
08.01.2020 2 7
09.01.2020 2 7
10.01.2020 2 4
The goal output is:
Date Region daily_new
01.01.2020 1 1
02.01.2020 1 1
03.01.2020 1 1
04.01.2020 1 1
05.01.2020 1 1
06.01.2020 1 1
07.01.2020 1 1
08.01.2020 1 0
09.01.2020 1 0
01.01.2020 2 1
02.01.2020 2 1
03.01.2020 2 1
04.01.2020 2 1
05.01.2020 2 1
06.01.2020 2 1
07.01.2020 2 1
08.01.2020 2 0
09.01.2020 2 0
10.01.2020 2 0
The way should be via undo the rolling sum operation with a window for 7 days, but I wasn't able to find any solution.

To get the original, perform a diff and fill with the first value:
s = df.groupby('Region')['sum_of_last_7_days'].diff()
df['original'] = s.mask(s.isna(), df['sum_of_last_7_days'])
output:
Date Region sum_of_last_7_days original
0 01.01.2020 1 1 1.0
1 02.01.2020 1 2 1.0
2 03.01.2020 1 3 1.0
3 04.01.2020 1 4 1.0
4 05.01.2020 1 5 1.0
5 06.01.2020 1 6 1.0
6 07.01.2020 1 7 1.0
7 08.01.2020 1 7 0.0
8 09.01.2020 1 7 0.0
9 01.01.2020 2 1 1.0
10 02.01.2020 2 2 1.0
11 03.01.2020 2 3 1.0
12 04.01.2020 2 4 1.0
13 05.01.2020 2 5 1.0
14 06.01.2020 2 6 1.0
15 07.01.2020 2 7 1.0
16 08.01.2020 2 7 0.0
17 09.01.2020 2 7 0.0

Related

check if cumsum of the column is greater than range value than increment the element in list

I have a list
sample_dates = ["10/07/2021","11/07/2021","12/07/2021","13/07/2021",
"14/07/2021","15/07/2021","16/07/2021","17/07/2021",
"18/07/2021","19/07/2021","20/07/2021","21/07/2021",
"22/07/2021","23/07/2021","24/07/2021"]
and dataframe like below
Truckid Tripid kms
1 1 700.3
1 1 608.9
1 1 400.2
1 2 100.2
1 2 140.8
1 3 1580.0
1 3 357.3
1 3 541.5
1 4 421.2
1 4 1694.4
1 4 1585.9
1 5 173.3
1 5 237.4
1 5 83.3
2 1 846.1
2 1 1167.6
2 2 388.8
2 2 70.5
2 2 127.1
2 3 126.7
2 3 262.4
I want Date column by cumsum,if kms > 0 & < 2000 should have same date,if it increase 2000 than change the date, and than if it is > 2000 & < 3000 than do not change and than if its passes 3000 than again change the date. and so on
also if tripid changes than restart the counting from 0.
I want something like this
Truckid Tripid kms Date
1 1 700.3 10/07/2021
1 1 608.9 10/07/2021
1 1 400.2 10/07/2021
1 2 100.2 11/07/2021
1 2 140.8 11/07/2021
1 3 1580.0 12/07/2021
1 3 357.3 12/07/2021
1 3 541.5 13/07/2021
1 4 421.2 14/07/2021
1 4 1694.4 15/07/2021
1 4 1585.9 16/07/2021
1 5 173.3 17/07/2021
1 5 237.4 17/07/2021
1 5 83.3 17/07/2021
2 1 846.1 18/07/2021
2 1 1167.6 19/07/2021
2 2 388.8 20/07/2021
2 2 70.5 20/07/2021
2 2 127.1 20/07/2021
2 3 126.7 21/07/2021
2 3 262.4 21/07/2021
You can compute the cumsum per group and either cut is manually or use a mathematical trick to make groups.
Then map your dates:
# round to thousands, clip to get min 1000 km
kms = df.groupby(['Truckid', 'Tripid'])['kms'].cumsum().floordiv(1000).clip(1)
# OR use manual bins
kms = pd.cut(df.groupby(['Truckid', 'Tripid'])['kms'].cumsum(),
bins=[0,2000,3000,4000]) # etc. up to max wanted value
df['Date'] = (df
.groupby(['Truckid', 'Tripid', kms]).ngroup() # get group ID
.map(dict(enumerate(sample_dates))) # match to items in order
)
alternative to use consecutive days from the starting point:
df['Date'] = pd.to_datetime(df.groupby(['Truckid', 'Tripid', kms]).ngroup(),
unit='d', origin='2021-07-10')
output:
Truckid Tripid kms Date
0 1 1 700.3 10/07/2021
1 1 1 608.9 10/07/2021
2 1 1 400.2 10/07/2021
3 1 2 100.2 11/07/2021
4 1 2 140.8 11/07/2021
5 1 3 1580.0 12/07/2021
6 1 3 357.3 12/07/2021
7 1 3 541.5 13/07/2021
8 1 4 421.2 14/07/2021
9 1 4 1694.4 15/07/2021
10 1 4 1585.9 16/07/2021
11 1 5 173.3 17/07/2021
12 1 5 237.4 17/07/2021
13 1 5 83.3 17/07/2021
14 2 1 846.1 18/07/2021
15 2 1 1167.6 19/07/2021
16 2 2 388.8 20/07/2021
17 2 2 70.5 20/07/2021
18 2 2 127.1 20/07/2021
19 2 3 126.7 21/07/2021
20 2 3 262.4 21/07/2021

How to find out the cumulative count between numbers?

i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change
IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0
If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0

Create new columns according row values in pandas

I have a pandas dataframe that looks like this:
id name total cubierto no_cubierto escuela_id nivel_id
0 1 direccion 1 1 0 420000707 1
1 2 frente_a_alunos 4 4 0 420000707 1
2 3 apoyo 2 2 0 420000707 1
3 4 direccion 2 2 0 840477414 2
4 5 frente_a_alunos 8 8 0 840477414 2
5 6 apoyo 4 3 1 840477414 2
6 7 direccion 7 7 0 918751515 3
7 8 apoyo 37 37 0 918751515 3
8 9 direccion 1 1 0 993683216 1
9 10 frente_a_alunos 7 7 0 993683216 1
The column "name" has 3 unique values:
- direccion
- frente a alunos
- apoyo
and I need to get a new dataframe, grouped by "escuela_id" and "nivel_id" that has the columns:
- direccion_total
- direccion_cubierto
- frente_a_alunos_total
- frente_a_alunos_cubierto
- apoyo_total
- apoyo_cubierto
- escuela_id
- nivel_id
getting the values from columns "total" and "cubierto" respectively.
I don't need the column "no_cubierto".
Is it possible to do it with pandas functions? I am stucked on it and I couldn't find any solution.
The output for the example should look like this:
escuela_id nivel_id apoyo_cubierto apoyo_total direccion_total
0 420000707 1 2 2 1
1 840477414 2 3 4 2
2 918751515 3 37 37 7
3 993683216 1 .. .. 1
direccion_cubierto frente_a_alunos_total frente_a_alunos_cubierto
0 1 4 4
1 2 8 8
2 7 .. ..
3 1 7 7
You need to use pivot_table here:
df = df.pivot_table(index=['escuela_id', 'nivel_id'], columns='name', values=['total', 'cubierto']).reset_index()
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print(df)
Output:
escuela_id_ nivel_id_ cubierto_apoyo cubierto_direccion cubierto_frente_a_alunos total_apoyo total_direccion total_frente_a_alunos
0 420000707 1 2.0 1.0 4.0 2.0 1.0 4.0
1 840477414 2 3.0 2.0 8.0 4.0 2.0 8.0
2 918751515 3 37.0 7.0 NaN 37.0 7.0 NaN
3 993683216 1 NaN 1.0 7.0 NaN 1.0 7.0

Create and populate a Pandas data frame columns using two existing columns

My data frame has 4 columns and looks as below.
What I have:
ID start_date end_date active
1,111 6/30/2015 8/6/1904 1 to 10
1,111 6/28/2016 3/30/1905 1 to 10
1,111 7/31/2017 6/6/1905 1 to 10
1,111 7/31/2018 6/6/1905 1 to 9
1,111 5/31/2019 12/4/1904 1 to 9
3,033 3/31/2015 5/18/1908 3 to 7
3,033 3/31/2016 11/24/1905 3 to 7
3,033 3/31/2017 1/20/1906 3 to 7
3,033 3/31/2018 1/8/1906 2 to 7
3,033 4/4/2019 2200,0 2 to 8
I want to generate 10 more columns based on the value of column "active" as below. Is there a way to populate this efficiently.
What I want to achieve
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1 1 1 1 1 1 1
1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1 1 1 1 1 1
1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1 1 1 1 1 1
3,033 3/31/2015 5/18/1908 3 to 7 1 1 1 1 1
3,033 3/31/2016 11/24/1905 3 to 7 1 1 1 1 1
3,033 3/31/2017 1/20/1906 3 to 7 1 1 1 1 1
3,033 3/31/2018 1/8/1906 2 to 7 1 1 1 1 1 1
3,033 4/4/2019 2200,0 2 to 8 1 1 1 1 1 1 1
Use custom function with np.arange:
def f(x):
a = list(map(int, x.split(' to ')))
return pd.Series(1, index= np.arange(a[0], a[1] + 1))
df = df.join(df['active'].apply(f).add_prefix('Type '))
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1.0 1.0 1.0 1.0
1 1,111 6/28/2016 3/30/1905 1 to 10 1.0 1.0 1.0 1.0
2 1,111 7/31/2017 6/6/1905 1 to 10 1.0 1.0 1.0 1.0
3 1,111 7/31/2018 6/6/1905 1 to 9 1.0 1.0 1.0 1.0
4 1,111 5/31/2019 12/4/1904 1 to 9 1.0 1.0 1.0 1.0
5 3,033 3/31/2015 5/18/1908 3 to 7 NaN NaN 1.0 1.0
6 3,033 3/31/2016 11/24/1905 3 to 7 NaN NaN 1.0 1.0
7 3,033 3/31/2017 1/20/1906 3 to 7 NaN NaN 1.0 1.0
8 3,033 3/31/2018 1/8/1906 2 to 7 NaN 1.0 1.0 1.0
9 3,033 4/4/2019 2200,0 2 to 8 NaN 1.0 1.0 1.0
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1.0 1.0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0 1.0 NaN
4 1.0 1.0 1.0 1.0 1.0 NaN
5 1.0 1.0 1.0 NaN NaN NaN
6 1.0 1.0 1.0 NaN NaN NaN
7 1.0 1.0 1.0 NaN NaN NaN
8 1.0 1.0 1.0 NaN NaN NaN
9 1.0 1.0 1.0 1.0 NaN NaN
Similar:
def f(x):
a = list(map(int, x.split(' to ')))
return pd.Series(1, index= np.arange(a[0], a[1] + 1))
df = df.join(df['active'].apply(f).add_prefix('Type ').fillna(0).astype(int))
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1
5 3,033 3/31/2015 5/18/1908 3 to 7 0 0 1 1
6 3,033 3/31/2016 11/24/1905 3 to 7 0 0 1 1
7 3,033 3/31/2017 1/20/1906 3 to 7 0 0 1 1
8 3,033 3/31/2018 1/8/1906 2 to 7 0 1 1 1
9 3,033 4/4/2019 2200,0 2 to 8 0 1 1 1
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 0
4 1 1 1 1 1 0
5 1 1 1 0 0 0
6 1 1 1 0 0 0
7 1 1 1 0 0 0
8 1 1 1 0 0 0
9 1 1 1 1 0 0
Another non loop solution - idea is remove duplicates, create new rows with get_dummies, reindex for add missing columns and last add 1 by multiple cumsumed values:
df1 = (df.set_index('active', drop=False)
.pop('active')
.drop_duplicates()
.str.get_dummies(' to '))
df1.columns = df1.columns.astype(int)
df1 = df1.reindex(columns=np.arange(df1.columns.min(),df1.columns.max() + 1), fill_value=0)
df1 = (df1.cumsum(axis=1) * df1.iloc[:, ::-1].cumsum(axis=1)).clip_upper(1)
print (df1)
1 2 3 4 5 6 7 8 9 10
active
1 to 10 1 1 1 1 1 1 1 1 1 1
1 to 9 1 1 1 1 1 1 1 1 1 0
3 to 7 0 0 1 1 1 1 1 0 0 0
2 to 7 0 1 1 1 1 1 1 0 0 0
2 to 8 0 1 1 1 1 1 1 1 0 0
df = df.join(df1.add_prefix('Type '), on='active')
print (df)
ID start_date end_date active Type 1 Type 2 Type 3 Type 4 \
0 1,111 6/30/2015 8/6/1904 1 to 10 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 to 10 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 to 10 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 to 9 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 to 9 1 1 1 1
5 3,033 3/31/2015 5/18/1908 3 to 7 0 0 1 1
6 3,033 3/31/2016 11/24/1905 3 to 7 0 0 1 1
7 3,033 3/31/2017 1/20/1906 3 to 7 0 0 1 1
8 3,033 3/31/2018 1/8/1906 2 to 7 0 1 1 1
9 3,033 4/4/2019 2200,0 2 to 8 0 1 1 1
Type 5 Type 6 Type 7 Type 8 Type 9 Type 10
0 1 1 1 1 1 1
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 0
4 1 1 1 1 1 0
5 1 1 1 0 0 0
6 1 1 1 0 0 0
7 1 1 1 0 0 0
8 1 1 1 0 0 0
9 1 1 1 1 0 0
def f(s):
a, b = map(int, s.split('to'))
return '|'.join(map(str, range(a, b + 1)))
df.drop('active', 1).join(df.active.apply(f).str.get_dummies().add_prefix('Type '))
ID start_date end_date Type 1 Type 10 Type 2 Type 3 Type 4 Type 5 Type 6 Type 7 Type 8 Type 9
0 1,111 6/30/2015 8/6/1904 1 1 1 1 1 1 1 1 1 1
1 1,111 6/28/2016 3/30/1905 1 1 1 1 1 1 1 1 1 1
2 1,111 7/31/2017 6/6/1905 1 1 1 1 1 1 1 1 1 1
3 1,111 7/31/2018 6/6/1905 1 0 1 1 1 1 1 1 1 1
4 1,111 5/31/2019 12/4/1904 1 0 1 1 1 1 1 1 1 1
5 3,033 3/31/2015 5/18/1908 0 0 0 1 1 1 1 1 0 0
6 3,033 3/31/2016 11/24/1905 0 0 0 1 1 1 1 1 0 0
7 3,033 3/31/2017 1/20/1906 0 0 0 1 1 1 1 1 0 0
8 3,033 3/31/2018 1/8/1906 0 0 1 1 1 1 1 1 0 0
9 3,033 4/4/2019 2200,0 0 0 1 1 1 1 1 1 1 0

Pandas: Groupby two columns and count the occurence of all values for 2nd column

I want to groupby my dataframe using two columns, one is yearmonth(format : 16-10) and other is number of cust. Then if number of cumstomers are more the six, i want to create one one row which replaces all the rows with number of cust = 6+ and sum of total values for number of cust >6.
This is how data looks like
index month num ofcust count
0 10 1.0 1
1 10 2.0 1
2 10 3.0 1
3 10 4.0 1
4 10 5.0 1
5 10 6.0 1
6 10 7.0 1
7 10 8.0 1
8 11 1.0 1
9 11 2.0 1
10 11 3.0 1
11 12 12.0 1
Output:
index month no of cust count
0 16-10 1.0 3
1 16-10 2.0 6
2 16-10 3.0 2
3 16-10 4.0 3
4 16-10 5.0 4
5 16-10 6+ 4
6 16-11 1.0 4
7 16-11 2.0 3
8 16-11 3.0 2
9 16-11 4.0 1
10 16-11 5.0 3
11 16-11 6+ 5
I believe you need replace all values >=6 first and then groupby + aggregate sum:
s = df['num ofcust'].mask(df['num ofcust'] >=6, '6+')
#alternatively
#s = df['num ofcust'].where(df['num ofcust'] <6, '6+')
df = df.groupby(['month', s])['count'].sum().reset_index()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1
Detail:
print (s)
0 1
1 2
2 3
3 4
4 5
5 6+
6 6+
7 6+
8 1
9 2
10 3
11 6+
Name: num ofcust, dtype: object
Another very similar solution is append data to column first:
df.loc[df['num ofcust'] >= 6, 'num ofcust'] = '6+'
df = df.groupby(['month', 'num ofcust'], as_index=False)['count'].sum()
print (df)
month num ofcust count
0 10 1 1
1 10 2 1
2 10 3 1
3 10 4 1
4 10 5 1
5 10 6+ 3
6 11 1 1
7 11 2 1
8 11 3 1
9 12 6+ 1

Categories

Resources