pandas broadcast function to ensure monotonically decreasing series - python

I'm wondering if there's a succinct, broadcast way to take a series/frame such as this:
>>> print(pd.DataFrame({'a': [5, 10, 9, 11, 13, 14, 12]}, index=pd.date_range('2020-12-01', periods=7)))
a
2020-12-01 5
2020-12-02 10
2020-12-03 9
2020-12-04 11
2020-12-05 13
2020-12-06 14
2020-12-07 12
...and turn it into:
>>> print(pd.DataFrame({'a': [5, 9, 9, 11, 12, 12, 12]}, index=pd.date_range('2020-12-01', periods=7)))
a
2020-12-01 5
2020-12-02 9
2020-12-03 9
2020-12-04 11
2020-12-05 12
2020-12-06 12
2020-12-07 12
NB: The important part is that the most recent value is kept, and and previous values that exceed it are replaced with the current value goes backwards in time, resulting in a monotonically increasing series of numbers such that any modifications do not increase a modified value.
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
any full
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-09 E92000001 11115.2 0.0
N92000002 724.6 0.0
S92000003 3801.8 0.0
W92000004 1651.4 0.0
...
2021-01-24 E92000001 5727693.0 441684.0
N92000002 159642.0 22713.0
S92000003 415402.0 5538.0
W92000004 270833.0 543.0
2021-01-25 E92000001 5962544.0 443010.0

Here's another way:
df.sort_index(ascending=False).cummin().sort_index()
a
2020-12-01 5
2020-12-02 9
2020-12-03 9
2020-12-04 11
2020-12-05 12
2020-12-06 12
2020-12-07 12
For the MultiIndex, this becomes:
df.sort_index(ascending=False).groupby('areaCode').cummin().sort_index()

Flipping the time axis and doing a rolling min does the trick:
df.sort_index(ascending = False).rolling(window=1000, min_periods=1).min().sort_index()
produces
a
2020-12-01 5.0
2020-12-02 9.0
2020-12-03 9.0
2020-12-04 11.0
2020-12-05 12.0
2020-12-06 12.0
2020-12-07 12.0

Related

Python to increment date every week in a dataframe

I am trying to work on this requirement where I need to increment the date in weeks, here is the below code for the same:
import pandas as pd
import numpy as np
c=15
s={'week':[1,2,3,4,5,6,7,8],'Sales':[10,20,30,40,50,60,70,80]}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
O/P-
How would I be able to increment from last row of week column to get next 15 weeks?
Basically, the desired output of week starts from 2022-03-01 till the next 14 weeks.
One option is to use date_range to generate additional dates, then use set_index + reindex to append them:
p = p.set_index('week').reindex(pd.date_range('2021-01-04', periods=8+14, freq='W-MON')).rename_axis(['week']).reset_index()
Output:
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
You can modify the length of of week list with range() function and your variable c, but you will also check for the length of sales, which has to have the same number of elements:
import pandas as pd
import numpy as np
import datetime
c=15
weeks = list(range(1, c+1))
sales = [10,20,30,40,50,60,70,80]
s={'week':weeks,'Sales':sales+[None]*(len(weeks)-len(sales) if (len(weeks)-len(sales)) >=0 else 0)}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
print(p)
another option in DateOffset:
p = pd.concat([p, pd.DataFrame({'week': [p.iloc[-1,0]+pd.DateOffset(weeks=i) for i in range(1,c)]})], ignore_index=True)
>>> p
'''
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN

Upsample and interpolate in multivariate time series pandas [duplicate]

I am trying to work on this requirement where I need to increment the date in weeks, here is the below code for the same:
import pandas as pd
import numpy as np
c=15
s={'week':[1,2,3,4,5,6,7,8],'Sales':[10,20,30,40,50,60,70,80]}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
O/P-
How would I be able to increment from last row of week column to get next 15 weeks?
Basically, the desired output of week starts from 2022-03-01 till the next 14 weeks.
One option is to use date_range to generate additional dates, then use set_index + reindex to append them:
p = p.set_index('week').reindex(pd.date_range('2021-01-04', periods=8+14, freq='W-MON')).rename_axis(['week']).reset_index()
Output:
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN
You can modify the length of of week list with range() function and your variable c, but you will also check for the length of sales, which has to have the same number of elements:
import pandas as pd
import numpy as np
import datetime
c=15
weeks = list(range(1, c+1))
sales = [10,20,30,40,50,60,70,80]
s={'week':weeks,'Sales':sales+[None]*(len(weeks)-len(sales) if (len(weeks)-len(sales)) >=0 else 0)}
p=pd.DataFrame(data=s)
p['week'] =p['week'].apply(
lambda x: datetime.datetime.strptime(f'2021-{x:02}-1', '%Y-%U-%u')
)
print(p)
another option in DateOffset:
p = pd.concat([p, pd.DataFrame({'week': [p.iloc[-1,0]+pd.DateOffset(weeks=i) for i in range(1,c)]})], ignore_index=True)
>>> p
'''
week Sales
0 2021-01-04 10.0
1 2021-01-11 20.0
2 2021-01-18 30.0
3 2021-01-25 40.0
4 2021-02-01 50.0
5 2021-02-08 60.0
6 2021-02-15 70.0
7 2021-02-22 80.0
8 2021-03-01 NaN
9 2021-03-08 NaN
10 2021-03-15 NaN
11 2021-03-22 NaN
12 2021-03-29 NaN
13 2021-04-05 NaN
14 2021-04-12 NaN
15 2021-04-19 NaN
16 2021-04-26 NaN
17 2021-05-03 NaN
18 2021-05-10 NaN
19 2021-05-17 NaN
20 2021-05-24 NaN
21 2021-05-31 NaN

resampling a pandas dataframe from almost-weekly to daily

What's the most succinct way to resample this dataframe:
>>> uneven = pd.DataFrame({'a': [0, 12, 19]}, index=pd.DatetimeIndex(['2020-12-08', '2020-12-20', '2020-12-27']))
>>> print(uneven)
a
2020-12-08 0
2020-12-20 12
2020-12-27 19
...into this dataframe:
>>> daily = pd.DataFrame({'a': range(20)}, index=pd.date_range('2020-12-08', periods=3*7-1, freq='D'))
>>> print(daily)
a
2020-12-08 0
2020-12-09 1
...
2020-12-19 11
2020-12-20 12
2020-12-21 13
...
2020-12-27 19
NB: 12 days between the 8th and 20th Dec, 7 days between the 20th and 27th.
Also, to give clarity of the kind of interpolation/resampling I want to do:
>>> print(daily.diff())
a
2020-12-08 NaN
2020-12-09 1.0
2020-12-10 1.0
...
2020-12-19 1.0
2020-12-20 1.0
2020-12-21 1.0
...
2020-12-27 1.0
The actual data is hierarchical and has multiple columns, but I wanted to start with something I could get my head around:
first_dose second_dose
date areaCode
2020-12-08 E92000001 0.0 0.0
N92000002 0.0 0.0
S92000003 0.0 0.0
W92000004 0.0 0.0
2020-12-20 E92000001 574829.0 0.0
N92000002 16068.0 0.0
S92000003 60333.0 0.0
W92000004 24056.0 0.0
2020-12-27 E92000001 267809.0 0.0
N92000002 14948.0 0.0
S92000003 34535.0 0.0
W92000004 12495.0 0.0
2021-01-03 E92000001 330037.0 20660.0
N92000002 9669.0 1271.0
S92000003 21446.0 44.0
W92000004 14205.0 27.0
I think you need:
df = df.reset_index('areaCode').groupby('areaCode')[['first_dose','second_dose']].resample('D').interpolate()
print (df)
first_dose second_dose
areaCode date
E92000001 2020-12-08 0.000000 0.000000
2020-12-09 47902.416667 0.000000
2020-12-10 95804.833333 0.000000
2020-12-11 143707.250000 0.000000
2020-12-12 191609.666667 0.000000
... ...
W92000004 2020-12-30 13227.857143 11.571429
2020-12-31 13472.142857 15.428571
2021-01-01 13716.428571 19.285714
2021-01-02 13960.714286 23.142857
2021-01-03 14205.000000 27.000000
[108 rows x 2 columns]

How do I change the starting point of x-ticks (using datetime data) in Matplotlib?

I have the following dataframe:
Date Prod_1 Prod_2 Clients Clients Growth
0 2016-08-01 17768 0.0 17768.0 9.877308
1 2016-09-01 19523 0.0 19523.0 10.295549
2 2016-10-01 21533 0.0 21533.0 7.709098
3 2016-11-01 23193 0.0 23193.0 17.410426
4 2016-12-01 27231 0.0 27231.0 -3.473982
5 2017-01-01 26285 0.0 26285.0 0.604908
6 2017-02-01 26444 0.0 26444.0 26.864317
7 2017-03-01 33548 0.0 33548.0 -12.626684
8 2017-04-01 29312 0.0 29312.0 21.114219
9 2017-05-01 35501 0.0 35501.0 6.577280
10 2017-06-01 37836 0.0 37836.0 3.282588
11 2017-07-01 39078 0.0 39078.0 7.733251
12 2017-08-01 42100 0.0 42100.0 -3.111639
13 2017-09-01 40790 0.0 40790.0 5.339544
14 2017-10-01 42968 0.0 42968.0 -5.797338
15 2017-11-01 40477 0.0 40477.0 13.508906
16 2017-12-01 45945 0.0 45945.0 -11.881598
17 2018-01-01 40486 0.0 40486.0 5.893395
18 2018-02-01 42872 0.0 42872.0 16.323008
19 2018-03-01 49870 0.0 49870.0 -4.958893
20 2018-04-01 47397 0.0 47397.0 13.408022
21 2018-05-01 53752 0.0 53752.0 -12.354889
22 2018-06-01 47111 0.0 47111.0 13.733523
23 2018-07-01 53581 0.0 53581.0 3.939829
24 2018-08-01 55692 0.0 55692.0 -6.834016
25 2018-09-01 51886 0.0 51886.0 9.784913
26 2018-10-01 56963 0.0 56963.0 -0.405526
27 2018-11-01 56732 0.0 56732.0 4.343228
28 2018-12-01 59196 0.0 59196.0 -3.327928
29 2019-01-01 57221 5.0 57226.0 -2.200049
30 2019-02-01 55495 472.0 55967.0 18.189290
31 2019-03-01 65394 753.0 66147.0 -8.984534
32 2019-04-01 59030 1174.0 60204.0 11.718490
33 2019-05-01 64466 2793.0 67259.0 -6.504706
34 2019-06-01 58471 4413.0 62884.0 12.739330
35 2019-07-01 64785 6110.0 70895.0 1.747655
36 2019-08-01 63774 8360.0 72134.0 2.423268
37 2019-09-01 64324 9558.0 73882.0 3.926531
38 2019-10-01 65733 11050.0 76783.0 NaN
And I need to plot a time series of the 'Clients Growth' column.
The 'Date' column is in the pandas datetime format.
So I used the following command:
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,4))
ax = fig.add_subplot(1,1,1)
plt.plot('Date', 'Clients Growth', data=test, linewidth=2, color='steelblue')
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=3))
plt.xticks(rotation=45, ha='right');
Output:
As you can see, I have changed the x-ticks interval to 3 months.
However, by default, matplotlib has started the x-ticks in 2016-07, and I would like the starting point to be in the first month that I have data (2016-08).
OBS: I know that if I change my inteval to 1 month instead of 3, the starting point of the x-ticks will be 2016-08, but I want to keep the interval as 3 months.
How can I solve this problem?
Thanks in advance.
You can provide a list of months to tick,
MonthLocator(bymonth=(2, 5, 8, 11))

How to find a position of a last ocurrence of certain value in a pandas dataframe?

In a dataframe where one column is datetime and another one is only ones or zeros, how can I find the times of each of the last occurences of 1?
For example:
times = pd.date_range(start="1/1/2015", end="2/1/2015",freq='D')
YN = np.zeros(len(times))
YN[0:8] = np.ones(len(YN[0:8]))
YN[12:20] = np.ones(len(YN[12:20]))
YN[25:29] = np.ones(len(YN[25:29]))
df = pd.DataFrame({"Time":times,"Yes No":YN})
print df
Which looks like
Time Yes No
0 2015-01-01 1.0
1 2015-01-02 1.0
2 2015-01-03 1.0
3 2015-01-04 1.0
4 2015-01-05 1.0
5 2015-01-06 1.0
6 2015-01-07 1.0
7 2015-01-08 1.0
8 2015-01-09 0.0
9 2015-01-10 0.0
10 2015-01-11 0.0
11 2015-01-12 0.0
12 2015-01-13 1.0
13 2015-01-14 1.0
14 2015-01-15 1.0
15 2015-01-16 1.0
16 2015-01-17 1.0
17 2015-01-18 1.0
18 2015-01-19 1.0
19 2015-01-20 1.0
20 2015-01-21 0.0
21 2015-01-22 0.0
22 2015-01-23 0.0
23 2015-01-24 0.0
24 2015-01-25 0.0
25 2015-01-26 1.0
26 2015-01-27 1.0
27 2015-01-28 1.0
28 2015-01-29 1.0
29 2015-01-30 0.0
30 2015-01-31 0.0
31 2015-02-01 0.0
How could I extract the dates that have the last occurrence of 1 before another series of zeros, in this case 8/1/2015, 20/1/2015 and 29/1/2015?
This question addresses a similar problem, but I don't want all of the ones, I just want the last one before it changes to zero (and not only the one where it happens for the first time).
you can use Series.shift(-1) in conjunction with Series.diff() methods
In [42]: df.loc[df['Yes No'].shift(-1).diff().eq(-1)]
Out[42]:
Time Yes No
7 2015-01-08 1.0
19 2015-01-20 1.0
28 2015-01-29 1.0
In [43]: df.loc[df['Yes No'].shift(-1).diff().eq(-1), 'Time']
Out[43]:
7 2015-01-08
19 2015-01-20
28 2015-01-29
Name: Time, dtype: datetime64[ns]
Explanation:
In [44]: df['Yes No'].shift(-1).diff()
Out[44]:
0 NaN
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 -1.0
8 0.0
9 0.0
10 0.0
11 1.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 -1.0
20 0.0
21 0.0
22 0.0
23 0.0
24 1.0
25 0.0
26 0.0
27 0.0
28 -1.0
29 0.0
30 0.0
31 NaN
Name: Yes No, dtype: float64
You can use diff with eq for boolean mask and filter by boolean indexing:
print (df[df['Yes_No'].diff(-1).eq(1)])
Time Yes_No
7 2015-01-08 1.0
19 2015-01-20 1.0
28 2015-01-29 1.0
print (df.loc[df['Yes_No'].diff(-1).eq(1), 'Time'])
7 2015-01-08
19 2015-01-20
28 2015-01-29
Name: Time, dtype: datetime64[ns]
numpy
v = df['Yes No'].values
df[(v - np.append(v[1:], 0) == 1)]
Time Yes No
7 2015-01-08 1.0
19 2015-01-20 1.0
28 2015-01-29 1.0
v = df['Yes No'].values
df.Time[(v - np.append(v[1:], 0) == 1)]
7 2015-01-08
19 2015-01-20
28 2015-01-29
Name: Time, dtype: datetime64[ns]
Here's an approach using pandas groupby.
It could be useful if you plan to do many operations on this kind of data.
def find_consecutive(x, on = None, filter = None):
# Group consecutive sequences
if on is None:
on = x.columns
return x.groupby([(x[on] != x[on].shift()).cumsum(), x[on].loc[:]])
grouped = df.pipe(lambda x: find_consecutive(x, on = 'Yes No'))
# For each sequence extract the last time
last_dates = grouped.last()\ # Explicitly: apply(lambda x: x['Time'].iloc[-1])\
.reset_index(level = 1, drop = False)
# A bit of formatting to extract only dates for "Yes" (there is probably
# a cleaner way to do this)
yes_last_dates = last_dates.pipe(lambda x: x[x["Yes No"]==1]['Time'])\
.pipe(lambda x: x.reset_index(drop = True))
This gives the expected result:
0 2015-01-08
1 2015-01-20
2 2015-01-29
You can inspect grouped doing the following:
for key, group in grouped:
print key, group
(1, 1.0) Time Yes No
0 2015-01-01 1.0
1 2015-01-02 1.0
2 2015-01-03 1.0
3 2015-01-04 1.0
4 2015-01-05 1.0
5 2015-01-06 1.0
6 2015-01-07 1.0
7 2015-01-08 1.0
(2, 0.0) Time Yes No
8 2015-01-09 0.0
9 2015-01-10 0.0
10 2015-01-11 0.0
11 2015-01-12 0.0
(3, 1.0) Time Yes No
12 2015-01-13 1.0
13 2015-01-14 1.0
14 2015-01-15 1.0
15 2015-01-16 1.0
16 2015-01-17 1.0
17 2015-01-18 1.0
18 2015-01-19 1.0
19 2015-01-20 1.0
....

Categories

Resources