Pandas percentage difference Calculation - python

I have the below Pandas dataframe. First column is date in YYYY-MM-DD format. It has month on month data but month starting may not be 1st necessarily and month last might not necessarily be 31 or 30 and not 29 or 28 incase of feb. It might vary. For example Feb 2020 has data from 2020-02-03 only and the last available data for feb is 2020-02-28 (not the 29th).
Date start_Value end_value
2020-01-01 115 120
2020-01-02 122 125
2020-01-03 125.2 126
...
2020-01-31 132 135
2020-02-03 135.5 137
2020-02-04 137.8 138
...
2020-02-28 144 145
My objective is to create a new column which calculates the percentage difference between the end value of the previous month's last available date in the dataframe and end value of next month's last available date in dataframe. It should be 0 for all the dates except the last available date for the month. For Jan 2020, since we dont have the previous month data, the percentage difference should be calculated using the end value of the first available date for the month.
For Jan 2020, the percentage difference will be calculated between end value of 2020-01-01 and end value on 2020-01-31.
For the rest (for example from Feb 2020: the percentage difference is calculated between end value on 2020-01-31 and end value on 2020-02-28).
Date start_Value end_value percentage difference
2020-01-01 115 120 0
2020-01-02 122 125 0
2020-01-03 125.2 126 0
...
2020-01-31 132 135 17.4
2020-02-03 135.5 137 0
2020-02-04 137.8 138 0
...
2020-02-28 144 145 7.41
how to achieve this in python and pandas?

Check with transform with duplicated
s = df.Date.dt.strftime('%Y-%m')
df['pct']= (df.groupby(s)['end_value'].transform('last')/df.groupby(s)['start_Value'].transform('first')-1).\
mask(s.duplicated(keep='last'))

Related

Pandas MultiIndex: Partial indexing on second level

I have a data-set open in Pandas with a 2-level MultiIndex. The first level of the MultiIndex is a unique ID (SID) while the second level is time (ISO_TIME). A sample of the data-set is given below.
SEASON NATURE NUMBER
SID ISO_TIME
2020138N10086 2020-05-16 12:00:00 2020 NR 26
2020-05-16 15:00:00 2020 NR 26
2020-05-16 18:00:00 2020 NR 26
2020-05-16 21:00:00 2020 NR 26
2020-05-17 00:00:00 2020 NR 26
2020155N17072 2020-06-02 18:00:00 2020 NR 30
2020-06-02 21:00:00 2020 NR 30
2020-06-03 00:00:00 2020 NR 30
2020-06-03 03:00:00 2020 NR 30
2020-06-03 06:00:00 2020 NR 30
2020327N11056 2020-11-21 18:00:00 2020 NR 103
2020-11-21 21:00:00 2020 NR 103
2020-11-22 00:00:00 2020 NR 103
2020-11-22 03:00:00 2020 NR 103
2020-11-22 06:00:00 2020 NR 103
2020329N10084 2020-11-23 12:00:00 2020 NR 104
2020-11-23 15:00:00 2020 NR 104
2020-11-23 18:00:00 2020 NR 104
2020-11-23 21:00:00 2020 NR 104
2020-11-24 00:00:00 2020 NR 104
I can do df.loc[("2020138N10086")] to select rows with SID=2020138N10086 or df.loc[("2020138N10086", "2020-05-17")] to select rows with SID=2020138N10086 and are on 2020-05-17.
What I want to do, but not able to, is to partially index using the second level of MultiIndex. That is, select all rows on 2020-05-17, irrespective of the SID.
I have read through Pandas MultiIndex / advanced indexing which explains how indexing is done with MultiIndex. But nowhere in it could I find how to do a partial indexing on the second/inner level of a Pandas MultiIndex. Either I missed it in the document or it is not explained in there.
So, is it possible to do a partial indexing in the second level of a Pandas MultiIndex?
If it is possible, how do I do it?
you can do this with slicing. See the pandas documentation.
Example for your dataframe:
df.loc[(slice(None), '2020-05-17'), :]
df=df.reset_index()
dates_rows= df[df["ISO_TIME"]=="2020-05-17"]
If you want you can convert it back to a multi-level index again, like below
df.set_index(['SID', 'ISO_TIME'], inplace=True)
Use a cross-section
df.xs('2020-05-17', level="ISO_TIME")

How to shift timestamp for the rows in dataframe on the specific date

I am working on churn prediction use case and here is a part of the dataset (short version)
ID Timestamp
0 026 2017-07-01
1 026 2017-08-01
2 026 2017-09-01
3 026 2017-10-01
4 026 2017-11-01
... ... ...
283 327 2019-05-01
284 327 2019-06-01
285 327 2019-07-01
... ... ...
528 500 2018-01-01
529 500 2018-02-01
Period of the observation is, for example, start date 2017-07-01 until 2019-12-01
First, I have to find all the users with the first date greater than the start date of the observation period ( 2017-07-01) and than shift all rows for them so that the first row starts with the date of the observation period.
For example, ID 026 is active from the start of the observation period, so that is ok - there is no transformation for him.
But ID 237 and 500 start activity later than the start of observation period and
I should shift all dates (rows) starting from that point
after transformation dataframe should look like
ID Timestamp
0 026 2017-07-01
1 026 2017-08-01
2 026 2017-09-01
3 026 2017-10-01
4 026 2017-11-01
... ... ...
283 327 2017-07-01
284 327 2017-08-01
285 327 2017-09-01
... ... ...
528 500 2017-07-01
529 500 2017-08-01
IIUC, you can do a groupby.cumcount and offset:
df.Timestamp = pd.to_datetime(df.Timestamp)
df['Timestamp'] =df.Timestamp.min() + pd.DateOffset(months=1) * df.groupby('ID').cumcount()
Output:
ID Timestamp
0 26 2017-07-01
1 26 2017-08-01
2 26 2017-09-01
3 26 2017-10-01
4 26 2017-11-01
283 327 2017-07-01
284 327 2017-08-01
285 327 2017-09-01
528 500 2017-07-01
529 500 2017-08-01
This approach set all data to continuous months starting from min date. If you want to just shift date, then a groupby().transform('min') woud do:
df.Timestamp -= df.groupby('ID')['Timestamp'].transform('min') - df.Timestamp.min()
Try using the min() function to find the lowest date for each ID and work from there.
To get a dicitonary of how much you need to shift the timestamps in each ID by:
shift_dict = {}
for id in df.ID.unique():
shift = min(df[df['ID'] == id]['Timestamp']) - min(df[df['ID'] == '026']['Timestamp'])
shift_dict[id] = shift
In dict comprehension:
shift_dict = {id : ( min(df[df['ID'] == id]['Timestamp']) - min(df[df['ID'] == '026']['Timestamp']) ) for id in df.ID.unique()}

Add different missing dates for groups of rows

Let's suppose that I have a dataset which consists of the following columns:
Stock_id: the id of a stock
Date: a date of 2018 e.g. 25/03/2018
Stock_value: the value of the stock at this specific date
I have some dates, different for each stock, which are entirely missing from the dataset and I would like to fill them in.
By missing dates, I mean that there is not even a row for each of these dates; not that these exist on the dataset and simply that the Stock_value at the rows is NA etc.
A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent.
By this I mean that if a stock was introduced to the stock market at the 21/05/2018 then apparently I want to fill in any missing dates for this stock from 21/05/2018 to 31/12/2018 but not dates before the 21/05/2018.
What is the most efficient way to do this?
I have seen some posts on StackOverflow (post_1, post_2 etc) but I think that my case is a more special one so I would like to see an efficient way to do this.
Let me provide an example. Let's limit this only to two stocks and only to the week from 01/01/2018 to the 07/01/2018 otherwise it won't fit in here.
Let's suppose that I initially have the following:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 06/01/2018 150
2 07/01/2018 147
Thus for Stock_id = 1 the date 04/01/2018 is missing.
For Stock_id = 2 the date 05/01/2018 is missing and since the dates for this stock are starting at 03/01/2018 then the dates before this date should not be filled in (because the stock was introduced at the stock market at the 03/01/2018).
Hence, I would like to have the following as output:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 04/01/2018 NA
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 05/01/2018 NA
2 06/01/2018 150
2 07/01/2018 147
Use asfreq per groups, but if large data performance should be problematic:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.asfreq('D'))
.reset_index()
)
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 124.0
1 1 2018-01-02 130.0
2 1 2018-01-03 136.0
3 1 2018-01-04 NaN
4 1 2018-01-05 129.0
5 1 2018-01-06 131.0
6 1 2018-01-07 133.0
7 2 2018-01-03 144.0
8 2 2018-01-04 148.0
9 2 2018-01-05 NaN
10 2 2018-01-06 150.0
11 2 2018-01-07 147.0
EDIT:
If want change values by minimal datetime per group with some scalar for maximum datetime, use reindex with date_range:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), '2019-02-20')))
.reset_index()
)
df.set_index(['Date', 'Stock_id']).unstack().fillna(method='ffill').stack().reset_index()

Automating interpolation of missing values in pandas dataframe

I have a dataframe with airline booking data for the past year for a particular origin and destination. There are hundreds of similar data-sets in the system.
In each data-set, there are holes in data. In the current example, we have about 85 days of year for which we don't have booking data.
There are two columns here - departure_date and bookings.
The next step for me would be to include the missing dates in the date column, and set the corresponding values in bookings column to NaN.
I am looking for the best way to do this.
Please find a part of the dataFrame below:
Index departure_date bookings
0 2017-11-02 00:00:00 43
1 2017-11-03 00:00:00 27
2 2017-11-05 00:00:00 27 ********
3 2017-11-06 00:00:00 22
4 2017-11-07 00:00:00 39
.
.
164 2018-05-22 00:00:00 17
165 2018-05-23 00:00:00 41
166 2018-05-24 00:00:00 73
167 2018-07-02 00:00:00 4 *********
168 2018-07-03 00:00:00 31
.
.
277 2018-10-31 00:00:00 50
278 2018-11-01 00:00:00 60
We can see that the data-set is for a one year period (Nov 2, 2017 to Nov 1, 2018). But we have data for 279 days only. For example, we don't have any data between 2018-05-25 and 2018-07-01. I would have to include these dates in the departure_date column and set the corresponding booking values to NaN.
For the second step, I plan to do some interpolation using something like
dataFrame['bookings'].interpolate(method='time', inplace=True)
Please suggest if there are better alternatives in Python.
This resample for each day. Then fill the gaps.
dataFrame['bookings'].resample('D').pad()
You can have more resampler idea on this page (so you can select the one that fit the best with your needs):
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html

Pandas get unique monthly data based on date range

I have something like the following dataframe:
d=pd.DataFrame()
d['id']=['a','a','a','b','b','c']
d['version_start']=['2017-01-01','2017-02-12','2017-03-25','2017-01-01','2017-6-15','2017-01-22']
d['version_end']=['2017-02-11','2017-03-24','2017-08-01','2017-06-14','2018-01-01','2018-01-01']
d['version_start']=pd.to_datetime(d.version_start)
d['version_end']=pd.to_datetime(d.version_end)
d['values']=[10,15,20,5,6,200]
print d
id version_start version_end values
0 a 2017-01-01 2017-02-11 10
1 a 2017-02-12 2017-03-24 15
2 a 2017-03-25 2017-08-01 20
3 b 2017-01-01 2017-06-14 5
4 b 2017-06-15 2018-01-01 6
5 c 2017-01-22 2018-01-01 200
The version start and version end represent for each ID, the date range for which that row can be considered valid. For example, the total values for a given date would be the records for which that date is between the version start and version end.
I am looking to get for a set of dates (the first of the month for each month in 2017) the sum of the "values" field. I can do this by looping through each month as follows:
df=pd.DataFrame()
for month in pd.date_range('2017-01-01','2018-01-01',freq='MS'):
s = d[(d.version_start<=month)&(d.version_end>month)]
s['month']=month
s=s.set_index(['month','id'])[['values']]
df=df.append(s)
print df.groupby(level='month')['values'].sum()
2017-01-01 15
2017-02-01 215
2017-03-01 220
2017-04-01 225
2017-05-01 225
2017-06-01 225
2017-07-01 226
2017-08-01 206
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
Name: values, dtype: int64
Is there a more elegant / efficient solution that doesn't require looping through this list of dates?
d.version_start=d.version_start+ pd.offsets.MonthBegin(0)
d.version_end=d.version_end+ pd.offsets.MonthBegin(0)
d['New']=d[['version_start','version_end']].apply(lambda x : pd.date_range(start=x.version_start,end=x.version_end,freq='MS').tolist(),1)
d.set_index(['id','version_start','version_end','values']).New.apply(pd.Series).stack().reset_index('values').groupby(0)['values'].sum()
Out[845]:
0
2017-01-01 15
2017-02-01 215
2017-03-01 230
2017-04-01 240
2017-05-01 225
2017-06-01 225
2017-07-01 231
2017-08-01 226
2017-09-01 206
2017-10-01 206
2017-11-01 206
2017-12-01 206
2018-01-01 206
Name: values, dtype: int64
I keep thinking there should be a way more elegant way to do this, but for now:
s = pd.Series(0, index=pd.date_range('2017-01-01','2018-01-01',freq='MS'))
for _id, start, end, values in d.itertuples(index=False):
s[start:end] += values
this returns the proper series, and works with any series for that matter.
If you want version_end day to be excluded, a quick fix is to add this line before the for cycle (only works if you are using 'MS' as frequency):
d.version_end = d.version_end.apply(lambda t: t.replace(day=2))
I think the idea of using explicit indexing is cleaner than conditional indexing based on comparisons between dates, which at scale is terribly slow (timestamps are a valid alternative if you are forced to do this on huge arrays).

Categories

Resources