Remove higher level index names after pivot [duplicate] - python

This question already has answers here:
flattern pandas dataframe column levels [duplicate]
(1 answer)
How to flatten a hierarchical index in columns
(19 answers)
Closed 1 year ago.
I have the following dataframe:
dates = [str(datetime.datetime(2020, 1, 1, 0, 0, 0, 0) + datetime.timedelta(days=i)) for i in range(3)]
repetitions = [3, 6, 4]
dates = [i for i, j in zip(dates, repetitions) for k in range(j)]
cities_ = ['Paris', 'Tokyo', 'Sydney', 'New-York', 'Rio', 'Berlin']
cities = [cities_[0: repetitions[i]] for i in range(len(repetitions))]
cities = [i for j in cities for i in j]
temperatures = [round(random.normalvariate(20, 5), 1) for _ in range(len(cities))]
humidities = [round(random.normalvariate(0.5, 0.4), 1) for _ in range(len(cities))]
humidities = [min(i, 1) for i in humidities]
humidities = [max(i, 0) for i in humidities]
df = pd.DataFrame(data=list(zip(dates, cities, temperatures, humidities)), columns=['date', 'city', 'temperature', 'humidity'])
I need to remove the indexes after applying the pivot function; the code below
values = ['temperature', 'humidity']
df_ = df.pivot(index='date', columns='city', values=values)
Col = list(set(df['city'].values))
for value in values:
df_.rename(columns={i: value + '_' + i for i in Col}, inplace=True)
outputs:
temperature ... humidity
city temperature_Berlin temperature_New-York temperature_Paris temperature_Rio ... temperature_Paris temperature_Rio temperature_Sydney temperature_Tokyo
date ...
2020-01-01 00:00:00 NaN NaN 21.2 NaN ... 0.3 NaN 1.0 1.0
2020-01-02 00:00:00 18.4 14.2 19.3 28.7 ... 0.6 0.6 0.1 0.2
2020-01-03 00:00:00 NaN 31.6 25.9 NaN ... 0.8 NaN 0.1 0.0
and I need the following result:
temperature_Paris humidity_Paris temperature_Tokyo humidity_Tokyo temperature_Sydney ... humidity_New-York temperature_Rio humidity_Rio temperature_Berlin humidity_Berlin
2020-01-01 00:00:00 21.2 0.3 17.5 1.0 26.3 ... NaN NaN NaN NaN NaN
2020-01-02 00:00:00 19.3 0.6 15.1 0.2 22.8 ... 0.1 28.7 0.6 18.4 0.4
2020-01-03 00:00:00 25.9 0.8 27.5 0.0 29.7 ... 0.6 NaN NaN NaN NaN
The various solutions offered for questions that look similar, like essentially:
df_ = df_.reset_index().rename_axis([None, None], axis=1)
do not work here.

Replace:
Col = list(set(df['city'].values))
for value in values:
df_.rename(columns={i: value + '_' + i for i in Col}, inplace=True)
With:
df_.columns = ['_'.join(i) for i in df_.columns]
Outputs:
temperature_Berlin temperature_New-York ... humidity_Sydney humidity_Tokyo
date
2020-01-01 00:00:00 NaN NaN ... 0.3 0.6
2020-01-02 00:00:00 23.3 26.3 ... 0.8 0.0
2020-01-03 00:00:00 NaN 14.6 ... 0.2 0.6
Edit:
A probably more elegant alternative, as suggested by #Henry Ecker in the comments:
df_.columns = df_.columns.map('_'.join)

You can use Index.map() with f-string, as follows:
df_.columns = df_.columns.map(lambda x: f'{x[0]}_{x[1]}')
Using this way, you have the freedom to arrange the sequence of the combined words from the MultiIndex as you wish. E.g. if you want to get the city name first then the word 'temperature' (e.g. Berlin_temperature instead), you can just reverse the sequence of x[0] and x[1] in the f-string above.
Result:
print(df_)
temperature_Berlin temperature_New-York temperature_Paris temperature_Rio temperature_Sydney temperature_Tokyo humidity_Berlin humidity_New-York humidity_Paris humidity_Rio humidity_Sydney humidity_Tokyo
date
2020-01-01 00:00:00 NaN NaN 22.8 NaN 24.7 28.8 NaN NaN 1.0 NaN 0.9 0.0
2020-01-02 00:00:00 20.2 21.5 21.6 21.6 4.3 21.5 0.5 0.5 1.0 0.4 0.4 0.0
2020-01-03 00:00:00 NaN 17.3 24.4 NaN 11.3 22.7 NaN 0.4 0.1 NaN 0.0 0.5

Related

Coalesce values only from columns where column matches with data dates

I have a data frame similar to one below.
Date 20180601T32 20180604T33 20180605T32 20180610T33
2018-06-04 0.1 0.5 4.5 nan
2018-06-05 1.5 0.2 nan 0
2018-06-07 1.1 1.6 nan nan
2018-06-10 0.4 1.1 0 0.3
The values in columns '20180601', '20180604', '20180605' and '20180607' needs to be coalesced into a new column.
I am using the method bfill as below but it selects first value in the row.
coalsece_columns = ['20180601', '20180604', '20180605', '20180610]
df['obs'] = df[coalesce_columns].bfill(axis=1).iloc[:,0]
But instead of taking value from first column, value should match 'Date' and respective column names. The expected output should be:
Date 20180601T32 20180604T33 20180605T32 20180610T33 Obs
2018-06-04 0.1 0.5 4.5 nan 0.5
2018-06-05 1.5 0.2 1.7 0 1.7
2018-06-07 1.1 1.6 nan nan nan
2018-06-10 0.4 1.1 0 0.3 0.3
Any suggestions?
Use lookup with convert Datecolumn to same format like columns names:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
If possible columnsnames are integers:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.rename(columns=str).reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

Drop the rows in front by conditions

df:
Id timestamp data Date sig event1 Start End Timediff2 datadiff2 B
51253 51494 2020-01-27 06:22:08.330 19.5 2020-01-27 -1.0 0.0 NaN 1.0 NaN NaN NaN
51254 51495 2020-01-27 06:22:08.430 19.0 2020-01-27 1.0 1.0 0.0 0.0 0.1 NaN NaN
51255 51496 2020-01-27 07:19:06.297 19.5 2020-01-27 1.0 0.0 1.0 0.0 3417.967 0.0 0.000000
51256 51497 2020-01-27 07:19:06.397 20.0 2020-01-27 1.0 0.0 0.0 0.0 0.1 1.0 0.000293
51259 51500 2020-01-27 07:32:19.587 20.5 2020-01-27 1.0 0.0 0.0 1.0 793.290 1.0 0.001261
I have 2 questions:
I want to drop the rows before the rows where Timediff2 ==0.1.
Add another condition, drop theses rows, unless for that row, Start ==1.
I suggest the following, first I create a top for the row just before Timediff2 == 0.1 then I filter:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Start": [np.NaN, 0.0, 1.0,0.0, 0.0],
"Timediff2": [np.NaN, 0.1, 3417, 0.1, 793]})
df["top"] = (df["Timediff2"] == 0.1).shift(-1)
df = df.loc[(df["Start"] == 1) | (df["top"] == False), :]
df = df.drop(columns="top")
The result is :
Start Timediff2
1 0.0 0.1
2 1.0 3417.0
3 0.0 0.1

How to sum values of two dataframes with different shapes when one is multilevel index and the other one is not

I have the following dataframes: eta1.shape = (8004, 29), eta2.shape = (138,)
eta1=
id/uniqueID var0 var1 var2 var3 var4 ... var28
5171/0 10.0 2.8 0.0 5.0 1.0 ... 9.4
5171/1 40.9 2.5 3.4 4.5 1.3 ... 7.7
5171/2 60.7 3.1 5.2 6.6 3.4 ... 1.0
...
5171/57 0.5 1.3 5.1 0.5 0.2 ... 0.4
4567/0 1.5 2.0 1.0 4.5 0.1 ... 0.4
4567/1 4.4 2.0 1.3 6.4 0.1 ... 3.3
4567/2 6.3 3.0 1.5 7.6 1.6 ... 1.6
...
4567/57 0.7 1.4 1.4 0.3 4.2 ... 1.7
...
9584/0 0.3 2.6 0.0 5.2 1.6 ... 9.7
9584/1 0.5 1.2 8.3 3.4 1.3 ... 1.7
9584/2 0.7 3.0 5.6 6.6 3.0 ... 1.0
...
9584/57 0.7 1.3 0.1 0.0 2.0 ... 1.7
eta2=
id var28
5171 67.0
4567 98.9
9584 47.7
...
8707 56.3
In eta2, I have one value per each id. I need to add the value for each id in eta2 to all the columns of eta1 with the same id, like eta = eta1+eta2. So for example the results for id=5171 should be as follow:
eta.loc[5171] =
id/uniqueID var0 var1 ... var28
5171/0 10.0+67.0 2.8+67.0 ... 9.4+67.0
5171/1 40.9+67.0 2.5+67.0 ... 7.7+67.0
5171/2 60.7+67.0 3.1+67.0 ... 1.0+67.0
...
5171/57 0.5+67.0 1.3+67.0 ... 0.4+67.0
Doing the sum by eta = eta1.add(eta2) gives wrong results because they don't have the same levels. I can't remove the levels, I need them for later calculations. So instead I tried to add a new level to eta2 and then do the sum, but I'm getting this error:
eta2['uniq_id'] = eta2.groupby('id').cumcount()
eta2 = eta2.set_index(['uniq_id'], append=True)
eta = eta1.add(eta2, level=0)
error: typeError: Join one level between two MultiIndex objects is ambiguous
How can I do this sum?
Use DataFrame.add with Series with eta2['var28'] and axis=0 level=0 parameter:
eta = eta1.add(eta2['var28'], axis=0, level=0)
print (eta)
var0 var1 var2 var3 var4 var28
id uniqueID
5171 0 77.0 69.8 67.0 72.0 68.0 76.4
1 107.9 69.5 70.4 71.5 68.3 74.7
2 127.7 70.1 72.2 73.6 70.4 68.0
57 67.5 68.3 72.1 67.5 67.2 67.4
4567 0 100.4 100.9 99.9 103.4 99.0 99.3
1 103.3 100.9 100.2 105.3 99.0 102.2
2 105.2 101.9 100.4 106.5 100.5 100.5
57 99.6 100.3 100.3 99.2 103.1 100.6
9584 0 48.0 50.3 47.7 52.9 49.3 57.4
1 48.2 48.9 56.0 51.1 49.0 49.4
2 48.4 50.7 53.3 54.3 50.7 48.7
57 48.4 49.0 47.8 47.7 49.7 49.4
If added cumulative count MultiIndex:
eta2['uniq_id'] = eta2.groupby('id').cumcount()
eta2 = eta2.set_index(['uniq_id'], append=True)
eta = eta1.add(eta2['var28'], axis=0)
print (eta)
var0 var1 var2 var3 var4 var28
id uniqueID uniq_id
4567 0 0 100.4 100.9 99.9 103.4 99.0 99.3
1 0 NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN
57 0 NaN NaN NaN NaN NaN NaN
5171 0 0 77.0 69.8 67.0 72.0 68.0 76.4
1 0 NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN
57 0 NaN NaN NaN NaN NaN NaN
8707 NaN 0 NaN NaN NaN NaN NaN NaN
9584 0 0 48.0 50.3 47.7 52.9 49.3 57.4
1 0 NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN
57 0 NaN NaN NaN NaN NaN NaN

Deleting values conditional on large values of another column

I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.
DF looks something like the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 20.2% 292%
May/28/2019 20.5% 1.4%
May/29/2019 20% -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.
So the above DF should look as the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 np.nan 292%
May/28/2019 np.nan 1.4%
May/29/2019 np.nan -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
This is where I've got to so far.... would appreciate some help
for i, j in df1.iterrows():
if df1['Shift'][i] > .50 :
x = df1['IR'][i]
if df1['Shift'][j] < -.50 :
y = df1['IR'][j]
df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'],
np.nan)
Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.
Setup
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))
IR Shift
May/24/2019 5.9 NaN
May/25/2019 6.0 1.67
May/26/2019 5.9 -1.67
May/27/2019 20.2 292.00
May/28/2019 20.5 1.40
May/29/2019 20.0 -1.60
May/30/2019 5.1 -292.00
May/31/2019 5.1 0.00
Code
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)
# Get the indices between consecutive pairs.
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1
df.loc[m, 'IR'] = np.NaN
# IR Shift
#May/24/2019 5.9 NaN
#May/25/2019 6.0 1.67
#May/26/2019 5.9 -1.67
#May/27/2019 NaN 292.00
#May/28/2019 NaN 1.40
#May/29/2019 NaN -1.60
#May/30/2019 5.1 -292.00
#May/31/2019 5.1 0.00
Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.
IR Shift IR_modified
May/24/2019 5.9 NaN 5.9
May/25/2019 6.0 1.67 6.0
May/26/2019 5.9 -1.67 5.9
May/27/2019 20.2 292.00 NaN
May/28/2019 20.5 1.40 NaN
May/29/2019 20.0 -1.60 NaN
May/30/2019 5.1 -292.00 5.1
May/31/2019 5.1 0.00 5.1
June/1/2019 7.0 415.00 NaN
June/2/2019 17.0 15.00 NaN
June/3/2019 27.0 12.00 NaN
June/4/2019 17.0 315.00 17.0
June/5/2019 7.0 -12.00 7.0
You can also np.where function from numpy as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)
In [8]: df
Out[8]:
Date IR Shift
0 2019-05-24 NaN NaN
1 2019-05-25 0.0167 0.0167
2 2019-05-26 NaN -0.0167
3 2019-05-27 2.9200 2.9200
4 2019-05-28 0.0140 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 NaN -2.9200
Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 0.202 2.9200
4 2019-05-28 0.205 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 0.051 -2.9200
df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 NaN -2.9200
Using df.at to access a single value for a row/column label pair.
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})
print("DataFrame Before :")
print(df)
count = 1
while (count < len(df.index)):
if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
df.at[count, 'IR'] = np.nan
count = count + 1
print("DataFrame After :")
print(df)
Output of program:
DataFrame Before :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 20.2 292.00
4 2019-05-28 20.5 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 5.1 -292.00
7 2019-05-31 5.1 0.00
DataFrame After :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 NaN 292.00
4 2019-05-28 NaN 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 NaN -292.00
7 2019-05-31 NaN 0.00
As per your description of triggering this on any large shift, positive or negative, you could do this:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 0.051 -2.9200
Steps:
abs(df.Shift) > .5: Find shift of above +/- 50%
.cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.
% 2 == 1: Checks which rows have odd numbers for cumsum().
Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.
Was not sure about your shift, so calculated again. Does this works for you?
import pandas as pd
import numpy as np
df.drop(columns=['Shift'], inplace=True) ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)
def shift(current, previous):
return (current-previous)/previous * 100
indexlist=[] ## to save index that will be set to null
prior=0 ## temporary flag to store value prior to a peak
flag=False
for index, row in df.iterrows():
if index==0: ## to skip first row of data
continue
if flag==False and (shift(row[1], row[2])) > 50: ## to check for start of peak
prior=row[2]
indexlist.append(index)
flag=True
continue
if flag==True: ## checking until when the peak lasts
if (shift(row[1], prior)) > 50:
indexlist.append(index)
df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
Output on print(df)
date IR nextval
0 May/24/2019 5.9 NaN
1 May/25/2019 6.0 5.9
2 May/26/2019 5.9 6.0
3 May/27/2019 NaN 5.9
4 May/28/2019 NaN 20.2
5 May/29/2019 NaN 20.5
6 May/30/2019 5.1 20.0
7 May/31/2019 5.1 5.1
df.loc[df['Shift']>0.5,'IR'] = np.nan

Categories

Resources