I have a data frame similar to one below.
Date 20180601T32 20180604T33 20180605T32 20180610T33
2018-06-04 0.1 0.5 4.5 nan
2018-06-05 1.5 0.2 nan 0
2018-06-07 1.1 1.6 nan nan
2018-06-10 0.4 1.1 0 0.3
The values in columns '20180601', '20180604', '20180605' and '20180607' needs to be coalesced into a new column.
I am using the method bfill as below but it selects first value in the row.
coalsece_columns = ['20180601', '20180604', '20180605', '20180610]
df['obs'] = df[coalesce_columns].bfill(axis=1).iloc[:,0]
But instead of taking value from first column, value should match 'Date' and respective column names. The expected output should be:
Date 20180601T32 20180604T33 20180605T32 20180610T33 Obs
2018-06-04 0.1 0.5 4.5 nan 0.5
2018-06-05 1.5 0.2 1.7 0 1.7
2018-06-07 1.1 1.6 nan nan nan
2018-06-10 0.4 1.1 0 0.3 0.3
Any suggestions?
Use lookup with convert Datecolumn to same format like columns names:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
If possible columnsnames are integers:
df['Date'] = pd.to_datetime(df['Date'])
idx, cols = pd.factorize(df['Date'].dt.strftime('%Y%m%d'))
df['obs'] = df.rename(columns=str).reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date 20180601 20180604 20180605 20180610 obs
0 2018-06-04 0.1 0.5 4.5 NaN 0.5
1 2018-06-05 1.5 0.2 1.7 0.0 1.7
2 2018-06-07 1.1 1.6 NaN NaN NaN
3 2018-06-10 0.4 1.1 0.0 0.3 0.3
I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6
df:
Id timestamp data Date sig event1 Start End Timediff2 datadiff2 B
51253 51494 2020-01-27 06:22:08.330 19.5 2020-01-27 -1.0 0.0 NaN 1.0 NaN NaN NaN
51254 51495 2020-01-27 06:22:08.430 19.0 2020-01-27 1.0 1.0 0.0 0.0 0.1 NaN NaN
51255 51496 2020-01-27 07:19:06.297 19.5 2020-01-27 1.0 0.0 1.0 0.0 3417.967 0.0 0.000000
51256 51497 2020-01-27 07:19:06.397 20.0 2020-01-27 1.0 0.0 0.0 0.0 0.1 1.0 0.000293
51259 51500 2020-01-27 07:32:19.587 20.5 2020-01-27 1.0 0.0 0.0 1.0 793.290 1.0 0.001261
I have 2 questions:
I want to drop the rows before the rows where Timediff2 ==0.1.
Add another condition, drop theses rows, unless for that row, Start ==1.
I suggest the following, first I create a top for the row just before Timediff2 == 0.1 then I filter:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Start": [np.NaN, 0.0, 1.0,0.0, 0.0],
"Timediff2": [np.NaN, 0.1, 3417, 0.1, 793]})
df["top"] = (df["Timediff2"] == 0.1).shift(-1)
df = df.loc[(df["Start"] == 1) | (df["top"] == False), :]
df = df.drop(columns="top")
The result is :
Start Timediff2
1 0.0 0.1
2 1.0 3417.0
3 0.0 0.1
I have the following dataframes: eta1.shape = (8004, 29), eta2.shape = (138,)
eta1=
id/uniqueID var0 var1 var2 var3 var4 ... var28
5171/0 10.0 2.8 0.0 5.0 1.0 ... 9.4
5171/1 40.9 2.5 3.4 4.5 1.3 ... 7.7
5171/2 60.7 3.1 5.2 6.6 3.4 ... 1.0
...
5171/57 0.5 1.3 5.1 0.5 0.2 ... 0.4
4567/0 1.5 2.0 1.0 4.5 0.1 ... 0.4
4567/1 4.4 2.0 1.3 6.4 0.1 ... 3.3
4567/2 6.3 3.0 1.5 7.6 1.6 ... 1.6
...
4567/57 0.7 1.4 1.4 0.3 4.2 ... 1.7
...
9584/0 0.3 2.6 0.0 5.2 1.6 ... 9.7
9584/1 0.5 1.2 8.3 3.4 1.3 ... 1.7
9584/2 0.7 3.0 5.6 6.6 3.0 ... 1.0
...
9584/57 0.7 1.3 0.1 0.0 2.0 ... 1.7
eta2=
id var28
5171 67.0
4567 98.9
9584 47.7
...
8707 56.3
In eta2, I have one value per each id. I need to add the value for each id in eta2 to all the columns of eta1 with the same id, like eta = eta1+eta2. So for example the results for id=5171 should be as follow:
eta.loc[5171] =
id/uniqueID var0 var1 ... var28
5171/0 10.0+67.0 2.8+67.0 ... 9.4+67.0
5171/1 40.9+67.0 2.5+67.0 ... 7.7+67.0
5171/2 60.7+67.0 3.1+67.0 ... 1.0+67.0
...
5171/57 0.5+67.0 1.3+67.0 ... 0.4+67.0
Doing the sum by eta = eta1.add(eta2) gives wrong results because they don't have the same levels. I can't remove the levels, I need them for later calculations. So instead I tried to add a new level to eta2 and then do the sum, but I'm getting this error:
eta2['uniq_id'] = eta2.groupby('id').cumcount()
eta2 = eta2.set_index(['uniq_id'], append=True)
eta = eta1.add(eta2, level=0)
error: typeError: Join one level between two MultiIndex objects is ambiguous
How can I do this sum?
Use DataFrame.add with Series with eta2['var28'] and axis=0 level=0 parameter:
eta = eta1.add(eta2['var28'], axis=0, level=0)
print (eta)
var0 var1 var2 var3 var4 var28
id uniqueID
5171 0 77.0 69.8 67.0 72.0 68.0 76.4
1 107.9 69.5 70.4 71.5 68.3 74.7
2 127.7 70.1 72.2 73.6 70.4 68.0
57 67.5 68.3 72.1 67.5 67.2 67.4
4567 0 100.4 100.9 99.9 103.4 99.0 99.3
1 103.3 100.9 100.2 105.3 99.0 102.2
2 105.2 101.9 100.4 106.5 100.5 100.5
57 99.6 100.3 100.3 99.2 103.1 100.6
9584 0 48.0 50.3 47.7 52.9 49.3 57.4
1 48.2 48.9 56.0 51.1 49.0 49.4
2 48.4 50.7 53.3 54.3 50.7 48.7
57 48.4 49.0 47.8 47.7 49.7 49.4
If added cumulative count MultiIndex:
eta2['uniq_id'] = eta2.groupby('id').cumcount()
eta2 = eta2.set_index(['uniq_id'], append=True)
eta = eta1.add(eta2['var28'], axis=0)
print (eta)
var0 var1 var2 var3 var4 var28
id uniqueID uniq_id
4567 0 0 100.4 100.9 99.9 103.4 99.0 99.3
1 0 NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN
57 0 NaN NaN NaN NaN NaN NaN
5171 0 0 77.0 69.8 67.0 72.0 68.0 76.4
1 0 NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN
57 0 NaN NaN NaN NaN NaN NaN
8707 NaN 0 NaN NaN NaN NaN NaN NaN
9584 0 0 48.0 50.3 47.7 52.9 49.3 57.4
1 0 NaN NaN NaN NaN NaN NaN
2 0 NaN NaN NaN NaN NaN NaN
57 0 NaN NaN NaN NaN NaN NaN
I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.
DF looks something like the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 20.2% 292%
May/28/2019 20.5% 1.4%
May/29/2019 20% -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.
So the above DF should look as the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 np.nan 292%
May/28/2019 np.nan 1.4%
May/29/2019 np.nan -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
This is where I've got to so far.... would appreciate some help
for i, j in df1.iterrows():
if df1['Shift'][i] > .50 :
x = df1['IR'][i]
if df1['Shift'][j] < -.50 :
y = df1['IR'][j]
df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'],
np.nan)
Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.
Setup
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))
IR Shift
May/24/2019 5.9 NaN
May/25/2019 6.0 1.67
May/26/2019 5.9 -1.67
May/27/2019 20.2 292.00
May/28/2019 20.5 1.40
May/29/2019 20.0 -1.60
May/30/2019 5.1 -292.00
May/31/2019 5.1 0.00
Code
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)
# Get the indices between consecutive pairs.
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1
df.loc[m, 'IR'] = np.NaN
# IR Shift
#May/24/2019 5.9 NaN
#May/25/2019 6.0 1.67
#May/26/2019 5.9 -1.67
#May/27/2019 NaN 292.00
#May/28/2019 NaN 1.40
#May/29/2019 NaN -1.60
#May/30/2019 5.1 -292.00
#May/31/2019 5.1 0.00
Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.
IR Shift IR_modified
May/24/2019 5.9 NaN 5.9
May/25/2019 6.0 1.67 6.0
May/26/2019 5.9 -1.67 5.9
May/27/2019 20.2 292.00 NaN
May/28/2019 20.5 1.40 NaN
May/29/2019 20.0 -1.60 NaN
May/30/2019 5.1 -292.00 5.1
May/31/2019 5.1 0.00 5.1
June/1/2019 7.0 415.00 NaN
June/2/2019 17.0 15.00 NaN
June/3/2019 27.0 12.00 NaN
June/4/2019 17.0 315.00 17.0
June/5/2019 7.0 -12.00 7.0
You can also np.where function from numpy as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)
In [8]: df
Out[8]:
Date IR Shift
0 2019-05-24 NaN NaN
1 2019-05-25 0.0167 0.0167
2 2019-05-26 NaN -0.0167
3 2019-05-27 2.9200 2.9200
4 2019-05-28 0.0140 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 NaN -2.9200
Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 0.202 2.9200
4 2019-05-28 0.205 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 0.051 -2.9200
df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 NaN -2.9200
Using df.at to access a single value for a row/column label pair.
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})
print("DataFrame Before :")
print(df)
count = 1
while (count < len(df.index)):
if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
df.at[count, 'IR'] = np.nan
count = count + 1
print("DataFrame After :")
print(df)
Output of program:
DataFrame Before :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 20.2 292.00
4 2019-05-28 20.5 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 5.1 -292.00
7 2019-05-31 5.1 0.00
DataFrame After :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 NaN 292.00
4 2019-05-28 NaN 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 NaN -292.00
7 2019-05-31 NaN 0.00
As per your description of triggering this on any large shift, positive or negative, you could do this:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 0.051 -2.9200
Steps:
abs(df.Shift) > .5: Find shift of above +/- 50%
.cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.
% 2 == 1: Checks which rows have odd numbers for cumsum().
Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.
Was not sure about your shift, so calculated again. Does this works for you?
import pandas as pd
import numpy as np
df.drop(columns=['Shift'], inplace=True) ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)
def shift(current, previous):
return (current-previous)/previous * 100
indexlist=[] ## to save index that will be set to null
prior=0 ## temporary flag to store value prior to a peak
flag=False
for index, row in df.iterrows():
if index==0: ## to skip first row of data
continue
if flag==False and (shift(row[1], row[2])) > 50: ## to check for start of peak
prior=row[2]
indexlist.append(index)
flag=True
continue
if flag==True: ## checking until when the peak lasts
if (shift(row[1], prior)) > 50:
indexlist.append(index)
df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
Output on print(df)
date IR nextval
0 May/24/2019 5.9 NaN
1 May/25/2019 6.0 5.9
2 May/26/2019 5.9 6.0
3 May/27/2019 NaN 5.9
4 May/28/2019 NaN 20.2
5 May/29/2019 NaN 20.5
6 May/30/2019 5.1 20.0
7 May/31/2019 5.1 5.1
df.loc[df['Shift']>0.5,'IR'] = np.nan