Pandas find and interpolate missing value - python

This question is pretty much a follow up from Pandas pivot or reshape dataframe with NaN
When decoding videos some frames go missing and that data needs to be interpolated
Current df
frame pvol vvol area label
0 NaN 109.8 120 v
2 NaN 160.4 140 v
0 23.1 NaN 110 p
1 24.3 NaN 110 p
2 25.6 NaN 112 p
Expected df
frame pvol vvol p_area v_area
0 23.1 109.8 110 110
1 24.3 135.1 110 111 # Interpolated for label v
2 25.6 160.4 112 120
I know I can do df.interpolate() once the current_df is reshaped for only p frames. The reshaping is the issue.
Note: label p >= label v meaning label p will always have all the frames but v can have missed frames

You can reshape, dropna as in the previous question, except that now you need to specify that you want to drop only empty columns, then interpolate:
out = (df.pivot(index='frame', columns='label')
.dropna(axis=1, how='all') # only drop empty columns
.interpolate() # interpolate
)
out.columns = [f'{y}_{x}' for x,y in out.columns]
Output:
p_pvol v_vvol p_area v_area
frame
0 23.1 109.8 110.0 120.0
1 24.3 135.1 110.0 130.0
2 25.6 160.4 112.0 140.0

Change the dropna remove the issue
s = df.set_index(['frame','label']).unstack().dropna(thresh=1,axis=1)
s.columns = s.columns.map('_'.join)
s = s.interpolate()
Out[279]:
pvol_p vvol_v area_p area_v
frame
0 23.1 109.8 110.0 120.0
1 24.3 135.1 110.0 130.0
2 25.6 160.4 112.0 140.0

Related

Pandas - Fillna based on last non-blank value and next column

I have the following pandas dataframe:
A B C
0 100.0 110.0 100
1 90.0 120.0 110
2 NaN 105.0 105
3 NaN 100.0 103
4 NaN NaN 107
5 NaN NaN 110
I need to fill NaNs in all columns in a particular way. Let's take column "A" as an example: the last non-NaN value is row #1 (90.0). So for column "A" I need to fill NaNs with the following formula:
Column_A-Row_1 * Column_B-CurrentRow / Column_B-Row_1
For example, the first NaN of column A (row #2) should be filled with: 90 * 105 / 120. The following NaN of column A should be filled with: 90 * 100 / 120.
Please note that column names can change, so I can't reference columns by name.
This is the expected output:
A B C
0 100.00 110.00 100.0
1 90.00 120.00 110.0
2 78.75 105.00 105.0
3 75.00 100.00 103.0
4 NaN 103.88 107.0
5 NaN 106.80 110.0
Any ideas? Thanks
You can fill the first NaN that follows a number using shift on both axis:
df2 = df.combine_first(df.shift().mul(df.div(df.shift()).shift(-1,axis=1)))
output:
A B C
0 100.00 110.000000 100
1 90.00 120.000000 110
2 78.75 105.000000 105
3 NaN 100.000000 103
4 NaN 103.883495 107
5 NaN NaN 110
It is unclear how you get the 75 though, do you want to iterate the process?

Deleting values conditional on large values of another column

I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.
DF looks something like the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 20.2% 292%
May/28/2019 20.5% 1.4%
May/29/2019 20% -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.
So the above DF should look as the below:
IR Shift
May/24/2019 5.9% -
May/25/2019 6% 1.67%
May/26/2019 5.9% -1.67
May/27/2019 np.nan 292%
May/28/2019 np.nan 1.4%
May/29/2019 np.nan -1.6%
May/30/2019 5.1% -292%
May/31/2019 5.1% 0%
This is where I've got to so far.... would appreciate some help
for i, j in df1.iterrows():
if df1['Shift'][i] > .50 :
x = df1['IR'][i]
if df1['Shift'][j] < -.50 :
y = df1['IR'][j]
df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'],
np.nan)
Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.
Setup
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))
IR Shift
May/24/2019 5.9 NaN
May/25/2019 6.0 1.67
May/26/2019 5.9 -1.67
May/27/2019 20.2 292.00
May/28/2019 20.5 1.40
May/29/2019 20.0 -1.60
May/30/2019 5.1 -292.00
May/31/2019 5.1 0.00
Code
# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)
# Get the indices between consecutive pairs.
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1
df.loc[m, 'IR'] = np.NaN
# IR Shift
#May/24/2019 5.9 NaN
#May/25/2019 6.0 1.67
#May/26/2019 5.9 -1.67
#May/27/2019 NaN 292.00
#May/28/2019 NaN 1.40
#May/29/2019 NaN -1.60
#May/30/2019 5.1 -292.00
#May/31/2019 5.1 0.00
Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.
IR Shift IR_modified
May/24/2019 5.9 NaN 5.9
May/25/2019 6.0 1.67 6.0
May/26/2019 5.9 -1.67 5.9
May/27/2019 20.2 292.00 NaN
May/28/2019 20.5 1.40 NaN
May/29/2019 20.0 -1.60 NaN
May/30/2019 5.1 -292.00 5.1
May/31/2019 5.1 0.00 5.1
June/1/2019 7.0 415.00 NaN
June/2/2019 17.0 15.00 NaN
June/3/2019 27.0 12.00 NaN
June/4/2019 17.0 315.00 17.0
June/5/2019 7.0 -12.00 7.0
You can also np.where function from numpy as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)
In [8]: df
Out[8]:
Date IR Shift
0 2019-05-24 NaN NaN
1 2019-05-25 0.0167 0.0167
2 2019-05-26 NaN -0.0167
3 2019-05-27 2.9200 2.9200
4 2019-05-28 0.0140 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 NaN -2.9200
Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 0.202 2.9200
4 2019-05-28 0.205 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 0.051 -2.9200
df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 0.200 -0.0160
6 2019-05-30 NaN -2.9200
Using df.at to access a single value for a row/column label pair.
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})
print("DataFrame Before :")
print(df)
count = 1
while (count < len(df.index)):
if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
df.at[count, 'IR'] = np.nan
count = count + 1
print("DataFrame After :")
print(df)
Output of program:
DataFrame Before :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 20.2 292.00
4 2019-05-28 20.5 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 5.1 -292.00
7 2019-05-31 5.1 0.00
DataFrame After :
Date IR Shift
0 2019-05-24 5.9 NaN
1 2019-05-25 6.0 1.67
2 2019-05-26 5.9 -1.67
3 2019-05-27 NaN 292.00
4 2019-05-28 NaN 1.40
5 2019-05-29 20.0 -1.60
6 2019-05-30 NaN -292.00
7 2019-05-31 NaN 0.00
As per your description of triggering this on any large shift, positive or negative, you could do this:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})
df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan
Date IR Shift
0 2019-05-24 0.059 NaN
1 2019-05-25 0.060 0.0167
2 2019-05-26 0.059 -0.0167
3 2019-05-27 NaN 2.9200
4 2019-05-28 NaN 0.0140
5 2019-05-29 NaN -0.0160
6 2019-05-30 0.051 -2.9200
Steps:
abs(df.Shift) > .5: Find shift of above +/- 50%
.cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.
% 2 == 1: Checks which rows have odd numbers for cumsum().
Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.
Was not sure about your shift, so calculated again. Does this works for you?
import pandas as pd
import numpy as np
df.drop(columns=['Shift'], inplace=True) ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)
def shift(current, previous):
return (current-previous)/previous * 100
indexlist=[] ## to save index that will be set to null
prior=0 ## temporary flag to store value prior to a peak
flag=False
for index, row in df.iterrows():
if index==0: ## to skip first row of data
continue
if flag==False and (shift(row[1], row[2])) > 50: ## to check for start of peak
prior=row[2]
indexlist.append(index)
flag=True
continue
if flag==True: ## checking until when the peak lasts
if (shift(row[1], prior)) > 50:
indexlist.append(index)
df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan
Output on print(df)
date IR nextval
0 May/24/2019 5.9 NaN
1 May/25/2019 6.0 5.9
2 May/26/2019 5.9 6.0
3 May/27/2019 NaN 5.9
4 May/28/2019 NaN 20.2
5 May/29/2019 NaN 20.5
6 May/30/2019 5.1 20.0
7 May/31/2019 5.1 5.1
df.loc[df['Shift']>0.5,'IR'] = np.nan

get value from pandas segment and subtract in place

i have a table with values similar to
val1 val2 val3 segVal
0 12.3 88.2
20 0 0
50 14.5 88.7
70 0 0
85 0 0
90 18.2 88.9
for my segVal, i need to use the differences from my val1 columns where val2 is known. so my first segment would be zero to 50, i'm subtracting from 0 and applying that to all segVal rows. my next segment is at 90 so i would subtract that from 50 and apply that.
So my output table would be
val1 val2 val3 segVal
0 12.3 88.2 50
20 0 0 50
50 14.5 88.7 50
70 0 0 40
85 0 0 40
90 18.2 88.9 40
my current somewhat working method is
df1 = df[df.val2 != 0]
df1 = df1.copy()
df1.segVal=(df1['val1'].diff(-1))*1
so i'm creating a additional df and calculating the values this way, then merging back the values with the original df.
It seems there has got to be a better way to do this, I mean, my method works, but doesn't' seem too efficient creating additional df's
Here's one way:
df['segVal'] = df.where(df.val2.ne(0)).val1.dropna().diff().reindex(df.index).bfill()
val1 val2 val3 segVal
0 0 12.3 88.2 50.0
1 20 0.0 0.0 50.0
2 50 14.5 88.7 50.0
3 70 0.0 0.0 40.0
4 85 0.0 0.0 40.0
5 90 18.2 88.9 40.0

shifting down rows of specific columns from a specific index in python

I am scraping multiple tables from multiple pages of a website. The issue is there is a row missing from the initial table. Basically, this is how the dataframe looks.
mar2018 feb2018 jan2018 dec2017 nov2017
oct2017 sep2017 aug2017
balls faced 345 561 295 0 645 balls faced 200 58 0
runs scored 156 281 183 0 389 runs scored 50 20 0
strike rate 52.3 42.6 61.1 0 52.2 strike rate 25 34 0
dot balls 223 387 173 0 476 dot balls 125 34 0
fours 8 12 19 0 22 sixes 2 0 0
doubles 20 38 16 0 36 fours 4 2 0
notout 2 0 0 0 4 doubles 2 0 0
notout 4 2 0
the column 'sixes' is missing in the first page and present in the subsequent pages. So, I am trying to move the rows starting from 'fours' to 'not out' to a position down and leave nan's in row 4 for first 5 columns starting from mar2018 to nov2017.
I tried the following code but it isn't working. This is moving the values horizontally but not vertically downward.
df.iloc[4][0:6] = df.iloc[4][0:6].shift(1)
and also
df2 = pd.DataFrame(index = 4)
df = pd.concat([df.iloc[:], df2, df.iloc[4:]]).reset_index(drop=True)
did not work.
df['mar2018'] = df['mar2018'].shift(1)
But this moves all the values of that column down by 1 row.
So, I was wondering if it is possible to shift down rows of specific columns from a specific index?
I think need reindex by union by numpy.union1d of all index values:
idx = np.union1d(df1.index, df2.index)
df1 = df1.reindex(idx)
df2 = df2.reindex(idx)
print (df1)
mar2018 feb2018 jan2018 dec2017 nov2017
balls faced 345.0 561.0 295.0 0.0 645.0
dot balls 223.0 387.0 173.0 0.0 476.0
doubles 20.0 38.0 16.0 0.0 36.0
fours 8.0 12.0 19.0 0.0 22.0
notout 2.0 0.0 0.0 0.0 4.0
runs scored 156.0 281.0 183.0 0.0 389.0
sixes NaN NaN NaN NaN NaN
strike rate 52.3 42.6 61.1 0.0 52.2
print (df2)
oct2017 sep2017 aug2017
balls faced 200 58 0
dot balls 125 34 0
doubles 2 0 0
fours 4 2 0
notout 4 2 0
runs scored 50 20 0
sixes 2 0 0
strike rate 25 34 0
If multiple DataFrames in list is possible use list comprehension:
from functools import reduce
dfs = [df1, df2]
idx = reduce(np.union1d, [x.index for x in dfs])
dfs1 = [df.reindex(idx) for df in dfs]
print (dfs1)
[ mar2018 feb2018 jan2018 dec2017 nov2017
balls faced 345.0 561.0 295.0 0.0 645.0
dot balls 223.0 387.0 173.0 0.0 476.0
doubles 20.0 38.0 16.0 0.0 36.0
fours 8.0 12.0 19.0 0.0 22.0
notout 2.0 0.0 0.0 0.0 4.0
runs scored 156.0 281.0 183.0 0.0 389.0
sixes NaN NaN NaN NaN NaN
strike rate 52.3 42.6 61.1 0.0 52.2, oct2017 sep2017 aug2017
balls faced 200 58 0
dot balls 125 34 0
doubles 2 0 0
fours 4 2 0
notout 4 2 0
runs scored 50 20 0
sixes 2 0 0
strike rate 25 34 0]

Change all rows in a dataframe, following the occurrence of an item in a row.

Say I have a data frame like this:
Open Close Split
144 144 False
142 143 False
... ... ...
138 139 False
72 73 True
72 74 False
75 76 False
... ... ...
79 78 False
Obviously the dataframe can be quite large, and may contain other columns, but this is the core.
My end goal is to adjust all of the data to account for the split, and so far I've been able to identify the split (that's the "Split" column).
Now, I'm looking for an elegant way to divide everything before the split by 2, or multiply everything after the split by 2.
I thought the best way might be to spread the True down towards the bottom, and then multiply all rows that contain a True in the "Split" column, but is there a more Pythonic way to do it?
Assuming Split is the only boolean column, and that everything else is numeric in nature, you can just take the cumsum and set values with loc accordingly -
m = df.pop('Split').cumsum()
df.loc[m.eq(0)] /= 2 # division before split
df.loc[m.eq(1)] *= 2 # multiplication after split
df
Open Close
0 72.0 72.0
1 71.0 71.5
2 69.0 69.5
3 144.0 146.0
4 144.0 148.0
5 150.0 152.0
6 158.0 156.0
This is by far the most performant option. Another possible option involves np.where -
df[:] = np.where(m.eq(0)[:, None], df / 2, df * 2)
df
Open Close
0 72.0 72.0
1 71.0 71.5
2 69.0 69.5
3 144.0 146.0
4 144.0 148.0
5 150.0 152.0
6 158.0 156.0
Or,
df.where/df.mask -
(df / 2).where(m.eq(0), df * 2)
Or,
(df / 2).where(m.ne(0), df * 2)
Open Close Split
0 72.0 72.0 0
1 71.0 71.5 0
2 69.0 69.5 0
3 144.0 146.0 2
4 144.0 148.0 0
5 150.0 152.0 0
6 158.0 156.0 0
These are nowhere as near efficient as the indexing option with loc, because it involves a lot of redundant computation.
Another cumsum-based solution:
columns = ['Open','Close']
df[columns] = df[columns].mul(df.Split.cumsum() + 1, axis=0)
# Open Close Split
#0 144 144 False
#1 142 143 False
#2 138 139 False
#3 144 146 True
#4 144 148 False
#5 150 152 False
#6 158 156 False
split_true = df[df['Split'] == True].index[0]
df.iloc[split_true:,:]

Categories

Resources