Padding rows based on conditional

Padding rows based on conditional - python

I have time series data per row (with columns as time steps) and I'd like to left and right pad each row with 0s based on a conditional row value (i.e. 'Padding amount'). This is what I have:
Padding amount T1 T2 T3
0 3 2.9 2.8
1 2.9 2.8 2.7
1 2.8 2.3 2.0
2 4.4 3.3 2.3
And this is what I'd like to produce:
Padding amount T1 T2 T3 T4 T5
0 3 2.9 2.8 0 0 (--> padding = 0, so no change)
1 0 2.9 2.8 2.7 0 (--> shifted one to the left)
1 0 2.8 2.3 2.0 0
2 0 0 4.4 3.3 2.3 (--> shifted two to the right)
I see that Keras has sequence padding, but not sure how this would work considering all rows have the same number of entries. I'm looking at Shift and np.roll but I'm sure a solution exists for this already somewhere.

In numpy, you could construct an array of indices for the locations where you want to place your array elements.
Let's say you have
padding = np.array([0, 1, 1, 2])
data = np.array([[3.0, 2.9, 2.8],
[2.9, 2.8, 2.7],
[2.8, 2.3, 2.0],
[4.4, 3.3, 2.3]])
M, N = data.shape
The output array would be
output = np.zeros((M, N + padding.max()))
You can make an index of where the data goes:
rows = np.arange(M)[:, None]
cols = padding[:, None] + np.arange(N)
Since the shape of the index broadcasts to the shape of the shape of the data, you can assign the output directly:
output[rows, cols] = data
Not sure how this applies to a DataFrame exactly, but you could probably construct a new one after operating on the values of the old one. Alternatively, you could probably implement all these operations equivalently directly in pandas.

This is one way of doing it, i've made the process really flexible in terms of how many time periods/steps it can take:
import pandas as pd
#data
d = {'Padding amount': [0, 1, 1, 2],
'T1': [3, 2.9, 2.8, 4.4],
'T2': [2.9, 2.7, 2.3, 3.3],
'T3': [2.8, 2.7, 2.0, 2.3]}
#create DF
df = pd.DataFrame(data = d)
#get max padding amount
maxPadd = df['Padding amount'].max()
#list of time periods
timePeriodsCols = [c for c in df.columns.tolist() if 'T' in c]
#reverse list
reverseList = timePeriodsCols[::-1]
#number of periods
noOfPeriods = len(timePeriodsCols)
#create new needed columns
for i in range(noOfPeriods + 1, noOfPeriods + 1 + maxPadd):
df['T' + str(i)] = ''
#loop over records
for i, row in df.iterrows():
#get padding amount
padAmount = df.at[i, 'Padding amount']
#if zero then do nothing
if padAmount == 0:
continue
#else: roll column value by padding amount and set old location to zero
else:
for col in reverseList:
df.at[i, df.columns[df.columns.get_loc(col) + padAmount]] = df.at[i, df.columns[df.columns.get_loc(col)]]
df.at[i, df.columns[df.columns.get_loc(col)]] = 0
print(df)
Padding amount T1 T2 T3 T4 T5
0 0 3.0 2.9 2.8
1 1 0.0 2.9 2.7 2.7
2 1 0.0 2.8 2.3 2
3 2 0.0 0.0 4.4 3.3 2.3

Related

select rows in group by dataframe before the row which not satisfies a condition (python)

I have a dataframe with some features. I want to group by 'id' feature. Then for each group I want to identify the row which has 'speed' feature value greater than a threshold and select all the rows before this one.
For example, my threshold is 1.5 for 'speed' feature and my input is:
id
speed
...
1
1.2
...
1
1.9
...
1
1.0
...
5
0.9
...
5
1.3
...
5
3.5
...
5
0.4
...
And my desired output is:
id
speed
...
1
1.2
...
5
0.9
...
5
1.3
...

This should get you the desired results:
# Create sample data
df = pd.DataFrame({'id':[1, 1, 1, 5, 5, 5, 5],
'speed':[1.2, 1.9, 1.0, 0.9, 1.3, 9.5, 0.4]
})
df
output:
id speed
0 1 1.2
1 1 1.9
2 1 1.0
3 5 0.9
4 5 1.3
5 5 9.5
6 5 0.4
ther = 1.5
s = df.speed.shift(-1).ge(ther)
df[s]
Output:
id speed
0 1 1.2
4 5 1.3

It took me an hour to figure out but I got what you need. You need to REVERSE the dataframe and use .cumsum() (cumulative sum) in the groupbyed id's to find the values after the speed threshold you set. Then drop the speeds more than threshold, along with rows that do not satisfy the condition. Finally, reverse back the dataframe:
# Create sample data
df = pd.DataFrame({'id':[1, 1, 1, 5, 5, 5, 5],
'speed':[1.2, 1.9, 1.0, 0.9, 1.3, 9.5, 0.4]
})
# Reverse the dataframe
df = df.iloc[::-1]
thre = 1.5
# Find rows with speed more than threshold
df = df.assign(ge=df.speed.ge(thre))
# Groupby and cumsum to get the rows that are after the threshold in with same id
df.insert(0, 'beforethre', df.groupby('id')['ge'].cumsum())
# Drop speed more than threshold
df['ge'] = df['ge'].replace(True, np.nan)
# Drop rows that don't have any speed more than threshold or after threshold
df['beforethre'] = df['beforethre'].replace(0, np.nan)
df = df.dropna(axis=0).drop(['ge', 'beforethre'], axis=1)
# Reverse back the dataframe
df = df.iloc[::-1]
# Viola!
df
Output:
id speed
0 1 1.2
3 5 0.9
4 5 1.3

Multiply by a different number of columns with iloc in Pandas

I have the following dataframe:
df = pd.DataFrame({'date' : ['2020-6','2020-07','2020-8'], 'd3_real':[1.2,1.3,0.8], 'd7_real' : [1.5,1.8,1.2], 'd14_real':[1.9,2.1,1.5],'d30_real' : [2.1, 2.2, 1.8],
'd7_mul':[1.12,1.1,1.15],'d14_mul':[1.08, 1.1, 1.14],'d30_mul':[1.23,1.25,1.12]})
The dX_real refers to the actual values on day 3, day 7, and day 14... and the second one is each multiplier for that specific day.
I want to calculate those predictions in the following way. First, I take the target column (d3_real, d7_real...) and then I multiply it for each multiplier depending on the case. for example, to calculate the prediction from d3_real to d30, I would need to multiply it by the multipliers of D7, D14 and D30.
df['d30_from_d3'] = df.iloc[:,1] * df.iloc[:,5] * df.iloc[:,6] * df.iloc[:,7]
df['d30_from_d7'] = df.iloc[:,2] * df.iloc[:,6] * df.iloc[:,7]
df['d30_from_d14'] = df.iloc[:,3] * df.iloc[:,7]
Is there any way to automate this with a loop? I do not know how to multiply each dX_real column without using conditional for each case as the number of multiplications changes.
This is what I have tried that it is not working as expected, as it is only multiplying the first multiplier:
pos_reals = [1,2,3]
pos_mul = [5,6,7]
clases = ['d3', 'd7','d14']
for target in pos_reals:
for clase in pos_mul:
df[f'f{clases}_hm_p_d30'] = df.iloc[:,target]
However, from here, I do not know how to specific which values it needs to multiply based on d3, d7 and d14.
Thanks!

bbb = [[1, 5, 6, 7], [2, 6, 7], [3, 7]]
ddd = ['d30_from_d3', 'd30_from_d7', 'd30_from_d14']
for i in range(0, len(ddd)):
df[ddd[i]] = df.iloc[:, bbb[i][0]]
for x in range(1, len(bbb[i])):
df[ddd[i]] = df[ddd[i]] * df.iloc[:, bbb[i][x]]
Output
date d3_real d7_real d14_real d30_real d7_mul d14_mul d30_mul \
0 2020-6 1.2 1.5 1.9 2.1 1.12 1.08 1.23
1 2020-07 1.3 1.8 2.1 2.2 1.10 1.10 1.25
2 2020-8 0.8 1.2 1.5 1.8 1.15 1.14 1.12
d30_from_d3 d30_from_d7 d30_from_d14
0 1.785370 1.99260 2.337
1 1.966250 2.47500 2.625
2 1.174656 1.53216 1.680
Here the first loop selects the names of the new columns from the 'ddd' list and sets the first value in the new column. In the nested loop, the numbers of the desired columns are taken from the list 'bbb' and the values are multiplied. Check with your data or show the expected result with your example. You need to check for a match.

Find a subset of columns based on another dataframe with NaN values?

I'm attempting to get the mean values on one data frame between certain time points that are marked as events in a second data frame.
This is a follow up to this question, where now I have missing/NaN values: Find a subset of columns based on another dataframe?
import pandas as pd
import numpy as np
#example
example_g = [["4/20/21 4:20", 302, 0, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
["2/17/21 9:20",135, 1, 1.4, 1.8, 2, 8, 10],
["2/17/21 9:20", 111, 4, 5, 5.1, 5.2, 5.3, 5.4]]
example_g_table = pd.DataFrame(example_g,columns=['Date_Time','CID', 0.0, 0.1, 0.2, 0.3, 0.4, 0.5])
#Example Timestamps
example_s = [["4/20/21 4:20",302,0.0, 0.2, np.NaN],
["2/17/21 9:20",135,0.0, 0.1, 0.4 ],
["2/17/21 9:20",111,0.3, 0.4, 0.5 ]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', "event_1", "event_2", "event_3"])
df = pd.merge(left=example_g_table,right=example_s_table,on=['Date_Time','CID'],how='left')
def func(df):
event_2 = df['event_2']
event_3 = df['event_3']
start = event_2 + 2 # this assumes that the column called 0 will be the third (and starting at 0, it'll be the called 2), column 1 will be the third column, etc
end = event_3 + 2 # same as above
total = sum(df.iloc[start:end+1]) # this line is the key. It takes the sum of the values of columns in the range of start to finish
avg = total/(end-start+1) #(end-start+1) gets the count of things in our range
return avg
df['avg'] = df.apply(func,axis=1)
I get the following error:
cannot do positional indexing on Index with these indexers [nan] of type float
I have attempted making sure that columns are floats and have tried removing the int() command within the definitions of the events.
How can I preform the same calculations as before where possible but while skipping any values that are NaN?

so about your question, check if this solution is ok:
def func(row):
try:
event_2 = row['event_2']
event_3 = row['event_3']
start = int(event_2 + 2)
end = int(event_3 + 2)+1
list_row = row.tolist()[start:end]
list_row = [x for x in list_row if x == x]
return sum(list_row)/(end-start)
except Exception as e:
return np.NaN
df['avg'] = df.apply(lambda x: func(x),axis=1)
I reduced the function and convert start and end parameters to integer before to the set a subset and when you call the function interows I'm using a lambda function and in Avg calculation, I remove all NaN values

You can check if the event values are NaN and if any of the event value is NaN, just return NaN from the function, else return the required value.
You can also modify the function a bit to calculate the values between any two given events, i.e. not necessarily event 2, and event 3. Also, the data you provided in the previous question had event values columns as integer, but this time, you have float values like 0.1, 0.2, 0.3, ... etc. You can just store the column for event values in a list in an increasing order to be able to access them via index values coming from events column from the second dataframe.
Additionally, you can directly use np.mean instead of calculating the sum and dividing it manually. The modified version of the function will look like this:
eventCols = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] # Columns having the value for events
def getMeanValue(row, eN1=2, eN2=3):
if pd.isna([row[f'event_{eN1}'], row[f'event_{eN2}']]).any():
return float('nan')
else:
requiredEventCols =eventCols[int(row[f'event_{eN1}']):int(row[f'event_{eN2}']+1)]
return np.mean(row[requiredEventCols])
Now, you can apply this function on the dataframe on axis=1=:
df['avg'] = df.apply(getMeanValue,axis=1)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 3.30
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.35
[3 rows x 12 columns]
Additionally, if needed, you can also pass the two event numbers, default values are 2, and 3 which means the value will be calculated for event_2, and event_3
Average between event_1 and event_2:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=2)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN 0.00
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 1.20
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.25
[3 rows x 12 columns]
Average between event_1 and event_3:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=3)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 2.84
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.30
[3 rows x 12 columns]

The format of your data is hard to work with. I would spend some time to rearrange it into a less wide format, then do the work needed.
Here is a quick example, but I did not spend any time making this readable:
base = example_g_table.set_index(['Date_Time','CID']).stack().to_frame()
data = example_s_table.set_index(['Date_Time','CID']).stack().reset_index().set_index(['Date_Time','CID', 0])
base['events'] = data
base = base.reset_index()
base = base.rename(columns={'level_2': 'local_index', 0: 'values'})
This produces a frame that looks something like this:
In this format calculating the result is not so hard.
import numpy
from functools import partial
def mean_two_events(event1, event2, columns_to_mean, df):
event_1 = df['events'] == event1
event_2 = df['events'] == event2
if any(event_1) and any(event_2):
return df.loc[event_1.idxmax():event_2.idxmax()][columns_to_mean].mean()
else:
return np.nan
mean_event2_and_event3 = partial(mean_two_events, 'event_2','event_3', 'values')
mean_event1_and_event3 = partial(mean_two_events, 'event_1','event_3', 'values')
base.groupby(['Date_Time','CID']).apply(mean_event2_and_event3).reset_index()
Good luck!
Edit:
Here is an alternative solution that filters out the values BEFORE the groupby.
base['events'] = base.groupby(['Date_Time','CID']).events.ffill()
# This caluclates all periods up until the next event. The shift makes the first values of the next event included as well.
# The problem with appoach is that more complex logic will be needed if you need to caluclate values between events that
# are not adjasant, IE this wont work if you want the calculate between event_1 and event_3.
base['time_periods_to_include'] = ((base.events == 'event_2') | (base.groupby(['Date_Time','CID']).events.shift() == 'event_2'))
# Now we can simply do:
filtered_base = base[base['time_periods_to_include']]
filtered_base.groupby(['Date_Time','CID']).values.mean()
# The benifit is that you can now eaisaly do:
filtered_base.groupby(['Date_Time','CID']).values.rolling(5).mean()

Rolling average with window size an interval of column values

I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?

Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).

df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.

how to implement this difference operation efficiently?

I would like to build a data frame from an existing one, where each value per row is depending on the previous one. I have an initial value v0 as starting point. Let me make an example
In [126]:import pandas as pd
In [127]: df = pd.DataFrame([1.0, 1.1, 1.2, 1.3])
In [128]: df_result = df.copy()
In [129]: v0 = 10
In [130]: for i in range(1, len(df.index)):
...: df_result.iloc[i, 0] = df.iloc[i, 0]*df_result.iloc[i-1, 0]
...:
In [131]: df_result
Out[131]:
0
0 1.000
1 1.100
2 1.320
3 1.716
In [132]:
My question is about the for loop. How can I more efficiently writing this?

I believe need first numpy.insert value v0 to first position and then call numpy.cumprod:
df = pd.DataFrame([1.0, 1.1, 1.2, 1.3], columns=['r'])
v0 = 10
df['n'] = np.cumprod(np.insert(df['r'].values[1:], 0, v0))
print (df)
r n
0 1.0 10.00
1 1.1 11.00
2 1.2 13.20
3 1.3 17.16

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Padding rows based on conditional - python

Related

select rows in group by dataframe before the row which not satisfies a condition (python)

Multiply by a different number of columns with iloc in Pandas

Find a subset of columns based on another dataframe with NaN values?

Rolling average with window size an interval of column values

how to implement this difference operation efficiently?

Categories

Resources