window based weighted average in pandas

window based weighted average in pandas - python

I am trying to do a window based weighted average of two columns
for example if i have my value column "a" and my weighting column "b"
a b
1: 1 2
2: 2 3
3: 3 4
with a trailing window of 2 (although id like to work with a variable window length)
my third weighted average column should be "c" where the rows that do not have enough previous data for a full weighted average are nan
c
1: nan
2: (1 * 2 + 2 * 3) / (2 + 3) = 1.8
3: (2 * 3 + 3 * 4) / (3 + 4) = 2.57

For your particular case of window of 2, you may use prod and shift
s = df.prod(1)
(s + s.shift()) / (df.b + df.b.shift())
Out[189]:
1 NaN
2 1.600000
3 2.571429
dtype: float64
On sample df2:
a b
0 73.78 51.46
1 73.79 27.84
2 73.79 34.35
s = df2.prod(1)
(s + s.shift()) / (df2.b + df2.b.shift())
Out[193]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
This method still works on variable window length. For variable window length, you need additional listcomp and sum
Try on sample df2 above
s = df2.prod(1)
m = 2 #window length 2
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[214]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
On window length 3
m = 3 #window length 3
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[215]:
0 NaN
1 NaN
2 73.785472
dtype: float64

Related

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

I created this code below to make a new column in my dataframe to compare the actual values and regressed value:
b = dfSemoga.loc[:, ['DoB','AA','logtime']]
y = dfSemoga.loc[:,'logCO2'].values.reshape(len(dfSemoga)+1,1)
lr = LinearRegression().fit(b,y)
z = lr.coef_[0,0]
j = lr.coef_[0,1]
k = lr.coef_[0,2]
c = lr.intercept_[0]
for i in range (0,len(dfSemoga)):
dfSemoga.loc[i,'EF CO2 Predict'] = (c + dfSemoga.loc[i,'DoB']*z +
dfSemoga.loc[i,'logtime']*k + dfSemoga.loc[i, 'AA']*j)
So, I basically regress a column with three variables: 1) AA, 2) logtime, and 3) DoB. But in this code, to get the regressed value in a new column called dfSemoga['EF CO2 Predict'] I assign the coefficient manually, as shown in the for loop.
Is there any fancy one-liner code that I can write to make my work more efficient?

Without sample data I can't confirm but you should just be able to do
dfSemoga["EF CO2 Predict"] = c + (z * dfSemoga["DoB"]) + (k * dfSemoga["logtime"]) + (j * dfSemoga["AA"])
Demo:
In [4]: df
Out[4]:
a b
0 0 0
1 0 8
2 7 6
3 3 1
4 3 8
5 6 6
6 4 8
7 2 7
8 3 8
9 8 1
In [5]: df["c"] = 3 + 0.5 * df["a"] - 6 * df["b"]
In [6]: df
Out[6]:
a b c
0 0 0 3.0
1 0 8 -45.0
2 7 6 -29.5
3 3 1 -1.5
4 3 8 -43.5
5 6 6 -30.0
6 4 8 -43.0
7 2 7 -38.0
8 3 8 -43.5
9 8 1 1.0

Create column using a sequence of numbers

I have a dataframe as such
Anger Sad Happy Disgust Neutral Scared
0.06754 0.6766 0.4343 0.7732 0.5563 0.76433
0.54434 0.9865 0.6654 0.3334 0.4322 0.54453
...
0.5633 0.67655 0.5444 0.3278 0.9834 0.88569
I would like to create a new column that marks the first 5 rows as 1, the next 5 rows as 2, the next 5 rows 3 and then the next 3 rows as 4, and repeat the same pattern till the end of the dataset. How can I achieve this?
I tried looking into arange but failed in the implementation
An example output would be the new column Tperiod
Tperiod
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
1
1
1
1
1

One way to do it would be as such
pattern = [1] * 5 + [2] * 5 + [3] * 5 + [4] * 3
no_patterns = len(df)//len(pattern)
remaining = len(df) - (len(pattern) * no_patterns)
new_values = pattern * no_patterns + pattern[:remaining]
df['new_column'] = new_values

You can do:
gr=[5,5,5,3]
mp={n: next(x+1 for x in range(len(gr)) if sum(gr[:x+1])>n) for n in range(sum(gr))}
df['Tperiod']=range(len(df))
df['Tperiod']=(df['Tperiod']%sum(gr)).map(mp)
Where gr indicates your groups size (that's your input), and mp is just to utilize it properly with pandas.

Optimizing a pandas apply looking back at the prior row mid calculation

I have a dataframe - pastebin for minimium code to run
df_dict = {
'A': [1, 2, 3, 4, 5],
'B': [5, 2, 3, 1, 5],
'out': np.nan
}
df = pd.DataFrame(df_dict)
I am currently performing some row by row calculations by doing the following:
def transform(row):
length = 2
weight = 5
row_num = int(row.name)
out = row['A'] / length
if (row_num >= length):
previous_out = df.at[ row_num-1, 'out' ]
out = (row['B'] - previous_out) * weight + previous_out
df.at[row_num, 'out'] = out
df.apply( lambda x: transform(x), axis=1)
This yield the correct result:
A B out
0 1 5 0.5
1 2 2 1.0
2 3 3 11.0
3 4 1 -39.0
4 5 5 181.0
The breakdown for the correct calculation is as follows:
A B out
0 1 5 0.5
out = a / b
1 2 2 1.0
out = a / b
row_num >= length:
2 3 3 11.0
out = (b - previous_out) * weight + previous_out
out = (3 - 1) * 5 + 1 = 11
3 4 1 -39.0
out = (1 - 11) * 5 + 11 = 39
4 5 5 181.0
out = (5 - (-39)) * 5 + (-39) = 181
Executing this across many columns and looping is slow so I would like to optimize taking advantage of some kind of vectorization if possible.
My current attempt looks like this:
df['out'] = df['A'] / length
df[length:]['out'] = (df[length:]['B'] - df[length:]['out'].shift() ) * weight + df[length:]['out'].shift()
This is not working and I'm not quite sure where to go from here.
Pastebin of the above code to just copy/paste into a file and run

You won't be able to do better than this:
df['out'] = df.A / length
for i in range(len(df)):
if i >= length:
df.loc[i, 'out'] = (df.loc[i, 'B'] -
df.loc[i - 1, 'out']) * weight + df.loc[i - 1, 'out']
The reason is that "the iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization" (as a commenter here puts it). You can't do a calculation where every result depends on the previous ones in a matrix - there will always be some kind of loop going on behind the scenes.

Create a pandas Series

I want to create a panda series that contains the first ‘n’ natural numbers and their respective squares. The first ‘n’ numbers should appear in the index position by using manual indexing
Can someone please share a code with me

Use numpy.arange with ** for squares:
n = 5
s = pd.Series(np.arange(n) ** 2)
print (s)
0 0
1 1
2 4
3 9
4 16
dtype: int32
If want omit 0:
n = 5
arr = np.arange(1, n + 1)
s = pd.Series(arr ** 2, index=arr)
print (s)
1 1
2 4
3 9
4 16
5 25
dtype: int32

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0

Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)

You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

window based weighted average in pandas - python

Related

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

Create column using a sequence of numbers

Optimizing a pandas apply looking back at the prior row mid calculation

Create a pandas Series

Pandas sequentially apply function using output of previous value

Categories

Resources