window based weighted average in pandas - python

I am trying to do a window based weighted average of two columns
for example if i have my value column "a" and my weighting column "b"
a b
1: 1 2
2: 2 3
3: 3 4
with a trailing window of 2 (although id like to work with a variable window length)
my third weighted average column should be "c" where the rows that do not have enough previous data for a full weighted average are nan
c
1: nan
2: (1 * 2 + 2 * 3) / (2 + 3) = 1.8
3: (2 * 3 + 3 * 4) / (3 + 4) = 2.57

For your particular case of window of 2, you may use prod and shift
s = df.prod(1)
(s + s.shift()) / (df.b + df.b.shift())
Out[189]:
1 NaN
2 1.600000
3 2.571429
dtype: float64
On sample df2:
a b
0 73.78 51.46
1 73.79 27.84
2 73.79 34.35
s = df2.prod(1)
(s + s.shift()) / (df2.b + df2.b.shift())
Out[193]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
This method still works on variable window length. For variable window length, you need additional listcomp and sum
Try on sample df2 above
s = df2.prod(1)
m = 2 #window length 2
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[214]:
0 NaN
1 73.783511
2 73.790000
dtype: float64
On window length 3
m = 3 #window length 3
sum([s.shift(x) for x in range(m)]) / sum([df2.b.shift(x) for x in range(m)])
Out[215]:
0 NaN
1 NaN
2 73.785472
dtype: float64

Related

Is there faster way to get values based on the linear regression model and append it to a new column in a DataFrame?

I created this code below to make a new column in my dataframe to compare the actual values and regressed value:
b = dfSemoga.loc[:, ['DoB','AA','logtime']]
y = dfSemoga.loc[:,'logCO2'].values.reshape(len(dfSemoga)+1,1)
lr = LinearRegression().fit(b,y)
z = lr.coef_[0,0]
j = lr.coef_[0,1]
k = lr.coef_[0,2]
c = lr.intercept_[0]
for i in range (0,len(dfSemoga)):
dfSemoga.loc[i,'EF CO2 Predict'] = (c + dfSemoga.loc[i,'DoB']*z +
dfSemoga.loc[i,'logtime']*k + dfSemoga.loc[i, 'AA']*j)
So, I basically regress a column with three variables: 1) AA, 2) logtime, and 3) DoB. But in this code, to get the regressed value in a new column called dfSemoga['EF CO2 Predict'] I assign the coefficient manually, as shown in the for loop.
Is there any fancy one-liner code that I can write to make my work more efficient?
Without sample data I can't confirm but you should just be able to do
dfSemoga["EF CO2 Predict"] = c + (z * dfSemoga["DoB"]) + (k * dfSemoga["logtime"]) + (j * dfSemoga["AA"])
Demo:
In [4]: df
Out[4]:
a b
0 0 0
1 0 8
2 7 6
3 3 1
4 3 8
5 6 6
6 4 8
7 2 7
8 3 8
9 8 1
In [5]: df["c"] = 3 + 0.5 * df["a"] - 6 * df["b"]
In [6]: df
Out[6]:
a b c
0 0 0 3.0
1 0 8 -45.0
2 7 6 -29.5
3 3 1 -1.5
4 3 8 -43.5
5 6 6 -30.0
6 4 8 -43.0
7 2 7 -38.0
8 3 8 -43.5
9 8 1 1.0

Create column using a sequence of numbers

I have a dataframe as such
Anger Sad Happy Disgust Neutral Scared
0.06754 0.6766 0.4343 0.7732 0.5563 0.76433
0.54434 0.9865 0.6654 0.3334 0.4322 0.54453
...
0.5633 0.67655 0.5444 0.3278 0.9834 0.88569
I would like to create a new column that marks the first 5 rows as 1, the next 5 rows as 2, the next 5 rows 3 and then the next 3 rows as 4, and repeat the same pattern till the end of the dataset. How can I achieve this?
I tried looking into arange but failed in the implementation
An example output would be the new column Tperiod
Tperiod
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
1
1
1
1
1
One way to do it would be as such
pattern = [1] * 5 + [2] * 5 + [3] * 5 + [4] * 3
no_patterns = len(df)//len(pattern)
remaining = len(df) - (len(pattern) * no_patterns)
new_values = pattern * no_patterns + pattern[:remaining]
df['new_column'] = new_values
You can do:
gr=[5,5,5,3]
mp={n: next(x+1 for x in range(len(gr)) if sum(gr[:x+1])>n) for n in range(sum(gr))}
df['Tperiod']=range(len(df))
df['Tperiod']=(df['Tperiod']%sum(gr)).map(mp)
Where gr indicates your groups size (that's your input), and mp is just to utilize it properly with pandas.

Optimizing a pandas apply looking back at the prior row mid calculation

I have a dataframe - pastebin for minimium code to run
df_dict = {
'A': [1, 2, 3, 4, 5],
'B': [5, 2, 3, 1, 5],
'out': np.nan
}
df = pd.DataFrame(df_dict)
I am currently performing some row by row calculations by doing the following:
def transform(row):
length = 2
weight = 5
row_num = int(row.name)
out = row['A'] / length
if (row_num >= length):
previous_out = df.at[ row_num-1, 'out' ]
out = (row['B'] - previous_out) * weight + previous_out
df.at[row_num, 'out'] = out
df.apply( lambda x: transform(x), axis=1)
This yield the correct result:
A B out
0 1 5 0.5
1 2 2 1.0
2 3 3 11.0
3 4 1 -39.0
4 5 5 181.0
The breakdown for the correct calculation is as follows:
A B out
0 1 5 0.5
out = a / b
1 2 2 1.0
out = a / b
row_num >= length:
2 3 3 11.0
out = (b - previous_out) * weight + previous_out
out = (3 - 1) * 5 + 1 = 11
3 4 1 -39.0
out = (1 - 11) * 5 + 11 = 39
4 5 5 181.0
out = (5 - (-39)) * 5 + (-39) = 181
Executing this across many columns and looping is slow so I would like to optimize taking advantage of some kind of vectorization if possible.
My current attempt looks like this:
df['out'] = df['A'] / length
df[length:]['out'] = (df[length:]['B'] - df[length:]['out'].shift() ) * weight + df[length:]['out'].shift()
This is not working and I'm not quite sure where to go from here.
Pastebin of the above code to just copy/paste into a file and run
You won't be able to do better than this:
df['out'] = df.A / length
for i in range(len(df)):
if i >= length:
df.loc[i, 'out'] = (df.loc[i, 'B'] -
df.loc[i - 1, 'out']) * weight + df.loc[i - 1, 'out']
The reason is that "the iterative nature of the calculation where the inputs depend on results of previous steps complicates vectorization" (as a commenter here puts it). You can't do a calculation where every result depends on the previous ones in a matrix - there will always be some kind of loop going on behind the scenes.

Create a pandas Series

I want to create a panda series that contains the first ā€˜nā€™ natural numbers and their respective squares. The first ā€˜nā€™ numbers should appear in the index position by using manual indexing
Can someone please share a code with me
Use numpy.arange with ** for squares:
n = 5
s = pd.Series(np.arange(n) ** 2)
print (s)
0 0
1 1
2 4
3 9
4 16
dtype: int32
If want omit 0:
n = 5
arr = np.arange(1, n + 1)
s = pd.Series(arr ** 2, index=arr)
print (s)
1 1
2 4
3 9
4 16
5 25
dtype: int32

Pandas sequentially apply function using output of previous value

I want to compute the "carryover" of a series. This computes a value for each row and then adds it to the previously computed value (for the previous row).
How do I do this in pandas?
decay = 0.5
test = pd.DataFrame(np.random.randint(1,10,12),columns = ['val'])
test
val
0 4
1 5
2 7
3 9
4 1
5 1
6 8
7 7
8 3
9 9
10 7
11 2
decayed = []
for i, v in test.iterrows():
if i ==0:
decayed.append(v.val)
continue
d = decayed[i-1] + v.val*decay
decayed.append(d)
test['loop_decay'] = decayed
test.head()
val loop_decay
0 4 4.0
1 5 6.5
2 7 10.0
3 9 14.5
4 1 15.0
Consider a vectorized version with cumsum() where you cumulatively sum (val * decay) with the very first val.
However, you then need to subtract the very first (val * decay) since cumsum() includes it:
test['loop_decay'] = (test.ix[0,'val']) + (test['val']*decay).cumsum() - (test.ix[0,'val']*decay)
You can utilize pd.Series.shift() to create a dataframe with val[i] and val[i-1] and then apply your function across a single axis (1 in this case):
# Create a series that shifts the rows by 1
test['val2'] = test.val.shift()
# Set the first row on the shifted series to 0
test['val2'].ix[0] = 0
# Apply the decay formula:
test['loop_decay'] = test.apply(lambda x: x['val'] + x['val2'] * 0.5, axis=1)

Categories

Resources