python/pandas rolling sum with time-varying windows - python

I have an array
arr = [1,2,3, ..., N]
and a list of windows (of length N)
windows = [2,2,1, ...]
Is it possible to do a rolling sum computation on arr with the time varying windows stored in windows?
For example at t=3, you have arr=[1,2,3] and window=1 so this would indicate returning a 1 day rolling sum such that out[2] = 3
At t=2, you have arr = [1,2] and window=2 so this would indicate a 2 day rolling sum such that out[1]=3

I can not grantee the speed , but it will achieve what you need
df['New']=np.clip(df.index-df.windows+1,a_min=0,a_max=None)
df
Out[626]:
val windows New
0 1 2 0
1 2 2 0
2 3 1 2
3 4 1 3
4 5 3 2
df.apply(lambda x : df.iloc[x['New']:x.name+1,0].sum(),1)
Out[630]:
0 1
1 3
2 3
3 4
4 12
dtype: int64

This might be what you are after:
arr = [1,2,3]
windows = [2,2,1]
out = [0,0,0]
for t, i in enumerate(windows):
newarr = arr[:t+1]
out[t] = sum(newarr[:-(i+1):-1])
print('t = ' + str(t+1))
print('arr = ' + str(newarr))
print('out[' + str(t) + '] = ' + str(out[t]))
print('\n')
Gives:
t = 1
arr = [1]
out[0] = 1
t = 2
arr = [1, 2]
out[1] = 3
t = 3
arr = [1, 2, 3]
out[2] = 3

Related

How can I fill empty DataFrame based on conditions?

I have following dataframe called condition:
[0] [1] [2] [3]
1 0 0 1 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
For easier reproduction:
import numpy as np
import pandas as pd
n=4
t=3
condition = pd.DataFrame([[0,0,1,0], [0,1,0,0], [0,0,0, 1], [0,0,0, 1]], columns=['0','1', '2', '3'])
condition.index=np.arange(1,n+1)
Further I have several dataframes that should be filled in a foor loop
df = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) ) #NaN DataFrame
df_2 = pd.DataFrame([],index = range(1,n+1),columns= range(t+1) )
df_3 = pd.DataFrame(3,index = range(1,n+1),columns= range(t+1) )
for i,t in range(t,-1,-1):
if condition[t]==1:
df.loc[:,t] = df_3.loc[:,t]**2
df_2.loc[:,t]=0
elif (condition == 0 and no 1 in any column after t)
df.loc[:,t] = 2.5
....
else:
df.loc[:,t] = 5
df_2.loc[:,t]= df.loc[:,t+1]
I am aware that this for loop is not correct, but what I wanted to do, is to check elementwise condition (recursevly) and if it is 1 (in condition) to fill dataframe df with squared valued of df_3. If it is 0 in condition, I should differentiate two cases.
In the first case, there are no 1 after 0 (row 1 and 2 in condition) then df = 2.5
Second case, there was 1 after and fill df with 5 (row 3 and 4)
So the dataframe df should look something like this
[0] [1] [2] [3]
1 5 5 9 2.5
2 5 9 2.5 2.5
3 5 5 5 9
4 5 5 5 9
The code should include for loop.
Thanks!
I am not sure if this is what you want, but based on your desired output you can do this with only masking operations (which is more efficient than looping over the rows anyway). Your code could look like this:
is_one = condition.astype(bool)
is_after_one = (condition.cumsum(axis=1) - condition).astype(bool)
df = pd.DataFrame(5, index=condition.index, columns=condition.columns)
df_2 = pd.DataFrame(2.5, index=condition.index, columns=condition.columns)
df_3 = pd.DataFrame(3, index=condition.index, columns=condition.columns)
df.where(~is_one, other=df_3 * df_3, inplace=True)
df.where(~is_after_one, other=df_2, inplace=True)
which yields:
0 1 2 3
1 5 5 9.0 2.5
2 5 9 2.5 2.5
3 5 5 5.0 9.0
4 5 5 5.0 9.0
EDIT after comment:
If you really want to loop explicitly over the rows and columns, you could do it like this with the same result:
n_rows = condition.index.size
n_cols = condition.columns.size
for row_index in range(n_rows):
for col_index in range(n_cols):
cond = condition.iloc[row_index, col_index]
if col_index < n_cols - 1:
rest_row = condition.iloc[row_index, col_index + 1:].to_list()
else:
rest_row = []
if cond == 1:
df.iloc[row_index, col_index] = df_3.iloc[row_index, col_index] ** 2
elif cond == 0 and 1 not in rest_row:
# fill whole row at once
df.iloc[row_index, col_index:] = 2.5
# stop iterating over the rest
break
else:
df.iloc[row_index, col_index] = 5
df_2.loc[:, col_index] = df.iloc[:, col_index + 1]
The result is the same, but this is much more inefficient and ugly, so I would not recommend it like this

pandas for loop for running average does not work

I tried to make a kind of running average - out of 90 rows, every 3 in column A should make an average that would be the same as those rows in column B.
For example:
From this:
df = pd.DataFrame( A B
2 0
3 0
4 0
7 0
9 0
8 0)
to this:
df = pd.DataFrame( A B
2 3
3 3
4 3
7 8
9 8
8 8)
I tried running this code:
x=0
for i in df['A']:
if x<90:
y = (df['A'][x]+ df['A'][(x +1)]+df['A'][(x +2)])/3
df['B'][x] = y
df['B'][(x+1)] = y
df['B'][(x+2)] = y
x=x+3
print(y)
It does print the correct Y
But does not change B
I know there is a better way to do it, and if anyone knows - it would be great if they shared it. But the more important thing for me is to understand why what I wrote down doesn't have an effect on the df.
You could group by the index divided by 3, then use transform to compute the mean of those values and assign to B:
df = pd.DataFrame({'A': [2, 3, 4, 7, 9, 8], 'B': [0, 0, 0, 0, 0, 0]})
df['B'] = df.groupby(df.index // 3)['A'].transform('mean')
Output:
A B
0 2 3
1 3 3
2 4 3
3 7 8
4 9 8
5 8 8
Note that this relies on the index being of the form 0,1,2,3,4,.... If that is not the case, you could either reset the index (df.reset_index(drop=True)) or use np.arange(df.shape[0]) instead i.e.
df['B'] = df.groupby(np.arange(df.shape[0]) // 3)['A'].transform('mean')
i = 0
batch_size = 3
df = pd.DataFrame({'A':[2,3,4,7,9,8,9,10],'B':[-1] * 8})
while i < len(df):
j = min(i+batch_size-1,len(df)-1)
avg =sum(df.loc[i:j,'A'])/ (j-i+1)
df.loc[i:j,'B'] = [avg] * (j-i+1)
i+=batch_size
df
corner case when len(df) % batch_size != 0 assumes we take the average of the leftover rows.

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

How to set ranges of rows in pandas?

I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?
You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.
for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary

print matrix with indicies python

I have a matrix in Python defined like this:
matrix = [['A']*4 for i in range(4)]
How do I print it in the following format:
0 1 2 3
0 A A A A
1 A A A A
2 A A A A
3 A A A A
>>> for i, row in enumerate(matrix):
... print i, ' '.join(row)
...
0 A A A A
1 A A A A
2 A A A A
3 A A A A
I guess you'll find out how to print out the first line :)
Something like this:
>>> matrix = [['A'] * 4 for i in range(4)]
>>> def solve(mat):
print " ", " ".join([str(x) for x in xrange(len(mat))])
for i, x in enumerate(mat):
print i, " ".join(x) # or " ".join([str(y) for y in x]) if elements are not string
...
>>> solve(matrix)
0 1 2 3
0 A A A A
1 A A A A
2 A A A A
3 A A A A
>>> matrix = [['A'] * 5 for i in range(5)]
>>> solve(matrix)
0 1 2 3 4
0 A A A A A
1 A A A A A
2 A A A A A
3 A A A A A
4 A A A A A
This function matches your exact output.
>>> def printMatrix(testMatrix):
print ' ',
for i in range(len(testMatrix[1])): # Make it work with non square matrices.
print i,
print
for i, element in enumerate(testMatrix):
print i, ' '.join(element)
>>> matrix = [['A']*4 for i in range(4)]
>>> printMatrix(matrix)
0 1 2 3
0 A A A A
1 A A A A
2 A A A A
3 A A A A
>>> matrix = [['A']*6 for i in range(4)]
>>> printMatrix(matrix)
0 1 2 3 4 5
0 A A A A A A
1 A A A A A A
2 A A A A A A
3 A A A A A A
To check for single length elements and put an & in place of elements with length > 1, you could put a check in the list comprehension, the code would change as follows.
>>> def printMatrix2(testMatrix):
print ' ',
for i in range(len(testmatrix[1])):
print i,
print
for i, element in enumerate(testMatrix):
print i, ' '.join([elem if len(elem) == 1 else '&' for elem in element])
>>> matrix = [['A']*6 for i in range(4)]
>>> matrix[1][1] = 'AB'
>>> printMatrix(matrix)
0 1 2 3 4 5
0 A A A A A A
1 A AB A A A A
2 A A A A A A
3 A A A A A A
>>> printMatrix2(matrix)
0 1 2 3 4 5
0 A A A A A A
1 A & A A A A
2 A A A A A A
3 A A A A A A
a=[["A" for i in range(4)] for j in range(4)]
for i in range(len(a)):
print()
for j in a[i]:
print("%c "%j,end='')
it will print like this:
A A A A
A A A A
A A A A
A A A A
Use pandas for showing any matrix with indices:
>>> import pandas as pd
>>> pd.DataFrame(matrix)
0 1 2 3
0 A A A A
1 A A A A
2 A A A A
3 A A A A

Categories

Resources