I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17
In one move we can make it equal to the 2nd maximum element and have to make all elements equal to the minimum element.
My code is given below it works fine but I want to reduce its time complexity.
def No_Books(arr, n):
arr = sorted(arr)
steps = 0
while arr[0]!= arr[arr.index(max(arr))]:
max1 = max(arr)
count = arr.count(max1)
scnd_max = arr.index(max1)-1
arr[scnd_max+count] = arr[scnd_max]
steps += 1
return steps
n = int(input())
arr = [int(x) for x in input().split()]
print(No_Books(arr,n))
Output
5
4 5 5 2 4
6
Here minimum moves required is 6
I'm interpreting the question in the following way:
For each element in the array, there is one and only one operation you're allowed to perform, and that operation is to replace an index's value with the array's current second-largest element.
How many operations are necessary to make the entire array's values equal to the initial minimum value?
With the example input 4 5 5 2 4 needing to go through the following steps:
Array - step - comments
4 5 5 2 4 - 0 - start
4 4 5 2 4 - 1 - replace the first 5 with 4 (the second-largest value in the array)
4 4 4 2 4 - 2 - replace the second 5 with 4
2 4 4 2 4 - 3 - replace the first 4 with 2
2 2 4 2 4 - 4
2 2 2 2 4 - 5
2 2 2 2 2 - 6
It took 6 steps, so the result is 6.
If that is correct, then I can change your quadratic solution (O(n^2), where n is the size of the array) to a quasilinear solution (O(n + mlogm) where n is the size of the array, and m is the number of unique values in the array), as follows.
The approach is to notice that each value needs to be dropped down to the next largest value for each unique value smaller than itself. So if we can track the count of each unique value, we can determine the number of steps without actually doing any array updates.
In pseudocode:
function determineSteps(array):
define map from integer to integer, defaulting to 0
for each value in array: // Linear in N
map(value)++
sort map by key, descending // M log M
// largerCount is the number of elements larger than the current second-largest value
define largerCount, assign 0 to largerCount
// stepCount is the number of steps required
define stepCount, assign 0 to stepCount
for each key in map except the last: // Linear in M
largerCount = largerCount + map(key)
stepCount = stepCount + largerCount
return stepCount
On your example input:
4 5 5 2 4
Create map { 4: 2, 5: 2, 2: 1 }
Sort map by key, descending: { 5: 2, 4: 2, 2: 1 }
stepCount = 0
largerCount = 0
Examine key = 5, map(key) = 2
largerCount = 0 + 2 = 2
stepCount = 0 + 2 = 2
Examine key = 4, map(key) = 2
largerCount = 2 + 2 = 4
stepCount = 2 + 4 = 6
return 6
I have a dataframe as such
Anger Sad Happy Disgust Neutral Scared
0.06754 0.6766 0.4343 0.7732 0.5563 0.76433
0.54434 0.9865 0.6654 0.3334 0.4322 0.54453
...
0.5633 0.67655 0.5444 0.3278 0.9834 0.88569
I would like to create a new column that marks the first 5 rows as 1, the next 5 rows as 2, the next 5 rows 3 and then the next 3 rows as 4, and repeat the same pattern till the end of the dataset. How can I achieve this?
I tried looking into arange but failed in the implementation
An example output would be the new column Tperiod
Tperiod
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
1
1
1
1
1
One way to do it would be as such
pattern = [1] * 5 + [2] * 5 + [3] * 5 + [4] * 3
no_patterns = len(df)//len(pattern)
remaining = len(df) - (len(pattern) * no_patterns)
new_values = pattern * no_patterns + pattern[:remaining]
df['new_column'] = new_values
You can do:
gr=[5,5,5,3]
mp={n: next(x+1 for x in range(len(gr)) if sum(gr[:x+1])>n) for n in range(sum(gr))}
df['Tperiod']=range(len(df))
df['Tperiod']=(df['Tperiod']%sum(gr)).map(mp)
Where gr indicates your groups size (that's your input), and mp is just to utilize it properly with pandas.
I am new to python and the last time I coded was in the mid-80's so I appreciate your patient help.
It seems .rolling(window) requires the window to be a fixed integer. I need a rolling window where the window or lookback period is dynamic and given by another column.
In the table below, I seek the Lookbacksum which is the rolling sum of Data as specified by the Lookback column.
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
eg:
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
You can create a custom function for use with df.apply, eg:
def lookback_window(row, values, lookback, method='sum', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
Then use it as:
df['new_col'] = df.apply(lookback_window, values=df['Data'], lookback=df['Lookback'], axis=1)
There may be some corner cases but as long as your indices align and are unique - it should fulfil what you're trying to do.
here is one with a list comprehension which stores the index and value of the column df['Lookback'] and the gets the slice by reversing the values and slicing according to the column value:
df['LookbackSum'] = [sum(df.loc[:e,'Data'][::-1].to_numpy()[:i+1])
for e,i in enumerate(df['Lookback'])]
print(df)
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
An exercise in pain, if you want to try an almost fully vectorized approach. Sidenote: I don't think it's worth it here. At all.
Inspired by Divakar's answer here
Given:
import numpy as np
import pandas as pd
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
Using the function from Divakar's answer, but slightly modified
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r, fill_value=np.nan):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),fill_value)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Now, we just need to prepare a 2d array for the data and independently shift the rows according to our desired lookback values.
arr = df['Data'].to_numpy().reshape(1, -1).repeat(len(df), axis=0)
shifter = np.arange(len(df) - 1, -1, -1) #+ d['Lookback'] - 1
temp = strided_indexing_roll(arr, shifter, fill_value=0)
out = strided_indexing_roll(temp, (len(df) - 1 - df['Lookback'])*-1, 0).sum(-1)
Output:
array([ 1, 2, 3, 4, 5, 8, 10, 7, 8, 3], dtype=int64)
We can then just assign it back to the dataframe as needed and check.
df['out'] = out
#output:
Data Lookback LookbackSum out
0 1 0 1 1
1 1 1 2 2
2 1 2 3 3
3 2 2 4 4
4 3 1 5 5
5 2 3 8 8
6 3 3 10 10
7 2 2 7 7
8 1 3 8 8
9 2 1 3 3
Say I have a dataframe containing strings, such as:
df = pd.DataFrame({'col1':list('some_string')})
col1
0 s
1 o
2 m
3 e
4 _
5 s
...
I'm looking for a way to apply a rolling window on col1 and join the strings in a certain window size. Say for instance window=3, I'd like to obtain (with no minimum number of observations):
col1
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
I've tried the obvious solutions with rolling which fail at handling object types:
df.col1.rolling(3, min_periods=0).sum()
df.col1.rolling(3, min_periods=0).apply(''.join)
Both raise:
cannot handle this type -> object
Is there a generalisable approach to do so (not using shift to match this specific case of w=3)?
How about shifting the series?
df.col1.shift(2).fillna('') + df.col1.shift().fillna('') + df.col1
Generalizing to any number:
pd.concat([df.col1.shift(i).fillna('') for i in range(3)], axis=1).sum(axis=1)
Rolling works only with numbers:
def _prep_values(self, values=None, kill_inf=True):
if values is None:
values = getattr(self._selected_obj, 'values', self._selected_obj)
# GH #12373 : rolling functions error on float32 data
# make sure the data is coerced to float64
if is_float_dtype(values.dtype):
values = ensure_float64(values)
elif is_integer_dtype(values.dtype):
values = ensure_float64(values)
elif needs_i8_conversion(values.dtype):
raise NotImplementedError...
...
...
So you should construct it manually. Here is one of the possible variants with simple list comprehensions (maybe there is a more Pandas-ish way exists):
df = pd.DataFrame({'col1':list('some_string')})
pd.Series([
''.join(df.col1.values[max(i-2, 0): i+1])
for i in range(len(df.col1.values))
])
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
dtype: object
Using pd.Series.cumsum seems like working (although bit of inefficient):
df['col1'].cumsum().str[-3:]
Output:
0 s
1 so
2 som
3 ome
4 me_
5 e_s
6 _st
7 str
8 tri
9 rin
10 ing
Name: col1, dtype: object