Reduce loop time in python - python

Loops in python taking alot time to give result.This contains around 100k records.
It is taking lot of time. How time can be reduced
df['loan_agr'] = df['loan_agr'].astype(int)
for i in range(len(df)):
if df.loc[i,'order_mt']== df.loc[i,'enr_mt']:
df['new_N_Loan'] = 1
df['exist_N_Loan'] = 0
df['new_V_Loan'] = df['loan_agr']
df['exist_V_Loan'] = 0
else:
df['new_N_Loan'] = 0
df['exist_N_Loan'] = 1
df['new_V_Loan'] = 0
df['exist_V_Loan'] = df['loan_agr']

You can use loc and set the new values in a vectorized way. This approach is much faster than using iteration because these operations are performed on entire columns at once, rather than individual values. Check out this article for more on speed optimization in pandas.
For example:
mask = df['order_mt'] == df['enr_mt']
df.loc[mask, ['new_N_Loan', 'exist_N_Loan', 'exist_V_Loan']] = [1, 0, 0]
df.loc[mask, ['new_V_Loan']] = df['loan_agr']
df.loc[~mask, ['new_N_Loan', 'exist_N_Loan', 'new_V_Loan']] = [0, 1, 0]
df.loc[~mask, ['exist_V_Loan']] = df['loan_agr']
Edit:
If the ~ (bitwise not) operator is not supported in your version of pandas, you can make a new mask for the "else" condition, similar to the first condition.
For example:
mask = df['order_mt'] == df['enr_mt']
else_mask = df['order_mt'] != df['enr_mt']
Then use the else_mask for the second set of definitions instead of ~mask.
Sample:
Input:
order_mt enr_mt new_N_Loan exist_N_Loan exist_V_Loan new_V_Loan loan_agr
0 1 1 None None None None 100
1 2 2 None None None None 200
2 3 30 None None None None 300
3 4 40 None None None None 400
Output:
order_mt enr_mt new_N_Loan exist_N_Loan exist_V_Loan new_V_Loan loan_agr
0 1 1 1 0 0 100 100
1 2 2 1 0 0 200 200
2 3 30 0 1 300 0 300
3 4 40 0 1 400 0 400

Instead of range(Len(...)) you could change the len function to a value.

Related

In Python Pandas , searching where there are 4 consecutive rows where values going up

I am trying to figure out how I can mark the rows where the price are part of 4 increase prices .
the "is_consecutive" is actually the mark .
I managed to do the diff between the rows :
df['diff1'] = df['Close'].diff()
But I didn't managed to find out which row is a part of 4 increase prices .
I had a thought to use df.rolling() .
The exmple df,
On rows 0-3 , we need to get an output of 'True' on the ["is_consecutive"] column , because the ['diff1'] on this consecutive rows is increase for 4 rows .
On rows 8-11 , we need to get an output of 'False' on the ["is_consecutive"] column , because the ['diff1'] on this consecutive rows is zero .
Date Price diff1 is_consecutive
0 1/22/20 0 0 True
1 1/23/20 130 130 True
2 1/24/20 144 14 True
3 1/25/20 150 6 True
4 1/27/20 60 -90 False
5 1/28/20 95 35 False
6 1/29/20 100 5 False
7 1/30/20 50 -50 False
8 2/01/20 100 0 False
9 1/02/20 100 0 False
10 1/03/20 100 0 False
11 1/04/20 100 0 False
12 1/05/20 50 -50 False
general example :
if
price = [30,55,60,65,25]
the different form the consecutive number on the list will be :
diff1 = [0,25,5,5,-40]
So when the diff1 is plus its actually means the consecutive prices are increase .
I need to mark(in the df) the rows that have 4 consecutive that go up.
Thank You for help (-:
Try: .rolling with window of size 4 and min periods 1:
df["is_consecutive"] = (
df["Price"]
.rolling(4, min_periods=1)
.apply(lambda x: (x.diff().fillna(0) >= 0).all())
.astype(bool)
)
print(df)
Prints:
Date Price is_consecutive
0 1/22/20 0 True
1 1/23/20 130 True
2 1/24/20 144 True
3 1/25/20 150 True
4 1/26/20 60 False
5 1/26/20 95 False
6 1/26/20 100 False
7 1/26/20 50 False
Assuming the dataframe is sorted. One way is based on the cumsum of the differences to identify the first time an upward Price move succeeding a 3 days upwards trend (i.e. 4 days of upward trend).
quant1 = (df['Price'].diff().apply(np.sign) == 1).cumsum()
quant2 = (df['Price'].diff().apply(np.sign) == 1).cumsum().where(~(df['Price'].diff().apply(np.sign) == 1)).ffill().fillna(0).astype(int)
df['is_consecutive'] = (quant1-quant2) >= 3
note that the above takes into account only strictly increasing Prices (not equal).
Then we override also the is_consecutive tag for the previous 3 Prices to be also TRUE using the win_view self defined function:
def win_view(x, size):
if isinstance(x, list):
x = np.array(x)
if isinstance(x, pd.core.series.Series):
x = x.values
if isinstance(x, np.ndarray):
pass
else:
raise Exception('wrong type')
return np.lib.stride_tricks.as_strided(
x,
shape=(x.size - size + 1, size),
strides=(x.strides[0], x.strides[0])
)
arr = win_view(df['is_consecutive'], 4)
arr[arr[:,3]] = True
Note that we inplace replace the values to be True.
EDIT 1
Inspired by the self defined win_view function, I realized that the solution it can be obtained simply by win_view (without the need of using cumsums) as below:
df['is_consecutive'] = False
arr = win_view(df['Price'].diff(), 4)
arr_ind = win_view(list(df['Price'].index), 4)
mask = arr_ind[np.all(arr[:, 1:] > 0, axis=1)].flatten()
df.loc[mask, 'is_consecutive'] = True
We maintain 2 arrays, 1 for the returns and 1 for the indices. We collect the indices where we have 3 consecutive positive return np.all(arr[:, 1:] > 0, axis=1 (i.e. 4 upmoving prices) and we replace those in our original df.
The function will return columns named "consecutive_up" which represents all rows that are part of the 5 increase series and "consecutive_down" which represents all rows that are part of the 4 decrees series.
def c_func(temp_df):
temp_df['increase'] = temp_df['Price'] > temp_df['Price'].shift()
temp_df['decrease'] = temp_df['Price'] < temp_df['Price'].shift()
temp_df['consecutive_up'] = False
temp_df['consecutive_down'] = False
for ind, row in temp_df.iterrows():
if row['increase'] == True:
count += 1
else:
count = 0
if count == 5:
temp_df.iloc[ind - 5:ind + 1, 4] = True
elif count > 5:
temp_df.iloc[ind, 4] = True
for ind, row in temp_df.iterrows():
if row['decrease'] == True:
count += 1
else:
count = 0
if count == 4:
temp_df.iloc[ind - 4:ind + 1, 5] = True
elif count > 4:
temp_df.iloc[ind, 5] = True
return temp_df

Sliding Window and comparing elements of DataFrame to a threshold

Assume I have the following dataframe:
Time Flag1
0 0
10 0
30 0
50 1
70 1
90 0
110 0
My goal is to identify if within any window that time is less than lets the number in the row plus 35, then if any element of flag is 1 then that row would be 1. For example consider the above example:
The first element of time is 0 then 0 + 35 = 35 then in the window of values less than 35 (which is Time =0, 10, 30) all the flag1 values are 0 therefore the first row will be assigned to 0 and so on. Then the next window will be 10 + 35 = 45 and still will include (0,10,30) and the flag is still 0. So the complete output is:
Time Flag1 Output
0 0 0
10 0 0
30 0 1
50 1 1
70 1 1
90 1 1
110 1 1
To implement this type of problem, I thought I can use two for loops like this:
Output = []
for ii in range(Data.shape[0]):
count =0
th = Data.loc[ii,'Time'] + 35
for jj in range(ii,Data.shape[0]):
if (Data.loc[jj,'Time'] < th and Data.loc[jj,'Flag1'] == 1):
count = 1
break
output.append(count)
However this looks tedious. since the inner for loop should go for continue for the entire length of data. Also I am not sure if this method checks the boundary cases for out of bound index when we are reaching to end of the dataframe. I appreciate if someone can comment on something easier than this. This is like a sliding window operation only comparing number to a threshold.
Edit: I do not want to compare two consecutive rows only. I want if for example 30 + 35 = 65 then as long as time is less than 65 then if flag1 is 1 then output is 1.
The second example:
Time Flag1 Output
0 0 0
30 0 1
40 0 1
60 1 1
90 1 1
140 1 1
200 1 1
350 1 1
Assuming a window k rows before and k rows after as mentioned in my comment:
import pandas as pd
Data = pd.DataFrame([[0,0], [10,0], [30,0], [50,1], [70,1], [90,1], [110,1]],
columns=['Time', 'Flag1'])
k = 1 # size of window: up to k rows before and up to k rows after
n = len(Data)
output = [0]*n
for i in range(n):
th = Data['Time'][i] + 35
j0 = max(0, i - k)
j1 = min(i + k + 1, n) # the +1 is because range is non-inclusive of end
output[i] = int(any((Data['Time'][j0 : j1] < th) & (Data['Flag1'][j0 : j1] > 0)))
Data['output'] = output
print(Data)
gives the same output as the original example. And you can change the size of the window my modifying k.
Of course, if the idea is to check any row afterward, then just use j1 = n in my example.
import pandas as pd
Data = pd.DataFrame([[0,0],[10,0],[30,0],[50,1],[70,1],[90,1],[110,1]],columns=['Time','Flag1'])
output = Data.index.map(lambda x: 1 if any((Data.Time[x+1:]<Data.Time[x]+35)*(Data.Flag1[x+1:]==1)) else 0).values
output[-1] = Data.Flag1.values[-1]
Data['output'] = output
print(Data)
# show
Time Flag1 output
0 0 0
30 0 1
40 0 1
50 1 1
70 1 1
90 1 1
110 1 1

Vectorized function with counter on pandas dataframe column

Consider this pandas dataframe where the condition column is 1 when value is below 5 (any threshold).
import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df
Out[1]:
value condition
0 30 0
1 100 0
2 4 1
3 0 1
4 80 0
5 0 1
6 1 1
7 4 1
8 70 0
9 70 0
What I want is to have all consecutive values below 5 to have the same id and all values above five have 0 (or NA or a negative value, doesn't matter, they just need to be the same). I want to create a new column called new_id that contains these cumulative ids as follows:
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
In a very inefficient for loop I would do this (which works):
for i in range(0,df.shape[0]):
if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
new_id = counter # assign new id
counter += 1
elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
new_id = counter-1 # assign current id
elif (df.loc[df.index[i],'condition']==0):
new_id = df.loc[df.index[i],'condition'] # assign 0
df.loc[df.index[i],'new_id'] = new_id
df
But this is very inefficient and I have a very big dataset. Therefore I tried different kinds of vectorization but I so far failed to keep it from counting up inside each "cluster" of consecutive points:
# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]
# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]
I also tried using apply() with a custom if else function but it seems like this does not allow me to use a counter.
There is already a ton of similar posts about this but none of them keep the same id for consecutive rows.
Example posts are:
Maintain count in python list comprehension
Pandas cumsum on a separate column condition
Python - keeping counter inside list comprehension
python pandas conditional cumulative sum
Conditional count of cumulative sum Dataframe - Loop through columns
You can use the cumsum(), as you did in your first try, just modify it a bit:
# calculate delta
df['delta'] = df['condition']-df['condition'].shift(1)
# get rid of -1 for the cumsum (replace it by 0)
df['delta'] = df['delta'].replace(-1,0)
# cumulative sum conditional: multiply with condition column
df['cumsum_x'] = df['delta'].cumsum()*df['condition']
Welcome to SO! Why not just rely on base Python for this?
def counter_func(l):
new_id = [0] # First value is zero in any case
counter = 0
for i in range(1, len(l)):
if l[i] == 0:
new_id.append(0)
elif l[i] == 1 and l[i-1] == 0:
counter += 1
new_id.append(counter)
elif l[i] == l[i-1] == 1:
new_id.append(counter)
else: new_id.append(None)
return new_id
df["new_id"] = counter_func(df["condition"])
Looks like this
value condition new_id
0 30 0 0
1 100 0 0
2 4 1 1
3 0 1 1
4 80 0 0
5 0 1 2
6 1 1 2
7 4 1 2
8 70 0 0
9 70 0 0
Edit :
You can also use numba, which sped up the function quite a lot for me about : about 1sec to ~60ms.
You should input numpy arrays into the function to use it, meaning you'll have to df["condition"].values.
from numba import njit
import numpy as np
#njit
def func(arr):
res = np.empty(arr.shape[0])
counter = 0
res[0] = 0 # First value is zero anyway
for i in range(1, arr.shape[0]):
if arr[i] == 0:
res[i] = 0
elif arr[i] and arr[i-1] == 0:
counter += 1
res[i] = counter
elif arr[i] == arr[i-1] == 1:
res[i] = counter
else: res[i] = np.nan
return res
df["new_id"] = func(df["condition"].values)

Python DataFrame Accumulator Based on Flag

I have a logic-driven flag column and I need to create a column that increments by 1 when the flag is true and decrements by 1 when the flag is false down to a floor of zero.
I've tried a few different methods and I can't get the Accumulator 'shift' to reference the new value created by the process. I know the method below wouldn't stop at zero anyway, but I was just trying to work through the concept before and this is the most to-the-point example to explain the goal. Do I need a for loop to iterate line-by-line?
df = pd.DataFrame(data=np.random.randint(2,size=10), columns=['flag'])
df['accum'] = 0
df['accum'] = np.where(df['flag'] == 1, df['accum'].shift(1) + 1, df['accum'].shift(1) - 1)
df['dOutput'] = [1,0,1,2,1,2,3,2,1,0] #desired output
df
Output
As far as I know, there's no numpy or pandas vectorized operation to do this, so, you should iterate line-by-line:
def cumsum_with_floor(series):
acc = 0
output = []
accum_list = []
for val in series:
val = 1 if val else -1
acc += val
accum_list.append(val)
acc = acc if acc > 0 else 0
output.append(acc)
return pd.Series(output, index=series.index), pd.Series(accum_list, index=series.index)
series = pd.Series([1,0,1,1,0,0,0,1])
dOutput, accum = cumsum_with_floor(series)
dOutput
Out:
0 1
1 0
2 1
3 2
4 1
5 0
6 0
7 1
dtype: int64
accum # shifted by one step forward compared with you example
Out:
0 1
1 -1
2 1
3 1
4 -1
5 -1
6 -1
7 1
dtype: int64
But may be there's somebody who knows suitable combination of pd.clip and pd.cumsum or other vectorized operations.

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Categories

Resources