Pandas dataframe if else condition based on previous rows not working - python

I have a pandas dataframe as below:
df = pd.DataFrame({'X':[1,1,1, 0, 0]})
df
X
0 1
1 1
2 1
3 0
4 0
Now I want to modify X based on the below condition:
If X = 0 , previous row + 1
So, my final output should look like below:
X
0 1
1 1
2 1
3 2
4 3
This can be achieved by iterating over rows and setting up a current and previous row and using iloc and is working as expected
for i in range(0, len(df)):
current_row = df.iloc[i]
if i > 0:
previous_row =df.iloc[i-1]
else:
previous_row = current_row
if (current_row['X'] == 0):
current_row['X'] = previous_row['X'] +1
I want more efficient way of doing that and I tried the below code but the output is not what I expected (the value of X for 5th row should be 3):
conditions = [df["X"] == 0]
values = [df["X"] .shift() + 1]
df['X'] = np.select(conditions, values)
>>> df
X
0 1
1 1
2 1
3 2
4 1

You could try the following:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': [1, 1, 1, 0, 0]})
# values previous to zero
pe_zero = df.X.shift(-1).eq(0) * df.X # [0 0 1 0 0]
# 1 for reach zero value as you sum one to the previous value
eq_zero = df.X.eq(0)
# find consecutive groups of 0
groups = pe_zero + eq_zero
consecutive = (groups.gt(0) != groups.gt(0).shift()).cumsum()
# find cumulative sum by groups
cumulative = groups.groupby(consecutive).cumsum()
# choose from cumulative when equals to zero else from original
result = np.where(eq_zero, cumulative, df.X)
print(result)
Output
[1 1 1 2 3]
UPDATE
For df = pd.DataFrame({'X': [1, 1, 1, 0, 0, 1, 1, 0, 0]})
returns:
[1 1 1 2 3 1 1 2 3]

You could try this:
arr = df.X.values # extract the column as a numpy array for faster iteration
for i, val in enumerate(arr[1:], start=1):
if val == 0:
arr[i] = arr[i-1] + 1

Related

Creating a new column after every For loop iteration in Python

Create a New Column after every for loop iteration
proba=[12,65,1,54]
tau=[]
for i in range(len(proba)):
for j in range(len(proba)):
if proba[j]>=proba[i]:
tau.append(1)
else:
tau.append(0)
print(tau)
Getting output like this as below:
[1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1]
But I required output like below:
proba tau1 tau2 tau3 tau4
12 1 0 1 0
65 1 1 1 1
1 0 0 1 0
54 1 0 1 1
we can use pandas and numpy also to make code more generic
You could use a combination of pandas and numpy:
proba = np.array([12,65,1,54])
df = pd.DataFrame(proba, columns=['proba'])
for i in range(len(proba)):
df = pd.concat([df, pd.Series(proba >= proba[i], name=f'tau{i}').astype(int)], axis=1)
Output:
proba tau0 tau1 tau2 tau3
0 12 1 0 1 0
1 65 1 1 1 1
2 1 0 0 1 0
3 54 1 0 1 1
Builtin data structures such as dictionaries and/or lists, serve well for creating dataframes
import pandas as pd
proba = [12, 65, 1, 54]
taus = {}
for idx, i in enumerate(proba):
vals=[]
for j in proba:
if j >= i:
vals.append(1)
else:
vals.append(0)
taus[f"tau{idx}"] = vals
df = pd.DataFrame(taus)
df["proba"] = proba

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

How to create a new column through a specific condition?

I have a column like this:
1
0
0
0
0
1
0
0
0
1
0
0
and need as result:
1
1
1
1
1
2
2
2
2
3
3
3
A method/algorithm that divides into ranks from 1 to 1 and gives them successively values.
Any idea?
You can loop through the list and use a counter to update the column value, and increment it everytime you find the number 1.
def rank(lst):
counter = 0
for i, column in enumerate(lst):
if column == 1:
counter+=1
lst[i] = counter
def fill_arr(arr):
curr = 1
for i in range(1, len(arr)):
arr[i] = curr
if i < len(arr)-1 and arr[i+1] == 1:
curr += 1
return arr
A quick test
arr = [1,0,0,0,1,0,0,0,1,0,0]
fill_arr(arr)
[1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
The idea is as follows:
keep track of the number of 1s we encounter it curr by looking ahead and increment it as we see new 1s.
set the elements at the current index to curr.
we start at index 1 since we know that there is a one at index zero. This helps us reduce edge cases and make the algorithm easier to manage.
What you are looking for is usually called the cumulated sums; or as a verb, you're looking to increasingly accumulate the values in the list.
For a python list:
import itertools
l1 = [1,0,0,0,1,0,0,0,1,0,0]
l2 = list(itertools.accumulate(l1))
print(l2)
# [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
For a numpy array:
import numpy
a1 = numpy.array([1,0,0,0,1,0,0,0,1,0,0])
a2 = a1.cumsum()
print(a2)
# [1 1 1 1 2 2 2 2 3 3 3]
For a column in a pandas dataframe:
import pandas
df = pandas.DataFrame({'col1': [1,0,0,0,1,0,0,0,1,0,0]})
df['col2'] = df['col1'].cumsum()
print(df)
# col1 col2
# 0 1 1
# 1 0 1
# 2 0 1
# 3 0 1
# 4 1 2
# 5 0 2
# 6 0 2
# 7 0 2
# 8 1 3
# 9 0 3
# 10 0 3
Documentation:
itertools.accumulate;
numpy.cumsum;
numpy.ndarray.cumsum;
pandas.DataFrame.cumsum;
pandas.Series.cumsum.

Assign a value to the first row of a sub-dataframe

I have a Pandas dataframe that looks like this:
df = pd.DataFrame({'gp_id': [1, 2, 1, 2], 'A': [1, 2, 3, 4]})
gp_id A
0 1 1
1 2 2
2 1 3
3 2 4
I want to assign the value -1 to the first row of the group with the id 2 (gp_id = 2), to get the following output:
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
To do this, I've tried the following code:
df[df.gp_id == 2].A.iloc[0] = -1
But this doesn't do anything as I'm assigning a value in the sub-dataframe df[df.gp_id == 2] and I'm not modifying the original dataframe df.
Is there an easy way to solve this problem?
You could do:
df.loc[(df.gp_id == 2).argmax(), 'A'] = -1
as pd.Series.argmax returns the first max.
If you are not sure that the value is present in the dataframe, you could do:
cond = (df.gp_id == 2)
if cond.sum():
df.loc[cond.argmax(), 'A'] = -1
General solution if possible mask return no rows is chain another mask by cumulative sum of mask by & for bitwise AND and set values by DataFrame.loc:
m = df.gp_id == 2
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Working well if no match - no assign, no error, no incorrect assignment:
m = df.gp_id == 7
df.loc[m & (m.cumsum() == 1), 'A'] = -1
Solution if always match mask at least one row is:
idx = df[df.gp_id == 2].index[0]
df.loc[idx, 'A'] = -1
print (df)
gp_id A
0 1 1
1 2 -1
2 1 3
3 2 4
If no match, solution raise error, no incorrect assignment.

Pandas: create column based on first off occurrences in column B after signal in column A

I have A column with signal on == 1 and B column with signal off == 1 ,the rest values are zero.
data = {'A': [1, 0, 0, 0, 0, 1, 0],
'B': [1, 0, 1, 1, 0, 0, 1]}
df = pd.DataFrame.from_dict(data)
I need to create a column C where:
A == 1 and B == 0 or 1, C= 1
C = 1 till to B == 1, than C = 0
Here what the result should be:
df['C'] = [1, 1, 0, 0, 0, 1, 0]
I used
df.loc[df['A'] == 1, 'C'] = 1
to set at 1 the row where A == 1, but I can not find the way to get first non zero in B column, after the 1 signal on A, and replace the other with zeros till to next 1 in A.
You can do mask, with transform idxmax , mask here is to set B to 0 when A equal to 1 , since no matter what value of B, the C will be 1.
df['C']=(df.index<df.B.mask(df.A.eq(1),0).groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
Update
s=df.B.mask(df.A.eq(1),0)
s=(s==1)&(s.shift(-1)==0)
df['C']=(df.index<s.groupby(df.A.cumsum()).transform('idxmax')).astype(int)
df.loc[df.A==1,'C']=1
Hello and welcome to stackoverflow.
This is a case you usually wouldn't use pandas for as the value of C depends on previous rows. And pandas is more about using "split-apply-combine" on independent measurements
If it is not runtime-critical I would probably write a plain old loop for this:
In [4]: C = []
...: signal = 0
...: for _, row in df.iterrows():
...: if ((signal == 1) and (row.B == 1)):
...: signal = 0
...: elif(row.A == 1):
...: signal = 1
...: C.append(signal)
...:
In [5]: C
Out[5]: [1, 1, 0, 0, 0, 1, 0]
In [6]: df['C'] = C
In [7]: df
Out[7]:
A B C
0 1 1 1
1 0 0 1
2 0 1 0
3 0 1 0
4 0 0 0
5 1 0 1
6 0 1 0
This won't have a good performance, but imho it is worth it to cleanly express the intent of your code if it is still "fast enough".
Solution based on iterrows (as proposed in one of other answers)
may be too slow.
Define the following function computing the output signal for a group
of input rows (starting on each case of A == 1):
def signal(grp):
return pd.Series(np.equal(np.where(grp.A == 1, 0, grp.B)
.cumsum(), 0).astype(int), index=grp.index)
Then group df and apply this function:
df['C'] = df.groupby(df.A.cumsum()).apply(signal)\
.reset_index(level=0, drop=True)
Edit
Yet faster solution, without grouping, is:
sig = df.A.replace(0, np.nan)
sig.update(df.A.lt(df.B).astype(int).replace(0, np.nan) - 1)
df['C'] = sig.ffill().fillna(0, downcast='infer')
For a sample of 7000 rows (your data repeated 1000 times) the execution
time of this solution is 14 times shorter than the solution by YOBEN_S.

Categories

Resources